Converting HTML text into a data object

A webpage is just a text file in HTML format. And HTML-formatted text is ultimately just text. So, let's write our own HTML from scratch, without worrying yet about "the Web":

htmltxt = "<p>Hello World</p>"

The point of HTML-parsing is to be able to efficiently extract the text values in an HTML document – e.g. Hello World – apart from the HTML markup – e.g. <p></p>.

We'll start out by using Beautiful Soup, one of Python's most popular HTML-parsing libraries.

Importing the BeautifulSoup constructor function

This is the standard import statement for using Beautiful Soup:

from bs4 import BeautifulSoup

The BeautifulSoup constructor function takes in two string arguments:

The HTML string to be parsed.
Optionally, the name of a parser. Without getting into the background of why there are multiple implementations of HTML parsing, for our purposes, we will always be using 'lxml'.

So, let's parse some HTML:

from bs4 import BeautifulSoup
htmltxt = "<p>Hello World</p>"
soup = BeautifulSoup(htmltxt, 'lxml')

The "soup" object

What is soup? As always, use the type() method to inspect an unknown object:

type(soup)
# bs4.BeautifulSoup

OK, at least we know that soup is not just plain text. The more complicated answer is that soup is now an object with much more complexity and methods than just a Python string. However, this complexity is worth diving into, because the BeautifulSoup-type object has specific methods designed for efficiently working with HTML.

Extracting text from soup

The BeautifulSoup object has a text attribute that returns the plain text of a HTML string sans the tags. Given our simple soup of <p>Hello World</p>, the text attribute returns:

soup.text
# 'Hello World'

Let's try a more complicated HTML string:

soup = BeautifulSoup("""<h1>Hello</h1><p>World</p>""", 'lxml')
soup.text
# 'HelloWorld'

And here's a HTML string that contains a URL:

mytxt = """
<h1>Hello World</h1>
<p>This is a <a href="http://example.com">link</a></p>"""

soup = BeautifulSoup(mytxt, 'lxml')
soup.text
# 'Hello World\nThis is a link'

Basically, the BeautifulSoup's text attribute will return a string stripped of any HTML tags and metadata.

Finding a tag with `find()`

Generally, we don't want to just spit all of the tag-stripped text of an HTML document. Usually, we want to extract text from just a few specific elements.

Let's re-use our "complicated" HTML string from above:

mytxt = """
<h1>Hello World</h1>
<p>This is a <a href="http://example.com">link</a></p>
"""

It contains 3 HTML tags:

A headline, <h1>
A paragraph, <p>
Within that paragraph, a hyperlink, <a>

To find the first element by tag, we use the BeautifulSoup object's find() method, which takes a tag's name as the first argument:

soup = BeautifulSoup(mytxt, 'lxml')
soup.find('a')
# <a href="http://example.com">link</a>

Again, use type() to figure out what exactly is being returned:

type(soup.find('a'))
# bs4.element.Tag

What's the difference between a Tag and BeautifulSoup object? I don't really know, but what's important to us is their similarities. A Tag object also has a text attribute:

soup.find('a').text
# link

Try find() with the other tags:

soup.find('p')
# <p>This is a <a href="http://example.com">link</a></p>
soup.find('p').text
# 'This is a link'

Extracting attributes from a tag with `attrs`

For the White House press briefings – and other HTML-parsing exercises – we want more than just the rendered text of the HTML. We'll want some of the meta attributes of the HTML, such as the href values for link tags.

The Tag object has the attrs attribute, which returns a dictionary of key-value pairs. Let's start from the top:

from bs4 import BeautifulSoup
mytxt = """
<h1>Hello World</h1>
<p>This is a <a href="http://example.com">link</a></p>
"""
soup = BeautifulSoup(mytxt, 'lxml')
mylink = soup.find('a')

To extract the value of the href attribute from the mylink object, use attrs:

type(mylink.attrs)
# dict
mylink.attrs
# {'href': 'http://example.com'}
mylink.attrs['href']
# 'http://example.com'

What about the other tags in our HTML snippet? They have no attributes and thus will have blank dictionaries for their attrs attributes:

soup.find('h1').attrs
# {}

Finding multiple elements with `find_all`

OK, let's step up the complexity; what if there are multiple <a> tags from which we want to extract href and text values? We use the find_all() method which returns a collection of elements:

moretxt = """
<p>Visit the <a href='http://www.nytimes.com'>New York Times</a></p>
<p>Visit the <a href='http://www.wsj.com'>Wall Street Journal</a></p>
"""
soup = BeautifulSoup(moretxt, 'lxml')
tags = soup.find_all('a')
type(tags)
# bs4.element.ResultSet

A ResultSet acts very much like other kinds of Python sequence, such as a list:

len(tags)
# 2
tags[0].text
# New York Times
tags[0].attrs['href']
'http://www.nytimes.com'
for t in tags:
    print(t.text, t.attrs['href'])
# New York Times http://www.nytimes.com
# Wall Street Journal http://www.wsj.com

However, be careful not to treat the ResultSet as if it were a Tag – try to understand why the following doesn't make much sense (nevermind results in an error):

tags.attrs['href']
# AttributeError: 'ResultSet' object has no attribute 'attrs'

The HTML attributes exist at a per-tag level – what would you expect it to return for a collection of tags? The designer of BeautifulSoup has no idea, thus, the error message.

If what you want is the href value for each of the tags, then you have to do it the old fashioned way with a for-loop:

hrefs = []
for t in tags:
    hrefs.append(t)

Finding nested elements

What happens when there is more than one "group" of link tags that we want? In the snippet below, the <a> tags we care about are nested within <h1> tags:

evenmoretxt = """
<h1><a href="http://www.a.com">Awesome</a></h1>
<h1><a href="http://www.b.com">Really Awesome</a></h1>

<div><a href="http://na.com">Ignore me</a></div>
<div><a href="http://127.0.0.1">Ignore me again</a></div>
"""

soup = BeautifulSoup(evenmoretxt, 'lxml')

First, we can collect all of the <h1> tags using find_all():

heds = soup.find_all('h1')

Each of the members of heds is a Tag object, and each Tag object has a find() method, which we can use to select just the nested <a> tag:

links = []
for h in heds:
    a = h.find('a')
    links.append(a)

Or, more concisely:

links = []
for h in heds:
    links.append(h.find('a'))

Real world example.com

Parsing our own hand-constructed HTML is not much fun. So let's get a "real" HTML document from the web.

This part should be familiar:

import requests
resp = requests.get('http://www.example.com')
txt = resp.text

Whether the contents of txt is a hand-constructed string or something that came from the Web doesn't matter when we're working with Beautiful Soup – we only care about converting a string into a BeautifulSoup object:

from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'lxml')

Look at the webpage at http://www.example.com/. Inspect its source. Then see if you can write the Python code that extracts:

The number of <p> tags.
The text in the first <p> tag
The length of the text of the first <h1> tag
The text of the first (and only) <a> tag
The href of the first <a> tag

My answers below:

# 1. The number of `<p>` tags.
len(soup.find_all('p'))
# 2. The text in the first `<p>` tag
soup.find_all('p')[0].text
# 3. The length of the text of the first `<h1>` tag
len(soup.find('h1').text)
# 4. The text of the first `<a>` tag
soup.find('a').text
# 5. The href of the first `<a>` tag
soup.find('a').attrs['href']

Extracting individual press briefings URLs from the White House press briefings list

Now see if you can extract each press briefing URL from this sample White House press briefings page:

http://stash.compjour.org/samples/webpages/whitehouse-press-briefings-page-50.html

image wh-press-briefings-landing.page.jpg

Examining the source HTML behind each press release tag

Let's look at that first URL.

Its text is:

Press Briefing by Press Secretary Jay Carney, 12/6/2013

Its href value is:

https://www.whitehouse.gov/the-press-office/2013/12/06/press-briefing-press-secretary-jay-carney-1262013

If you inspect the source and search for the specific tag, you'll find this HTML:

  <div class="views-field views-field-title">        <h3 class="field-content"><a href="https://www.whitehouse.gov/the-press-office/2013/12/06/press-briefing-press-secretary-jay-carney-1262013">Press Briefing by Press Secretary Jay Carney, 12/6/2013</a></h3>  </div>  </div>

For this page, a link is more than just an <a> tag; it's nested within several other tags. Here's a pretty-formatted version of that one link and its parent tags:

<div class="views-field views-field-title">
  <h3 class="field-content">
      <a href="https://www.whitehouse.gov/the-press-office/2013/12/06/press-briefing-press-secretary-jay-carney-1262013">
          Press Briefing by Press Secretary Jay Carney, 12/6/2013
      </a>
  </h3>
</div>

Processing the press briefings page as soup

Let's turn this convoluted HTML into soup. See if you can remember the steps for downloading the webpage and converting it to a soup object well enough to type them by memory:

import requests
from bs4 import BeautifulSoup
url = 'http://stash.compjour.org/samples/webpages/whitehouse-press-briefings-page-50.html'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'lxml')

There are 10 press briefings per page, but it should be evident that there are more than 10 link tags. That's easy enough to find out:

len(soup.find_all('a'))
# 263

So how do we get just the URLs for the actual press briefings? From the HTML that we inspected previously, we want <a> tags that are nested within <h3> tags.

So let's find and count the number of h3 tags:

len(soup.find_all('h3'))
# 10

Hey, what a coincidence – there are exactly as many h3 tags as links to press briefings. This is just a lucky result of how the White House webdevs decided to build this page.

Here's one way to extract all the URLs of the nested link tags into a list:

urls = []
for h in soup.find_all('h3'):
    a = h.find('a')
    urls.append(a.attrs['href'])

Here's a more concise – albeit harder to read – version:

urls = []
for h in soup.find_all('h3'):
    urls.append(h.find('a').attrs['href'])

Either way, this is the contents of urls:

['https://www.whitehouse.gov/the-press-office/2013/12/06/press-briefing-press-secretary-jay-carney-1262013',
 'https://www.whitehouse.gov/the-press-office/2013/12/05/daily-briefing-press-secretary-1252013',
 'https://www.whitehouse.gov/the-press-office/2013/12/05/press-briefing-senior-administration-officials-fact-sheet-strengthening-',
 'https://www.whitehouse.gov/the-press-office/2013/12/04/press-briefing-press-secretary-1232013',
 'https://www.whitehouse.gov/the-press-office/2013/12/02/press-briefing-press-secretary-jay-carney-1222013',
 'https://www.whitehouse.gov/the-press-office/2013/11/26/press-gaggle-principal-deputy-press-secretary-josh-earnest-los-angeles-c',
 'https://www.whitehouse.gov/the-press-office/2013/11/25/press-gaggle-principal-deputy-press-secretary-josh-earnest-aboard-air-fo',
 'https://www.whitehouse.gov/the-press-office/2013/11/22/daily-briefing-press-secretary-112213',
 'https://www.whitehouse.gov/the-press-office/2013/11/21/briefing-principal-deputy-press-secretary-josh-earnest-112113',
 'https://www.whitehouse.gov/the-press-office/2013/11/20/press-briefing-press-secretary-jay-carney-11192013']

All together

To extract the URLs from the canned sample webpage, here's all the code:

import requests
from bs4 import BeautifulSoup
url = 'http://stash.compjour.org/samples/webpages/whitehouse-press-briefings-page-50.html'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'lxml')

urls = []
for h in soup.find_all('h3'):
    a = h.find('a')
    urls.append(a.attrs['href'])

Now all we have to do is repeat this for every page of press briefings listings…

Using BeautifulSoup to parse HTML and extract press briefings URLs