Extracting URLs from the downloaded index pages

Let's assume you've downloaded all the White House press briefings from the first exercise, which means you have an index-pages directory with files 0.html through 162.html.

Let's practice extracting the press briefings URLs from just one of those files.

First, we open up the saved webpage and read it as text – note that we don't use requests yet because we're not downloading the page:

from os.path import join
from bs4 import BeautifulSoup
INDEX_PAGES_DIR = 'index-pages'
# get the text
some_filename = join(INDEX_PAGES_DIR, '150.html')
with open(some_filename, 'r') as rf:
    txt = rf.read()
# parse the HTML
soup = BeautifulSoup(txt, 'lxml')
# extract the URLs
for h in soup.find_all('h3'):
    a = h.find('a')
    print(a.attrs['href'])

You should get output that looks like this:

/the-press-office/briefing-presidents-upcoming-trip-saudi-arabia-egypt-germany-and-france
/the-press-office/briefing-white-house-press-secretary-robert-gibbs-5-29-09
/the-press-office/briefing-white-house-press-secretary-robert-gibbs-5-28-09
/the-press-office/briefing-white-house-press-secretary-robert-gibbs-5272009
/the-press-office/briefing-white-house-press-secretary-robert-gibbs-5-26-09
/the-press-office/briefing-white-house-press-secretary-robert-gibbs-5-22-09
/the-press-office/briefing-white-house-press-secretary-robert-gibbs-5-21-09
/the-press-office/briefing-white-house-press-secretary-robert-gibbs-5-20-09
/the-press-office/briefing-white-house-press-secretary-robert-gibbs-51909
/the-press-office/briefing-secretary-state-hillary-clinton-humanitarian-aid-pakistan

What gives? None of those look like actual URLs, just partial URLs.

To get the absolute URL, we have to prepend the White House website domain, e.g.:

'https://www.whitehouse.gov' + '/the-press-office/briefing-white-house-press-secretary-robert-gibbs-5-29-09'
# 'https://www.whitehouse.gov/the-press-office/briefing-white-house-press-secretary-robert-gibbs-5-29-09'

That seems inconvenient…

Relative vs. absolute URLs

In the previous lesson, we used a canned webpage:

http://stash.compjour.org/samples/webpages/whitehouse-press-briefings-page-50.html

Because of the way this particular webpage is preserved, the URLs are absolute (because of the way that I saved the webpage via browser):

https://www.whitehouse.gov/the-press-office/2013/12/06/press-briefing-press-secretary-jay-carney-1262013

On the actual White House press releases site:

https://www.whitehouse.gov/briefing-room/press-briefings

– the URLs are relative:

/the-press-office/2013/12/06/press-briefing-press-secretary-jay-carney-1262013

This is a standard convention – which you can read about in the HTML spec – and a web browser will assume that the base of the relative URL – in this case, https://www.whitehouse.gov – is the same as it is for the page on which the relative URL is found.

However, we're not using a web browser. Our Python script has no sense of where a webpage came from – it's just reading a text file and parsing the HTML.

So we need to explicitly provide the base URL. And then join it to relative URL.

Using urllib.parse.urljoin()

Your first instinct may be to concatenate the base part of the URL and its relative URLs manually, as you would two ordinary strings:

BASE_URL = 'https://www.whitehouse.gov'
url = BASE_URL + '/the-press-office/2013/12/06/press-briefing-press-secretary-jay-carney-1262013'

However, similar to the reasons of why we use os.path.join() to join file system paths, manually adding strings together can lead to "one-off" errors (spot the extra forward-slash below):

BASE_URL = 'https://www.whitehouse.gov/'
url = BASE_URL + '/the-press-office/2013/12/06/press-briefing-press-secretary-jay-carney-1262013'

And so, this is a task that we delegate to one of Python's standard libraries: urllib.parse.urljoin:

from urllib.parse import urljoin
BASE_URL = 'https://www.whitehouse.gov/'
url = urljoin(BASE_URL, '/the-press-office/2013/12/06/press-briefing-press-secretary-jay-carney-1262013')

It's not evident in the above example, but one convenience of urljoin() is that it will properly resolve the relative path, whether BASE_URL is the domain of the URL, or the absolute URL of the webpage that the relative URL came from.

In other words, the two urljoin() calls below produce the same URL:

from urllib.parse import urljoin
URL_A = 'http://www.example.com'
URL_B = 'http://www.example.com/some/fun/page.html'
urljoin(URL_A, '/helpme')
# http://www.example.com/helpme
urljoin(URL_B, '/helpme')
# http://www.example.com/helpme

The importance of this will be evident in situations when your automated script needs to resolve a URL but you don't know at the time that you write the program exactly what website your script is visiting – e.g. a script that extracts URLs from multiple websites.

Extracting and resolving absolute URLs

Going back to the beginning with what we know:

from bs4 import BeautifulSoup
from os.path import join
from urllib.parse import urljoin
WH_BASE_URL = 'https://www.whitehouse.gov/briefing-room/press-briefings/'
INDEX_PAGES_DIR = 'index-pages'
# get the text
some_filename = join(INDEX_PAGES_DIR, '150.html')
with open(some_filename, 'r') as rf:
    txt = rf.read()
# parse the HTML 
soup = BeautifulSoup(txt, 'lxml')
# extract the URLs
for h in soup.find_all('h3'):
    a = h.find('a')
    url = urljoin(WH_BASE_URL, a.attrs['href'])
    print(url)

And the result:

https://www.whitehouse.gov/the-press-office/briefing-presidents-upcoming-trip-saudi-arabia-egypt-germany-and-france
https://www.whitehouse.gov/the-press-office/briefing-white-house-press-secretary-robert-gibbs-5-29-09
https://www.whitehouse.gov/the-press-office/briefing-white-house-press-secretary-robert-gibbs-5-28-09
https://www.whitehouse.gov/the-press-office/briefing-white-house-press-secretary-robert-gibbs-5272009
https://www.whitehouse.gov/the-press-office/briefing-white-house-press-secretary-robert-gibbs-5-26-09
https://www.whitehouse.gov/the-press-office/briefing-white-house-press-secretary-robert-gibbs-5-22-09
https://www.whitehouse.gov/the-press-office/briefing-white-house-press-secretary-robert-gibbs-5-21-09
https://www.whitehouse.gov/the-press-office/briefing-white-house-press-secretary-robert-gibbs-5-20-09
https://www.whitehouse.gov/the-press-office/briefing-white-house-press-secretary-robert-gibbs-51909
https://www.whitehouse.gov/the-press-office/briefing-secretary-state-hillary-clinton-humanitarian-aid-pakistan

Use glob() to get a list of files

OK, now we know how to pull out the URLs from each individual page, let's repeat it across all the downloaded pages.

All of the files in the index-pages directory have an extension of .html…and while you'd think there wouldn't be any other kinds of files in that directory…sometimes your operating system will add hidden metadata files. So to be safe, we don't want all of the files in index-html, just the ones that have an extension of .html.

There's several ways to get a list of filenames that match a wildcard pattern; I like using glob(), which is part of the glob module. Here's how to use it in excruciating step-by-step detail:

from glob import glob
from os.path import join
INDEX_PAGES_DIR = 'index-pages'
gp = join(INDEX_PAGES_DIR, '*.html')
index_pages_filenames = glob(gp)
type(index_pages_filenames)
# list
len(index_pages_filenames)
# 162
index_pages_filenames[42]
# 'index-pages/136.html'

Slightly more condensed:

from glob import glob
from os.path import join
INDEX_PAGES_DIR = 'index-pages'
ip_fnames = glob(join(INDEX_PAGES_DIR, '*.html'))

All together

To loop through all of the downloaded pages in index-pages and to extract their absolute URLs, here's the code:

from bs4 import BeautifulSoup
from glob import glob
from os.path import join
from urllib.parse import urljoin
INDEX_PAGES_DIR = 'index-pages'
WH_BASE_URL = 'https://www.whitehouse.gov/briefing-room/press-briefings/'
ip_fnames = glob(join(INDEX_PAGES_DIR, '*.html'))
for fname in ip_fnames:
    # get the text
    with open(fname, 'r') as rf:
        txt = rf.read()
    # parse the HTML 
    soup = BeautifulSoup(txt, 'lxml')
    # extract the URLs
    for h in soup.find_all('h3'):
        a = h.find('a')
        url = urljoin(WH_BASE_URL, a.attrs['href'])
        print(url)