Downloading and saving each and every White House press briefings

Basically, mass web-scraping and file-management, all in one exercise.
This article is part of a sequence:
Table of contents

Making a "safe" filename

It was easy to come up with filenames for each of the index page URLs – the number of each page could serve as the "unique id":

But what about each individual press briefing URL? Not only are there 1,600+ of them, the directory structure used by whitehouse.gov has changed a few times – which is not unusual over a time of 8 or so years.

All the press briefing URLs start out like this:

  https://www.whitehouse.gov/the-press-office

So let's examine the latter parts of their paths. From what I can tell, the White House website has had at least 2 changes in how it structured its URLs. Early URLs were all just stuffed into the /the-press-office path:

And some of them get awfully close to being the same path:

At some point, the presidential web developers realized that they were going to have some name collisions, so URLs now are prefaced with YYYY/MM/DD paths – which is really nice for when we need to extract the date of given URL:

However, it makes things complicated when saving to disk. Ideally, we want to preserve the directory structure in the URL when saving it to our briefs/ subdirectory, e.g.

briefs/2016/04/02/weekly-address-securing-world-nuclear-terrorism.html

But your operating system will not let you save a filename that refers to yet-uncreated subdirectories. To save a file at the given filename above, we have to create all of the intermediary subdirectories, e.g. briefs/2016/04/02

Replacing slashes with hyphens with replace()

For now, the easiest way to create a filename from the relative URLs is to simply replace all of the slash characters, /, with hyphens (or any character of your choice).

Here it is in isolation:

"/the/path/to/fun/".replace('/', '-')
# '-the-path-to-fun-'

I like throwing in the strip() function, which takes an optional string argument to remove a given character at the beginning and end of a string:

"/the/path/to/fun/".replace('/', '-').strip('-')
# 'the-path-to-fun'

This works, too:

"/the/path/to/fun/".strip('/').replace('/', '-')
# 'the-path-to-fun'

All together

Dry-run

The following is code that simply prints the extracted URLs and the filenames that will be created. Run it to make sure it prints sensical URLs and file paths:

from glob import glob
from bs4 import BeautifulSoup
from os.path import join
from urllib.parse import urljoin

WH_PB_HOME_URL = 'https://www.whitehouse.gov/briefing-room/press-briefings'
INDEX_PAGES_DIR = 'index-pages'
BRIEFS_DIR = 'briefs'
# Gather up all the index pages
html_filenames = glob(join(INDEX_PAGES_DIR, '*.html'))

# for each index page
for hname in html_filenames:
    # open and parse it as HTML
    with open(hname, 'r') as rf:
        txt = rf.read()
        soup = BeautifulSoup(txt, 'lxml')
    # now select for <a> tags nested within <h3> tags
    for hed in soup.find_all('h3'):
        href = hed.find('a').attrs['href']
        url = urljoin(WH_PB_HOME_URL, href)
        print("Downloading from...\n", url)
    # create a filename from the relative href path
        fn = href.replace('/', '-').strip('-') + '.html'
        full_fname = join(BRIEFS_DIR, fn)
        print("Saving to...\n", full_fname)

Note that adding the .html to the end of the saved filename isn't necessary…

    fn = href.replace('/', '-').strip('-') + '.html'

…but we do it just so that it's easier to recognize the file as HTML on our own file system.

You should get output that looks like this:

Downloading from...
 https://www.whitehouse.gov/the-press-office/press-briefing-press-secretary-robert-gibbs-7262010
Saving to...
 briefs/the-press-office-press-briefing-press-secretary-robert-gibbs-7262010.html
Downloading from...
 https://www.whitehouse.gov/the-press-office/press-briefing-press-secretary-robert-gibbs-7222010
Saving to...
 briefs/the-press-office-press-briefing-press-secretary-robert-gibbs-7222010.html

Actual downloading

Now construct the script so that it does the following:

  1. Actually creates the briefs subdirectory
  2. Uses requests to actually fetch the individual press briefings webpages
  3. Actually saves the downloaded webpage to the appropriate file in briefs

This should just take a few more lines of code. Here's my answer:

from bs4 import BeautifulSoup
from glob import glob
from os import makedirs
from os.path import join
from urllib.parse import urljoin
import requests

WH_PB_HOME_URL = 'https://www.whitehouse.gov/briefing-room/press-briefings'
INDEX_PAGES_DIR = 'index-pages'
BRIEFS_DIR = 'briefs'
makedirs(BRIEFS_DIR, exist_ok=True)
# Gather up all the index pages
html_filenames = glob(join(INDEX_PAGES_DIR, '*.html'))

# for each index page
for hname in html_filenames:
    # open and parse it as HTML
    with open(hname, 'r') as rf:
        txt = rf.read()
        soup = BeautifulSoup(txt, 'lxml')
    # now select for <a> tags nested within <h3> tags
    for hed in soup.find_all('h3'):
        href = hed.find('a').attrs['href']
        url = urljoin(WH_PB_HOME_URL, href)
        print("Downloading from...\n", url)
        resp = requests.get(url)
    # create a filename from the relative href path
        fn = href.replace('/', '-').strip('-') + '.html'
        full_fname = join(BRIEFS_DIR, fn)
        print("Saving to...\n", full_fname)
        with open(full_fname, 'w') as wf:
            wf.write(resp.text)
This article is part of a sequence: