Making a "safe" filename

It was easy to come up with filenames for each of the index page URLs – the number of each page could serve as the "unique id":

from: https://www.whitehouse.gov/briefing-room/weekly-address?page=1
to: index-pages/1.html

But what about each individual press briefing URL? Not only are there 1,600+ of them, the directory structure used by whitehouse.gov has changed a few times – which is not unusual over a time of 8 or so years.

All the press briefing URLs start out like this:

  https://www.whitehouse.gov/the-press-office

So let's examine the latter parts of their paths. From what I can tell, the White House website has had at least 2 changes in how it structured its URLs. Early URLs were all just stuffed into the /the-press-office path:

/press-briefing-presidents-upcoming-trip-guadalajara-8-6-09
/press-briefing-2609

And some of them get awfully close to being the same path:

/briefing-white-house-press-secretary-robert-gibbs-8409
/briefing-white-house-press-secretary-robert-gibbs-8309

At some point, the presidential web developers realized that they were going to have some name collisions, so URLs now are prefaced with YYYY/MM/DD paths – which is really nice for when we need to extract the date of given URL:

/2016/04/02/weekly-address-securing-world-nuclear-terrorism
/2011/09/11/gaggle-principal-deputy-press-secretary-josh-earnest-aboard-air-force-on

However, it makes things complicated when saving to disk. Ideally, we want to preserve the directory structure in the URL when saving it to our briefs/ subdirectory, e.g.

briefs/2016/04/02/weekly-address-securing-world-nuclear-terrorism.html

But your operating system will not let you save a filename that refers to yet-uncreated subdirectories. To save a file at the given filename above, we have to create all of the intermediary subdirectories, e.g. briefs/2016/04/02

Replacing slashes with hyphens with replace()

For now, the easiest way to create a filename from the relative URLs is to simply replace all of the slash characters, /, with hyphens (or any character of your choice).

Here it is in isolation:

"/the/path/to/fun/".replace('/', '-')
# '-the-path-to-fun-'

I like throwing in the strip() function, which takes an optional string argument to remove a given character at the beginning and end of a string:

"/the/path/to/fun/".replace('/', '-').strip('-')
# 'the-path-to-fun'

This works, too:

"/the/path/to/fun/".strip('/').replace('/', '-')
# 'the-path-to-fun'

All together

Dry-run

The following is code that simply prints the extracted URLs and the filenames that will be created. Run it to make sure it prints sensical URLs and file paths:

from glob import glob
from bs4 import BeautifulSoup
from os.path import join
from urllib.parse import urljoin

WH_PB_HOME_URL = 'https://www.whitehouse.gov/briefing-room/press-briefings'
INDEX_PAGES_DIR = 'index-pages'
BRIEFS_DIR = 'briefs'
# Gather up all the index pages
html_filenames = glob(join(INDEX_PAGES_DIR, '*.html'))

# for each index page
for hname in html_filenames:
    # open and parse it as HTML
    with open(hname, 'r') as rf:
        txt = rf.read()
        soup = BeautifulSoup(txt, 'lxml')
    # now select for <a> tags nested within <h3> tags
    for hed in soup.find_all('h3'):
        href = hed.find('a').attrs['href']
        url = urljoin(WH_PB_HOME_URL, href)
        print("Downloading from...\n", url)
    # create a filename from the relative href path
        fn = href.replace('/', '-').strip('-') + '.html'
        full_fname = join(BRIEFS_DIR, fn)
        print("Saving to...\n", full_fname)

Note that adding the .html to the end of the saved filename isn't necessary…

    fn = href.replace('/', '-').strip('-') + '.html'

…but we do it just so that it's easier to recognize the file as HTML on our own file system.

You should get output that looks like this:

Downloading from...
 https://www.whitehouse.gov/the-press-office/press-briefing-press-secretary-robert-gibbs-7262010
Saving to...
 briefs/the-press-office-press-briefing-press-secretary-robert-gibbs-7262010.html
Downloading from...
 https://www.whitehouse.gov/the-press-office/press-briefing-press-secretary-robert-gibbs-7222010
Saving to...
 briefs/the-press-office-press-briefing-press-secretary-robert-gibbs-7222010.html

Actual downloading

Now construct the script so that it does the following:

Actually creates the briefs subdirectory
Uses requests to actually fetch the individual press briefings webpages
Actually saves the downloaded webpage to the appropriate file in briefs

This should just take a few more lines of code. Here's my answer:

from bs4 import BeautifulSoup
from glob import glob
from os import makedirs
from os.path import join
from urllib.parse import urljoin
import requests

WH_PB_HOME_URL = 'https://www.whitehouse.gov/briefing-room/press-briefings'
INDEX_PAGES_DIR = 'index-pages'
BRIEFS_DIR = 'briefs'
makedirs(BRIEFS_DIR, exist_ok=True)
# Gather up all the index pages
html_filenames = glob(join(INDEX_PAGES_DIR, '*.html'))

# for each index page
for hname in html_filenames:
    # open and parse it as HTML
    with open(hname, 'r') as rf:
        txt = rf.read()
        soup = BeautifulSoup(txt, 'lxml')
    # now select for <a> tags nested within <h3> tags
    for hed in soup.find_all('h3'):
        href = hed.find('a').attrs['href']
        url = urljoin(WH_PB_HOME_URL, href)
        print("Downloading from...\n", url)
        resp = requests.get(url)
    # create a filename from the relative href path
        fn = href.replace('/', '-').strip('-') + '.html'
        full_fname = join(BRIEFS_DIR, fn)
        print("Saving to...\n", full_fname)
        with open(full_fname, 'w') as wf:
            wf.write(resp.text)

Downloading and saving each and every White House press briefings

Making a "safe" filename

Replacing slashes with hyphens with replace()

All together

Dry-run

Actual downloading