It was easy to come up with filenames for each of the index page URLs – the number of each page could serve as the "unique id":
https://www.whitehouse.gov/briefing-room/weekly-address?page=1
index-pages/1.html
But what about each individual press briefing URL? Not only are there 1,600+ of them, the directory structure used by whitehouse.gov
has changed a few times – which is not unusual over a time of 8 or so years.
All the press briefing URLs start out like this:
https://www.whitehouse.gov/the-press-office
So let's examine the latter parts of their paths. From what I can tell, the White House website has had at least 2 changes in how it structured its URLs. Early URLs were all just stuffed into the /the-press-office
path:
And some of them get awfully close to being the same path:
At some point, the presidential web developers realized that they were going to have some name collisions, so URLs now are prefaced with YYYY/MM/DD
paths – which is really nice for when we need to extract the date of given URL:
However, it makes things complicated when saving to disk. Ideally, we want to preserve the directory structure in the URL when saving it to our briefs/
subdirectory, e.g.
briefs/2016/04/02/weekly-address-securing-world-nuclear-terrorism.html
But your operating system will not let you save a filename that refers to yet-uncreated subdirectories. To save a file at the given filename above, we have to create all of the intermediary subdirectories, e.g. briefs/2016/04/02
For now, the easiest way to create a filename from the relative URLs is to simply replace all of the slash characters, /
, with hyphens (or any character of your choice).
Here it is in isolation:
"/the/path/to/fun/".replace('/', '-')
# '-the-path-to-fun-'
I like throwing in the strip() function, which takes an optional string argument to remove a given character at the beginning and end of a string:
"/the/path/to/fun/".replace('/', '-').strip('-')
# 'the-path-to-fun'
This works, too:
"/the/path/to/fun/".strip('/').replace('/', '-')
# 'the-path-to-fun'
The following is code that simply prints the extracted URLs and the filenames that will be created. Run it to make sure it prints sensical URLs and file paths:
from glob import glob
from bs4 import BeautifulSoup
from os.path import join
from urllib.parse import urljoin
WH_PB_HOME_URL = 'https://www.whitehouse.gov/briefing-room/press-briefings'
INDEX_PAGES_DIR = 'index-pages'
BRIEFS_DIR = 'briefs'
# Gather up all the index pages
html_filenames = glob(join(INDEX_PAGES_DIR, '*.html'))
# for each index page
for hname in html_filenames:
# open and parse it as HTML
with open(hname, 'r') as rf:
txt = rf.read()
soup = BeautifulSoup(txt, 'lxml')
# now select for <a> tags nested within <h3> tags
for hed in soup.find_all('h3'):
href = hed.find('a').attrs['href']
url = urljoin(WH_PB_HOME_URL, href)
print("Downloading from...\n", url)
# create a filename from the relative href path
fn = href.replace('/', '-').strip('-') + '.html'
full_fname = join(BRIEFS_DIR, fn)
print("Saving to...\n", full_fname)
Note that adding the .html
to the end of the saved filename isn't necessary…
fn = href.replace('/', '-').strip('-') + '.html'
…but we do it just so that it's easier to recognize the file as HTML on our own file system.
You should get output that looks like this:
Downloading from...
https://www.whitehouse.gov/the-press-office/press-briefing-press-secretary-robert-gibbs-7262010
Saving to...
briefs/the-press-office-press-briefing-press-secretary-robert-gibbs-7262010.html
Downloading from...
https://www.whitehouse.gov/the-press-office/press-briefing-press-secretary-robert-gibbs-7222010
Saving to...
briefs/the-press-office-press-briefing-press-secretary-robert-gibbs-7222010.html
Now construct the script so that it does the following:
briefs
subdirectoryrequests
to actually fetch the individual press briefings webpagesbriefs
This should just take a few more lines of code. Here's my answer:
from bs4 import BeautifulSoup
from glob import glob
from os import makedirs
from os.path import join
from urllib.parse import urljoin
import requests
WH_PB_HOME_URL = 'https://www.whitehouse.gov/briefing-room/press-briefings'
INDEX_PAGES_DIR = 'index-pages'
BRIEFS_DIR = 'briefs'
makedirs(BRIEFS_DIR, exist_ok=True)
# Gather up all the index pages
html_filenames = glob(join(INDEX_PAGES_DIR, '*.html'))
# for each index page
for hname in html_filenames:
# open and parse it as HTML
with open(hname, 'r') as rf:
txt = rf.read()
soup = BeautifulSoup(txt, 'lxml')
# now select for <a> tags nested within <h3> tags
for hed in soup.find_all('h3'):
href = hed.find('a').attrs['href']
url = urljoin(WH_PB_HOME_URL, href)
print("Downloading from...\n", url)
resp = requests.get(url)
# create a filename from the relative href path
fn = href.replace('/', '-').strip('-') + '.html'
full_fname = join(BRIEFS_DIR, fn)
print("Saving to...\n", full_fname)
with open(full_fname, 'w') as wf:
wf.write(resp.text)