The first step in downloading and parsing White House press briefings is to download and parse the list of press briefings.
This series of lessons is meant to give an overview of all the parts that go into the task that is commonly referred to as “web scraping”, including understanding how URLs work and how HTML parsing is different from the text parsing we’ve done so far.
As in previous lessons, we start out as if we knew nothing more than how to deal with plain text, which works just fine when it comes to URLs themselves. The batch download of 100+ webpages is pretty simple, so we’ll also practice some best practices in Python programming and data organization.
This is the first part of a series of lessons for doing web-scraping and a little text-mining. Our dataset here is all the press releases currently hosted on the White House's website:
https://www.whitehouse.gov/briefing-room/press-briefings
In this lesson, we won't be downloading each press release page, yet. Instead, we will be downloading and saving each of the index pages that lists the press releases. That allows us in the next lessons to try out text/HTML parsing without hitting up the White House webserver every time.
The landing page for the White House press briefings can be found here:
https://www.whitehouse.gov/briefing-room/press-briefings
The page features the annoyance of JavaScript-powered-infinite scroll, in which the only way to get to the next page of press briefings is to keep scrolling down, down, down the page.
Luckily, the website does have affordances for old-fashioned navigation, which I only remember from last year when we scraped the site.
To get to the 50th page of the press briefings, just take the landing page URL and append page=50
as its query string:
https://www.whitehouse.gov/briefing-room/press-briefings?page=50
You can read more about query strings on Wikipedia, but basically, they're the key-value pairs (e.g. page=50
) that are separated from the URL by a question mark. You've seen them in YouTube URLs:
https://www.youtube.com/watch?v=LyONt_ZH_aw&t=4s
The key-value pairs in that YouTube URL are:
v=LyONt_ZH_aw
- the v
key specifies the YouTube video ID to play.t=4s
- the t
key specifies the number of seconds to start playing from.Query strings are a way for a website to add more options to how a URL is interpreted. In the case of the White House press briefings, the page
attribute is used to specify which "page" in the back log of entries to show. As users of the websites, we can jump from page to page just by modifying that part of the URL.
Later in this guide, we'll cover how to use Python to handle URLs and query strings more generally – for now, it's enough to know that a URL query string refers to any text that is on the right side of a delimiting question mark:
https://www.example.com?hello=42&world=apples
Before we consider the "correct" way to do this, let's just do it the most obvious way. Open up your browser and assign large numbers to the query string's page
parameter. I like to start off with just a medium value, such as 100
, just to make sure things are working as expected:
https://www.whitehouse.gov/briefing-room/press-briefings?page=100
As of April 2016, the dates of the briefings on page 100 of President Obama's press briefings page are in the September 2011 range. So page 100 is more than halfway there.
I like to pick an extreme number – e.g. page=1000
just to see how the site behaves. Does it return a 404 error? Does it return any kind of warning or error to the user, such as, "Sorry, you've reached the end of our content"? Or is there no error, just a blank page of entries? The most annoying situation is when you get to the purported "end" of a website's content and it keeps repeatedly showing you the very last page because it doesn't know what to do when you've gone beyond its page limits – it's difficult to design an automated web crawler that won't just keep on going and going and going…
Let's see how the White House press briefings reacts to a request for page 1000:
https://www.whitehouse.gov/briefing-room/press-briefings?page=1000
The site returns a normal webpage, except with nothing in its content area:
It doesn't take long to eventually find the last page of entries – for me, in April 2016, it was page 161. In another month or so, that last page number will increase. So in the long run, it's best to code a solution that can detect when it's reached the final page of entries. But for now, let's just plan on hardcoding a loop that increments from page 0
to 161
.
This part should be old hat. Import the Requests library and do a get()
request for one of the pages:
import requests
url = 'https://www.whitehouse.gov/briefing-room/press-briefings?page=0'
resp = requests.get(url)
Then pick a place (i.e think of a filename) to save the text
contents of the webserver response to disk. Given that this is page number 0
, it makes sense to save it as a file named 0.html:
with open("0.html", "w") as wf:
wf.write(resp.text)
If you were in a hurry and had to get all those lists of press briefings right now, you could be forgiven for doing this:
import requests
THE_URL = 'https://www.whitehouse.gov/briefing-room/press-briefings?page='
MAX_PAGE_NUM = 162
for pagenum in range(0, MAX_PAGE_NUM):
url = THE_URL + str(pagenum)
print("Downloading", url)
resp = requests.get(url)
fname = str(pagenum) + '.html'
print("Saving to", fname)
with open(fname, "w") as wf:
wf.write(resp.text)
The White House site is pretty fast, so you'd see your script's output fly pretty quickly:
Downloading https://www.whitehouse.gov/briefing-room/press-briefings?page=0
Saving to 0.html
Downloading https://www.whitehouse.gov/briefing-room/press-briefings?page=1
Saving to 1.html
Downloading https://www.whitehouse.gov/briefing-room/press-briefings?page=2
Saving to 2.html
...
Downloading https://www.whitehouse.gov/briefing-room/press-briefings?page=160
Saving to 160.html
Downloading https://www.whitehouse.gov/briefing-room/press-briefings?page=161
Saving to 161.html
Actually, this particular task is straightforward enough that there's not much more to do here. However, since this web-crawl job is so easy, it's a good chance to slow down and practice more advanced techniques and good habits that are necessary for more difficult tasks.
Our bigger job is to not just to collect the pages that list press briefing headlines – we want to collect each press briefing page itself, and then analyze the text of each press briefing. That last task becomes fairly annoying if we make no effort to keep the press briefing list pages from the individual press briefing pages, nevermind the problem of keeping our Python script files separate from the thousands of webpages we'll be downloading.
The rest of this guide will touch on patterns and techniques that will, to be honest, seem like programming for programming's sake, if not clinically anal-retentive. But they'll make more sense as the programming work gets more complicated, so it's best to start practicing them for small tasks.
So instead of just downloading the webpages right into the local directory, let's create a subdirectory named index-pages, and then adjust our script to save the files there.
The least-effort approach would simply be to change the line where the fname
variable is assigned:
# ...
resp = requests.get(url)
fname = "index-pages/" + str(pagenum) + '.html'
print("Saving to", fname)
#...
This little fix ends up creating several problems, the most prominent being that the script will fail unless a directory named index-pages
already exists…which it probably doesn't at this point.
makedirs()
to create a directoryIf you're in a hurry, you might create the index-pages
subdirectory using your operating system and a combination of pointing-and-clicking. But let's try it the programmatic way – it will be a little inconvenient at first, but it's a habit that will pay off very quickly. Typos in file path names are the cause of a huge number of bugs, and they become much more frequent (and harder to track down) if you create path names outside of your programming environment.
Python's os module provides a function named makedirs()
We provide it with a (string) name of the directory we hope to create and it will do it:
from os import makedirs
makedirs('index-pages')
One of the problems with the makedirs()
function is that it will throw an error if a directory or file already exists at the specified path:
# for the second time
makedirs('index-pages')
Here's what it looks like in my interactive recording:
However, by passing in a second named argument, exist_ok=True
, we can prevent the makedirs()
function from barking at us just because we didn't know whether a directory already existed:
makedirs('index-pages', exist_ok=True)
It may seem strange that we write extra code to prevent an error message instead of just deleting the makedirs()
call so that it doesn't run again. But in production environments, it's not a good idea to add/remove setup code – including the code that sets up subdirectories – on a manual basis.
The web-crawling code we've written may seem simple enough to just paste into ipython and run as a one-off…but, that's because it's a small job. For most real, complex tasks, the web-crawling/scraping code will be saved in a Python script. And it may have to be run again, over and over again. It's not fun to have to go in and remove the makedirs()
calls by hand – or to add it back in if we've decided to move our work to another directory or new computer entirely.
The exist_ok
argument makes it so that we can use a makedirs()
call and let it do its job – or to stay silent if the job is already done. When we're dealing with much bigger code complexity, that "silence is golden" mentality is nice when there's generally no problem with a subdirectory already existing.
You can try this other exercise that uses the makedirs() function idempotently. Otherwise, it's enough to remember its basic functionality and exist_ok
argument.
It's best to add the makedirs()
call early in the script, so that the index-pages
directory is created before anything else happens:
import requests
from os import makedirs
makedirs('index-pages', exist_ok=True)
THE_URL = 'https://www.whitehouse.gov/briefing-room/press-briefings?page='
MAX_PAGE_NUM = 162
for pagenum in range(0, MAX_PAGE_NUM):
url = THE_URL + str(pagenum)
# ...
fname = "index-pages/" + str(pagenum) + '.html'
index-pages
to a variableHow hard is it to remember that the string 'index-pages'
represents the subdirectory that we save downloaded pages into? Not hard at all, but only because our code is just a few lines.
Generally, when you have some kind of constant value – i.e. the name of a directory that you plan on referring to on several different occasions – it's best to assign it to a variable. And for variables on relatively constant values – are we really going to rename the index-pages
directory later on? (I hope not!) – the Python style guide recommends using all-caps letters and underscores for the variable name:
INDEX_PAGES_DIR = 'index-pages'
a = INDEX_PAGES_DIR + '/' + 'hello.html'
b = INDEX_PAGES_DIR + '/' + 'world.html'
The snippet above probably didn't seem particularly economical. After all, typing INDEX_PAGES_DIR
requires typing several more characters than index-pages
.
However, one main advantage of using variable names is that when you mistype a variable name, Python will immediately throw an error:
from os import makedirs
INDEX_PAGES_DIR = 'index-pages'
makedirs(INDEXPAGES_DIR, exist_ok=True)
NameError: name 'INDEXPAGES_DIR' is not defined
Getting an error message is usually unpleasant. But at least you were notified of a very fundamental error before the rest of your program did their thing.
Compare that to just trying to type 'index-pages'
manually:
from os import makedirs
makedirs('index_pages', exist_ok=True)
Do you see the typo? That makedirs()
call isn't going to throw an error because it assumes you know what you're doing, that you really want a subdirectory named index_pages
instead of index-pages
. But what if you don't? How far along will this subtle naming error go undetected? And how much are you going to regret not spending the two seconds it takes to assign a variable name, compared to the potential hours of debugging the kind of error that heavily depends on the acuity of your eyesight?
This use of code is going to seem really unnecessary.
Let's say we have a subdirectory named index-pages
and we want to save a file named 42.html
inside of it. What will that file's path name be? Well, if you're on a Unix-like system – including Mac OS X and Linux – that file path is simply:
index-pages/42.html
And to dynamically generate that path, just add the index-pages directory name, plus a forward-slash, plus the base filename together:
INDEX_PAGES_DIR = 'index-pages'
for i in range(30, 50):
bn = str(i) + '.html'
fname = INDEX_PAGES_DIR + '/' + bn
That seems straightforward. But what if you have a friend who is on a PC? And that PC uses Windows? And Windows uses its own version of a file path delimiter – which is the backwards-slash, \
? Now your simple script could end up in a major headache for anyone running it on a different platform.
So let's import the os.path submodule and use its join() function to create file pathnames. Try it out in ipython – make sure you understand that all it does it takes in arguments and returns a string.
from os.path import join
join('index-pages', '42.html')
# 'index-pages/42.html'
join("this", "is", "a", "really", "nested", "file.txt")
# 'this/is/a/really/nested/file.txt'
It doesn't seem like much – in fact, it seems not much different than the Python string object's own join() method…but the key difference is that it knows how to formulate the appropriate path string based on the operating system from which your program is being executed.
But even if you plan to never, ever write code that will be executed by a Windows machine, in day-to-day programming, it is not uncommon to make this kind of typo:
a = 'somedirectory/'
b = '/myfile.txt'
fullpath = a + b
– so, let os.path.join()
do that work for us:
import requests
from os import makedirs
from os.path import join
makedirs('index-pages', exist_ok=True)
THE_URL = 'https://www.whitehouse.gov/briefing-room/press-briefings?page='
MAX_PAGE_NUM = 162
for pagenum in range(0, MAX_PAGE_NUM):
url = THE_URL + str(pagenum)
# ...
fname = join("index-pages", str(pagenum) + '.html')
In many of my early examples, when we need to join a literal string pattern – such as:
"http://www.example.com?person="
– to a set of values, such as:
names = ["larry", "curly", "moe"]
– I'll sometimes show the most "common sense" way of accomplishing the task:
names = ["larry", "curly", "moe"]
for n in names:
url = "http://www.example.com?person=" + n
However, as you've started to surmise by now, "common sense" does not always scale.
Python has a confusing number of ways to format strings…I'm not even sure which PEP I'm supposed to link to (this one?).
For now, I advise using the string's format() method. You can read about the syntax and examples in the official documentation…though I find those docs to be confusing. However, the site pyformat.info – yes, Python string formatting is so ubiquitous and so confusing that someone's made a specialty site just for that topic – is a pretty readable reference.
For our purposes, instead of typecasting the loop integer (i.e. pagenum
) to a string and then adding '.html'
:
THE_URL = 'https://www.whitehouse.gov/briefing-room/press-briefings?page='
# ...
for pagenum in range(0, MAX_PAGENUM):
url = THE_URL + str(pagenum)
fname = join("index-pages", str(pagenum) + '.html')
– try this instead:
THE_URL = 'https://www.whitehouse.gov/briefing-room/press-briefings?page={}'
# ...
for pagenum in range(0, MAX_PAGENUM):
url = THE_URL.format(pagenum)
fname = join("index-pages", '{}.html'.format(pagenum))
Again, it doesn't save us much in terms of verbosity or readability of code. But in more complex scenarios, it will prove to be much more maintainable.
Using string formatting to generate the URL https://www.whitehouse.gov/briefing-room/press-briefings?page=42 is actually not recommended. Yes, all URLs are just strings. But the structure of a URL follows a specification, which means that there are URL-parsing/generating libraries that we can use.
In particular, the part of a URL that is prone to be affected by human error is the query string. The White House press briefings URLs have a very simple query string:
page=42
But query strings can get very complicated. Consider Google's Street View API, which uses URL query strings as a way to provide Street View imagery via URL.
At a minimum, the Street View API's query string requires a size
parameter to specify the width and height of imagery:
https://maps.googleapis.com/maps/api/streetview?size=600x300
Because the default location is in the middle of an ocean, however, the API returns a blank image:
img src="https://maps.googleapis.com/maps/api/streetview?size=600x300" alt='street view google api'>
So we can provide a location
parameter; here's what Stanford,CA
looks like:
https://maps.googleapis.com/maps/api/streetview?size=600x300&location=Stanford,CA
And if we want to turn around, we can supply a value – 30
(degrees) – for the heading
parameter:
https://maps.googleapis.com/maps/api/streetview?size=600x300&location=Stanford,CA&heading=30
So let's try to abstract the URL pattern using the string's format()
method with named arguments:
URL_ENDPOINT = 'https://maps.googleapis.com/maps/api/streetview'
QUERY_STRING = 'size={size}&location={location}&heading={heading}'
q_str = QUERY_STRING.format(size='400x400', location='New York', heading=0)
url = URL_ENDPOINT + '?' + q_str
print(url)
# https://maps.googleapis.com/maps/api/streetview?size=400x400&location=New York&heading=0
This is not an illogical approach…though if you know a bit about the URL spec, you know that there are a lot of caveats involved – for example, URLs aren't supposed to have space characters in them, so officially, New York
, won't fly – though most modern browsers know how to deal with that.
The bigger issue is, besides the overall tediousness of writing out the format()
call, is properly delimiting (using &
) the key/value pairs. Can you see the typos in the example below?
URL_ENDPOINT = 'https://maps.googleapis.com/maps/api/streetview'
QUERY_STRING = '?size={size}location={location}&heading={heading}'
q_str = QUERY_STRING.format(size='400x400', location='Chicago', heading="")
url = URL_ENDPOINT + '?' + q_str
print(url)
The Requests library earns its subtitle: HTML for Humans. When we want to get()
a URL that requires a query string, we can supply a dictionary of key-value pairs to the params
argument.
In the case of Google Street View:
import requests
URL_ENDPOINT = 'https://maps.googleapis.com/maps/api/streetview'
resp = requests.get(URL_ENDPOINT, params={'size': '600x300', 'location': 'Stanford, CA'})
print(resp.url)
# https://maps.googleapis.com/maps/api/streetview?location=Stanford%2C+CA&size=600x300
Note several of its conveniences:
+
and %2C
, respectively.&
?
delimiter between URL and query stringThis is such a useful feature of the Requests library that you should use it whenever possible, even for simple query strings as in the case of the White House press briefings pages:
import requests
URL_ENDPOINT = 'https://www.whitehouse.gov/briefing-room/press-briefings'
# ...
# ...
for pagenum in range(0, MAX_PAGENUM):
resp = requests.get(URL_ENDPOINT, params={'page': pagenum})
print("Downloaded", resp.url)
Note: If we use requests.get()
in the most straightforward way, we don't actually get a reference to the URL it formulates until the get()
action is completed and we have a Response object, i.e. resp
and resp.url
.
That's a minor inconvenience…it's nice to know the URL before the remote request is made in some debugging situations. But it's not a big deal here. If you want more details on ways to generate URL query strings with Python, check out this lesson.
This isn't a terrible approach, but all the little shortcuts, especially not creating a separate directory to save all the downloaded files, will bite us in the near term:
import requests
THE_URL = 'https://www.whitehouse.gov/briefing-room/press-briefings?page='
MAX_PAGE_NUM = 162
for pagenum in range(0, MAX_PAGE_NUM):
url = THE_URL + str(pagenum)
print("Downloading", url)
resp = requests.get(url)
fname = str(pagenum) + '.html'
print("Saving to", fname)
with open(fname, "w") as wf:
wf.write(resp.text)
The script below does virtually the same thing as before, but it creates a subfolder to save the files, and uses a lot of Python functions to keep the programmer (i.e. us) from having to be extra careful with typos.
In practice, you'll find yourself having a mix of emotions between "just-get-er-done" and "oh-wait-will-this-hurt-me?" – the snippet below isn't always best practice. But it's a good idea to at least be mindful of the difference between "It works!" and "It could be better":
import requests
from os import makedirs
from os.path import join
URL_ENDPOINT = 'https://www.whitehouse.gov/briefing-room/press-briefings'
MAX_PAGE_NUM = 162
INDEX_PAGES_DIR = 'index-pages'
makedirs(INDEX_PAGES_DIR, exist_ok=True)
for pagenum in range(0, MAX_PAGE_NUM):
resp = requests.get(URL_ENDPOINT, params={'page': pagenum})
print("Downloaded", resp.url)
fname = join(INDEX_PAGES_DIR, '{}.html'.format(pagenum))
print("Saving to", fname)
with open(fname, "w") as wf:
wf.write(resp.text)