A walkthrough of HTML scraping and regexes

Texas and HTML parsing and regexing and religion
Table of contents

Sample code

The following section contains scripts that all tackle the same homework problems: Scrape and Count Webpages

They don't solve the exact requirement, but are just meant to show different ways to think about the problem, or at least get you more acquainted with the web-scraping libraries and techniques.

White House press briefings word counts

Some assorted, integrated scripts for downloading and extracting data from the WH press briefings can be found at the site repo:


You can read them if you want, but they contain snippets of other features, including the drafts of me figuring out what the easiest way to introduce the somewhat esoteric syntax of matplotlib.

Here's a script to count mentions of ISIS/ISIL and graph it with a minimal amount of matplotlib code:


Some of the inspiration for this is drawn from NPR's excellent time-series analysis, The Fleeting Obsessions Of The White House Press Corps

Religion-testing Texas in one big script

Here's the religion-filtering-Texas-death-row script on a Gist; I've tried to clean it up to the point where it might be hard to untangle how things work from the abstractions. Also I decided to make it spit out colorful text:


On the site's Github repo


I wrote a few scripts that are meant to just be run independently of each other in the context of the death row pages.

Fetch and re-organize

I had hoped to do a quick lesson on how to build a different/better death penalty site. And then got caught up in all the different ways there are to just download the Texas site.

Here's the root folder for scripts that work in tandem.

The hard part isn't downloading the files. It's organizing them on your end. My approach was to save individual inmate files by their Texas ID number. So, for example, given this URL:


Instead of saving it locally at:


Creating a new filename based on Mr. Jacobs' Texas system ID of 872:


And for that matter, saving his biographical information at:


Class notes: Let's focus on last words of Texas death penalty inmates

The index page of offenders can be found here:


An individual inmate's last words page can be found here:


Who was Adam Ward? His biographical information is on a wholly separate page:


A simple task – find out if Adam Ward was religious enough to mention religion in his last words, ends up being a microcosm of the pain and promise of web-scraping…or rather, being able to programmatically bring together data.

The site github repo has various files stashed away – I'll link to them later:


So the code in this "mirror-tdcj-dp-site" repo is meant to collect and "glue together" the Texas death penalty site in such a way that we can then transform its data into a different site of our own. One that has, for example, faces on the index view. Or one that focuses on education level and age of the convicts, or on other datapoints that are currently buried in the official design.

This is a long way of saying: the "mirror-tdcj-dp-site" repo contains way more code than needed to do a simple scrape job. But you might find some insights from it.

Class example

Some code to fetch and parse a single page:

from bs4 import BeautifulSoup
import requests
SOURCE_URL = 'http://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html'

resp = requests.get(SOURCE_URL)
soup = BeautifulSoup(resp.text, 'lxml')

# get the node with the Last Statement content
el = soup.find('p', text='Last Statement:')
# all other subsequent sibling nodes should have the right text
lastwords = ""
for p in el.find_next_siblings('p'):
    lastwords += p.text

# All done with extracting last words...let's try to get their name
offel = soup.find('p', text="Offender:")
# by definition, the next sibling p tag contains the name
offender = offel.find_next_sibling('p').text