The following section contains scripts that all tackle the same homework problems: Scrape and Count Webpages
They don't solve the exact requirement, but are just meant to show different ways to think about the problem, or at least get you more acquainted with the web-scraping libraries and techniques.
Some assorted, integrated scripts for downloading and extracting data from the WH press briefings can be found at the site repo:
You can read them if you want, but they contain snippets of other features, including the drafts of me figuring out what the easiest way to introduce the somewhat esoteric syntax of matplotlib.
Some of the inspiration for this is drawn from NPR's excellent time-series analysis, The Fleeting Obsessions Of The White House Press Corps
Here's the religion-filtering-Texas-death-row script on a Gist; I've tried to clean it up to the point where it might be hard to untangle how things work from the abstractions. Also I decided to make it spit out colorful text:
I wrote a few scripts that are meant to just be run independently of each other in the context of the death row pages.
mydata/lastwordpages. You can then glob these files when evaluating for religiosity, rather than continually pinging the Texas live site.
'Last Statement:', not just match its value exactly.
I had hoped to do a quick lesson on how to build a different/better death penalty site. And then got caught up in all the different ways there are to just download the Texas site.
The hard part isn't downloading the files. It's organizing them on your end. My approach was to save individual inmate files by their Texas ID number. So, for example, given this URL:
Instead of saving it locally at:
Creating a new filename based on Mr. Jacobs' Texas system ID of 872:
And for that matter, saving his biographical information at:
The index page of offenders can be found here:
An individual inmate's last words page can be found here:
Who was Adam Ward? His biographical information is on a wholly separate page:
A simple task – find out if Adam Ward was religious enough to mention religion in his last words, ends up being a microcosm of the pain and promise of web-scraping…or rather, being able to programmatically bring together data.
The site github repo has various files stashed away – I'll link to them later:
So the code in this "mirror-tdcj-dp-site" repo is meant to collect and "glue together" the Texas death penalty site in such a way that we can then transform its data into a different site of our own. One that has, for example, faces on the index view. Or one that focuses on education level and age of the convicts, or on other datapoints that are currently buried in the official design.
This is a long way of saying: the "mirror-tdcj-dp-site" repo contains way more code than needed to do a simple scrape job. But you might find some insights from it.
Some code to fetch and parse a single page:
from bs4 import BeautifulSoup import requests SOURCE_URL = 'http://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html' resp = requests.get(SOURCE_URL) soup = BeautifulSoup(resp.text, 'lxml') # get the node with the Last Statement content el = soup.find('p', text='Last Statement:') # all other subsequent sibling nodes should have the right text lastwords = "" for p in el.find_next_siblings('p'): lastwords += p.text # All done with extracting last words...let's try to get their name offel = soup.find('p', text="Offender:") # by definition, the next sibling p tag contains the name offender = offel.find_next_sibling('p').text print(offender) print(lastwords)