Lecture 2016-04-04

An overview of text and text-like documents and turning them into data
Table of contents


Due Wednesday, 2016-04-06 23:59 (but try to have it done before class so you can ask questions)

0. Read the following and be prepared to discuss

1. Aggregate and inspect healthcare data breach records

In an email to (dun@stanford.edu), send in free-form letter (text, attachments, etc):

2. Aggregate the job losses as recorded by the California WARN act

Write a Python script that:

Note: you will probably want to have the pdfplumber library for Python installed…which you should be able to do by running the following command at your Terminal/system shell:

  pip install pdfplumber

(Please email me if you have problems with this)

Here's some code snippets on how to use pdfplumber with the California WARN act PDFs:


Send me another email with:

3. Download and process a data file from another state's WARN Act

Google for another state that posts its WARN acts. Convert one of its PDFs into CSV and total up the job losses in that PDF. Or, better yet, find a state that reports the data in a better format than CSV.

Send me another email with:


Example answers to number of "innocents" stopped by NYPD Stop and Frisk

The code here shows how to count "innocents": basically, filter for all stops in which both the 'arstmade' and 'sumissue' are not equal to 'Y':

Warmup exercise: Counting the innocent stops by the NYPD

This is something we learn from the data schema provided by the NYPD.

The problem with PDFs

The Wall Street Journal has mass-downloaded the tens of thousands of scanned PDFs of Secretary Clinton emails and uploaded them to a viewer (which has links to the zip files):

image wsj-clinton-emails.jpg

You can see their Github repo of Python scraping code here: https://github.com/wsjdata/clinton-email-cruncher.

Governor Sarah Palin's emails

Back when Governor Palin was running for VP, [her emails were also FOIAed. And they, too, were printed out, redacted, and then scanned back in digitally.

Here's what the extracted text of a digital scan looks like, thanks to the inaccuracies of optical character recognition (OCR) technology:

Best wishes, Louis

Kelli Adams-Jenkinson
Assistant to Ambassador Louis Susman
24 Grosvenor Square
London W1A 1AE

Compared it to the scanned text:

image clinton-ocr-state.clov.jpg

BTW, here's the full PDF of that email. See if you can spot the other mistranslations:


Examples of PDFs/emails in which digital files were uploaded

Can you see the problem with Blagojevich's document? If not, that's the problem (and an answer to why digital scans are usually uploaded):

image blago-non-redacts.png

Having data our way

Extracting messy text and turning it into structured data and machine-readable text formats (e.g. CSV, JSON) will generally be the hardest problem that we face. However, even when the data already comes as non-scanned PDFs…or perfectly seeming HTML files (which we can "web-scrape")…or even as raw CSVs…there's still a lot of work to wrangle the data.

Medical records breach data

California WARN ACT PDFs

These are tables…but they're PDFs. Why do organizations make PDFs when they're so hard to parse?


Pfizer Doctor Payments

Even a nice looking website like this Pfizer site has inherent problems:

image pfizer-pharma.png

Via the New York Times: Data on Fees to Doctors Is Called Hard to Parse, April 12, 2010