In an email to (dun@stanford.edu), send in free-form letter (text, attachments, etc):
Write a Python script that:
Note: you will probably want to have the pdfplumber library for Python installed…which you should be able to do by running the following command at your Terminal/system shell:
pip install pdfplumber
(Please email me if you have problems with this)
Here's some code snippets on how to use pdfplumber with the California WARN act PDFs:
https://gist.github.com/dannguyen/4263ceb99365def49fba78df59a8d23c
Send me another email with:
Google for another state that posts its WARN acts. Convert one of its PDFs into CSV and total up the job losses in that PDF. Or, better yet, find a state that reports the data in a better format than CSV.
Send me another email with:
The code here shows how to count "innocents": basically, filter for all stops in which both the 'arstmade'
and 'sumissue'
are not equal to 'Y'
:
Warmup exercise: Counting the innocent stops by the NYPD
This is something we learn from the data schema provided by the NYPD.
The Wall Street Journal has mass-downloaded the tens of thousands of scanned PDFs of Secretary Clinton emails and uploaded them to a viewer (which has links to the zip files):
You can see their Github repo of Python scraping code here: https://github.com/wsjdata/clinton-email-cruncher.
Back when Governor Palin was running for VP, [her emails were also FOIAed. And they, too, were printed out, redacted, and then scanned back in digitally.
Here's what the extracted text of a digital scan looks like, thanks to the inaccuracies of optical character recognition (OCR) technology:
Best wishes, Louis
Kelli Adams-Jenkinson
Assistant to Ambassador Louis Susman
24 Grosvenor Square
London W1A 1AE
020-7894-0214
adamskc3@state.clov
Compared it to the scanned text:
BTW, here's the full PDF of that email. See if you can spot the other mistranslations:
http://stash.compjour.org/samples/pdfs/clinton-emails/HRCEmail_DecWeb/C05788935.pdf
Can you see the problem with Blagojevich's document? If not, that's the problem (and an answer to why digital scans are usually uploaded):
Extracting messy text and turning it into structured data and machine-readable text formats (e.g. CSV, JSON) will generally be the hardest problem that we face. However, even when the data already comes as non-scanned PDFs…or perfectly seeming HTML files (which we can "web-scrape")…or even as raw CSVs…there's still a lot of work to wrangle the data.
Seen in my doctor's office: pic.twitter.com/LhXkjKvhQd
— Steven Bellovin (@SteveBellovin) March 23, 2016
These are tables…but they're PDFs. Why do organizations make PDFs when they're so hard to parse?
http://www.edd.ca.gov/jobs_and_training/Layoff_Services_WARN.htm
Even a nice looking website like this Pfizer site has inherent problems:
Via the New York Times: Data on Fees to Doctors Is Called Hard to Parse, April 12, 2010