Due Wednesday, 2016-04-06 23:59 (but try to have it done before class so you can ask questions)
0. Read the following and be prepared to discuss
- Pharmaceutical Company Payments to Physicians Early Experiences With Disclosure Laws in Vermont and Minnesota via JAMA, March 21, 2007
- Psychiatrists Top List in Drug Maker Gifts via New York Times, June 6, 2007
- Researchers Fail to Reveal Full Drug Pay, via New York Times, June 8, 2008
- Data on Fees to Doctors Is Called Hard to Parse, via New York Times, April 12, 2010
- Docs on Pharma Payroll Have Blemished Records, Limited Credentials via ProPublica, October 10, 2010
1. Aggregate and inspect healthcare data breach records
In an email to (firstname.lastname@example.org), send in free-form letter (text, attachments, etc):
- The total the number of individuals affected.
- A chart showing the number of individuals affected by year.
- An attempt at ascertaining whether or not electronic records are more prone to violation than non-electronic records.
- A paragraph or so explaining why it is difficult to ascertain any paper vs. electronic trend, in terms of pure numbers. This includes thinking about the limitations of the data and how it was collected and recorded.
2. Aggregate the job losses as recorded by the California WARN act
Write a Python script that:
- Downloads all of the PDF WARN reports for California, which can be found here: http://www.edd.ca.gov/jobs_and_training/Layoff_Services_WARN.htm
- Extracts tabular data from each PDF and serializes it as CSV (i.e. saves it as a CSV file). You can save all to one CSV file. Or make a CSV file for each PDF. Note that you're going to run into a data "uncleanliness" issue, in which the columns aren't correctly translated across each PDF.
- Use Excel (or you can do it in Python, though you'll have to deal with various text/typecasting errors) to open that file and aggregate that total number of jobs lost.
Note: you will probably want to have the pdfplumber library for Python installed…which you should be able to do by running the following command at your Terminal/system shell:
pip install pdfplumber
(Please email me if you have problems with this)
Here's some code snippets on how to use pdfplumber with the California WARN act PDFs:
Send me another email with:
- your Python code for downloading the PDFs and converting to CSV/text
- the CSV file(s) you created
- the total number of California job losses you calculated
3. Download and process a data file from another state's WARN Act
Google for another state that posts its WARN acts. Convert one of its PDFs into CSV and total up the job losses in that PDF. Or, better yet, find a state that reports the data in a better format than CSV.
Send me another email with:
- Your relevant Python code (for downloading the data)
- The CSV file you created, if any
- The total number of jobs listed in that particular report
Example answers to number of "innocents" stopped by NYPD Stop and Frisk
The code here shows how to count "innocents": basically, filter for all stops in which both the
'sumissue' are not equal to
Warmup exercise: Counting the innocent stops by the NYPD
This is something we learn from the data schema provided by the NYPD.
The problem with PDFs
The Wall Street Journal has mass-downloaded the tens of thousands of scanned PDFs of Secretary Clinton emails and uploaded them to a viewer (which has links to the zip files):
You can see their Github repo of Python scraping code here: https://github.com/wsjdata/clinton-email-cruncher.
Governor Sarah Palin's emails
Back when Governor Palin was running for VP, [her emails were also FOIAed. And they, too, were printed out, redacted, and then scanned back in digitally.
Here's what the extracted text of a digital scan looks like, thanks to the inaccuracies of optical character recognition (OCR) technology:
Best wishes, Louis
Assistant to Ambassador Louis Susman
24 Grosvenor Square
London W1A 1AE
Compared it to the scanned text:
BTW, here's the full PDF of that email. See if you can spot the other mistranslations:
Examples of PDFs/emails in which digital files were uploaded
Can you see the problem with Blagojevich's document? If not, that's the problem (and an answer to why digital scans are usually uploaded):
Having data our way
Extracting messy text and turning it into structured data and machine-readable text formats (e.g. CSV, JSON) will generally be the hardest problem that we face. However, even when the data already comes as non-scanned PDFs…or perfectly seeming HTML files (which we can "web-scrape")…or even as raw CSVs…there's still a lot of work to wrangle the data.
Medical records breach data
California WARN ACT PDFs
These are tables…but they're PDFs. Why do organizations make PDFs when they're so hard to parse?
Pfizer Doctor Payments
Even a nice looking website like this Pfizer site has inherent problems:
Via the New York Times: Data on Fees to Doctors Is Called Hard to Parse, April 12, 2010