The code for this lesson can be found here on the site Github repo.
Check out the homepage for the CPSC Recalls Application Program Interface (API) to see the full documentation. But here's the main takeaway:
The base endpoint is this:
http://www.saferproducts.gov/RestWebServices/Recall
To request the data in JSON format, we supply a query string of format=json
. The following URL will retrieve all of the available recall data in JSON format:
http://www.saferproducts.gov/RestWebServices/Recall?format=json
However, that's a pretty big file. The API (see the PDF manual for the complete info) allows us to do simple text filtering on various fields.
The field most relevant to us is RecallDateStart
; to request only the recalls that began in 2016 in JSON format:
http://www.saferproducts.gov/RestWebServices/Recall?format=json&RecallDateStart=2016
However, I don't think there's a way to do more sophisticated date filtering, such as recalls that started between two given dates. Consult the manual if you wish, but year-by-year filtering is good enough for now.
For the purposes of this exercise, feel free to use the following URL, which contains data as I've downloaded and stashed on my own server:
http://stash.compjour.org/samples/cpsc/recalls201604.json
This lesson isn't really an exercise in interacting with a live API. But if you want to, you can hit up the live API for fun:
http://www.saferproducts.gov/RestWebServices/Recall?format=json&RecallDateStart=2016
You know how this works:
import requests
URL = 'http://stash.compjour.org/samples/cpsc/recalls201604.json'
resp = requests.get(URL)
txt = resp.text
Note: if you are hitting up the live API, make sure to generate the query string using the requests.get()
params argument:
import requests
ENDPOINT = 'http://www.saferproducts.gov/RestWebServices/Recall'
resp = requests.get(ENDPOINT,
params={'format':'json', 'RecallDateStart': 2016})
txt = resp.text
Either way, txt
is a big giant character string.
type(txt)
# str
len(txt) # i.e. number of characters
# 298351
But we want to turn txt
into a data object. Here's how to do it:
import json
datarows = json.loads(txt)
type(datarows)
# list
len(datarows) # number of members
# 99
I'll leave it to you to inspect each data object:
row = datarows[0]
type(row)
# dict
datarows
ends up being a bad label. We aren't dealing with rows, as in spreadsheet rows, but data that is structured as a dictionary. Inspect any of the members of datarows
:
datarows[7]
{'ConsumerContact': 'Fisher-Price at 800-432-5437 from 9 a.m. to 6 p.m. ET Monday through Friday, or online at www.service.mattel.com and click on Recalls & Safety Alerts for more information. ',
'Description': "This recall includes three models of the Fisher-Price cradle swings: CHM84 Soothing Savanna Cradle 'n Swing, CMR40 Sweet Surroundings Cradle 'n Swing, and CMR43 Sweet Surroundings Butterfly Friends Cradle 'n Swing. The swings have two different swinging motions - rocking side-to-side, or swinging head-to-toe, and six different swing speeds from low to high. The product number is located on the seat under the pad. ",
'Hazards': [{'HazardTypeID': '67521',
'Name': 'When the seat peg is not fully engaged the seat can fall unexpectedly, posing a risk of injury to the child.'}],
'Images': [{'URL': 'http://www.cpsc.gov/Global/Images/Recall/2016/16143/FPCHM84PR2Z.jpg'},
{'URL': 'http://www.cpsc.gov/Global/Images/Recall/2016/16143/FPCMR40PR3Z.jpg'},
{'URL': 'http://www.cpsc.gov/Global/Images/Recall/2016/16143/FPCMR43PR3Z.jpg'},
{'URL': 'http://www.cpsc.gov/Global/Images/Recall/2016/16143/Inset1.jpg'},
{'URL': 'http://www.cpsc.gov/Global/Images/Recall/2016/16143/Inset2.jpg'}],
'Inconjunctions': [],
'Injuries': [{'Name': 'Fisher-Price has received two reports of a seat peg coming out from the seat, causing the seat to fall. No injuries have been reported.'}],
'LastPublishDate': '2016-04-14T00:00:00',
'ManufacturerCountries': [{'Country': 'Mexico'}],
'Manufacturers': [],
'ProductUPCs': [],
'Products': [{'CategoryID': '67570',
'Description': '',
'Model': '',
'Name': 'Cradle ‘n Swings',
'NumberOfUnits': 'About 34,000',
'Type': 'Portable Baby Swings'}],
'RecallDate': '2016-04-14T00:00:00',
'RecallID': 6705,
'RecallNumber': '16143',
'Remedies': [{'Name': 'Consumers should immediately stop using the recalled cradle swing and contact Fisher-Price for revised assembly instructions.'}],
'Retailers': [{'CompanyID': '0',
'Name': 'buybuyBaby, Target and other stores nationwide and online at Amazon.com and other websites from November 2015 through March 2016 for about $170.'}],
'Title': 'Fisher-Price Recalls Infant Cradle Swings Due to Fall Hazard',
'URL': 'http://www.cpsc.gov/en/Recalls/2016/Fisher-Price-Recalls-Infant-Cradle-Swings/'}
I'll leave it to you to inspect the different elements in the dictionary:
row = datarows[7]
row['Title']
# 'Fisher-Price Recalls Infant Cradle Swings Due to Fall Hazard'
row['RecallDate']
# '2016-04-14T00:00:00'
images = row['Images']
type(images)
# list
len(images)
# 5
type(images[0])
# dict
images[0]
# {'URL': 'http://www.cpsc.gov/Global/Images/Recall/2016/16143/FPCHM84PR2Z.jpg'}
images[0]['URL']
# 'http://www.cpsc.gov/Global/Images/Recall/2016/16143/FPCHM84PR2Z.jpg'
The JSON data format is not much different than any other data format we've worked with; it follows a specification, and thus, people have written libraries to make it easy to turn JSON-formatted text into data objects, and vice versa.
The Python standard library includes a module named json.
Here's a quick tutorial from last year: An introduction to data serialization and Python Requests. And here's a quiz.
Basically, JSON is a nice way to serialize nested dictionaries and lists into text, something that can't easily be done in CSV. If you understand how to get around lists and dictionaries, you'll be fine.
(Here's review material for lists and for dictionaries)
Remember how to make a simple Flask app?. Review the previous lessons, then see if you can create app.py from memory:
from flask import Flask
app = Flask(__name__)
@app.route("/")
def homepage():
htmlstr = "<h1>Hello world!</h1>"
return htmlstr
if __name__ == '__main__':
app.run(use_reloader=True, debug=True)
Throw that app.py into some new directory for now. Then run it, as we've done before:
OK, making a Flask app is easy enough, but how do we download the CPSC recalls data for its use? Basically, just put the download code right into app.py – preferably before writing the view functions (i.e. homepage()
). To make things a little nicer, we'll wrap up our download code into its own function – get_data():
def get_data():
resp = requests.get(URL)
datarows = json.loads(resp.text)
return datarows
And then inside the homepage() view function, we'll call get_data()
and use it to print the number of recall records:
@app.route("/")
def homepage():
datarows = get_data()
htmlstr = "<h1>Hello world!</h1>"
htmlstr += "<p>There have been {num} recalls</p>".format(num=len(datarows))
return htmlstr
This is what that simple page looks like:
But printing the number of records isn't very exciting. Let's try to print out each record. For now, let's just print each record as plaintext – because that's how the data came to us.
This is just a for loop:
@app.route("/")
def homepage():
datarows = get_data()
htmlstr = "<h1>Hello world!</h1>"
htmlstr += "<p>There have been {num} recalls</p>".format(num=len(datarows))
for d in datarows:
htmlstr += "<p>{record}</p>".format(record=d)
return htmlstr
And this is why we don't print out raw data as plain text – it's basically incomprehensible:
We saw earlier that each record is a dictionary. So let's just print out things we want to see from each dictionary, not each entire dictionary:
@app.route("/")
def homepage():
datarows = get_data()
htmlstr = "<h1>Hello world!</h1>"
htmlstr += "<p>There have been {num} recalls</p>".format(num=len(datarows))
for d in datarows:
htmlstr += "<p>{title} -- {url}</p>".format(title=d['Title'], url=d['URL'])
return htmlstr
That looks better:
And of course, with HTML, we generally want the URLs to be clickable hyperlinks. One more iteration on our simple app:
@app.route("/")
def homepage():
datarows = get_data()
htmlstr = "<h1>Hello world!</h1>"
htmlstr += "<p>There have been {num} recalls</p>".format(num=len(datarows))
for d in datarows:
htmlstr += """<p><a href="{url}">{title}</a></p>""".format(title=d['Title'], url=d['URL'])
return htmlstr
That's much more Web-ish:
Here's what the code for the entire app looks like:
from flask import Flask
import json
import requests
URL = 'http://stash.compjour.org/samples/cpsc/recalls201604.json'
def get_data():
resp = requests.get(URL)
datarows = json.loads(resp.text)
return datarows
app = Flask(__name__)
@app.route("/")
def homepage():
datarows = get_data()
htmlstr = "<h1>Hello world!</h1>"
htmlstr += "<p>There have been {num} recalls</p>".format(num=len(datarows))
for d in datarows:
htmlstr += """<p><a href="{url}">{title}</a></p>""".format(title=d['Title'], url=d['URL'])
return htmlstr
if __name__ == '__main__':
app.run(use_reloader=True, debug=True)
We're almost done here. But before moving on, it's worth doing some logistical cleanup.
Our app.py is getting messy looking. For starters, those HTML strings clutter up the view function – but we'll deal with that in the next lesson on Jinja templates.
Let's deal with the fact that our data-fetching code is intermingled with our web-app code.
Here's the code that defines the data-fetching function:
import json
import requests
URL = 'http://stash.compjour.org/samples/cpsc/recalls201604.json'
def get_data():
resp = requests.get(URL)
datarows = json.loads(resp.text)
return datarows
Any reason that needs to be inside of app.py? Nope. So let's move it to its own file.
In the same directory as app.py, create a new file named datafoo.py. It should contain the data-fetching code as defined above:
import json
import requests
URL = 'http://stash.compjour.org/samples/cpsc/recalls201604.json'
def get_data():
resp = requests.get(URL)
datarows = json.loads(resp.text)
return datarows
Then trim app.py to look like this – and take note of the second from/import
statement:
from flask import Flask
from datafoo import get_data
app = Flask(__name__)
datarows = get_data()
@app.route("/")
def homepage():
htmlstr = "<h1>Hello world!</h1>"
htmlstr += "<p>There have been {num} recalls</p>".format(num=len(datarows))
for d in datarows:
htmlstr += """<p><a href="{url}">{title}</a></p>""".format(title=d['Title'], url=d['URL'])
return htmlstr
if __name__ == '__main__':
app.run(use_reloader=True, debug=True)
One more subtlety: I've put the following line:
datarows = get_data()
– in the global scope – as opposed to just within the scope of the homepage
function:
@app.route("/")
def homepage():
datarows = get_data()
This means it's only run once the app is started. This means, for development purposes, get_data()
doesn't do its slow downloading process each time we refresh the homepage, just when we restart the app.
Shut down the localhost webserver, then re-start it. Even with the code separated, everything should work the same.
This lesson was meant as a quick review of how to make a Flask app, a quick introduction to JSON (basically, dictionaries and lists, in string format), and a demonstration of a key software engineering concept: separation of concerns – in this case, the data-fetching code from the web-app code.
Our web apps will no longer be just app.py. In subsequent lessons, we'll focus on separating out the messy HTML-generating view code from app.py
The code for this lesson can be found here on the site Github repo.