Importing the CPSC Recall data into our Flask app (1/6)

This article is part of a sequence:
Introduction to Simple News Apps based on CSPC Recall Data
Learning to make a news app by trying to make a better Recalls page

Objectives

  • Review basic Flask app construction
  • Review importing and parsing of JSON
  • Understand the structure of the CSPC Recalls data
Table of contents

Structure of the API

Check out the homepage for the CPSC Recalls Application Program Interface (API) to see the full documentation. But here's the main takeaway:

The base endpoint is this:

  http://www.saferproducts.gov/RestWebServices/Recall

To request the data in JSON format, we supply a query string of format=json. The following URL will retrieve all of the available recall data in JSON format:

http://www.saferproducts.gov/RestWebServices/Recall?format=json

Filtering by RecallDateStart

However, that's a pretty big file. The API (see the PDF manual for the complete info) allows us to do simple text filtering on various fields.

The field most relevant to us is RecallDateStart; to request only the recalls that began in 2016 in JSON format:

http://www.saferproducts.gov/RestWebServices/Recall?format=json&RecallDateStart=2016

However, I don't think there's a way to do more sophisticated date filtering, such as recalls that started between two given dates. Consult the manual if you wish, but year-by-year filtering is good enough for now.

Using canned data

For the purposes of this exercise, feel free to use the following URL, which contains data as I've downloaded and stashed on my own server:

http://stash.compjour.org/samples/cpsc/recalls201604.json

This lesson isn't really an exercise in interacting with a live API. But if you want to, you can hit up the live API for fun:

http://www.saferproducts.gov/RestWebServices/Recall?format=json&RecallDateStart=2016

Downloading the data with requests

You know how this works:

import requests
URL = 'http://stash.compjour.org/samples/cpsc/recalls201604.json'
resp = requests.get(URL)
txt = resp.text

Note: if you are hitting up the live API, make sure to generate the query string using the requests.get() params argument:

import requests
ENDPOINT = 'http://www.saferproducts.gov/RestWebServices/Recall'
resp = requests.get(ENDPOINT, 
                    params={'format':'json', 'RecallDateStart': 2016})
txt = resp.text

Deserializing JSON

Either way, txt is a big giant character string.

type(txt)
# str
len(txt)  # i.e. number of characters
# 298351

But we want to turn txt into a data object. Here's how to do it:

import json
datarows = json.loads(txt)
type(datarows)
# list
len(datarows) # number of members
# 99

I'll leave it to you to inspect each data object:

row = datarows[0]
type(row)
# dict

The structure of each recall record

datarows ends up being a bad label. We aren't dealing with rows, as in spreadsheet rows, but data that is structured as a dictionary. Inspect any of the members of datarows:

datarows[7]
{'ConsumerContact': 'Fisher-Price at 800-432-5437 from 9 a.m. to 6 p.m. ET Monday through Friday, or online at www.service.mattel.com and click on Recalls & Safety Alerts for more information. ',
 'Description': "This recall includes three models of the Fisher-Price cradle swings: CHM84 Soothing Savanna Cradle 'n Swing, CMR40 Sweet Surroundings Cradle 'n Swing, and CMR43 Sweet Surroundings Butterfly Friends Cradle 'n Swing. The swings have two different swinging motions - rocking side-to-side, or swinging head-to-toe, and six different swing speeds from low to high. The product number is located on the seat under the pad. ",
 'Hazards': [{'HazardTypeID': '67521',
   'Name': 'When the seat peg is not fully engaged the seat can fall unexpectedly, posing a risk of injury to the child.'}],
 'Images': [{'URL': 'http://www.cpsc.gov/Global/Images/Recall/2016/16143/FPCHM84PR2Z.jpg'},
  {'URL': 'http://www.cpsc.gov/Global/Images/Recall/2016/16143/FPCMR40PR3Z.jpg'},
  {'URL': 'http://www.cpsc.gov/Global/Images/Recall/2016/16143/FPCMR43PR3Z.jpg'},
  {'URL': 'http://www.cpsc.gov/Global/Images/Recall/2016/16143/Inset1.jpg'},
  {'URL': 'http://www.cpsc.gov/Global/Images/Recall/2016/16143/Inset2.jpg'}],
 'Inconjunctions': [],
 'Injuries': [{'Name': 'Fisher-Price has received two reports of a seat peg coming out from the seat, causing the seat to fall. No injuries have been reported.'}],
 'LastPublishDate': '2016-04-14T00:00:00',
 'ManufacturerCountries': [{'Country': 'Mexico'}],
 'Manufacturers': [],
 'ProductUPCs': [],
 'Products': [{'CategoryID': '67570',
   'Description': '',
   'Model': '',
   'Name': 'Cradle ‘n Swings',
   'NumberOfUnits': 'About 34,000',
   'Type': 'Portable Baby Swings'}],
 'RecallDate': '2016-04-14T00:00:00',
 'RecallID': 6705,
 'RecallNumber': '16143',
 'Remedies': [{'Name': 'Consumers should immediately stop using the recalled cradle swing and contact Fisher-Price for revised assembly instructions.'}],
 'Retailers': [{'CompanyID': '0',
   'Name': 'buybuyBaby, Target and other stores nationwide and online at Amazon.com and other websites from November 2015 through March 2016 for about $170.'}],
 'Title': 'Fisher-Price Recalls Infant Cradle Swings Due to Fall Hazard',
 'URL': 'http://www.cpsc.gov/en/Recalls/2016/Fisher-Price-Recalls-Infant-Cradle-Swings/'}

I'll leave it to you to inspect the different elements in the dictionary:

row = datarows[7]
row['Title']
# 'Fisher-Price Recalls Infant Cradle Swings Due to Fall Hazard'
row['RecallDate']
# '2016-04-14T00:00:00'
images = row['Images']
type(images)
# list
len(images)
# 5
type(images[0])
# dict
images[0]
# {'URL': 'http://www.cpsc.gov/Global/Images/Recall/2016/16143/FPCHM84PR2Z.jpg'}
images[0]['URL']
# 'http://www.cpsc.gov/Global/Images/Recall/2016/16143/FPCHM84PR2Z.jpg'

More about JSON

The JSON data format is not much different than any other data format we've worked with; it follows a specification, and thus, people have written libraries to make it easy to turn JSON-formatted text into data objects, and vice versa.

The Python standard library includes a module named json.

Here's a quick tutorial from last year: An introduction to data serialization and Python Requests. And here's a quiz.

Basically, JSON is a nice way to serialize nested dictionaries and lists into text, something that can't easily be done in CSV. If you understand how to get around lists and dictionaries, you'll be fine.

(Here's review material for lists and for dictionaries)

Creating a Flask app

Remember how to make a simple Flask app?. Review the previous lessons, then see if you can create app.py from memory:

from flask import Flask
app = Flask(__name__)

@app.route("/")
def homepage():
    htmlstr = "<h1>Hello world!</h1>"
    return htmlstr


if __name__ == '__main__':
    app.run(use_reloader=True, debug=True)

Throw that app.py into some new directory for now. Then run it, as we've done before:

Downloading data into our Flask app

OK, making a Flask app is easy enough, but how do we download the CPSC recalls data for its use? Basically, just put the download code right into app.py – preferably before writing the view functions (i.e. homepage()). To make things a little nicer, we'll wrap up our download code into its own function – get_data():

def get_data():
    resp = requests.get(URL)
    datarows = json.loads(resp.text)
    return datarows

Printing the number of downloaded records

And then inside the homepage() view function, we'll call get_data() and use it to print the number of recall records:

@app.route("/")
def homepage():
    datarows = get_data()
    htmlstr = "<h1>Hello world!</h1>"
    htmlstr += "<p>There have been {num} recalls</p>".format(num=len(datarows))
    return htmlstr

This is what that simple page looks like:

image hello-99-recalls.png

Printing each record

But printing the number of records isn't very exciting. Let's try to print out each record. For now, let's just print each record as plaintext – because that's how the data came to us.

This is just a for loop:

@app.route("/")
def homepage():
    datarows = get_data()
    htmlstr = "<h1>Hello world!</h1>"
    htmlstr += "<p>There have been {num} recalls</p>".format(num=len(datarows))
    for d in datarows:
        htmlstr += "<p>{record}</p>".format(record=d)
    return htmlstr

And this is why we don't print out raw data as plain text – it's basically incomprehensible:

image hello-99-recalls-eachone.png

Selectively printing attributes from each record

We saw earlier that each record is a dictionary. So let's just print out things we want to see from each dictionary, not each entire dictionary:

@app.route("/")
def homepage():
    datarows = get_data()
    htmlstr = "<h1>Hello world!</h1>"
    htmlstr += "<p>There have been {num} recalls</p>".format(num=len(datarows))
    for d in datarows:
        htmlstr += "<p>{title} -- {url}</p>".format(title=d['Title'], url=d['URL'])
    return htmlstr

That looks better:

image hello-99-recalls-eachone-title.png

And of course, with HTML, we generally want the URLs to be clickable hyperlinks. One more iteration on our simple app:

@app.route("/")
def homepage():
    datarows = get_data()
    htmlstr = "<h1>Hello world!</h1>"
    htmlstr += "<p>There have been {num} recalls</p>".format(num=len(datarows))
    for d in datarows:
        htmlstr += """<p><a href="{url}">{title}</a></p>""".format(title=d['Title'], url=d['URL'])
    return htmlstr

That's much more Web-ish:

image hello-99-recalls-eachone-hyperlinked.png

All together

Here's what the code for the entire app looks like:

from flask import Flask
import json
import requests
URL = 'http://stash.compjour.org/samples/cpsc/recalls201604.json'

def get_data():
    resp = requests.get(URL)
    datarows = json.loads(resp.text)
    return datarows

app = Flask(__name__)

@app.route("/")
def homepage():
    datarows = get_data()
    htmlstr = "<h1>Hello world!</h1>"
    htmlstr += "<p>There have been {num} recalls</p>".format(num=len(datarows))
    for d in datarows:
        htmlstr += """<p><a href="{url}">{title}</a></p>""".format(title=d['Title'], url=d['URL'])
    return htmlstr

if __name__ == '__main__':
    app.run(use_reloader=True, debug=True)

Separating the app and data code

We're almost done here. But before moving on, it's worth doing some logistical cleanup.

Our app.py is getting messy looking. For starters, those HTML strings clutter up the view function – but we'll deal with that in the next lesson on Jinja templates.

Let's deal with the fact that our data-fetching code is intermingled with our web-app code.

Here's the code that defines the data-fetching function:

import json
import requests
URL = 'http://stash.compjour.org/samples/cpsc/recalls201604.json'

def get_data():
    resp = requests.get(URL)
    datarows = json.loads(resp.text)
    return datarows

Any reason that needs to be inside of app.py? Nope. So let's move it to its own file.

Creating datafoo.py

In the same directory as app.py, create a new file named datafoo.py. It should contain the data-fetching code as defined above:

import json
import requests
URL = 'http://stash.compjour.org/samples/cpsc/recalls201604.json'

def get_data():
    resp = requests.get(URL)
    datarows = json.loads(resp.text)
    return datarows

Then trim app.py to look like this – and take note of the second from/import statement:

from flask import Flask
from datafoo import get_data

app = Flask(__name__)
datarows = get_data()

@app.route("/")
def homepage():
    htmlstr = "<h1>Hello world!</h1>"
    htmlstr += "<p>There have been {num} recalls</p>".format(num=len(datarows))
    for d in datarows:
        htmlstr += """<p><a href="{url}">{title}</a></p>""".format(title=d['Title'], url=d['URL'])
    return htmlstr

if __name__ == '__main__':
    app.run(use_reloader=True, debug=True)

One more subtlety: I've put the following line:

datarows = get_data()

– in the global scope – as opposed to just within the scope of the homepage function:

@app.route("/")
def homepage():
  datarows = get_data()

This means it's only run once the app is started. This means, for development purposes, get_data() doesn't do its slow downloading process each time we refresh the homepage, just when we restart the app.

Shut down the localhost webserver, then re-start it. Even with the code separated, everything should work the same.

Conclusion

This lesson was meant as a quick review of how to make a Flask app, a quick introduction to JSON (basically, dictionaries and lists, in string format), and a demonstration of a key software engineering concept: separation of concerns – in this case, the data-fetching code from the web-app code.

Our web apps will no longer be just app.py. In subsequent lessons, we'll focus on separating out the messy HTML-generating view code from app.py

This article is part of a sequence:
Introduction to Simple News Apps based on CSPC Recall Data
Learning to make a news app by trying to make a better Recalls page