Downloading the data

The Sunlight Foundation compiles legislator data in this convenient CSV file:

http://unitedstates.sunlightfoundation.com/legislators/legislators.csv

In this step, we're just downloading the data. We don't need to understand its full contents, other than that it's a text file. If you're interested in perusing its contents – which we'll do in the next lesson – I've uploaded it to Gist.

Using Requests to download from a URL

This should be familiar to you; try running it in interactive Python to make sure you grok all the steps:

import requests
url = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv'
resp = requests.get(url)

Saving data to disk

Remember that in programming, requesting data from a remote URL is a separate step from saving it to your computer. Once the requests.get() call above finishes, and assuming that it worked, we can save the contents of the response object, resp, to a file named legislators.csv:

data_filename = 'legislators.csv'
with open(data_filename, 'w') as wf:
    wf.write(resp.text)

At the end of this process, you should have a file named legislators.csv saved in the current working directory.

Writing fetch.py

This data-fetching routine is going to be used in our main application. So let's save it in a Python file named fetch.py, which allows us to run it from the command line:

$ python fetch.py

So, create a new file named fetch.py, and start writing code.

We can use the code from the previous example, but we need to add some print() statements… otherwise the Python code in fetch.py will run silently. Modify it to look like this:

import requests
DATA_URL = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv'
DATA_FILENAME = 'legislators.csv'

print("Downloading from", DATA_URL)
resp = requests.get(DATA_URL)

with open(DATA_FILENAME, 'w') as wf:
    msg = "Saving {numchars} to: {filename}"
    print(msg.format(numchars=len(resp.text), filename=DATA_FILENAME))
    wf.write(resp.text)

This is what it looks like when you run python fetch.py in your system shell:

Separating code into functions, away from the "main" action

Sometimes when running a Python script, we don't want to necessarily execute everything in it. Sometimes we want to just store variable and functions ina script file, and let other scripts refer to those variables and functions. In fact, we've already been doing that in most of our code when we use the import statement.

This subsection is a segue for explaining the Python idiom of __name__ and '__main__' – and to just push you through a Python convention for organizing code. If you already know it, you can skip this section. If you don't know it, you probably won't totally understand it, but it's enough to know that the pattern exists.

What is `name` and `'main'`?

Back when we were learning how to make a basic Flask app, we saw – and blindly repeated – this following snippet of code in app.py:

if __name__ == '__main__':
    app.run()

What does that if-statement imply? More specifically, what is that __name__ variable, and what kind of object does it refer to? And why do we care if it is equivalent to '__main__'?

First, the fact that __name__ is being tested to see if it is equal to the string value '__main__' implies that __name__ itself is a string. Jump into ipython yourself and test it out:

Yep, that __name__ object is a string. And in ipython, its value is indeed equal to '__main__'.

However, the value of __name__ changes when you import an external file into iPython.

Make a throwaway file named fooey.py. In that file, write and save the following Python code.

print("Inside foo.py, the name is:", __name__)

Run it from your command-line interpreter:

$ python foo.py

The output should be this:

Inside foo.py, the name is: __main__

Now, launch ipython and import the foo.py file with this statement:

import foo

The code inside of foo.py should immediately execute…but note the change in __name__ value:

Inside foo.py, the name is: foo

If this isn't clear enough, here's a screencast of me:

Running python foo.py from the command-line – i.e. reading foo.py with the Python interpreter.
Jumping into the ipython shell and importing foo.py

Recall the contents of foo.py; it's a simple print statement:

print("Inside foo.py, the name is:", __name__)

When we execute a Python file directly – i.e. python foo.py – the code in that script has a top level context. Or, for the sake of Python convention, its __name__ is '__main__'.

However, if foo.py is imported by another Python script or process, such as jumping into the ipython shell and running import foo, then the context, or rather the __name__, under which foo.py operates is now 'foo'. Not '__main__'.

If you want a more official explanation, you can check these links:

Via the [Python documentation]((https://docs.python.org/3/library/main.html):

A module can discover whether or not it is running in the main scope by checking its own __name__, which allows a common idiom for conditionally executing code in a module when it is run as a script or with python -m but not when it is imported:
if __name__ == "__main__":
   # execute only if run as a script
   main()

Refactoring fetch.py

If that __name__ and '__main__' business was hard to swallow, let's see how it applies to our fetch.py.

Recall that fetch.py contains this:

import requests
DATA_URL = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv'
DATA_FILENAME = 'legislators.csv'

print("Downloading from", DATA_URL)
resp = requests.get(DATA_URL)

with open(DATA_FILENAME, 'w') as wf:
    msg = "Saving {numchars} to: {filename}"
    print(msg.format(numchars=len(resp.text), filename=DATA_FILENAME))
    wf.write(resp.text)

We can run fetch.py via the command-line interpreter like so:

$ python fetch.py

Which results in this output:

Downloading from http://unitedstates.sunlightfoundation.com/legislators/legislators.csv
Saving 268767 to: legislators.csv

The same thing happens if you go into the ipython shell and run this Python command:

import fetch

So what does it mean to refactor it to consider the top-level, i.e. '__main__' scope?

First, it means separating the code that defines the variables from the code that does the work. Typically, we set the latter code off in its own function.

In the example below, the code that downloads/saves the data file is in a function named fetch_data(). Refactor fetch.py to look the same:

import requests
DATA_URL = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv'
DATA_FILENAME = 'legislators.csv'

def fetch_data():
    print("Downloading from", DATA_URL)
    resp = requests.get(DATA_URL)

    with open(DATA_FILENAME, 'w') as wf:
        msg = "Saving {numchars} to: {filename}"
        print(msg.format(numchars=len(resp.text), filename=DATA_FILENAME))
        wf.write(resp.text)

So, one difference is that if we now attempt to execute the script via the command line:

$ python fetch.py

…nothing will happen. That's because while the fetch_data() function has been defined, it is never executed.

Jump into ipython and import fetch.py:

import fetch

Again, nothing will happen.

But now, at the ipython shell prompt, execute the fetch.fetch_data() function:

fetch.fetch_data()

And you'll get the following output:

Downloading from http://unitedstates.sunlightfoundation.com/legislators/legislators.csv
Saving 268767 to: legislators.csv

Or, if you prefer the more Pythonic way of importing modules and their function names into the space:

from fetch import fetch_data
fetch_data()

Defining the top-level `'main'` routine

OK, so basically we've deferred the fetching data routine to only run when the fetch.fetch_data() function is executed.

However, we miss the convenience of getting that functionality when running fetch.py from the command line:

$ python fetch.py

How can we get back that convenience?

That's what the if __name__ == '__main__': branch checks. Remember how when we ran python foo.py from the command-line, it told us its __name__ was '__main__'? Here's how we use that insight in fetch.py – note the only change is the addition of two lines at the bottom:

if __name__ == '__main__':
    fetch_data()

The full fetch.py file:

import requests
DATA_URL = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv'
DATA_FILENAME = 'legislators.csv'

def fetch_data():
    print("Downloading from", DATA_URL)
    resp = requests.get(DATA_URL)

    with open(DATA_FILENAME, 'w') as wf:
        msg = "Saving {numchars} to: {filename}"
        print(msg.format(numchars=len(resp.text), filename=DATA_FILENAME))
        wf.write(resp.text)

if __name__ == '__main__':
    fetch_data()

Now running from the command-line will, by default, execute the code in that __main__ branch:

$ python fetch.py

Downloading from http://unitedstates.sunlightfoundation.com/legislators/legislators.csv
Saving 268767 to: legislators.csv

Caching data files

We're almost done; let's add one more feature to our fetching routine: the logic for not downloading the file when we already have it.

The slowest part of fetch.py, by an order of magnitude, is the time it takes to download legislators.csv from the Sunlight Foundation's remote server. Because the composition of Congress changes relatively slowly, why do we need to keep redownloading it over and over?

The Python standard library contains the os.path module which contains a set of helpful file-related functions, including the exists() function which accepts an argument for a string, and then returns True or False based on whether that string represents a filename that exists.

You can try in ipython. Assuming you're running ipython in the same directory to which you've been running fetch.py:

from os.path import exists
exists("booobooo123.blahblah")
# False
exists("fetch.py")
# True
exists("legislators.csv")
# True

See if you can figure out how to include it in fetch.py yourself before checking my answer below.

fetch.py with a conditional check for file existence

This is what my fetch.py looks like:

from os.path import exists
import requests
DATA_URL = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv'
DATA_FILENAME = 'legislators.csv'

def fetch_data():
    if exists(DATA_FILENAME):
        print("Already downloaded", DATA_FILENAME)
    else:
        print("Downloading from", DATA_URL)
        resp = requests.get(DATA_URL)
        with open(DATA_FILENAME, 'w') as wf:
            msg = "Saving {numchars} to: {fname}"
            print(msg.format(numchars=len(resp.text), fname=DATA_FILENAME))
            wf.write(resp.text)

if __name__ == '__main__':
    fetch_data()

Let's see if it actually works. Delete the legislators.csv file. Then run fetch.py from the command-line twice. The first time, it should print out the messages telling us that it's downloading from a URL. The second time, it should tell us that the file has already been downloaded and, more importantly, the program should exit almost instantaneously:

Add a read_data() function

OK OK, one more function, the necessity of which won't make sense until the next lesson.

Currently, our fetch_data() function in fetch.py does not return anything. Instead, it does the work of downloading and saving a file, and notifying us (via print()) that the download has happened…but at the end of the function, nothing is returned. Which is always the case when there is no return statement.

However, other parts of our code base may want access to the raw text data that is in DATA_FILENAME, i.e. legislators.csv…but those parts of the code base shouldn't care about the fetch_data() function…they don't want to worry about downloading data. They just want to get the raw text.

So let's create a little "helper" (also known as a "convenience") function. We'll call it read_data(), and whenever someone calls it, it does the following:

runs fetch_data() to ensure that data has been downloaded and that the file exists
opens the file at DATA_FILENAME, reads it, and simply returns its contents as a big string.

Here's the function definition that you can put into fetch.py:

def read_data():
    fetch_data()
    with open(DATA_FILENAME, 'r') as rf:
        txt = rf.read()
    return txt

Or, more concisely:

def read_data():
    fetch_data()
    with open(DATA_FILENAME, 'r') as rf:
        return rf.read()

Pop open your ipython shell. Then let's test out the difference between fetch_data() and read_data():

from fetch import fetch_data, read_data
f_result = fetch_data()
type(f_result)
# NoneType
f_result
# None

r_result = read_data()
type(r_result)
# str
len(r_result)
# 268767
r_result[0:50]
# 'title,firstname,middlename,lastname,name_suffix,ni'
len(r_result.splitlines())

Here's an interactive demonstration of what that looks like:

All together

The final version of fetch.py before moving on:

from os.path import exists
import requests
DATA_URL = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv'
DATA_FILENAME = 'legislators.csv'

def fetch_data():
    if exists(DATA_FILENAME):
        print("Already downloaded", DATA_FILENAME)
    else:
        print("Downloading from", DATA_URL)
        resp = requests.get(DATA_URL)
        with open(DATA_FILENAME, 'w') as wf:
            msg = "Saving {numchars} to: {fname}"
            print(msg.format(numchars=len(resp.text), fname=DATA_FILENAME))
            wf.write(resp.text)

def read_data():
    fetch_data()
    with open(DATA_FILENAME, 'r') as rf:
        return rf.read()

if __name__ == '__main__':
    fetch_data()

Fetching the Congressmember data to make a birthday Flask app

Summary

Objectives

Downloading the data

Using Requests to download from a URL

Saving data to disk

Writing fetch.py

Separating code into functions, away from the "main" action

What is `name` and `'main'`?

Refactoring fetch.py

Defining the top-level `'main'` routine

Caching data files

fetch.py with a conditional check for file existence

Add a read_data() function

All together

Fetching the Congressmember data to make a birthday Flask app

Summary

Objectives

Downloading the data

Using Requests to download from a URL

Saving data to disk

Writing fetch.py

Separating code into functions, away from the "main" action

What is __name__ and '__main__'?

Refactoring fetch.py

Defining the top-level '__main__' routine

Caching data files

fetch.py with a conditional check for file existence

Add a read_data() function

All together

What is `name` and `'main'`?

Defining the top-level `'main'` routine