Even though we intend to build a Flask app, our first set of scripts is going to have nothing to do with the Flask library. Instead, we need to do the work of just getting the data and organizing it. These are steps that we can perform – and then put into their own scripts – before we get to the Flask-app making part.
The Sunlight Foundation compiles legislator data in this convenient CSV file:
http://unitedstates.sunlightfoundation.com/legislators/legislators.csv
In this step, we're just downloading the data. We don't need to understand its full contents, other than that it's a text file. If you're interested in perusing its contents – which we'll do in the next lesson – I've uploaded it to Gist.
This should be familiar to you; try running it in interactive Python to make sure you grok all the steps:
import requests
url = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv'
resp = requests.get(url)
Remember that in programming, requesting data from a remote URL is a separate step from saving it to your computer. Once the requests.get()
call above finishes, and assuming that it worked, we can save the contents of the response object, resp
, to a file named legislators.csv
:
data_filename = 'legislators.csv'
with open(data_filename, 'w') as wf:
wf.write(resp.text)
At the end of this process, you should have a file named legislators.csv
saved in the current working directory.
This data-fetching routine is going to be used in our main application. So let's save it in a Python file named fetch.py, which allows us to run it from the command line:
$ python fetch.py
So, create a new file named fetch.py, and start writing code.
We can use the code from the previous example, but we need to add some print()
statements… otherwise the Python code in fetch.py will run silently. Modify it to look like this:
import requests
DATA_URL = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv'
DATA_FILENAME = 'legislators.csv'
print("Downloading from", DATA_URL)
resp = requests.get(DATA_URL)
with open(DATA_FILENAME, 'w') as wf:
msg = "Saving {numchars} to: {filename}"
print(msg.format(numchars=len(resp.text), filename=DATA_FILENAME))
wf.write(resp.text)
This is what it looks like when you run python fetch.py
in your system shell:
Sometimes when running a Python script, we don't want to necessarily execute everything in it. Sometimes we want to just store variable and functions ina script file, and let other scripts refer to those variables and functions. In fact, we've already been doing that in most of our code when we use the import statement.
This subsection is a segue for explaining the Python idiom of __name__
and '__main__'
– and to just push you through a Python convention for organizing code. If you already know it, you can skip this section. If you don't know it, you probably won't totally understand it, but it's enough to know that the pattern exists.
__name__
and '__main__'
?Back when we were learning how to make a basic Flask app, we saw – and blindly repeated – this following snippet of code in app.py:
if __name__ == '__main__':
app.run()
What does that if
-statement imply? More specifically, what is that __name__
variable, and what kind of object does it refer to? And why do we care if it is equivalent to '__main__'
?
First, the fact that __name__
is being tested to see if it is equal to the string value '__main__'
implies that __name__
itself is a string. Jump into ipython yourself and test it out:
Yep, that __name__
object is a string. And in ipython, its value is indeed equal to '__main__'
.
However, the value of __name__
changes when you import an external file into iPython.
Make a throwaway file named fooey.py. In that file, write and save the following Python code.
print("Inside foo.py, the name is:", __name__)
Run it from your command-line interpreter:
$ python foo.py
The output should be this:
Inside foo.py, the name is: __main__
Now, launch ipython and import the foo.py file with this statement:
import foo
The code inside of foo.py should immediately execute…but note the change in __name__
value:
Inside foo.py, the name is: foo
If this isn't clear enough, here's a screencast of me:
python foo.py
from the command-line – i.e. reading foo.py with the Python interpreter.Recall the contents of foo.py; it's a simple print statement:
print("Inside foo.py, the name is:", __name__)
When we execute a Python file directly – i.e. python foo.py
– the code in that script has a top level context. Or, for the sake of Python convention, its __name__
is '__main__'
.
However, if foo.py is imported by another Python script or process, such as jumping into the ipython shell and running import foo
, then the context, or rather the __name__
, under which foo.py operates is now 'foo'
. Not '__main__'
.
If you want a more official explanation, you can check these links:
Via the [Python documentation]((https://docs.python.org/3/library/main.html):
A module can discover whether or not it is running in the main scope by checking its own
__name__
, which allows a common idiom for conditionally executing code in a module when it is run as a script or withpython -m
but not when it is imported:if __name__ == "__main__": # execute only if run as a script main()
If that __name__
and '__main__'
business was hard to swallow, let's see how it applies to our fetch.py.
Recall that fetch.py contains this:
import requests
DATA_URL = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv'
DATA_FILENAME = 'legislators.csv'
print("Downloading from", DATA_URL)
resp = requests.get(DATA_URL)
with open(DATA_FILENAME, 'w') as wf:
msg = "Saving {numchars} to: {filename}"
print(msg.format(numchars=len(resp.text), filename=DATA_FILENAME))
wf.write(resp.text)
We can run fetch.py via the command-line interpreter like so:
$ python fetch.py
Which results in this output:
Downloading from http://unitedstates.sunlightfoundation.com/legislators/legislators.csv
Saving 268767 to: legislators.csv
The same thing happens if you go into the ipython shell and run this Python command:
import fetch
So what does it mean to refactor it to consider the top-level, i.e. '__main__'
scope?
First, it means separating the code that defines the variables from the code that does the work. Typically, we set the latter code off in its own function.
In the example below, the code that downloads/saves the data file is in a function named fetch_data(). Refactor fetch.py
to look the same:
import requests
DATA_URL = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv'
DATA_FILENAME = 'legislators.csv'
def fetch_data():
print("Downloading from", DATA_URL)
resp = requests.get(DATA_URL)
with open(DATA_FILENAME, 'w') as wf:
msg = "Saving {numchars} to: {filename}"
print(msg.format(numchars=len(resp.text), filename=DATA_FILENAME))
wf.write(resp.text)
So, one difference is that if we now attempt to execute the script via the command line:
$ python fetch.py
…nothing will happen. That's because while the fetch_data() function has been defined, it is never executed.
Jump into ipython and import fetch.py:
import fetch
Again, nothing will happen.
But now, at the ipython shell prompt, execute the fetch.fetch_data()
function:
fetch.fetch_data()
And you'll get the following output:
Downloading from http://unitedstates.sunlightfoundation.com/legislators/legislators.csv
Saving 268767 to: legislators.csv
Or, if you prefer the more Pythonic way of importing modules and their function names into the space:
from fetch import fetch_data
fetch_data()
'__main__'
routineOK, so basically we've deferred the fetching data routine to only run when the fetch.fetch_data()
function is executed.
However, we miss the convenience of getting that functionality when running fetch.py from the command line:
$ python fetch.py
How can we get back that convenience?
That's what the if __name__ == '__main__':
branch checks. Remember how when we ran python foo.py
from the command-line, it told us its __name__
was '__main__'
? Here's how we use that insight in fetch.py – note the only change is the addition of two lines at the bottom:
if __name__ == '__main__':
fetch_data()
The full fetch.py file:
import requests
DATA_URL = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv'
DATA_FILENAME = 'legislators.csv'
def fetch_data():
print("Downloading from", DATA_URL)
resp = requests.get(DATA_URL)
with open(DATA_FILENAME, 'w') as wf:
msg = "Saving {numchars} to: {filename}"
print(msg.format(numchars=len(resp.text), filename=DATA_FILENAME))
wf.write(resp.text)
if __name__ == '__main__':
fetch_data()
Now running from the command-line will, by default, execute the code in that __main__
branch:
$ python fetch.py
Downloading from http://unitedstates.sunlightfoundation.com/legislators/legislators.csv
Saving 268767 to: legislators.csv
We're almost done; let's add one more feature to our fetching routine: the logic for not downloading the file when we already have it.
The slowest part of fetch.py, by an order of magnitude, is the time it takes to download legislators.csv from the Sunlight Foundation's remote server. Because the composition of Congress changes relatively slowly, why do we need to keep redownloading it over and over?
The Python standard library contains the os.path module which contains a set of helpful file-related functions, including the exists() function which accepts an argument for a string, and then returns True
or False
based on whether that string represents a filename that exists.
You can try in ipython. Assuming you're running ipython in the same directory to which you've been running fetch.py:
from os.path import exists
exists("booobooo123.blahblah")
# False
exists("fetch.py")
# True
exists("legislators.csv")
# True
See if you can figure out how to include it in fetch.py yourself before checking my answer below.
This is what my fetch.py looks like:
from os.path import exists
import requests
DATA_URL = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv'
DATA_FILENAME = 'legislators.csv'
def fetch_data():
if exists(DATA_FILENAME):
print("Already downloaded", DATA_FILENAME)
else:
print("Downloading from", DATA_URL)
resp = requests.get(DATA_URL)
with open(DATA_FILENAME, 'w') as wf:
msg = "Saving {numchars} to: {fname}"
print(msg.format(numchars=len(resp.text), fname=DATA_FILENAME))
wf.write(resp.text)
if __name__ == '__main__':
fetch_data()
Let's see if it actually works. Delete the legislators.csv
file. Then run fetch.py from the command-line twice. The first time, it should print out the messages telling us that it's downloading from a URL. The second time, it should tell us that the file has already been downloaded and, more importantly, the program should exit almost instantaneously:
OK OK, one more function, the necessity of which won't make sense until the next lesson.
Currently, our fetch_data()
function in fetch.py does not return anything. Instead, it does the work of downloading and saving a file, and notifying us (via print()
) that the download has happened…but at the end of the function, nothing is returned. Which is always the case when there is no return
statement.
However, other parts of our code base may want access to the raw text data that is in DATA_FILENAME
, i.e. legislators.csv…but those parts of the code base shouldn't care about the fetch_data()
function…they don't want to worry about downloading data. They just want to get the raw text.
So let's create a little "helper" (also known as a "convenience") function. We'll call it read_data(), and whenever someone calls it, it does the following:
fetch_data()
to ensure that data has been downloaded and that the file existsDATA_FILENAME
, reads it, and simply returns its contents as a big string.Here's the function definition that you can put into fetch.py:
def read_data():
fetch_data()
with open(DATA_FILENAME, 'r') as rf:
txt = rf.read()
return txt
Or, more concisely:
def read_data():
fetch_data()
with open(DATA_FILENAME, 'r') as rf:
return rf.read()
Pop open your ipython shell. Then let's test out the difference between fetch_data()
and read_data()
:
from fetch import fetch_data, read_data
f_result = fetch_data()
type(f_result)
# NoneType
f_result
# None
r_result = read_data()
type(r_result)
# str
len(r_result)
# 268767
r_result[0:50]
# 'title,firstname,middlename,lastname,name_suffix,ni'
len(r_result.splitlines())
Here's an interactive demonstration of what that looks like:
The final version of fetch.py before moving on:
from os.path import exists
import requests
DATA_URL = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv'
DATA_FILENAME = 'legislators.csv'
def fetch_data():
if exists(DATA_FILENAME):
print("Already downloaded", DATA_FILENAME)
else:
print("Downloading from", DATA_URL)
resp = requests.get(DATA_URL)
with open(DATA_FILENAME, 'w') as wf:
msg = "Saving {numchars} to: {fname}"
print(msg.format(numchars=len(resp.text), fname=DATA_FILENAME))
wf.write(resp.text)
def read_data():
fetch_data()
with open(DATA_FILENAME, 'r') as rf:
return rf.read()
if __name__ == '__main__':
fetch_data()