Due: 2016-04-20 (Wednesday)
Send scripts/answers by email: dun@stanford.edu
You can also upload it to Github if you wish, and just send me the links.
Run through the following lessons – at the end, you should have a briefs
subdirectory with 1,600+ files.
Lesson: Scraping the White House Press Briefings
Then, do a search for "ISIS" or "ISIL", as a string.
Use glob()
to get a list of filenames:
from glob import glob
from os.path import join
BRIEFS_DIR = 'briefs'
filenames = glob(join(BRIEFS_DIR, '*.html'))
Given a filename, find if either ISIS
or ISIL
exists anywhere in the file:
with open(fname, 'r') as rf:
txt = rf.read()
if 'ISIS' in txt or 'ISIL' in txt:
print(fname, "mentions ISIS/ISIL")
Put the two snippets together and you have a crude script that will print out the filename of any file that contains either of those strings. Among the files that mention it, find what seems to be the earliest file/reference to ISIS/ISIL.
However…are we sure if they spell "ISIS"
or "ISIL"
in all-caps? We don't know for sure, so let's upper/lower-case the entire text string of the file before doing the comparison:
with open(fname, 'r') as rf:
txt = rf.read()
txt = txt.upper()
if 'ISIS' in txt or 'ISIL' in txt:
print(fname, "mentions ISIS/ISIL")
However, you'll find that you'll get a lot of false positives, e.g.
'ISIS' in 'WE ARE IN CRISIS MODE'
# True
This is where you'll need to use regular expressions. Here's a nice lesson on them before I write up my own summary:
Automate the Boring Stuff with Python: Pattern Matching with Regular Expressions
The "pattern" that we want to match is 'ISIS'
or 'ISIL'
in which those strings are standalone, i.e. not part of 'CRISIS'
.
Regex is its own syntax, but here's one way to test it:
import re
re.search(r'\bISIS\b', 'WE ARE IN CRISIS MODE')
# None
re.search(r'\bISIS\b', 'WE WERE ATTACKED BY ISIS.')
# <_sre.SRE_Match object; span=(20, 24), match='ISIS'>
You can throw the regex search in the conditional:
import re
with open(fname, 'r') as rf:
txt = rf.read()
txt = txt.upper()
if re.search(r'\bISIS\b', txt) or re.search(r'\bISIL\b', txt):
print(fname, "mentions ISIS/ISIL")
This also works, and is a more adroit use of regex:
import re
with open(fname, 'r') as rf:
txt = rf.read()
txt = txt.upper()
if re.search(r'\bISI[LS]\b', txt):
print(fname, "mentions ISIS/ISIL")
The execution list is here:
http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html
This is a simpler version of the White House briefings, but you can approach it in the same fashion:
Try to use HTML parsing on each file to extract text from the relevant tags…e.g. paragraph tags.
http://www.supremecourt.gov/oral_arguments/argument_transcript/2015