Due: 2016-04-20 (Wednesday)
Send scripts/answers by email: firstname.lastname@example.org
You can also upload it to Github if you wish, and just send me the links.
Run through the following lessons – at the end, you should have a
briefs subdirectory with 1,600+ files.
Then, do a search for "ISIS" or "ISIL", as a string.
glob() to get a list of filenames:
from glob import glob from os.path import join BRIEFS_DIR = 'briefs' filenames = glob(join(BRIEFS_DIR, '*.html'))
Given a filename, find if either
ISIL exists anywhere in the file:
with open(fname, 'r') as rf: txt = rf.read() if 'ISIS' in txt or 'ISIL' in txt: print(fname, "mentions ISIS/ISIL")
Put the two snippets together and you have a crude script that will print out the filename of any file that contains either of those strings. Among the files that mention it, find what seems to be the earliest file/reference to ISIS/ISIL.
However…are we sure if they spell
"ISIL" in all-caps? We don't know for sure, so let's upper/lower-case the entire text string of the file before doing the comparison:
with open(fname, 'r') as rf: txt = rf.read() txt = txt.upper() if 'ISIS' in txt or 'ISIL' in txt: print(fname, "mentions ISIS/ISIL")
However, you'll find that you'll get a lot of false positives, e.g.
'ISIS' in 'WE ARE IN CRISIS MODE' # True
This is where you'll need to use regular expressions. Here's a nice lesson on them before I write up my own summary:
The "pattern" that we want to match is
'ISIL' in which those strings are standalone, i.e. not part of
Regex is its own syntax, but here's one way to test it:
import re re.search(r'\bISIS\b', 'WE ARE IN CRISIS MODE') # None re.search(r'\bISIS\b', 'WE WERE ATTACKED BY ISIS.') # <_sre.SRE_Match object; span=(20, 24), match='ISIS'>
You can throw the regex search in the conditional:
import re with open(fname, 'r') as rf: txt = rf.read() txt = txt.upper() if re.search(r'\bISIS\b', txt) or re.search(r'\bISIL\b', txt): print(fname, "mentions ISIS/ISIL")
This also works, and is a more adroit use of regex:
import re with open(fname, 'r') as rf: txt = rf.read() txt = txt.upper() if re.search(r'\bISI[LS]\b', txt): print(fname, "mentions ISIS/ISIL")
The execution list is here:
This is a simpler version of the White House briefings, but you can approach it in the same fashion:
Try to use HTML parsing on each file to extract text from the relevant tags…e.g. paragraph tags.