Mueller Report and "Data Science" (part 1 in a series)

By Dustin Miller


A note before I begin: I'm writing this article to help everyone—including non-technical readers—understand the power and value of using "data science" to learn something from a jumble of information. Developers: I'm hopeful there's enough here that is of interest to you. If you're familiar with NLP and Python, you can probably skip the narrative and focus on the code. 🤓

Updates

2:16am CDT, April 23, 2019
I've fixed a timezone conversion issue with the timeline.
2:35pm CDT, April 23, 2019
Updated some formatting

The term "data science" can mean different things to different people. To folks not normally involved in statistical analysis, it can even evoke thoughts of "black magic". Me personally? I like to use a different definition.

"Data Science" is the study of taking information of any kind—text, images, sound—and figuring out how to learn more from its individual parts.

Sure, it's a simplification. But it works for me.

Today, I'm going to use one particular aspect of data science: "Natural Language Processing", or NLP.

Using NLP, we can learn more about a document's text (the information) by looking more closely at its words and phrases (the individual parts). More specifically: their parts of speech, the grammar rules that connect multiple words in a phrase, the frequency with which words appear in the text, and even if those words and phrases have a special meaning that we can learn even more from.

What kind of meaning? Well, some words and phrases represent something more conceptual. Like dates, for example. Or people. Or locations. I'll show you what I mean. What can we learn about the phrases in this text:

On March 3, 2009, Dustin Miller was in Seattle, for an event at Microsoft headquarters.

First, I'm going to set up some code that I'll use to extract the concepts that are hidden in that text using NLP:

Initial setup

I'm writing the code for this article using a programming language called Python. This is the initial setup I'll use to extract information from the text shown above using a Python library called "spaCy".

In [459]:
%matplotlib inline
import spacy
from spacy import displacy
In [380]:
# Load a "model" of statistical information about the english language
nlp = spacy.load("en_core_web_md")
In [931]:
# Tell spaCy to do its magic, and extract all the knowledge it can from the
# individual parts (words, phrases) of the information (the sentence itself).
demo = nlp("On March 3, 2009, Dustin Miller was in Seattle for an event at Microsoft headquarters.")

The first peek into the power of NLP

What did spaCy learn from its statistical analysis of that sentence? First, the conceptual phrases:

In [935]:
displacy.render(demo, style='ent')
On March 3, 2009 DATE , Dustin Miller PERSON was in Seattle GPE for an event at Microsoft ORG headquarters.

The term of art used in NLP circles to describe this extraction of conceptual phrases is "Named Entity Recognition" (NER). spaCy's statistical knowledge of the English language—learned from vast quantities of text from blogs, news sites, talk show transcripts, and more—allowed it to recognize that some phrases in that sentence have a greater meaning.

In fact, it can recognize all sorts of things:

Entity Type Description
PERSON People, including fictional.
NORP Nationalities or religious or political groups.
FAC Buildings, airports, highways, bridges, etc.
ORG Companies, agencies, institutions, etc.
GPE Countries, cities, states.
LOC Non-GPE locations, mountain ranges, bodies of water.
PRODUCT Objects, vehicles, foods, etc. (Not services.)
EVENT Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART Titles of books, songs, etc.
LAW Named documents made into laws.
LANGUAGE Any named language.
DATE Absolute or relative dates or periods.
TIME Times smaller than a day.
PERCENT Percentage, including ”%“.
MONEY Monetary values, including unit.
QUANTITY Measurements, as of weight or distance.
ORDINAL “first”, “second”, etc.
CARDINAL Numerals that do not fall under another type.

It's important to keep one phrase in mind: "statistical analysis". Some phrases are more likely to be recognized as a "city", but that isn't necessarily always true. That means there can be mistakes. For example:

In [154]:
displacy.render(nlp("A man named Washington was also in town on the 4th."), style='ent')
A man named Washington GPE was also in town on the 4th ORDINAL .

In cases like this, statistical analysis can learn more by looking past individual words; by looking at their context.

In [155]:
displacy.render(nlp("A man named George Washington also in town on the 4th of July."), style='ent')
A man named George Washington PERSON also in town on the 4th of July DATE .

But wait, there's more!

Named Entity Recognition can certainly let us learn more from a body of text, but there's more that can be done with the statistical analysis of Natural Language Processing. For example, remember having to create sentence diagrams back in school? What if they could look like this:

In [164]:
displacy.render(nlp("I wrote this article."))
I PRON wrote VERB this DET article. NOUN nsubj det dobj

spaCy's model of the English language—built on several statistical analyses developed by data scientists and linguists—is able to make predictions about grammar. Above, the diagram shows that I is a PRONoun; wrote is a VERB, this is a DETerminer; and article is a NOUN. It also demonstrates the relationship between those words. The word I is the noun subject (nsubj) of the verb wrote, and article is its direct object (dobj)—which is modified by the determiner (det) this.

Okay, so what can it tell us about the Mueller Report?

The report is filled with names, locations, dates and events. By using NLP to analyze the text, we can extract targeted bits of information. The first question I set out to answer:

Can one create a timeline of events or references made by the Mueller Report using NLP to identify sentences with dates?

But there was a problem.

Dirty Data

The Justice Department released the Mueller Report to the public on April 18, 2019. The format of this release? A document that had been written in electronic format, securely redacted in electronic format, printed out on paper, and scanned as a collection of images in a PDF.

Normally, PDF files store the actual text of a document. That means you can highlight sections of it for copying, search the text for words of interest, or for users with a vision impairment, you can use assistive technology—like a screen reader or Braille printer—to "read" the report to you.

However, the Mueller Report PDF did not have the actual text of the document—just pictures of each page. We need to find a way to transform those images into text. A later article will go over the approach I took when I originally embarked on an effort to translate the PDF images into text. I learned a lot from that endeavor, and still have more to learn. More relevant to this article, though: it was taking too much time, and I wanted to get into analysis.

Floating around the internet was a PDF that had been run through Adobe Acrobat Pro's tool to perform "Optical Character Recognition" (OCR) on the images—thus translating them back into text.

It's not perfect. It's not even close to perfect. Perfect, as a destination, can be seen if you squint your eyes and focus real hard. Many bits of text were badly translated by the OCR process. At some point, someone will release a fully OCR'd copy of the report. If that doesn't happen soon, I will set up a site where to crowdsource that effort. But until then, I'll be using a copy of the report in PDF format.

Setting up the code

First, I need to load up some tools that I'll use to extract the text that Adobe Acrobat Pro was able to recognize in the original report.

Import libraries

  • fitz: used to read the PDF and extract HTML (a web page) for each page. This gets the text out, but also preserves the formatting/positioning of that text. You'll see why that's useful soon.
  • html: used to help convert some of the special characters normally used in HTML (like & and ') back into normal text (& and ')
  • re: stands for "regular expression", and lets me search the HTML for things like "any two digits in a row"
  • bs4 (BeautifulSoup): it's, like, the best name ever for this library. It takes the soup of code in (among other things) an HTML document and turns it into a format that can be navigated and modified in Python
  • tqdm: used to show a progress bar during the extraction process
In [ ]:
import fitz # for reading the PDF as an HTML document
import html # for converting some special characters
import re

from bs4 import BeautifulSoup
from tqdm import tdqm

Preprocess each page in the PDF

As I mentioned, fitz will read the OCR'd PDF and convert each page into a web page. The benefit is that the positioning and formatting of the text is preserved. I'll use that to my advantage.

Using BeautifulSoup, I'll identify the text that is positioned in the top 49 "points" of the page (about the top two-thirds of an inch) and remove it. That gets rid of the header at the top of each page. We don't need to analyze that text; it's the same on every page, and is not relevant content.

Data science is messy: During later analysis, I identified some words that were mistakenly recognized as a PERSON. This pre-processing step includes some hand-crafted cleanup to work around those issues.

Note to any developers reading: this code is written verbosely for the purposes of this article.

In [360]:
"""
Pre-processes each PDF page and returns only the needed text
"""


def preprocess_page(pdf, page_num):
    extracted_html = pdf[page_num].getText("html")
    parsed_html = BeautifulSoup(extracted_html, 'html5lib')

    # remove the image overlays from fitz
    for img in parsed_html.find_all("img"):
        img.extract()

    # remove the text from the top 49pts (headers) of the original scan
    for _h in parsed_html.find_all(style=re.compile(r"top:[0-4][0-9]pt")):
        _h.extract()

    # extract all the remaining text without any HTML markup
    text = parsed_html.get_text()

    # convert ' and other entities
    text = html.unescape(text)

    # aggressively remove whitespace
    text = " ".join(text.split())

    # look for possessives with an extra space and remove that space
    text = re.sub(r"\s+'s\b", "'s", text)

    # It's "comey", not "Corney"
    text = text.replace("Corney","Comey")

    # Lowercase email
    text = text.replace("Email", "email")

    return text
In [361]:
# The file name; the file itself is in the same directory as this code
file = "mueller_report_ocr.pdf"

# Okay, fitz: open that file!
pdf = fitz.open(file)
In [362]:
# Create an empty list to store the pre-processed text from each page
pages = []

# Starting on page 8 (after the table of contents), pre-process each
# page, and store the resulting text in the "pages" list. Also, use
# "tqdm" for a snazzy progress bar.
for i in tqdm(range(8,pdf.pageCount)):
    page = preprocess_page(pdf, i)
    pages.append(page)

# Merge all the pages together
allpages = " ".join(pages)
100%|██████████| 440/440 [00:20<00:00, 33.31it/s]

Check the work

I'll check to make sure the pages list has the 440 pages I extracted, and sample some of the text to see how it looks.

In [364]:
len(pages) # how many pages are in the list?
Out[364]:
440
In [518]:
pages[7][:200] # first 200 characters of the 7th page that was pre-processed
Out[518]:
"National Security Agency-that concluded with high confidence that Russia had intervened in the election through a variety of means to assist Trump's candidacy and harm Clinton's. A declassified versio"

Analyze the text

So far, so good.

spaCy will not process text with more than one million characters. The Mueller Report is a little more than that. On my computer, spaCy can handle the additional text, and I didn't feel it was necessary to do the parsing in batches. Your mileage may vary, but this isn't really "big data" type stuff we're dealing with so...

In [366]:
len(allpages) # How many characters are in the full text?
Out[366]:
1189235
In [519]:
nlp.max_length = len(allpages)

Time for spaCy to do its magic on the Mueller Report. This may take a while if you're running this notebook on your own computer.

In [382]:
doc = nlp(allpages)

Check the work

In [383]:
print("spaCy parsed {} words in {} sentences.".format(len(doc),len(list(doc.sents))))
spaCy parsed 230410 words in 13422 sentences.

Time to do some data science

Now that spaCy has done the heavy lifting, I'll explore some interesting things that can be learned. First, let's get a list of every word or phrase that spaCy identified as a PERSON.

In Natural Language Processing tasks in particular, exploring the data is a good starting point. Even if you already know what you want to learn, the act of exploration can often reveal something interesting.

So let's explore named entities.

doc.ents is a collection of all the "entities" found during the Named Entity Recognition phase of the analysis. Each entity (ent below) has a label_ indicating what type of entity it is, as well as the text of the entity itself. I'll create a new list of these PERSON entities, and call that list people.

In [384]:
people = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]
In [385]:
print("spaCy found {} possible references to people in the report.".format(len(people)))
spaCy found 8689 possible references to people in the report.

People

I'll import the Counter module from collections, which lets me count how many times a given item appears in a list of items. I'll use this to show the number of times a given PERSON appears in the text.

In [386]:
from collections import Counter, OrderedDict

people_freq = Counter(people)
In [387]:
people_freq.most_common(10) # top ten people. Oh, Cohen.
Out[387]:
[('Cohen', 615),
 ('Flynn', 504),
 ('Comey', 502),
 ('Trump', 397),
 ('McGahn', 367),
 ('Papadopoulos', 213),
 ('Kushner', 172),
 ('Kislyak', 170),
 ('Sessions', 166),
 ('Clinton', 137)]

You may have noticed that these are all last names. The report often refers to people by their last name when their full name was recently used in the text. There are PERSON entities with full names, they just don't appear as often in the style of writing used by the preparers of the report. Let's get a look at the PERSON entities that have more than one word:

In [431]:
fullnames = [ent.text for ent in doc.ents if ent.label_ == "PERSON" and len(ent.text.split()) > 1]
fullnames_freq = Counter(fullnames)
In [432]:
fullnames_freq.most_common(10)
Out[432]:
[('Trump Jr.', 87),
 ('Paul Manafort', 71),
 ('Michael Cohen', 65),
 ('Hillary Clinton', 57),
 ('Donald Trump', 52),
 ('James B. Comey', 42),
 ('McGahn 3/8/18', 41),
 ('Jared Kushner', 36),
 ('Michael Flynn', 27),
 ('George Papadopoulos', 25)]

There's a weird looking "person" in that list, isn't there? This is the natural side effect of the statistical analysis of text. Sometimes the predictions made by those statistics aren't exactly what you're expecting.

Can we do better?

What if there was a way to try and replace the last name-only mentions with their full names?

spaCy introduced a feature called "pipeline extensions", which can let me catch when NER is performed, and extend the results with my own custom data. However, while writing this article, I encountered a problem, so I opted to take another approach.

My approach for this part of the series: Take the existing list of matched people, and if there's a replacement available (i.e. a "full name"), use its replacement instead. Also, several words identified as a PERSON aren't actually people, so I'll create a list of those and exclude them from the list.

NOTE: You'll notice "Smith" in the replacements. The last name "Smith" is also referenced in a footnote entry, but doesn't refer to "Peter Smith". For the purposes of this overall analysis, I've opted to ignore this small discrepancy.

NOTE: In part 2 of this series, I'll show how I "walked back" in the document to find the full name that would be the most likely replacement for those occasions where only a last name is tagged as a PERSON. In NLP circles, this is known sometimes as "de-referencing" or "dis-ambiguation".

Data science is messy. 90% of the effort is thankless, boring, tedious stuff. "Data cleansing" is the euphemistic name given to this drudgery.

Here we go…

Replacements and exclusions

In [439]:
replacements = {
    "Cohen": "Michael Cohen",
    "Flynn": "Michael Flynn",
    "McGahn": "Don McGahn",
    "Sessions": "Jeff Sessions",
    "Kushner": "Jared Kushner",
    "Kislyak": "Sergey Kislyak",
    "Clinton": "Hillary Clinton",
    "Manafort": "Paul Manafort",
    "Dmitriev": "Kirill Dmitriev",
    "Putin": "Vladimir Putin",
    "Trump Jr.": "Donald Trump, Jr.",
    "Nader": "George Nader",
    "Lewandowski": "Corey Lewandowski",
    "Comey": "James B. Comey",
    "James Comey": "James B. Comey",
    "Hicks": "Hope Hicks",
    "Sater": "Felix Sater",
    "Rosenstein": "Rod Rosenstein",
    "Kilimnik": "Konstantin Kilimnik",
    "Papadopoulos": "George Papadopoulos",
    "Trump": "Donald Trump",
    "Bannon": "Steve Bannon",
    "Prince": "Erik Prince",
    "Gates": "Rick Gates",
    "Mueller": "Robert Mueller",
    "Page": "Carter Page",
    "Gordon": "J.D. Gordon",
    "Simes": "Dimitri Simes",
    "Porter": "Rob Porter",
    "McGahn 3/8/18": "Don McGahn",
    "Coats": "Dan Coats",
    "Gerson": "Rick Gerson",
    "S. Miller": "Stephen Miller",
    "McCabe": "Andrew McCabe",
    "Klokov": "Dmitri Klokov",
    "Aven": "Pitr Aven",
    "Christie": "Chris Christie",
    "Giuliani": "Rudy Giuliani",
    "Hunt": "Jody Hunt",
    "Kelly": "John Kelly",
    "Smith": "Peter Smith"

}
exclusions = ['Cong','Doc','Jr.']

people = [ent.text if not ent.text in replacements else replacements[ent.text]
          for ent in doc.ents
          if ent.label_ == "PERSON" and ent.text not in exclusions]
In [444]:
people_freq = Counter(people)
people_freq.most_common(10) # top ten people. Oh, Cohen.
Out[444]:
[('Michael Cohen', 680),
 ('James B. Comey', 554),
 ('Michael Flynn', 531),
 ('Donald Trump', 449),
 ('Don McGahn', 411),
 ('George Papadopoulos', 238),
 ('Jared Kushner', 208),
 ('Hillary Clinton', 194),
 ('Sergey Kislyak', 186),
 ('Jeff Sessions', 185)]

Visualizing the people in the report

There's a tried-and-true way to get a visual overview of the number of times a given word appears in a body of text: a "word cloud". It may be a cliché, but it does the job, and it does it well for users without a vision impairment. I'm looking into contributing to the word cloud generator in order to generate a more accessible image. Until then, and with apologies, the word cloud image isn't an accessible image.

I'll bring in some libraries to show a word cloud of the post common names to appear in the report:

In [453]:
import matplotlib.pyplot as plt

from wordcloud import WordCloud, STOPWORDS

Show me the word cloud!

Here it is…a word cloud of the 40 most frequently mentioned people in the Mueller Report:

In [486]:
people_dict = OrderedDict(people_freq.most_common(40))

wc = WordCloud(background_color="white", scale=3)
wc.generate_from_frequencies(people_dict)
plt.figure(figsize=(16,9))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()

Data science can seem like a game

Okay, these next few things are just exploratory. They aren't necessarily going to reveal anything of importance, but then again, they might. That's a core part of data science: exploration. When you aren't sure what questions you want to ask of your data, sometimes exploring different views of the information can inspire you.

Verbs first

When spaCy analyzed the text of the report, it also analyzed its grammar and sentence structure. For example, the verbs used as the root verb of each sentence, converted to the base form of that verb (the "lemma"):

In [516]:
verbs = [token.lemma_ for token in doc if token.pos_ == "VERB" and token.dep_ == "ROOT" and token.is_stop == False]
verbs_freq = Counter(verbs)
verbs_dict = OrderedDict(verbs_freq.most_common())

wc = WordCloud(background_color="white", scale=3, stopwords=STOPWORDS)
wc.generate_from_frequencies(verbs_dict)
plt.figure(figsize=(16,9))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()

Noun chunks

We can also get a quick overview of "noun chunks", or nouns plus the adjectives and articles used to describe them. I've reduced them all to lower case to avoid duplicate phrases that differ only in capitalization.

In [517]:
noun_chunks = [chunk.text.lower() for chunk in doc.noun_chunks if len(chunk.text.split()) > 1]
noun_chunks_freq = Counter(noun_chunks)
noun_chunks_dict = OrderedDict(noun_chunks_freq.most_common())

wc = WordCloud(background_color="white", scale=3)
wc.generate_from_frequencies(noun_chunks_dict)
plt.figure(figsize=(16,9))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()