Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.1/2, June 2020
QUANTUM CRITICISM: AN ANALYSIS OF POLITICAL
NEWS REPORTING
Ashwini Badgujar1, Sheng Chen1, Pezanne Khambatta1, Tuethu Tran1,
Andrew Wang1, Kai Yu1, Paul Intrevado2 and David Guy Brizan1
1
Department of Computer Science
University of San Francisco, San Francisco, CA, USA
2
Department of Mathematics and Data Science
University of San Francisco,San Francisco, CA, USA
ABSTRACT
In this project, we continuously collect data from the RSS feeds of traditional news sources. We apply
several pre-trained implementations of named entity recognition (NER) tools, quantifying the success of
each implementation. We also perform sentiment analysis of each news article at the document, paragraph
and sentence level, with the goal of creating a corpus of tagged news articles that is made available to the
public through a web interface. We show how the data in this corpus could be used to identify bias in news
reporting, and also establish different quantifiable publishing patterns of left-leaning and right-leaning
news organisations.
KEYWORDS
Content Analysis, Named Entity Recognition, Sentiment Analysis, Politics, News
1. INTRODUCTION
Many of us implicitly believe that the news we consume is an important summary of the events
germane to our lives. Regardless of how we divide ourselves—by demographics, political
leaning, profession or other socioeconomic schism—we rely on trusted individual journalists and
the news organizations to which they belong to distill stories and provide unbiased context.
There are several organizations that attempt to address this need. USAFacts.org is a non-profit
organization and website which offers a non-partisan portrait of the US population, its
government’s finances, and government’s impact on society. Similar sites and outlets have had
the same mission, perhaps most prominently MIT’s Data USA and the US government’s
data.gov. These efforts, however, largely deal with quarterly or bi-annual government reports,
excluding day-to-day news analysis about business, politics, etc.
More timely news on these excluded topics can typically be found reported on by private news
organizations, often funded by a subscription or ad-based model. There are, however, a subset of
articles that are freely available to the public. News producers often promote selected articles
through their real simple syndication (RSS) feeds, consumed by phone or web applications such
as Feedly, NewsBlur and FlowReader, among others.
News organizations should be a reflection of the populations they represent. Yet, despite ease of
access to news articles through RSS feeds, we find a dearth of resources supporting the analysis
of said news articles, e.g., how the news is reported or how it may be affecting our lives over
DOI:10.5121/mlaij.2020.7201
1
Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.1/2, June 2020
time. For example, observing climate change denial, one journalist from Vox, David Roberts, has
named the current American philosophical divide “tribal epistemology,” specifically discussing
the tribalism of information through the news [1]. While his presentation is compelling, the idea
of tribal epistemology is largely delivered without an analysis of the news from sources which he
critiques. Roberts’s lack of analysis could be the result of having no facile manner to find and
analyse daily news articles from multiple sources in a single corpus.
In our survey of existing news corpora (Section 2), we find existing corpora lacking in one or
more aspects, including cost, availability, coverage and/or analysis. We therefore create our own
corpus, Quantum Criticism, to address these issues. Specifics of the tools and approaches we use
to build our corpus are discussed in Section 3. We discuss the performance of our tools in Section
4. We aspire for our corpus to be used by journalists and for those in academic research to
establish trends, identify differences, and affect change in news reporting and its interpretation. In
Section 5, we demonstrate two ways in which our corpus can be used to uncover potential media
bias.
2. RELATED WORK
We begin the Related Work section by highlighting existing corpora that have some coverage or
analysis limitation, discussed in Section 2.1. Section 2.2 briefly reviews common tasks in natural
language processing, as well as some of the available tools for accomplishing those tasks. Lastly,
in Section 2.3, we explore several use-cases of existing news corpora.
2.1. Corpora
There are several outcomes of forming a news-based corpus. One may be the task of language
modelling. Journalists and news organizations can be barometers of when a word gets introduced
to a language. Another important use of news-based corpora is the derivation of larger social
patterns from individual units of reporting.
The consumers of a news corpus must regard journalists and news organisations as imperfect
messengers. As far back as 1950, White [2] demonstrated that the news we read is frequently
collated by a set of “gate keepers” who filter candidate events. These gate keepers may have
biases based on ideological (liberal or conservative) leanings, race or gender [3], economic
interdependence [4] and geopolitical affiliation [5], likely only some of the many factors
influencing a news story’s selection. One use of a properly constructed corpus could be the
unearthing of selection bias or other biases.
Selection bias may be the result of the choices of not only the specific journalists but also the
news organizations and their owners [6]. In a large-scale study based on articles from the GDELT
database, [7] lays out the constraints under which the news organizations operate and quantify the
selection bias of news organizations.
Prior to building our Quantum Criticism corpus, we considered a number of other corpora
assembled from news articles, all appearing online. The Linguistic Data Consortium (LDC) has
an extensive collection, including the New York Times corpus [8], which we use for validation of
our tools. (Details in Section 4.) This corpus contains 1.8 million news articles from the New
York Times over a period of more than 10 years, covering stories as diverse as political news and
restaurant reviews. Articles are provided in an XML format, with the majority of the articles
tagged for named entities—persons, places, organizations, titles and topics—so that these named
entities are consistent across articles.
2
Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.1/2, June 2020
The LDC also offer the North American News Corpus [9], assembled from varied sources,
including the New York Times, the Los Angeles Times, the Wall Street Journal and others. The
primary goals of this corpus are support for information retrieval and language modelling, so the
count of “words”—almost 350 million tokens—is more important than the number of articles.
Also offered by the LDC is the Treebank corpus [10], often called the Penn Treebank, which has
been an important and enduring language modelling resource; see [11] for an early use of this
corpus, and [12] for a more recent implementation.
Collectively, the LDC corpora and their like are excellent resources for news generated from a
discrete number of sources during a particular period of time. Because of their volume of articles
and tokens, and because they are mostly written in Standard American English, they are ideal for
building language models from the period during which they were collected. However, we find
the aforementioned corpora broadly lacking in a number of areas, chiefly, in their static nature:
these corpora do not continuously collect new articles. Depending on the research being
conducted, researchers may require current articles as well as historic ones. We also find flaws in
the tagging of the articles in the New York Times Annotated Corpus, but leave the full treatment
of this to Section 4. Finally, we find that processing these articles requires a non-trivial cost and
effort. Finding articles in which a particular person, place or organisation is mentioned requires a
search through a considerable number of articles, for which there are no additional tags.
In contrast to the offerings by the LDC, the Global Database of Events, Language and Tone [13],
known as GDELT, has a dizzying array of tools for searching and analysing their corpus. With a
public, no-cost access to articles from 1979 to present, albeit offered at a 48-hour delay, and a
commitment to the continued collection of news from a wide variety of sources, GDELT’s
offerings have resulted in insightful results, some of which are explored herein.
One criticism of GDELT by Ward et al. [14] is that the collection effort has been optimized for
volume of news articles and speed of analysis through automated techniques, sacrificing the
careful curation of articles. This results in the improper classification of articles, erring mostly
toward false positives, i.e., presenting more news articles as related to an event than is warranted.
In terms of implementation, our Quantum Criticism corpus is closest to the News on the Web
(NOW) Corpus, itself a public-facing version of the Corpus of Contemporary American English
[15]. As of the time of this writing, this corpus reports containing 8.7 billion words from a
number of American English sources, including such varied sources as the Wall Street Journal
and tigerdroppings.com, the student newspaper of Louisiana State University. While the diversity
of our Quantum Criticism corpus is not as extensive as what we find in the NOW Corpus, our
initial version of the Quantum Criticism corpus contains one non-American English source and
allows the user to specify the source(s) for a query. We believe the power of our search and
presentation makes our corpus a better analysis tool.
2.2. Overview of NLP Tools
We analyse news articles in two ways: through named-entity recognition and sentiment analysis.
Our search tool exposes the results of these analyses simultaneously. In free-form text, namedentity recognition (NER) seeks to locate and classify the names of (among other entities) people,
organisations and locations. Although there are other possible categories of named entities, we
selected these three classes based on available resources and commonality of model outputs.
Three powerful and oft used NER tools include BERT (Bidirectional Encoder Representations
from Transformers, [16]), which uses BIO tagging, CoreNLP [17], which offers both IO and BIO
tagging, and spaCy [18], which employs IOB tagging.
3
Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.1/2, June 2020
2.3. Use Cases of Corpora
Using corpora and NLP tools, we can discover the biases of a journalist, a news organisation or
the target audience of the news. The effects of biases can effect change on the political or
sociological lives of a people. We see some interesting examples of these effects. While work by
Rafail and McCarthy [19] stops short of the claim that some news organizations made the Tea
Party—a small, right-leaning movement—a political force, there may be ample evidence to draw
such a conclusion. The suggestion is that the news media simplified the message of the party so
that it could be consumed by a wider audience, as well as amplified the coverage of the party’s
events beyond the size its supporters would normally warrant given their numbers.
A more pernicious effect may be seen in the coverage of the Persian Gulf “Crisis” and
subsequent war of the early 1990s [20]. Here, the media was focused on stories which, among
other effects, made readers inclined to favour military rather than diplomatic paths. In turn, this
had an effect on the political leadership of the time. The authors also find an interesting effect
wherein the selection bias for stories was proportional to public interest in such stories.
Interestingly, work by Soroka et al. [21] suggests the opposite effect may be a force. Here, the
“strength” of sentiment in social media reactions differ from the news media coverage in some
economic news coverage. As a result, the contexts and degree to which public opinion affects
news coverage or vice versa deserve additional study.
Systematic analysis of media coverage often involves framing the content from the point-of-view
of the reader. A paper by An and Gower [22] discusses five frames (attribution of responsibility,
human interest, etc.) and two “responsible parties” (individuals vs. organizations) in coverage of
crises, finding that some frames are more common than others. Similarly, Trumbo [23] examines
the differing reactions of scientists and politicians to climate change. While analysis approaches
tend to focus on the content produced, work by Ribeiro et al. [24] examines the political leanings
and demographics of the target audience through the advertising associated with the content. We
see this kind of side-channel investigation as promising, especially if applied systematically to a
large set of data.
Our Quantum Criticism corpus is designed with these types of analysis in mind. We tag each
article for named entities and sentiment and expose this corpus to the public. We expect this
corpus to have multiple purposes, including sociological research on influential people and
organizations, “framing” news articles and assigning responsible parties, and the detection of
selection bias and other biases in a media organization’s coverage. We provide details on how
each element of our pipeline is built, and quantify the performance using well-established
metrics. We conclude by validating the tools employed and discussing two use cases for our
corpus.
3. CORPUS AND DATA PROCESSING
The data used for our Quantum Criticism effort was collected, managed, and processed using a
proprietary system designed to scrape, parse, store and analyse the content of news articles from a
variety of sources. Several sentiment and named entity recognition tools were run against the
collected news articles. We also implemented a custom entity resolution algorithm, providing a
rich data set upon which to explore several hypotheses. A pictorial summary of the ingestion,
analysis and storage pipeline is shown in Figure 1.
4
Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.1/2, June 2020
Figure 1: A Summary of the Ingestion, Analysis and Storage Pipeline
3.1. News Scraper
Several custom web scrapers were created for retrieving news articles from various online news
organizations. All web scrapers were run every two hours to retrieve articles from the following
five news sites: the Atlantic, the British Broadcasting Corporation (BBC) News, Fox News, the
New York Times and Slate Magazine. Web scrapers continue to run every two hours in
perpetuity, scraping additional news articles. Collectively, the web scrapers used each news
organization’s RSS feed as input, storing the scraped output into a custom database. Article
URLs were used for disambiguation; where two scraped articles shared a URL, the most recently
retrieved article replaced previous versions of articles.
As of July 2020, we collected a total of 150,000 news articles from nine media organizations.
Figure 2 depicts the number of cumulative articles scraped for each news organization over time.
Even though articles from Fox News were regularly scraped four months later than other news
sources, the number of articles scraped rose quickly, and now constitutes the news organization
with the most scraped articles. Given the news scrapers run at regularly scheduled two-hour
intervals for all news organization, this suggests that Fox News updates its RSS feed with new
articles far more often than others, and the Atlantic updates its RSS feed far less frequently than
others.
Figure 2: Cumulative Quantity of Articles Scraped by News Organization
5
Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.1/2, June 2020
3.2. Data and Database Management
All scraped data is stored in a MariaDB relational database. We considered a NoSQL database,
especially one focused on storing documents, such as MongoDB; however, we found that a
relational database was appropriate for the needs of this project.
We constructed many “primary” tables to support the scraped articles. The most important of
these tables are the article, media (e.g., The Atlantic, BBC, etc. representing the news
organization) and entity (a named person, location or organization) tables. To support modelling
the many-to-many relationship between article and entity, we have one “join” table (article
entity). To support the work in sentiment analysis and named entity recognition; we also created
tables to store the outputs of the algorithms for these tasks. For sentiment analysis, we created a
table called “sentiment. For named entity recognition, we created a table “entity.” Other tables in
our schema are omitted for brevity. Courtesy of dbdiagram.io, a schema appears as Figure 3.
Figure 3: Schema for the Quantum Criticism Database
3.3. Sentiment Analysis
For each news article, we generated a sentiment score. We employed both the VADER (Valence
Aware Dictionary and sentiment Reasoner) [25] module, as implemented in NLTK [26] in
python, as well as CoreNLP sentiment analysis. Sentiment scores in VADER are continuous
values between -1 (very negative) to +1 (very positive), with 0 representing neutral sentiment.
Sentiment scores in CoreNLP are integer values between 0 (very negative) and 4 (very positive),
with 2 representing neutral sentiment.
6
Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.1/2, June 2020
Sentiment analysis tools were run against each sentence and each paragraph in the article, as well
as on the entire article. For example, if an article contained two paragraphs, where paragraph 1
contains two sentences and paragraph 2 contains one sentence, we would have calculated six
different sentiment scores per sentiment analysis tool: one for each sentence (3), one for each
paragraph (2), and one for the article (1). This deconstructed approach allows researchers to
associate named entities with their associated sentiment at a quantum level. This granular level of
sentiment may help disambiguate the sentiment of an article with respect to the named entities.
For example, an article from a conservative news organization may be positive overall, however,
it is likely to be more critical of more liberal politicians, organizations or causes mentioned
therein, and more supportive of conservative organizations or causes. Our quantum approach to
sentiment analysis allows researchers to parse sentiment at the sentence level and associate that
sentiment with named entities, independently of the paragraph or article in the aggregate.
3.4. Named Entity Resolution
We employed eight named entity recognition (NER) models from CoreNLP, spaCy, and BERT
packages to identify PERSONs, ORGANIZATIONs and LOCATIONs in news articles. While
some models predict different NER categories, we sought only those entities which were tagged
using the algorithm depicted in Figure 4.
Figure 4: Pseudo-code for Named Entity Resolution Within a News Article
We store each NER model output individually in our database. In many articles, a named entity is
referenced by a complete name or title, and subsequently, by a shortened version. For example, a
recent New York Times Opinion article (Democrats’ Vulnerabilities? Elitism and Negativity)
first refers to politician Alexandria Ocasio-Cortez by her full name, and then subsequently as
Ocasio-Cortez. In order to connect references to the same named entity, we implemented a
custom entity resolution algorithm. Owing to the highly structured manner in which we observed
news articles were written, we expected to observe the pattern of an entity’s full name, followed
by partial name. Our algorithm therefore matched any name extracted in an article as a substring,
to the most recent instance of another name in the same article. Where a match occurred, the two
names are determined to be the same entity. Such an entry is matched or created by category
(PERSON, LOCATION, ORGANIZATION).
7
Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.1/2, June 2020
This process often failed for abbreviations such as the acronym F.B.I.—in reference to the
Federal Bureau of Investigation—with periods left out, resulting in FBI. We therefore also
created custom code to query a corpus of abbreviations and associate acronyms to their full
names. Only full names were stored in the database. We label each such instance of the full name
a resolved entity. The entity resolution algorithm is depicted in Figure 4. Entities are also
resolved across articles in a similar manner.
3.5. Web Interface
We designed and implemented a web interface for our corpus. Through this interface, a user can
specify basic search criteria for the articles, specifically: the entity name, in whole or part, of the
entity to be searched; the news source(s) to be searched from among those in our database; and
the date or date range of the articles. (See Figure 5.)
Figure 5: Search Screen for Web Interface
Additionally, advanced search criteria—not shown in Figure 5 but available on the live web
interface—allow the user to include additional filters for specific NER tools, the sentiment tools,
and/or the level of granularity (article, paragraph or sentence) to be reported. Upon executing a
successful query, a small subset of the results is displayed so that the user may perform a quick
validation. In addition, a link is provided to allow the user to download the full set of results in
comma-separated value (CSV) format. Each row in the result set contains the fields listed in
Table 1.
Table 1: Fields in the Result Set of the Web Interface
Field
id
entity
entity id
type
date
url
NER tool
paragraph
sentence
sentiment score
sentiment tool
media name
media url
Type
integer
string
int
Enum
Date
String
String
int
int
float
String
String
String
Notes
Database table ID
Full name of entity
Database table ID
PER, LOC or ORG
Last modified date
Article’s URL
Name of the NER tool
Possibly NULL or empty
Possibly NULL or empty
Name of the model
“Fox News”, “Slate,” ...
URL of the news org.
8
Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.1/2, June 2020
Notably absent from the columns in the result set are the contents of the article. This absence is
deliberate. While recent legal rulings have suggested that distributing content produced by third
parties is permissible, we are unsure about whether that ruling is the final word on this or whether
the ruling applies globally. As a result, we provide the URL to the source article, allowing the
user to download the content themselves.
4. VALIDATION
We employ well-studied tools with established performance benchmarks in our data ingestion
and processing pipeline. In this section, we describe how we evaluated the performance of those
tools. For the validations reported in this section, we used two news corpora: a historical New
York Times corpus and our Quantum Criticism corpus of scraped news articles.
4.1 Named Entity Recognition Validation
We test the efficacy of our eight NER models across three different NER tools using two
approaches. Firstly, we executed each of the models against the articles in the New York Times
Annotated Corpus with a 1st of December publication date across all 20 years covered by the
corpus. Secondly, we explore the fidelity of the NER tools by examining how well they identify
the 538 members of the U.S. Congress (Senate and House of Representatives), as identified by
Ballotpedia.
4.1.1.
NER Tools vs. the LDC New York Times Corpus
To determine the fidelity of our results, we ran each NER model against the New York Times
Annotated Corpus [8], for which named entities are provided as an adjunct list. The corpus
contains 1.8 million articles from the New York Times from the years between 1987 and 2007.
While we found that the 4,713 articles from the 1st of December 1987–2007 was a sufficiently
ample volume from which to draw conclusions, we tested an additional ten months of data for the
spaCy and CoreNLP models, finding no significant deviation from the results we report here.
For each of the articles published on the 1st of December, we determined the mean (and standard
deviation) of the following in each article: the number of tokens per article: 587.2 (643.7); and
the number of named entities identified by the models per article: 31.8 (43.9). For each of the
models, we also computed the precision, recall and F1 score for each article. The BERT bert.
base. multilingual. cased model generated the highest mean precision and mean F1 scores of
0.1753 and 0.2549, respectively, whereas the highest mean recall score was obtained from the
CoreNLP en-glish. all. 3class. distsim. crf. ser model. We observed a consistently low F1 score
for all NER models, despite the variable number of entities identified by the classifiers. Some of
this poor performance may be explained by the models’ generation of improperly resolved
entities in the body of the article. However, we believe that this poor performance can be largely
attributed to errors in the labels of the source corpus.
To confirm this hypothesis, we examined several articles from the New York Time Annotated
Corpus, and found dis-agreement with the named entities identified in the manual tagging of the
corpus. Filtering for the named entity classes PERSON, LOCATION and ORGANIZATION in
one of these examined articles, Homicides Up in New York; Other Crimes Keep Falling, we find
only three tags from the corpus: Cara Buckley, the article’s author; New York City, the location
being reported; and the Federal Bureau of Investigation. These instances identified by the corpus
are highlighted in blue. In contrast, one of the authors, a native English speaker who has
performed several annotation tasks on other projects, identified several other named entities.
9
Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.1/2, June 2020
These additional named entities are highlighted in yellow. Interestingly, two of the models we
use, bert. base. cased and bert. base. multilingual. cased, added a spurious named entity label in
this case, “Homicides.” We highlight this deviation in red.
In the articles we inspected, our annotator found that his identification is closer to the results we
obtain from the NER models. Interestingly, BERT appears to exhibit a tendency toward
combining tokens to form named entities (e.g., “Ms. Pickett” with “Fort Greene” to form “fort
green pickett”), and toward listing names of people with family name first (e.g., “pickett,
cheryl”); we converted to the surname-last ordering common in American English. While we
believe the output of the BERT models is closer aligned to our expectations, it also clearly
misidentified tokens (e.g., “homicides” in the above example) as named entities and misclassified
named entities, commonly determining a person’s name to be a LOCATION instance.
Figure 6: Labels identified by the NY Times Corpus (blue)and additional labels identified by our annotator
(yellow) and spurious labels identified by BERT (red)
Although our manual evaluation of the corpus was limited in scope, it does lead us to believe that
named entities are generally under-reported by the corpus. We therefore concluded that the most
reliable metric in comparing our models to the corpus is recall. This allows us to treat each NER
system as a detector, i.e., to determine what fraction of the entities in the articles’ annotations are
identified by the NER models.
10
Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.1/2, June 2020
4.1.2.
NER Tools vs. the Quantum Criticism Corpus
Because we were unable to label an extensive set of articles, we have no ground truth for the
performance of the NER tools against our corpus. Instead, we largely rely on the metrics of the
NER tools against the NY Times Corpus for this. However, we are interested in determining the
number or rate of misclassifications of entities.
For this, we used the 538 current members of the U.S. legislative branch (Congress) as identified
by Ballotpedia. This list includes all members of the U.S. Senate and the U.S. House of
Representatives. Of these, 372 (69%) are mentioned at least once in the articles we scraped. We
examined these mentions as a way to assess the quality of the NER tools employed.
Looking across all eight NER tools from BERT, CoreNLP and spaCy, 96.9% of all entities
resolved to the correct classification of PERSON. There were, however, some notable deviations.
Two models however, one from spaCy and another from CoreNLP, consistently misidentified
Congress people as ORGANIZATION instances, at a rate of 5.06% and 2.98%, respectively, as
depicted in Figure 7. This behaviour may be, in part, be attributed to the fact that Congresspeople
often lead or participate in important organizations, and are therefore often conflated with them.
For example, Nancy Pelosi, the former and current Speaker of the U.S. House of Representatives
at the time of this writing, is often misidentified as an ORGANIZATION given her leadership
role. Perhaps more interesting is the performance of the english. conll. 4class. distsim. crf. ser
model in CoreNLP, which misidentifies 7.76% of all Congresspeople.
Figure 7: (Mis)Classification of PERSON Entities by NER Model 4.2. Entity Resolution Validation
Querying our scraped articles, we sought to explore how well our proprietary entity resolution
algorithm worked to resolve the names of Congresspeople. We searched our scraped data using
each space-separated or hyphen-separated token from the full names of members of Congress
listed on Ballotpedia. These results were then manually checked to retain only valid references to
the individuals in question.
11
Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.1/2, June 2020
To illustrate the above, we use Nancy Pelosi, who is the most-mentioned Congressperson in our
database. Using the strings “%Nancy%” and “%Pelosi%” as our search criteria and removing
unassociated entities (e.g., “Pino Pelosi”, “Nancy Reagan”, etc.), we identified thousands of
references to 475 entities. “Nancy Pelosi” as a PERSON instance is the most common entity,
with 1,915 scraped articles. “Pelosi,” misidentified as an ORGANIZATION 371 times, is the
second most frequent occurrence. “Ms. Pelosi,” “Pelosi” and other variants are less frequent.
Figure 8 shows the top ten entities for Nancy Pelosi along with the frequency with which they
occur. We can measure the precision of our model with respect to an individual instance as the
most frequent occurrence. In this case, Nancy Pelosi (PERSON) represents 53% of all the
references to her.
Figure 8: Frequency of Top 10 Entities Associated with “Nancy Pelosi”
Building on the above analysis, we sought to examine how often and what fraction of the time the
ten most frequently-mentioned U.S. Congresspeople were correctly resolved by our ER
algorithm. Figure 8 depicts the top ten ways in which “Nancy Pelosi” is resolved. “Nancy
Pelosi”, how-ever, is resolved a total of over 400 different ways, demonstrating room for
improvement. Figure 9 depicts the number of different ways in which a Congressperson is
resolved (x-axis), the cumulative sum of the number of times the name was resolved (y-axis),
with a percentage, in square brackets, indicating the fraction of references attributed to the most
frequent instance of the entity.
12
Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.1/2, June 2020
Figure 9: (Mis)Identification of Congresspeople
The many different entities for Congresspeople can be partly attributed to errors in NER models
that associate extraneous characters and tokens with names. As discussed above, we also
observed Congresspeople being associated with the incorrect labels of LOCATION or
ORGANIZATION. Sporadic errors in the spelling of a Congressperson’s name from source
articles also contributed to errors in this Entity Resolution step. For example, “Alexander
OcasioCortez” [sic] appears as a misspelling of the representative.
5. CASE STUDIES
To demonstrate the power of our resource, we choose various case studies. The first study uses
the locations mentioned in an article by different news organisations to expose a location bias.
The second study demonstrates how a critical event on a given day can alter the sentiment
ascribed to a politician by a news organization, and how our resources provides the high level of
resolution necessary to detect said changes. The third takes a deep dive into the operational
characteristics of media organizations.
5.1. Location Bias
In seeking to determine whether news organizations have a geographic reporting bias, we plotted
all named LOCATION entities and their frequencies for the Atlantic and Slate news articles
between July 2018 and June 2019. The geomap, produced using OpenHeatMap, is depicted in
Figure 10, and demonstrates that, despite having a larger volume of articles than the Atlantic,
articles found in Slate produce fewer mappable locations.
13
Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.1/2, June 2020
Moreover, locations referenced are concentrated on the North American coast (Eastern and
Western United States), the British Islands and Southern France. Counterintuitively, the smaller
volume of articles from the Atlantic produce a larger number and wider variety of references to
locations. The location bias for Slate is not wholly unexpected. The first sentence in its
description on Wikipedia is, “Slate is an online magazine that covers current affairs, politics, and
culture in the United States.” We find a similar pattern for the other news organizations from
which we scraped data. For example, the BBC shows a plurality of articles referencing the United
Kingdom, Ireland, the USA, with several references to former British colonies (India, Australia,
New Zealand, South Africa, etc.). These findings are a confirmation of [4].
Figure 10: Locations Referenced by The Atlantic and Slate
5.2. Analysis of Media Organizations’ Operational Characteristics
The comprehensive collection of parsed, processed and tagged articles in the Quantum Criticism
Corpus allows us to take a deep dive into the characteristics of various news organizations and
examine the polarized news-scape across several axes of analysis. Right- and left-wing media
organizations—including media organizations that either self-identify with a particular political
ideology, or those that are widely and publicly understood to be supportive of a particular
political ideology—often exhibit specific behavioural patterns, or make claims about each other’s
behavioural patterns, which we can now quantify with our Corpus.
Firstly, we explore the total number of articles published per month, per media organization via
their RSS feed. Figure 11 depicts the total number of articles published per month on the y-axis,
shown over time, since we began collecting data in June 2018. With the exception of the New
York Times which publishes a significant quantity of articles (but still less than their rightlearning counterparts), left or left-of-center media outlets such as Slate, the Atlantic, Reuters, the
BBC, CNN and the New Yorker all publish far fewer articles compared with right-leaning
counterparts, Fox News and Breitbart.
14
Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.1/2, June 2020
Fox News
5,000
CNN, Breitbart, Reuters, New Yorker Scrapers Launch
4,000
Breitbart
Fox News Scraper Offline
Monthly Article Count
BBC
Breitbart News
3,000
CNN
Fox News
New York Times
Reuters
Slate
The Atlantic
2,000
The New Yorker
1,000
0
Jul−2018
Jan−2019
Jul−2019
Jan−2020
Figure 11: Monthly Mean Article Count per Media Organisation
Table 2 confirms that independent of peak and troughs occurring over time, the mean number of
articles published daily by media outlets that espouse ideologically right political viewpoints,
namely Fox News and Breitbart, publish almost twice as many articles when compared with the
highest-publishing published left- leaning media outlets. In fact, from June 01, 2020 to June 15,
2020, we scraped an astonishing 939 from only Fox News and Breitbart, dwarfing total articles
scraped from all other news organizations combined at just 573.
Table 2: Mean Quantity of New Articles Published Per Day per Media Organization
News Organization
BBC
Breitbart News
CNN
Fox News
New York Times
Reuters
Slate
The Atlantic
The New Yorker
Mean Quantity of New Articles Published Per Day
37.9
87.3
24.9
105.9
61.1
13.1
22.5
14.3
45.6
In an inversion of article publication mean frequency, Table 3 shows the left-leaning The Atlantic
and New York Times media organizations publish articles with the highest mean number of
named entities, whereas right-leaning Fox News published articles with the fewest mean number
of named entities, where named entities as defined in our research consist of people, places, and
15
Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.1/2, June 2020
organizations, as detailed in Section 3.4. Note here that certain media organization for which we
only recently began collecting data have been omitted from this list, including Breitbart, CNN,
Reuters and The New Yorker.
Table 3: Mean Number of Entities Per Article per Media Organization
News Organization
BBC
Fox News
New York Times
Slate
The Atlantic
Mean Number of Entities Per Article
39.6
29.8
63.3
45.0
65.4
When combined with data from Table 2, this suggests that Fox News published very frequently
in comparison to its media counterparts, but publishes articles that are far more focused, and
discuss roughly half the amount of people, places and/or organizations than the New York Times
does in each article.
To normalize this measure, the mean number of sentences per article was computed (see Table 4)
and was divided by the mean number of entities per article in Table 3 to obtain the mean number
of entities discussed per sentence in Table 5. The results demonstrate that although Fox News
does produce articles with far fewer mean sentences per article, each sentence discussed the
highest number of mean entities. Conversely, the BCC has sentences with the fewest number of
named entities per sentence.
Table 4: Mean Number of Sentences per Article per Media Organization
News Organization
BBC
Fox News
New York Times
Slate
The Atlantic
Mean Number Sentences per Article
51.9
32.8
77.5
53.0
82.8
Lastly, we explore mean sentiment associated with different news organizations in Figure 12
using both CoreNLP (scoring integer values from 0 to 4 inclusive, with 0 being negative and 4
being positive) and VADER (scoring from -1 to +1 inclusive, with -1 being negative and +1
being positive). Both the New York Times and Slate have statistically significantly higher mean
sentiment per article, when using both CoreNLP and VADER to measure sentiment. VADER
reports sentiment of these two media organizations as positive, whereas CoreNLP reports them as
negative, albeit less negative (i.e., more positive) than certain other news organisations
Comparatively, Fox News is reported to have statistically significantly lower and negative
sentiment.
Table 5: Mean Number of Entities per Sentence per Media Organization
News Organization
BBC
Fox News
New York Times
Slate
The Atlantic
Mean Number of Entities per Sentence
0.764
0.908
0.816
0.849
0.799
16
Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.1/2, June 2020
Figure 12: Mean Sentiment per Article per News Organisation
A pattern begins to emerge here, where left-leaning organizations tend to write less frequent yet
significantly longer articles, with fewer references to named entities per paragraph, and an overall
increased level of sentiment. The longer articles also tend to include relatively many more named
entities in the articles, making them also more complex. Right-leaning organizations tend to write
frequent, shorter articles, with a high number of named entities per sentence with a negative mean
sentiment. However, they have fewer overall named entities per article, which implies the articles
are more focused on a smaller subset of ideas or connections. This recipe of shorter articles,
fewer named entities per article, negative sentiment, and higher frequency of publication seems to
be successful, with the Fox News network being the dominant news network on television in the
USA.
5.3. The August 2019 Mass Shooting in El Paso
We use an event as an exemplar so that we could investigate some phenomena more thoroughly.
The background of this event occurred in August 2019. The city of El Paso, TX was
unfortunately the site of a mass shooting. All of the media outlets covering the event (The
Atlantic, BBC, Fox News, the New York Times and Slate) acknowledged the shooter was racist
and that Mexicans were the primary target. All the outlets alluded to connections between this
event and two others—a mass shooting in Christchurch, New Zealand, and another in a
synagogue located in Dayton, OH—because of their time and motivations of the shooters.
Days after the event, the U.S. President, Donald Trump, visited El Paso. In addition to the
coverage of the shooting more broadly and the ensuing reactions from those in government and
entertainment, this visit was covered by three of the media outlets from which we collected data
(BBC, Fox News and the New York Times). We chose this visit because of its potential to
highlight the differences among the different media outlets, specifically as a larger representation
17
Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.1/2, June 2020
of the coverage typical of each outlet. While other media outlets covered the background
shooting and related events, we could not find coverage of this specific event (Trump’s visit) in
the RSS feeds for the Atlantic or for Slate.
Table 6: Coverage of Trump’s Visit to El Paso following Mass Shooting
News Organization
BBC
Fox News
New York Times
Sentences
49
11
69
Standard Deviation of Sentiment (VADER)
0.4287
0.3634
0.4087
We used NLTK’s sentence tokenizer to determine the number of sentences in each article. As
shown in Table 6, these resulted in very different lengths. The shortest, 11 sentences by Fox
News, might be characterised less about Trump’s actual visit and more as a description of a
contemporaneous discussion between New York City mayor Bill De Blasio and Fox News
commentator Sean Hannity, along with comments made by other political figures. By contrast the
comparable articles in the BBC and the New York Times were longer (49 and 69 sentences
respectively) and mentioned more people overall – albeit fewer per sentence, as described in
Tables 4 and 5). These articles also used the extra space to provide additional context and
background, such as Trump’s history of being unable to console following a disaster and
politicians local to the El Paso area.
Using VADER as implemented in NLTK, we examined the overall sentiment as well as the
sentiment at the sentence level. The overall sentiment for each article is similar, hovering
between -0.2934 and -0.1057 the BBC, Fox News and the NY Times. Figure 13 shows how
sentiment changes for each sentence through each article, scraped chronologically. We did
however observe a notably lower standard deviation of sentiment for sentences in Fox News
articles when compared with both the BBC and the New York Times (see Table 6).
Taken with the significantly lower sentiment of all articles, a profile of Fox News emerges. Their
articles are designed to be read quickly, contain more assertions by their commentators and other
popular figures and are written with a focus on the negative elements of newsworthy events.
18
Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.1/2, June 2020
Figure 13: Sentiment per Sentence for a Representative Article per Media Organisation
5.4. Sentiment for José Serrano
Scouring left-leaning news organizations, we observed a peculiar pattern. When reporting on a
left-leaning politicians—in America, typically, a Democrat—the sentiment associated with this
reporting follows a pattern whereby the sentiment of the overall article is lower than that of the
sentiment associated with paragraphs in which the politician in question is referenced, which
itself has a is lower sentiment than the sentence(s) in which the politician is mentioned. In sum,
the more focus there is on the politician herself, the higher the sentiment. This has been shown to
be true for several left-leaning politicians when querying the Quantum Criticism corpus. This
rule, however, is violated, when a seminal event. For example, when José E. Serrano, Democrat
representing the 15th district of New York announced his retirement, the overall sentiment of the
article jumped to a value higher than either the paragraph-or sentence-level sentiment (see Figure
14), a change only detectable with the sentence-level of granularity provided by the Quantum
Criticism corpus.
19
Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.1/2, June 2020
Figure 14: Sentiment for José E. Serrano at the Sentence, Paragraph and Article levels
6. CONCLUSION AND FUTURE WORK
We collected a database of news articles from five popular media organizations, placed each
article in a pipeline to identify named entities and determined the affect of each named entity. We
identified interesting patterns and confirmed a geographic selection bias found by other researchers. Collecting new news data every two hours, our platform shows great promise for
future research, and will further benefit from additional iterations.
We aspire to make this tool even more useful through the addition of news articles from
additional news sources. Because news is sometimes underreported by organizations—see
Radiolab’s Breaking Bongo [27] for one unusual case–we will also consider adding selected
tweets and other social media messages from individuals and organizations. We have already
collected hundreds of thousands of candidate tweets which we have not yet filtered for relevance
or made available. When coupled with better or customized tools for NER, sentiment and entity
resolution, we believe this project has the potential to uncover a wide range of phenomena.
The addition of one or more frameworks for coding event data, such as CAMEO, COPDAB or
others would also in-crease the usefulness of the tool. Such frameworks would allow comparison
of the same set of events across different media outlets, communities and countries. The authors
acknowledge and thank culture critic Theodore Gioia, who originated the term Quantum
Criticism and was a guide and inspiration for this work. Software Engineer Nikhil Barapatre led
the effort to produce the web interface. The authors would also like to thank the Machine
learning, Artificial and Gaming Intelligence and, Computing at Scale (MAGICS) Lab at the
University of San Francisco for supporting this research with mentorship and computational
infrastructure.
20
Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.1/2, June 2020
REFERENCES
[1]
Roberts, D. (2017). Donald trump and the rise of tribal epistemology. Vox Media.
[2]
White, D. M. (1950). The “gate keeper”: A case study in the selection of news. Journalism Bulletin,
27(4):383– 390.
[3]
Gruenewald, J., Pizarro, J., and Chermak, S. M. (2009). Race, gender, and the newsworthiness of
homicide incidents. Journal of criminal justice, 37(3):262–272.
[4]
Wu, H. D. (2000). Systemic determinants of international news coverage: A comparison of 38
countries. Journal of communication, 50(2):110–130.
[5]
Huiberts, E. and Joye, S. (2018). Close, but not close enough? audiences’ reactions to domesticated
distant suffering in international news coverage. Media, Culture & Society, 40(3):333–347.
[6]
Baum, M. A. and Zhukov, Y. M. (2019). Media ownership and news coverage of international
conflict. Political Communication, 36(1):36–63.
[7]
Bourgeois, D., Rappaz, J., and Aberer, K. (2018). Selection bias in news coverage: learning it,
fighting it. In Companion of the The Web Conference 2018 on The Web Conference 2018, pages
535–543. International World Wide Web Conferences Steering Committee.
[8]
Sandhaus, E. (2008). The new york times annotated corpus ldc2008t19.
[9]
Graff, D. (1995). North American news text corpus.
[10] Marcus, M. P., Santorini, B., Marcinkiewicz, M. A., and Taylor, A. (1999). Treebank-3. Linguistic
Data Consortium, Philadelphia, 14.
[11] Bell, A., Jurafsky, D., Fosler-Lussier, E., Girand, C., Gregory, M., and Gildea, D. (2001). Form
variation of english function words in conversation. Submitted manuscript.
[12] Kann, K., Mohananey, A., Bowman, S., and Cho, K. (2019). Neural unsupervised parsing beyond
english. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP
(DeepLo 2019), pages 209–218.
[13] Leetaru, K. and Schrodt, P. A. (2013). Gdelt: Global data on events, location, and tone, 1979–2012.
In ISA annual convention, volume 2, pages 1–49. Citeseer.
[14] Ward, M. D., Beger, A., Cutler, J., Dickenson, M., Dorff, C., and Radford, B. (2013). Comparing
gdelt and icews event data. Analysis, 21(1):267–97.
[15] Davies, M. (2010). The corpus of contemporary american english as the first reliable monitor corpus
of english. Literary and linguistic computing, 25(4):447–464.
[16] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805.
[17] Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., and McClosky, D. (2014). The Stanford
corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association
for computational linguistics: system demonstrations, pages 55–60.
[18] Honnibal, M. and Montani, I. (2017). spaCy 2: Natural language understanding with Bloom
embeddings, convolutional neural networks and incremental parsing. To appear.
21
Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.1/2, June 2020
[19] Rafail, P. and McCarthy, J. D. (2018). Making the tea party republican: Media bias and framing in
newspapers and cable news. Social Currents, 5(5):421–437.
[20] Iyengar, S. and Simon, A. (1993). News coverage of the gulf crisis and public opinion: A study of
agendasetting, priming, and framing. Communication research, 20(3):365–383.
[21] Soroka, S., Daku, M., Hiaeshutter-Rice, D., Guggenheim, L., and Pasek, J. (2018). Negativity and
positivity biases in economic news coverage: Traditional versus social media. Communication
Research, 45(7):1078–1098.
[22] An, S.-K. and Gower, K. K. (2009). How do the news media frame crises? a content analysis of crisis
news coverage. Public Relations Review, 35(2):107–112.
[23] Trumbo, C. (1996). Constructing climate change: claims and frames in us news coverage of an
environmental issue. Public understanding of science, 5:269–283.
[24] Ribeiro, F. N., Henrique, L., Benevenuto, F., Chakraborty, A., Kulshrestha, J., Babaei, M., and
Gummadi, K. P. (2018). Media bias monitor: Quantifying biases of social media news outlets at
large-scale. In Twelfth Inter-national AAAI Conference on Web and Social Media.
[25] Hutto, C. J. and Gilbert, E. (2014). Vader: A parsimonious rule-based model for sentiment analysis of
social media text. In Eighth international AAAI conference on weblogs and social media.
[26] Loper, E. and Bird, S. (2002). Nltk: The natural language toolkit. In Proceedings of the ACL-02
Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and
Computational Linguistics - Volume 1, ETMTNLP ’02, pages 63–70, Stroudsburg, PA, USA.
Association for Computational Linguistics.
[27] Adler, S. “Breaking Bongo,” Radiolab, 26 November, 2019. New York City: WNYC Studios.
Available https://www.wnycstudios.org/podcasts/radiolab/articles/breaking-bongo. Accessed: 28
February, 2020.
22