Fake News Detection On Social Media: A Data Mining Perspective

Fake News Detection on Social Media:
A Data Mining Perspective
Kai Shuy, Amy Slivaz, Suhang Wangy, Jiliang Tang \, and Huan Liuy
yComputer Science & Engineering, Arizona State University, Tempe, AZ, USA
zCharles River Analytics, Cambridge, MA, USA \Computer Science & Engineering,
Michigan State University, East Lansing, MI, USA
ABSTRACT on, and discuss the news with friends or other readers on
social media. For example, 62 percent of U.S. adults get
Social media for news consumption is a double-edged sword.
news on social media in 2016, while in 2012, only 49 per-
On the one hand, its low cost, easy access, and rapid dissem- 1
ination of information lead people to seek out and consume news cent reported seeing news on social media . It was also
from social media. On the other hand, it enables the wide spread found that social media now outperforms television as the
2
of \fake news", i.e., low quality news with in-tentionally false major news source . Despite the advantages provided by
information. The extensive spread of fake news has the potential social media, the quality of news on social media is lower
for extremely negative impacts on individuals and society. than traditional news organizations. However, because it is
Therefore, fake news detection on social media has recently cheap to provide news online and much faster and easier to
become an emerging research that is attracting tremendous disseminate through social media, large volumes of fake
news, i.e., those news articles with intentionally false infor-
attention. Fake news detection on social media presents unique
mation, are produced online for a variety of purposes, such
characteristics and chal-lenges that make existing detection
as nancial and political gain. It was estimated that over 1
algorithms from tradi-tional news media ine ective or not 3
applicable. First, fake news is intentionally written to mislead million tweets are related to fake news \Pizzagate" by the
readers to believe false information, which makes it di cult and end of the presidential election. Given the prevalence of this
new phenomenon, \Fake news" was even named the word
nontrivial to detect based on news content; therefore, we need
of the year by the Macquarie dictionary in 2016.
to include auxiliary information, such as user social
engagements on social media, to help make a determination. The extensive spread of fake news can have a serious nega-tive
Second, ex-ploiting this auxiliary information is challenging in impact on individuals and society. First, fake news can break the
authenticity balance of the news ecosystem. For example, it is
and of itself as users’ social engagements with fake news
evident that the most popular fake news was even more widely
produce data that is big, incomplete, unstructured, and noisy.
spread on Facebook than the most pop-ular authentic
Be-cause the issue of fake news detection on social media is
both challenging and relevant, we conducted this survey to fur- mainstream news during the U.S. 2016 pres-ident election4.
ther facilitate research on the problem. In this survey, we present Second, fake news intentionally persuades consumers to accept
biased or false beliefs. Fake news is usually manipulated by
a comprehensive review of detecting fake news on social media,
propagandists to convey political messages or in uence. For
including fake news characterizations on psy-chology and social
example, some report shows that Russia has created fake
theories, existing algorithms from a data mining perspective,
evaluation metrics and representative datasets. We also discuss accounts and social bots to spread false stories 5. Third, fake
related research areas, open prob-lems, and future research news changes the way people in-terpret and respond to real
news. For example, some fake news was just created to trigger
directions for fake news detection on social media.
people’s distrust and make them confused, impeding their
abilities to di erentiate what is true from what is not 6. To help
mitigate the negative ef-fects caused by fake news{both to bene
1. INTRODUCTION t the public and the news ecosystem{It’s critical that we develop
As an increasing amount of our lives is spent interacting online methods to automatically detect fake news on social media.
through social media platforms, more and more peo-ple tend to
1
seek out and consume news from social media rather than http://www.journalism.org/2016/05/26/news-use-across-social-media-
platforms-2016/
traditional news organizations. The reasons for this change in 2
consumption behaviors are inherent in the nature of these social http://www.bbc.com/news/uk-36528256
3
media platforms: (i) it is often more timely and less expensive to https://en.wikipedia.org/wiki/Pizzagate conspiracy theory
4
consume news on social media compared with traditional news https://www.buzzfeed.com/craigsilverman/viral-
fake-election-news-outperformed-real-news-on-
media, such as newspapers or television; and (ii) it is easier to facebook?utm term=.nrg0WA1VP0#.gjJyKapW5y
further share, comment 5
http://time.com/4783932/inside-russia-social-media-war-
america/
6
https://www.nytimes.com/2016/11/28/opinion/fake-
news-and-the-internet-shell-game.html? r=0
7
Detecting fake news on social media poses several new and We discuss the narrow and broad de nitions of fake
challenging research problems. Though fake news itself is not a news that cover most existing de nitions in the litera-
new problem{nations or groups have been using the news media ture and further present the unique characteristics of
to execute propaganda or in uence operations for centuries{the fake news on social media and its implications com-
rise of web-generated news on social me-dia makes fake news pared with the traditional media;
a more powerful force that challenges traditional journalistic
norms. There are several character-istics of this problem that We give an overview of existing fake news detection
make it uniquely challenging for automated detection. First, fake methods with a principled way to group representative
news is intentionally writ-ten to mislead readers, which makes it methods into di erent categories; and
nontrivial to detect simply based on news content. The content
of fake news is rather diverse in terms of topics, styles and media
We discuss several open issues and provide future di-
rections of fake news detection in social media.
platforms, and fake news attempts to distort truth with diverse
lin-guistic styles while simultaneously mocking true news. For
The remainder of this survey is organized as follows. In
example, fake news may cite true evidence within the in-correct
Section 2, we present the de nition of fake news and char-
context to support a non-factual claim [22]. Thus, existing hand- acterize it by comparing di erent theories and properties in
crafted and data-speci c textual features are generally not su both traditional and social media. In Section 3, we continue
cient for fake news detection. Other aux-iliary information must to formally de ne the fake news detection problem and sum-
also be applied to improve detec-tion, such as knowledge base marize the methods to detect fake news. In Section 4, we
and user social engagements. Second, exploiting this auxiliary discuss the datasets and evaluation metrics used by existing
information actually leads to another critical challenge: the methods. We brie y introduce areas related to fake news de-
quality of the data itself. Fake news is usually related to newly tection on social media in Section 5. Finally, we discuss the
emerging, time-critical events, which may not have been open issues and future directions in Section 6 and conclude
properly veri ed by exist-ing knowledge bases due to the lack of this survey in Section 7.
corroborating ev-idence or claims. In addition, users’ social
engagements with fake news produce data that is big,
incomplete, un-structured, and noisy [79]. E ective methods to di 2. FAKE NEWS CHARACTERIZATION
erenti-ate credible users, extract useful post features and exploit In this section, we introduce the basic social and psychologi-cal
network interactions are an open area of research and need theories related to fake news and discuss more advanced
further investigations. patterns introduced by social media. Speci cally, we rst discuss
various de nitions of fake news and di erentiate re-lated concepts
In this article, we present an overview of fake news detection that are usually misunderstood as fake news. We then describe
and discuss promising research directions. The key motiva- di erent aspects of fake news on tradi-tional media and the new
tions of this survey are summarized as follows: patterns found on social media.
Fake news on social media has been occurring for 2.1 Definitions of Fake News
sev-eral years; however, there is no agreed upon de Fake news has existed for a very long time, nearly the same
nition of the term \fake news". To better guide the amount of time as news began to circulate widely after the
future directions of fake news detection research,
printing press was invented in 1439 7. However, there is no
appropriate clari cations are necessary.
agreed de nition of the term \fake news". Therefore, we rst
discuss and compare some widely used de nitions of fake news
Social media has proved to be a powerful source for
in the existing literature, and provide our de nition of fake news
fake news dissemination. There are some emerging
that will be used for the remainder of this survey. A narrow de
patterns that can be utilized for fake news detection in
nition of fake news is news articles that are in-tentionally and veri
social media. A review on existing fake news detec-
ably false and could mislead readers [2]. There are two key
tion methods under various social media scenarios
features of this de nition: authenticity and intent. First, fake news
can provide a basic understanding on the state-of-the-
includes false information that can be veri ed as such. Second,
art fake news detection methods.
fake news is created with dishonest intention to mislead
consumers. This de nition has been widely adopted in recent
Fake news detection on social media is still in the
studies [57; 17; 62; 41]. Broader de nitions of fake news focus
early age of development, and there are still many
challeng-ing issues that need further investigations. It on the either authen-ticity or intent of the news content. Some
is neces-sary to discuss potential research directions papers regard satire news as fake news since the contents are
that can improve fake news detection and mitigation false even though satire is often entertainment-oriented and
capabili-ties. reveals its own deceptiveness to the consumers [67; 4; 37; 9].
Other literature directly treats deceptive news as fake news [66],
To facilitate research in fake news detection on social me- which includes serious fabrications, hoaxes, and satires.
dia, in this survey we will review two aspects of the fake news In this article, we use the narrow de nition of fake news.
detection problem: characterization and detection. As shown Formally, we state this de nition as follows,
in Figure 1, we will rst describe the background of the fake
news detection problem using theories and prop-erties from Definition 1 (Fake News) Fake news is a news article that is
psychology and social studies; then we present the detection intentionally and veri ably false.
approaches. Our major contributions of this survey are 7
http://www.politico.com/magazine/story/2016/12/fake-news-history-
summarized as follows: long-violent-214535
8
Figure 1: Fake news on social media: from characterization to detection.
to reduce misperceptions, but sometimes may even increase the

The reasons for choosing this narrow de nition are three- misperceptions, especially among ideological groups [59].
folds. First, the underlying intent of fake news provides both
theoretical and practical value that enables a deeper under-
standing and analysis of this topic. Second, any techniques Social Foundations of the Fake News Ecosystem. Considering
for truth veri cation that apply to the narrow conception of the entire news consumption ecosystem, we can also describe
fake news can also be applied to under the broader de - some of the social dynamics that contribute to the proliferation
nition. Third, this de nition is able to eliminate the ambi- of fake news. Prospect theory describes decision making as a
guities between fake news and related concepts that are not process by which people make choices based on the relative
considered in this article. The following concepts are not fake gains and losses as compared to their current state [39; 81]. This
news according to our de nition: (1) satire news with proper desire for maximizing the reward of a decision applies to social
context, which has no intent to mislead or deceive gains as well, for instance, continued acceptance by others in a
consumers and is unlikely to be mis-perceived as factual; user’s immediate social network. As described by social identity
(2) rumors that did not originate from news events; (3) con- theory [76; 77] and normative in uence theory [3; 40], this
spiracy theories, which are di cult verify as true or false; preference for social acceptance and a rmation is essential to a
(4) misinformation that is created unintentionally; and (5) person’s identity and self-esteem, making users likely to choose
hoaxes that are only motivated by fun or to scam targeted \socially safe" options when consuming and disseminating news
individuals. informa-tion, following the norms established in the community
even if the news being shared is fake news.
2.2 Fake News on Traditional News Media This rational theory of fake news interactions can be mod-eled
from an economic game theoretical perspective [26] by
Fake news itself is not a new problem. The media ecology of formulating the news generation and consumption cycle as a
fake news has been changing over time from newsprint to
two-player strategy game. For explaining fake news, we as-
radio/television and, recently, online news and social media.
sume there are two kinds of key players in the information
We denote \traditional fake news" as the fake news problem
ecosystem: publisher and consumer. The process of news
before social media had important e ects on its production
publishing is modeled as a mapping from original signal s to
and dissemination. Next, we will describe several psycholog-
resultant news report a with an e ect of distortion bias
ical and social science foundations that describe the impact
of fake news at both the individual and social information b, i.e., s!b a, where b = [1; 0; 1] indicates [left; no; right] biases
ecosystem levels. take e ects on news publishing process. Intuitively, this is
capturing the degree to which a news article may be biased or
Psychological Foundations of Fake News. Humans are naturally distorted to produce fake news. The utility for the publisher
not very good at di erentiating between real and fake news. stems from two perspectives: (i) short-term utility: the incentive
There are several psychological and cognitive theories that can to maximize pro t, which is positively correlated with the number
explain this phenomenon and the in uen-tial power of fake news. of consumers reached; (ii) long-term utility: their reputation in
Traditional fake news mainly tar-gets consumers by exploiting terms of news authenticity. Utility of consumers consists of two
their individual vulnerabilities. There are two major factors which parts: (i) information utility: obtaining true and unbiased
make consumers natu-rally vulnerable to fake news: (i) Na ve information (usually ex-tra investment cost needed); (ii)
Realism: consumers tend to believe that their perceptions of psychology utility: receiving news that satis es their prior
reality are the only accurate views, while others who disagree opinions and social needs, e.g., con rmation bias and prospect
are regarded as uninformed, irrational, or biased [92]; and (ii) theory. Both publisher and consumer try to maximize their
Con rmation Bias: consumers prefer to receive information that overall utilities in this strat-egy game of the news consumption
con rms their existing views [58]. Due to these cognitive biases process. We can capture the fact that fake news happens when
inher-ent in human nature, fake news can often be perceived as the short-term utility dominates a publisher’s overall utility and
real by consumers. Moreover, once the misperception is formed, psychology utility dominates the consumer’s overall utility, and
it is very hard to correct it. Psychology studies shows that an equilibrium is maintained. This explains the social dynamics
correction of false information (e.g., fake news) by the pre- that lead to an information ecosystem where fake news can
sentation of true, factual information is not only unhelpful thrive.
9
2.3 Fake News on Social Media cess by which people consume and believe fake news due to
In this subsection, we will discuss some unique character- the following psychological factors [60]: (1) social credibility,
istics of fake news on social media. Speci cally, we will which means people are more likely to perceive a source as
highlight the key features of fake news that are enabled by credible if others perceive the source is credible, especially when
social media. Note that the aforementioned characteristics of there is not enough information available to access the
traditional fake news are also applicable to social media. truthfulness of the source; and (2) frequency heuristic, which
means that consumers may naturally favor information they hear
Malicious Accounts on Social Media for Propaganda. While frequently, even if it is fake news. Studies have shown that
many users on social media are legitimate, social me-dia increased exposure to an idea is enough to generate a positive
users may also be malicious, and in some cases are not even opinion of it [100; 101], and in echo chambers, users continue to
real humans. The low cost of creating social media accounts share and consume the same information. As a result, this echo
also encourages malicious user accounts, such as social chamber e ect creates segmented, homoge-neous communities
bots, cyborg users, and trolls. A social bot refers to a social with a very limited information ecosys-tem. Research shows that
media account that is controlled by a computer al-gorithm to the homogeneous communities become the primary driver of
automatically produce content and interact with humans (or information di usion that fur-ther strengthens polarization [18].
other bot users) on social media [23]. Social bots can become
malicious entities designed speci cally with the purpose to do 3. FAKE NEWS DETECTION
harm, such as manipulating and spreading fake news on
In the previous section, we introduced the conceptual char-
social media. Studies shows that social bots distorted the
acterization of traditional fake news and fake news in so-cial
2016 U.S. presidential election online discus-sions on a large
media. Based on this characterization, we further ex-plore
scale [6], and that around 19 million bot accounts tweeted in
the problem de nition and proposed approaches for fake
support of either Trump or Clinton in the week leading up to
8 news detection.
election day . Trolls, real human users who aim to disrupt
online communities and provoke consumers into an 3.1 Problem Definition
emotional response, are also playing an important role in
In this subsection, we present the details of mathematical
spreading fake news on social media. For example, evidence
formulation of fake news detection on social media. Specif-
suggests that there were 1,000 paid Rus-sian trolls spreading
9 ically, we will introduce the de nition of key components of
fake news on Hillary Clinton . Trolling behaviors are highly a fake news and then present the formal de nition of fake news
ected by people’s mood and the con-text of online detection. The basic notations are de ned below,
discussions, which enables the easy dissemi-nation of fake
news among otherwise \normal" online com-munities [14]. Let a refer to a News Article. It consists of two ma-jor
The e ect of trolling is to trigger people’s inner negative components: Publisher and Content. Publisher p~a
emotions, such as anger and fear, resulting in doubt, distrust, includes a set of pro le features to describe the origi-
and irrational behavior. Finally, cyborg users can spread fake nal author, such as name, domain, age, among other
news in a way that blends automated activities with human attributes. Content c~a consists of a set of attributes
input. Usually cyborg accounts are registered by human as a that represent the news article and includes headline,
camou age and set automated pro-grams to perform text, image, etc.
activities in social media. The easy switch of functionalities
between human and bot o ers cyborg users unique We also de ne Social News Engagements as a set of
opportunities to spread fake news [15]. In a nutshell, these tuples E = feitg to represent the process of how news
highly active and partisan malicious accounts on so-cial spread over time among n users U = fu1; u2; :::; ung
media become the powerful sources and proliferation of fake and their corresponding posts P = fp1; p2; :::; png on
news. social media regarding news article a. Each engage-
ment eit = fui; pi; tg represents that a user ui spreads
Echo Chamber E ect. Social media provides a new paradigm of news article a using pi at time t. Note that we set t =
information creation and consumption for users. The Null if the article a does not have any engage-ment
information seeking and consumption process are chang-ing yet and thus ui represents the publisher.
from a mediated form (e.g., by journalists) to a more disinter-
mediated way [19]. Consumers are selectively ex-posed to Definition 2 (Fake News Detection) Given the social news
certain kinds of news because of the way news feed appear on engagements E among n users for news article a, the task of
their homepage in social media, amplifying the psychological fake news detection is to predict whether the news article a
challenges to dispelling fake news identi-ed above. For is a fake news piece or not, i.e., F : E ! f0; 1g such that,
example, users on Facebook always follow like-minded people
and thus receive news that promote their favored existing
narratives [65]. Therefore, users on social media tend to form 1; if a is a piece of fake news,
groups containing like-minded people where they then polarize
F(a) =
(
their opinions, resulting in an echo chamber e ect. The echo 0; otherwise. (1)
chamber e ect facilitates the pro- where F is the prediction function we want to learn.
Note that we de ne fake news detection as a binary classi -
8
http://comprop.oii.ox.ac.uk/2016/11/18/resource-for- cation problem for the following reason: fake news is essen-
understanding-political-bots/
9
tially a distortion bias on information manipulated by the
http://www.hu ngtonpost.com/entry/russian-trolls-fake- publisher. According to previous research about media bias
news us 58dde6bae4b08194e3b8d5c4
10
theory [26], distortion bias is usually modeled as a binary as frequency of function words and phrases (i.e., \n-grams"
classi cation problem. and bag-of-words approaches [24]) or punctuation and parts-
Next, we propose a general data mining framework for fake of-speech (POS) tagging. Domain-speci c linguistic fea-
news detection which includes two phases: (i) feature ex- tures, which are speci cally aligned to news domain, such as
traction and (ii) model construction. The feature extraction quoted words, external links, number of graphs, and the
phase aims to represent news content and related auxiliary average length of graphs, etc [62]. Moreover, other features
information in a formal mathematical structure, and model can be speci cally designed to capture the deceptive cues in
construction phase further builds machine learning models to writing styles to di erentiate fake news, such as lying-
better di erentiate fake news and real news based on the detection features [1].
feature representations.
Visual-based: Visual cues have been shown to be an im-
3.2 Feature Extraction 10
portant manipulator for fake news propaganda . As we have
Fake news detection on traditional news media mainly relies characterized, fake news exploits the individual vulner-
on news content, while in social media, extra social context abilities of people and thus often relies on sensational or even
auxiliary information can be used to as additional informa- fake images to provoke anger or other emotional response of
tion to help detect fake news. Thus, we will present the consumers. Visual-based features are extracted from visual
details of how to extract and represent useful features from elements (e.g. images and videos) to capture the di erent
news content and social context. characteristics for fake news. Faking images were identi ed
based on various user-level and tweet-level hand-crafted fea-
3.2.1 News Content Features tures using classi cation framework [28]. Recently, various
News content features c~a describe the meta information re- visual and statistical features has been extracted for news
lated to a piece of news. A list of representative news content veri cation [38]. Visual features include clarity score, coher-
attributes are listed below: ence score, similarity distribution histogram, diversity score,
and clustering score. Statistical features include count, im-
Source: Author or publisher of the news article age ratio, multi-image ratio, hot image ratio, long image ratio,
Headline: Short title text that aims to catch the at- etc.
tention of readers and describes the main topic of the
article
3.2.2 Social Context Features
In addition to features related directly to the content of the
Body Text: Main text that elaborates the details of the news articles, additional social context features can also be
news story; there is usually a major claim that is speci derived from the user-driven social engagements of news
cally highlighted and that shapes the angle of the consumption on social media platform. Social engagements
publisher represent the news proliferation process over time, which
provides useful auxiliary information to infer the veracity of
Image/Video: Part of the body content of a news
news articles. Note that few papers exist in the literature that
article that provides visual cues to frame the story
detect fake news using social context features. How-ever,
Based on these raw content attributes, di erent kinds of because we believe this is a critical aspect of successful fake
feature representations can be built to extract discriminative news detection, we introduce a set of common features
characteristics of fake news. Typically, the news content we utilized in similar research areas, such as rumor veracity
are looking at will mostly be linguistic-based and visual- classi cation on social media. Generally, there are three
based, described in more detail below. major aspects of the social media context that we want to
represent: users, generated posts, and networks. Below, we
investigate how we can extract and represent social context
Linguistic-based: Since fake news pieces are intention-ally
features from these three aspects to support fake news de-
created for nancial or political gain rather than to re-port tection.
objective claims, they often contain opinionated and in
ammatory language, crafted as \clickbait" (i.e., to en-tice users User-based: As we mentioned in Section 2.3, fake news pieces
to click on the link to read the full article) or to incite confusion are likely to be created and spread by non-human accounts,
[13]. Thus, it is reasonable to exploit linguistic features that such as social bots or cyborgs. Thus, capturing users’ pro les
capture the di erent writing styles and sensational headlines to and characteristics by user-based features can provide useful
detect fake news. Linguistic-based features are extracted from information for fake news detection. User-based features
the text content in terms of document organizations from di erent represent the characteristics of those users who have
levels, such as char-acters, words, sentences, and documents. interactions with the news on social media. These features can
In order to cap-ture the di erent aspects of fake news and real be categorized across di erent levels: individual level and group
news, ex-isting work utilized both common linguistic features level. Individual level features are extracted to infer the credibility
and domain-speci c linguistic features. Common linguistic fea- and reliability for each user using various aspects of user
tures are often used to represent documents for various tasks in demographics, such as registration age, number of
natural language processing. Typical common linguis-tic followers/followees, number of tweets the user has authored, etc
features are: (i) lexical features, including character-level and [11]. Group level user features cap-ture overall characteristics of
word-level features, such as total words, charac-ters per word, groups of users related to the news [99]. The assumption is that
frequency of large words, and unique words; (ii) syntactic the spreaders of fake news
features, including sentence-level features, such 10
https://www.wired.com/2016/12/photos-fuel-spread-fake-news/
11
are properly built, existing network metrics can be applied as
and real news may form di erent communities with unique feature representations. For example, degree and cluster-ing
characteristics that can be depicted by group level features. coe cient have been used to characterize the di usion
Commonly used group level features come from aggregat- network [42] and friendship network [42]. Other approaches
ing (e.g., averaging and weighting) individual level features, learn the latent node embedding features by using SVD [69]
such as ‘percentage of veri ed users’ and ‘average number or network propagation algorithms [37].
of followers’ [49; 42].
3.3 Model Construction
Post-based: People express their emotions or opinions to-wards In the previous section, we introduced features extracted
fake news through social media posts, such as skep-tical from di erent sources, i.e., news content and social con-text,
opinions, sensational reactions, etc. Thus, it is rea-sonable to for fake news detection. In this section, we discuss the details
extract post-based features to help nd potential fake news via of the model construction process for several exist-ing
reactions from the general public as expressed in posts. Post- approaches. Speci cally we categorize existing methods
based features focus on identifying useful in-formation to infer based on their main input sources as: News Content Models
the veracity of news from various aspects of relevant social and Social Context Models.
media posts. These features can be cat-egorized as post level,
group level, and temporal level. Post level features generate 3.3.1 News Content Models
feature values for each post. The aforementioned linguistic- In this subsection, we focus on news content models, which
based features and some embed-ding approaches [69] for news mainly rely on news content features and existing factual sources
content can also be applied for each post. Speci cally, there are to classify fake news. Speci cally, existing approaches can be
unique features for posts that represent the social response from categorized as Knowledge-based and Style-based.
general pub-lic, such as stance, topic, and credibility. Stance
features (or viewpoints) indicate the users’ opinions towards the Knowledge-based: Since fake news attempts to spread false
news, such as supporting, denying, etc [37]. Topic features can claims in news content, the most straightforward means of
be extracted using topic models, such as latent Dirichlet allo- detecting it is to check the truthfulness of major claims in a news
cation (LDA) [49]. Credibility features for posts assess the article to decide the news veracity. Knowledge-based
degree of reliability [11]. Group level features aim to ag-gregate approaches aim to use external sources to fact-check proposed
the feature values for all relevant posts for speci c news articles claims in news content. The goal of fact-checking is to assign a
by using \wisdom of crowds". For example, the average truth value to a claim in a particular context [83]. Fact-checking
credibility scores are used to evaluate the credi-bility of news has attracted increasing attention, and many e orts have been
[37]. A more comprehensive list of group-level post features can made to develop a feasible automated fact-checking system.
also be found in [11]. Temporal level fea-tures consider the Existing fact-checking approaches can be categorized as
temporal variations of post level feature values [49]. expert-oriented, crowdsourcing-oriented, and computational-
Unsupervised embedding methods, such as re-current neural oriented.
network (RNN), are utilized to capture the changes in posts over
time [69; 48]. Based on the shape of this time series for various
Expert-oriented fact-checking heavily relies on human
metrics of relevant posts (e.g, number of posts), mathematical domain experts to investigate relevant data and doc-
features can be computed, such as SpikeM parameters [42]. uments to construct the verdicts of claim veracity, for
11 12
example PolitiFact , Snopes , etc. However,
expert-oriented fact-checking is an intellectually
Network-based: Users form di erent networks on social media demanding and time-consuming process, which limits
in terms of interests, topics, and relations. As men-tioned the poten-tial for high e ciency and scalability.
before, fake news dissemination processes tend to form an
echo chamber cycle, highlighting the value of ex-tracting Crowdsourcing-oriented fact-checking exploits the
network-based features to represent these types of network \wis-dom of crowd" to enable normal people to
patterns for fake news detection. Network-based features are annotate news content; these annotations are then
extracted via constructing speci c networks among the users aggregated to produce an overall assessment of the
who published related social media posts. Di erent types of 13
news verac-ity. For example, Fiskkit allows users to
networks can be constructed. The stance network can be built discuss and annotate the accuracy of speci c parts of a
with nodes indicating all the tweets relevant to the news and news arti-cle. As another example, an anti-fake news
the edge indicating the weights of similar-ity of stances [37; bot named \For real" is a public account in the instant
75]. Another type of network is the co-occurrence network, 14
which is built based on the user engage-ments by counting communi-cation mobile application LINE , which
allows people to report suspicious news content which
whether those users write posts relevant to the same news
is then further checked by editors.
articles [69]. In addition, the friendship network indicates the
following/followee structure of users who post related tweets
Computational-oriented fact-checking aims to provide an
[42]. An extension of this friendship network is the di usion
automatic scalable system to classify true and false
network, which tracks the trajectory of the spread of news
claims. Previous computational-oriented fact checking
[42], where nodes represent the users and edges represent methods try to solve two majors issues: (i) identifying
the information di usion paths among them. That is, a di usion
11http://www.politifact.com/
path between two users ui and uj exists if and only if (1) uj
12
follows ui, and (2) uj posts about a given news only after ui http://www.snopes.com/
13
http:// skkit.com
does so. After these networks 14
https://grants.g0v.tw/projects/588fa7b382223f001e022944
12
check-worthy claims and (ii) discriminating the verac-ity of wants to convey, and thus misleading and deceptive
fact claims. To identify check-worthy claims, fac-tual claims clickbait titles can serve as a good indicator for recog-
in news content are extracted that convey key statements nizing fake news articles [13].
and viewpoints, facilitating the subse-quent fact-checking
process [31]. Fact-checking for spe-ci c claims largely relies
on external resources to deter-mine the truthfulness of a 3.3.2 Social Context Models
particular claim. Two typ-ical external sources include the
The nature of social media provides researchers with ad-
open web and struc-tured knowledge graph. Open web
ditional resources to supplement and enhance News Con-
sources are utilized as references that can be compared
tent Models. Social context models include relevant user
with given claims in terms of both the consistency and
social engagements in the analysis, capturing this auxiliary
frequency [5; 50]. Knowledge graphs are integrated from
information from a variety of perspectives. We can clas-sify
the linked open data as a structured network topology, such
existing approaches for social context modeling into two
as DB-pedia and Google Relation Extraction Corpus. Fact-
categories: Stance-based and Propagation-based. Note that
checking using a knowledge graph aims to check whether
very few existing fake news detection approaches have uti-
the claims in news content can be inferred from exist-ing
lized social context models. Thus, we also introduce similar
facts in the knowledge graph [98; 16; 72].
methods for rumor detection using social media, which have
potential application for fake news detection.
Style-based: Fake news publishers often have malicious
intent to spread distorted and misleading information and in
uence large communities of consumers, requiring partic-ular Stance-based: Stance-based approaches utilize users’ view-
writing styles necessary to appeal to and persuade a wide points from relevant post contents to infer the veracity of
scope of consumers that is not seen in true news ar-ticles. original news articles. The stance of users’ posts can be
Style-based approaches try to detect fake news by capturing represented either explicitly or implicitly. Explicit stances are
the manipulators in the writing style of news con-tent. There direct expressions of emotion or opinion, such as the \thumbs
are mainly two typical categories of style-based methods: up" and \thumbs down" reactions expressed in Facebook.
Deception-oriented and Objectivity-oriented. Implicit stances can be automatically extracted from social
media posts. Stance detection is the task of automatically
Deception-oriented stylometric methods capture the determining from a post whether the user is in favor of,
deceptive statements or claims from news content. The neutral toward, or against some target entity, event, or idea
motivation of deception detection originates from foren-sic [53]. Previous stance classi cation methods mainly rely on
psychology (i.e., Undeutsch Hypothesis) [82] and various hand-crafted linguistic or embedding features on individual
forensic tools including Criteria-based Content Analysis posts to predict stances [53; 64]. Topic model methods, such
[84] and Scienti c-based Content Analysis [45] have been as latent dirichlet allocation (LDA) can be applied to learn
developed. More recently, advanced natu-ral language latent stance from topics [37]. Using these methods, we can
processing models are applied to spot de-ception phases infer the news veracity based on the stance values of relevant
from the following perspectives: Deep syntax and posts. Tacchini et al. proposed to con-struct a bipartite
Rhetorical structure. Deep syntax models have been network of user and Facebook posts using the \like" stance
implemented using probabilistic context fr-ree grammers information [75]; based on this network, a semi-supervised
(PCFG), with which sentences can be transformed into probabilistic model was used to predict the likelihood of
rules that describe the syntax struc-ture. Based on the Facebook posts being hoaxes. Jin et al. explored topic
PCFG, di erent rules can be de-veloped for deception models to learn latent viewpoint values and further exploited
detection, such as unlexicalized/ lexicalized production these viewpoints to learn the credibility of relevant posts and
rules and grandparent rules [22]. Rhetorical structure news content [37].
theory can be utilized to capture the di erences between
deceptive and truthful sen-tences [68]. Deep network Propagation-based: Propagation-based approaches for fake
models, such as convolu-tional neural networks (CNN), news detection reason about the interrelations of relevant
have also been applied to classify fake news veracity [90]. social media posts to predict news credibility. The basic
assumption is that the credibility of a news event is highly
related to the credibilities of relevant social media posts. Both
Objectivity-oriented approaches capture style signals that homogeneous and heterogeneous credibility networks can be
can indicate a decreased objectivity of news con-tent and built for propagation process. Homogeneous credi-bility
thus the potential to mislead consumers, such as networks consist of a single type of entities, such as post or
hyperpartisan styles and yellow-journalism. Hyper- event [37]. Heterogeneous credibility networks in-volve di
partisan styles represent extreme behavior in favor of a erent types of entities, such as posts, sub-events, and events
particular political party, which often correlates with a [36; 29]. Gupta et al. proposed a PageRank-like credibility
strong motivation to create fake news. Linguistic-based propagation algorithm by encoding users’ credi-bilities and
features can be applied to detect hyperpartisan articles tweets’ implications on a three layer user-tweet-event
[62]. Yellow-journalism represents those ar-ticles that do heterogeneous information network. Jin et al. pro-posed to
not contain well-researched news, but instead rely on include news aspects (i.e., latent sub-events), build a three-
eye-catching headlines (i.e., clickbait) with a propensity layer hierarchical network, and utilize a graph op-timization
for exaggeration, sensationalization, scare-mongering, framework to infer event credibilities. Recently, the con icting
etc. Often, news titles will summa-rize the major viewpoint relationships are included to build a homogeneous
viewpoints of the article that the author credibility network among tweets and guide the process to
evaluate their credibilities [37].
13
4. ASSESSING DETECTION EFFICACY these datasets also have speci c limitation that make them
In this section, we discuss how to assess the performance of challenging to use for fake news detection. BuzzFeedNews
algorithms for fake news detection. We focus on the avail- only contains headlines and text for each news piece and
able datasets and evaluation metrics for this task. covers news articles from very few news agencies. LIAR in-
cludes mostly short statements, rather than the entire news
4.1 Datasets content. Further, these statements are collected from vari-
Online news can be collected from di erent sources, such as ous speakers, rather than news publishers, and may include
news agency homepages, search engines, and social me- some claims that are not fake news. BS Detector data is
dia websites. However, manually determining the verac-ity of collected and annotated by using a developed news veracity
news is a challenging task, usually requiring annota-tors with checking tool. As the labels have not been properly vali-
domain expertise who performs careful analysis of claims dated by human experts, any model trained on this data is
and additional evidence, context, and reports from really learning the parameters of BS Detector, rather than
authoritative sources. Generally, news data with annota- expert-annotated ground truth fake news. Finally, CRED-
tions can be gathered in the following ways: Expert journal- BANK was originally collected for tweet credibility assess-
ists, Fact-checking websites, Industry detectors, and Crowd- ment, so the tweets in this dataset are not really the social
sourced workers. However, there are no agreed upon bench- engagements for speci c news articles.
mark datasets for the fake news detection problem. Some To address the disadvantages of existing fake news detec-
publicly available datasets are listed below: tion datasets, we have an ongoing project to develop a us-
able dataset for fake news detection on social media. This
15 20
BuzzFeedNews : This dataset comprises a complete dataset, called F akeNewsNet , includes all mentioned news
sample of news published in Facebook from 9 news content and social context features with reliable ground truth
agencies over a week close to the 2016 U.S. election fake news labels.
from September 19 to 23 and September 26 and 27.
Every post and the linked article were fact-checked 4.2 Evaluation Metrics
claim-by-claim by 5 BuzzFeed journalists. This To evaluate the performance of algorithms for fake news de-
dataset is further enriched in [62] by adding the linked tection problem, various evaluation metrics have been used.
articles, attached media, and relevant metadata. It In this subsection, we review the most widely used metrics
contains 1,627 articles{826 mainstream, 356 left-wing, for fake news detection. Most existing approaches consider
and 545 right-wing articles. the fake news problem as a classi cation problem that pre-
16 dicts whether a news article is fake or not:
LIAR : This dataset is collected from fact-checking
website PolitiFact through its API [90]. It includes True Positive (TP): when predicted fake news pieces
12,836 human-labeled short statements, which are are actually annotated as fake news;
sam-pled from various contexts, such as news True Negative (TN): when predicted true news pieces
releases, TV or radio interviews, campaign speeches, are actually annotated as true news;
etc. The labels for news truthfulness are ne-grained
multiple classes: pants- re, false, barely-true, half-true, False Negative (FN): when predicted true news
mostly true, and true. pieces are actually annotated as fake news;
17 False Positive (FP): when predicted fake news pieces
BS Detector : This dataset is collected from a are actually annotated as true news.
browser extension called BS detector developed for
18
checking news veracity . It searches all links on a By formulating this as a classi cation problem, we can de ne
given web-page for references to unreliable sources following metrics,
by checking against a manually complied list of
domains. The la-bels are the outputs of BS detector, P recision = jT P j (2)
rather than human annotators. jT P j + jF P j
19 Recall = jT P j (3)
CREDBANK : This is a large scale crowdsourced
dataset of approximately 60 million tweets that cover jT P j + jF Nj
96 days starting from October 2015. All the tweets are F1 = 2 P recision Recall (4)
broken down to be related to over 1,000 news events, P recision + Recall
with each event assessed for credibilities by 30 anno-
Accuracy = jT P j + jT Nj (5)
tators from Amazon Mechanical Turk [52].
jT P j + jT Nj + jF P j + jF Nj
In Table 1, we compare these public fake news detection These metrics are commonly used in the machine learning
datasets, highlighting the features that can be extracted from community and enable us to evaluate the performance of a
each dataset. We can see that no existing public dataset can classi er from di erent perspectives. Speci cally, accuracy
provide all possible features of interest. In addition, measures the similarity between predicted fake news and real
15
fake news. Precision measures the fraction of all detected fake
https://github.com/BuzzFeedNews/2016-10-facebook-fact- news that are annotated as fake news, addressing the important
check/tree/master/data
16https://www.cs.ucsb.edu/ william/data/liar dataset.zip problem of identifying which news is fake. How-ever, because
17
https://www.kaggle.com/mrisdal/fake-news
fake news datasets are often skewed, a high precision can be
18 easily achieved by making fewer positive
https://github.com/bs-detector/bs-detector
19 20https://github.com/KaiDMML/FakeNewsNet
http://compsocial.github.io/CREDBANK-data/
14
Table 1: Comparison of Fake News Detection Datasets.
Features News Content Social Context
Dataset Linguistic Visual User Post Network
BuzzFeedNews X
LIAR X
BS Detector X
CREDBANK X X X X
predictions. Thus, recall is used to measure the sensitivity, detection, rumor tracking, stance classi cation, and verac-ity
or the fraction of annotated fake news articles that are pre- classi cation [102]. Speci cally, rumor detection aims to
dicted to be fake news. F1 is used to combine precision and classify a piece of information as rumor or non-rumor [96;
recall, which can provide an overall prediction performance 70]; rumor tracking aims to collect and lter posts discussing
for fake news detection. Note that for P recision; Recall, F1, speci c rumors; rumor stance classi cation determines how
and Accuracy, the higher the value, the better the per- each relevant post is oriented with respect to the rumor’s
formance. veracity; veracity classi cation attempts to predict the ac-tual
The Receiver Operating Characteristics (ROC) curve pro- truth value of the rumor. The most related task to fake news
vides a way of comparing the performance of classi ers by detection is the rumor veracity classi cation. Rumor veracity
looking at the trade-o in the False Positive Rate (FPR) and classi cation relies heavily on the other subtasks, requiring
the True Positive Rate (TPR). To draw the ROC curve, we the stances or opinions can be extracted from rel-evant
plot the FPR on the x axis and and TPR along the y axis. The posts. These posts are considered as important sen-sors for
ROC curve compares the performance of di erent clas-si ers determining the veracity of the rumor. Di erent from rumors,
by changing class distributions via a threshold. TPR and FPR which may include long-term rumors, such as conspiracy
are de ned as follows (note that TPR is the same as recall de theories, as well as short-term emerging rumors, fake news
ned above): refers to information related speci cally to public news events
that can be veri ed as false.
TPR = jT P j (6)
jT P j + jF Nj 5.2 Truth Discovery
FPR = jF P j (7) Truth discovery is the problem of detecting true facts from
multiple con icting sources [46]. Truth discovery methods do
jF P j + jT Nj
not explore the fact claims directly, but rely on a collec-tion
Based on the ROC curve, we can compute the Area Under of contradicting sources that record the properties of objects
the Curve (AUC) value, which measures the overall perfor- to determine the truth value. Truth discovery aims to
mance of how likely the classi er is to rank the fake news
determine the source credibility and object truthfulness at the
higher than any true news. Based on [30], AUC is de ned as
same time. The fake news detection problem can bene t from
below.
AUC =
P
(n0 + n1 + 1
i
n0n 1 n0(n0 + 1)=2 (8) various aspects of truth discovery approaches under di erent
r) scenarios. First, the credibility of di erent news outlets can be
modeled to infer the truthfulness of re-ported news. Second,
relevant social media posts can also be modeled as social
where ri is the rank of ith fake news piece and n0 (n1) is the response sources to better determine the truthfulness of
number of fake (true) news pieces. It is worth men-tioning claims [56; 93]. However, there are some other issues that
that AUC is more statistically consistent and more must be considered to apply truth discovery to fake news
discriminating than accuracy [47], and it is usually applied in detection in social media scenarios. First, most existing truth
an imbalanced classi cation problem, such as fake news discovery methods focus on handling struc-tured input in the
classi cation, where the number of ground truth fake news form of Subject-Predicate-Object (SPO) tuples, while social
articles and and true news articles have a very imbalanced media data is highly unstructured and noisy. Second, truth
distribution. discovery methods can not be well ap-plied when a fake
news article is newly launched and pub-lished by only a few
5. RELATED AREAS news outlets because at that point there is not enough social
In this section, we further discuss areas that are related to media posts relevant to it to serve as additional sources.
the problem of fake news detection. We aim to point out the
di erences between these areas and fake news detection by 5.3 Clickbait Detection
brie y explaining the task goals and highlighting some Clickbait is a term commonly used to describe eye-catching
popular methods. and teaser headlines in online media. Clickbait headlines
create a so-called \curiosity gap", increasing the likelihood
5.1 Rumor Classification that reader will click the target link to satisfy their curios-ity.
A rumor can usually be de ned as \a piece of circulating Existing clickbait detection approaches utilize various
information whose veracity status is yet to be veri ed at the linguistic features extracted from teaser messages, linked
time of spreading" [102]. The function of a rumor is to make webpages, and tweet meta information [12; 8; 63]. Di er-ent
sense of an ambiguous situation, and the truthful-ness value types of clickbait are categorized, and some of them are
could be true, false or unveri ed. Previous ap-proaches for highly correlated with non-factual claims [7]. The underly-
rumor analysis focus on four subtasks: rumor
15
58; 59], but quantitative studies to verify these psychological
factors are rather limited. For example, the echo chamber e
ect plays an important role for fake news spreading in so-cial
ing motivation of clickbait is usually for click-through rates media. Then how to capture echo chamber e ects and how
and the resultant advertising revenue. Thus, the body text of to utilize the pattern for fake news detection in social media
clickbait articles are often informally organized and poorly
could be an interesting investigation. Moreover, in-tention
reasoned. This discrepancy has been used by researchers
to identify the inconsistency between headlines and news detection from news data is promising but limited as most
21 existing fake news research focus on detecting the
con-tents in an attempt to detect fake news articles . Even authenticity but ignore the intent aspect of fake news. In-
though not all fake news may include clickbait headlines, tention detection is very challenging as the intention is often
speci c clickbait headlines could serve as an important in- explicitly unavailable. Thus. it’s worth to explore how to use
dicator, and various features can be utilized to help detect data mining methods to validate and capture psychol-ogy
fake news. intentions.
5.4 Spammer and Bot Detection
Feature-oriented: Feature-oriented fake news research aims
Spammer detection on social media, which aims to cap-ture to determine e ective features for detecting fake news from
malicious users that coordinate among themselves to launch multiple data sources. We have demonstrated that there are
various attacks, such as spreading ads, disseminat-ing two major data sources: news content and social context.
pornography, delivering viruses, and phishing [44], has From a news content perspective, we introduced linguistic-
recently attracted wide attention. Existing approaches for based and visual-based techniques to extract features from
social spammer detection mainly rely on extracting features text information. Note that linguistic-based features have
from user activities and social network information [35; 95; been widely studied for general NLP tasks, such as text clas-
33; 34]. In addition, the rise of social bots has also increased si cation and clustering, and speci c applications such as
the circulation of false information as they automatically author identi cation [32] and deception detection [22], but the
retweet posts without verifying the facts [23]. The major underlying characteristics of fake news have not been fully
challenge brought by social bots is that they can give a false understood. Moreover, embedding techniques, such as word
impression that information is highly popular and endorsed embedding and deep neural networks, are attracting much
by many people, which enables the echo chamber e ect for attention for textual feature extraction, and has the potential
the propagation of fake news. Previous approaches for bot to learn better representations [90; 87; 88]. In ad-dition,
detection are based on social network information, crowd- visual features extracted from images are also shown to be
sourcing, and discriminative features [23; 55; 54]. Thus, both important indicators for fake news [38]. However, very limited
spammer and social bots could provide insights about target research has been done to exploit e ective visual fea-tures,
speci c malicious social media accounts that can be used for including traditional local and global features [61] and newly
fake news detection. emerging deep network-based features [43; 89; 85], for the
fake news detection problem. Recently, it has been shown
6. OPEN ISSUES AND FUTURE RESEARCH that advanced tools can manipulate video footage of public
gures [80], synthesize high quality videos [74], etc. Thus, it
In this section, we present some open issues in fake news becomes much more challenging and important to di
detection and future research directions. Fake news detec- erentiate real and fake visual content, and more advanced
tion on social media is a newly emerging research area, so visual-based features are needed for this research. From a
we aim to point out promising research directions from a data social context perspective, we introduced user-based, post-
mining perspective. Speci cally, as shown in Figure 2, we based, and network-based features. Existing user-based fea-
outline the research directions in four categories: Data- tures mainly focus on general user pro les, rather than dif-
oriented, Feature-oriented, Model-oriented and Application-
ferentiating account types separately and extracting user-
oriented.
speci c features. Post-based features can be represented us-
ing other techniques, such as convolutional neural networks
Data-oriented: Data-oriented fake news research is fo-cusing on (CNN) [69], to better capture people’s opinions and reac-
di erent kinds of data characteristics, such as : dataset, temporal tions toward fake news. Images in social media posts can
and psychological. From a dataset per-spective, we also be utilized to better understand users’ sentiments [91]
demonstrated that there is no existing bench-mark dataset that toward news events. Network-based features are extracted
includes resources to extract all relevant features. A promising to represent how di erent types of networks are constructed.
direction is to create a comprehensive and large-scale fake news
It is important to extend this preliminary work to explore
benchmark dataset, which can be used by researchers to
facilitate further research in this area. From a temporal (i) how other networks can be constructed in terms of di er-
perspective, fake news dissemination on so-cial media ent aspects of relationships among relevant users and posts;
demonstrates unique temporal patterns di erent from true news. and (ii) other advanced methods of network representations,
Along this line, one interesting problem is to perform early fake such as network embedding [78; 86].
news detection, which aims to give early alerts of fake news
during the dissemination process. For example, this approach Model-oriented: Model-oriented fake news research opens the
could look at only social media posts within some time delay of door to building more e ective and practical models for fake news
the original post as sources for news veri cation [37]. Detecting detection. Most previously mentioned approaches focus on
fake news early can help prevent further propagation on social extracting various features, incorporating theses features into
media. From a psycho-logical perspective, di erent aspects of supervised classi cation models, such as na ve Bayes, decision
fake news have been qualitatively explored in the social tree, logistic regression, k nearest neighbor
psychology literature [92;
21http://www.fakenewschallenge.org/
16
Figure 2: Future directions and open issues for fake news detection on social media.
(KNN), and support vector machines (SVM), and then se-lecting vised classi cation methods may be more accurate given a
the classi er that performs the best [62; 75; 1]. More research well-curated ground truth dataset for training, unsupervised
can be done to build more complex and e ective models and to models can be more practical because unlabeled datasets
better utilize extracted features, such as ag-gregation methods, are easier to obtain.
probabilistic methods, ensemble methods, or projection
methods [73]. Speci cally, we think there is some promising Application-oriented: Application-oriented fake news re-search
research in the following directions. First, aggregation methods encompass research that goes into other areas beyond fake
combine di erent feature representa-tions into a weighted form news detection. We propose two major directions along these
and optimize the feature weights. Second, since fake news may lines: fake news di usion and fake news intervention. Fake news
commonly mix true statements with false claims, it may make di usion characterizes the di usion paths and patterns of fake
more sense to predict the likelihood of fake news instead of news on social media sites. Some early re-search has shown
producing a binary value; probabilistic models predict a that true information and misinformation follow di erent patterns
probabilistic distribution of class labels (i.e., fake news versus when propagating in online social networks [18; 51]. Similarly,
true news) by assuming a generative model that pulls from the the di usion of fake news in social media demonstrates its own
same distribution as the original feature space [25]. Third, one characteristics that need further investigation, such as social
of the major chal-lenges for fake news detection is the fact that dimensions, life cycle, spreader identi cation, etc. Social
each feature, such as source credibility, news content style, or dimensions refer to the heterogeneity and weak dependency of
social re-sponse, has some limitations to directly predict fake social connections within di erent social communities. Users’
news on its own. Ensemble methods build a conjunction of perceptions of fake news pieces are highly a ected by their like-
several weak classi ers to learn a strong classi er that is more minded friends in social media (i.e., echo chambers), while the
suc-cessful than any individual classi er alone; ensembles have de-gree di ers along di erent social dimensions. Thus, it is worth
been widely applied to various applications in the machine exploring why and how di erent social dimensions play a role in
learning literature [20]. It may be bene cial to build ensem-ble spreading fake news in terms of di erent top-ics, such as political,
models as news content and social context features each have education, sports, etc. The fake news di usion process also has
supplementary information that has the potential to boost fake di erent stages in terms of peo-ple’s attentions and reactions as
news detection performance. Finally, fake news content or social time goes by, resulting in a unique life cycle. Research has
context information may be noisy in the raw feature space; shown that breaking news and in-depth news demonstrate di
projection methods refer to approaches that lean projection erent life cycles in social media [10]. Studying the life cycle of
functions to map between original fea-ture spaces (e.g., news fake news will provide deeper understanding of how particular
content features and social context features) and the latent stories \go viral" from normal public discourse. Tracking the life
feature spaces that may be more useful for classi cation. cycle of fake news on social media requires recording essential
trajectories of fake news di usion in general [71], as well as
Moreover, most existing approaches are supervised, which further in-vestigations of the process for speci c fake news
requires a pre-annotated fake news ground truth dataset to pieces, such as graph-based models and evolution-based
train a model. However, obtaining a reliable fake news models [27]. In addition, identifying key spreaders of fake news
dataset is very time and labor intensive, as the process of- is crucial to mitigate the di usion scope in social media. Note that
ten requires expert annotators to perform careful analysis of key spreaders can be categorized in two ways, i.e., stance and
claims and additional evidence, context, and reports from authenticity. Along the stance dimensions, spreaders can either
authoritative sources. Thus, it is also important to con-sider be (i) clari ers, who propose skeptical and opposing viewpoints
scenarios where limited or no labeled fake news pieces are towards fake news and try to clarify them; or (ii) persuaders, who
available in which semi-supervised or unsupervised mod-els spread fake news with supporting opinions
can be applied. While the models created by super-
to persuade others to believe it. In this sense, it is impor-tant to explore how to detect clari ers and persuaders and better use
them to control the dissemination of fake news. From an authenticity perspective, spreaders could be either human, bot, or
17
cyborg. Social bots have been used to inten-tionally [3] Solomon E Asch and H Guetzkow.
spread fake news in social media, which motivates E ects of group pressure upon the modi
further research to better characterize and detect cation and distortion of judg-ments. Groups,
malicious accounts designed for propaganda. leadership, and men, pages 222{236, 1951.
Finally, we also propose further research into fake news
in-tervention, which aims to reduce the e ects of fake [4] Meital Balmas. When fake news
news by proactive intervention methods that minimize the becomes real: Com-bined exposure to multiple
spread scope or reactive intervention methods after fake news sources and political attitudes of ine cacy,
news goes viral. Proactive fa ke news intervention methods alienation, and cynicism. Com-munication
try to (i) remove malicious accounts that spread fake news Research, 41(3):430{454, 2014.
or fake news itself to isolate it from future consumers;
(ii) immu-nize users with true news to change the belief of [5] Michele Banko, Michael J Cafarella,
users that may already have been a ected by fake news. Stephen Soder-land, Matthew Broadhead, and
There is recent research that attempts to use content-
Oren Etzioni. Open information extraction from
based immunization and network-based immunization
the web. In IJCAI’07.
methods in misinforma-tion intervention [94; 97]. One
approach uses a multivariate Hawkes process to model
both true news and fake news and mitigate the spreading
[6] Alessandro Bessi and Emilio Ferrara.
of fa ke news in real-time [21]. The aforementioned spreader Social bots dis-tort the 2016 us presidential
detection techniques can also be applied to target certain election online discussion. First Monday,
users (e.g., persuaders) in social media to stop spreading 21(11), 2016.
fake news, or other users (e.g. clar-i ers) to maximize the
spread of corresponding true news. [7] Prakhar Biyani, Kostas Tsioutsiouliklis,
and John Blackmer. " 8 amazing secrets for
7. CONCLUSION getting more clicks": Detecting clickbaits in news
streams using article in-formality. In AAAI’16.
With the increasing popularity of social media, more and
more people consume news from social media instead of
tra-ditional news media. However, social media has also [8] Jonas Nygaard Blom and Kenneth
been used to spread fake news, which has strong Reinecke Hansen. Click bait: Forward-reference
negative impacts on individual users and broader society. as lure in online news headlines. Journal of
In this article, we explored the fake news problem by Pragmatics, 76:87{100, 2015.
reviewing existing lit -erature in two phases: characterization
and detection. In the characterization phase, we [9] Paul R Brewer, Dannagal Goldthwaite
introduced the basic concepts and principles of fake news Young, and Michelle Morreale. The impact of
in both traditional media and so-cial media. In the real news about fake news: Intertextual
detection phase, we reviewed existing fake news processes and political satire. International
detection approaches from a data mining perspective, Journal of Public Opinion Research,
including feature extraction and model construction. We 25(3):323{343, 2013.
also further discussed the datasets, evaluation metrics,
and promising future directions in fake news detection
[10] Carlos Castillo, Mohammed El-Haddad,
research and expand the eld to other applications.
Jurgen Pfef-fer, and Matt Stempeck.
Characterizing the life cycle of online news
8. ACKNOWLEDGEMENTS stories using social media reactions. In
This material is based upon work supported by, or in part CSCW’14.
by, the ONR grant N00014-16-1-225
[11] Carlos Castillo, Marcelo Mendoza, and
Barbara Poblete. Information credibility on
twitter. In WWW’11.
REFERENCES
[1] Sadia Afroz, Michael Brennan, and

Rachel Green-stadt. Detecting hoaxes,
frauds, and deception in writ-ing style online.
In ISSP’12.
[2] Hunt Allcott and Matthew Gentzkow.

Social media and fake news in the 2016
election. Technical report, National Bureau of
Economic Research, 2017.
18

Fake News Detection On Social Media: A Data Mining Perspective

Uploaded by

Copyright:

Available Formats

Fake News Detection On Social Media: A Data Mining Perspective

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fake News Detection On Social Media: A Data Mining Perspective

Uploaded by

Copyright:

Available Formats

Fake News Detection on Social Media:

A Data Mining Perspective

to reduce misperceptions, but sometimes may even increase the

[1] Sadia Afroz, Michael Brennan, and

[2] Hunt Allcott and Matthew Gentzkow.

You might also like