The History of Digital Spam: Emilio Ferrara
The History of Digital Spam: Emilio Ferrara
The History of Digital Spam: Emilio Ferrara
Emilio Ferrara
University of Southern California
Information Sciences Institute
Marina Del Rey, CA
[email protected]
ACM Reference Format: This broad definition will allow me to track, in an inclusive
Emilio Ferrara. 2019. The History of Digital Spam. In Communications of manner, the evolution of digital spam across its most popular appli-
the ACM, August 2019, Vol. 62 No. 8, Pages 82-91. ACM, New York, NY, USA, cations, starting from spam emails to modern-days spam. For each
arXiv:1908.06173v1 [cs.CY] 14 Aug 2019
9 pages. https://doi.org/10.1145/3299768 highlighted application domain, I will dive deep to understand the
nuances of different digital spam strategies, including their intents
Spam!: that’s what Lorrie Faith Cranor and Brian LaMacchia ex-
and catalysts and, from a technical standpoint, how they are carried
claimed in the title of a popular call-to-action article that appeared
out and how they can be detected.
twenty years ago on Communications of the ACM [10]. And yet,
Wikipedia provides an extensive list of domains of application:
despite the tremendous efforts of the research community over the
``While the most widely recognized form of spam is email spam,
last two decades to mitigate this problem, the sense of urgency
the term is applied to similar abuses in other media: instant
remains unchanged, as emerging technologies have brought new
messaging spam, Usenet newsgroup spam, Web search engine spam,
dangerous forms of digital spam under the spotlight. Furthermore, spam in blogs, wiki spam, online classified ads spam, mobile
when spam is carried out with the intent to deceive or influence phone messaging spam, Internet forum spam, junk fax
at scale, it can alter the very fabric of society and our behavior. In transmissions, social spam, spam mobile apps, television
this article, I will briefly review the history of digital spam: starting advertising and file sharing spam''.
from its quintessential incarnation, spam emails, to modern-days (cf. https://en.wikipedia.org/wiki/Spamming)
forms of spam affecting the Web and social media, the survey will
Table 1 summarizes a few examples of types of spam and relative
close by depicting future risks associated with spam and abuse of
context, including whereas there exist machine-learning solutions
new technologies, including Artificial Intelligence (e.g., Digital Hu-
(ML) to each problem. Email is known to be historically the first ex-
mans). After providing a taxonomy of spam, and its most popular
ample of digital spam (cf. Figure 1) and remains uncontested in scale
applications emerged throughout the last two decades, I will review
and pervasiveness with billions of spam emails generated every
technological and regulatory approaches proposed in the literature,
day [10]. In the late 1990s, spam landed on instant messaging (IM)
and suggest some possible solutions to tackle this ubiquitous digital
platforms (SPIM) starting from AIM (AOL Instant Messenger™) and
epidemic moving forward.
evolving through modern-days IM systems such as WhatsApp™,
1 TYPES OF SPAM Facebook Messenger™, WeChat™, etc. A widespread form of spam
that emerged in the same period was Web search engine manip-
An omni-comprehensive, universally-acknowledged definition of ulation: content spam and link farms allowed spammers to boost
digital spam is hard to formalize. Laws and regulation attempted to the position of a target Website in the search result rankings of
define particular forms of spam, e.g., email (cf., 2003’s Controlling the popular search engines, by gaming algorithms like PageRank and
Assault of Non-Solicited Pornography and Marketing Act.) However, the like. With the success of the Social Web [22], in the early 2000s
nowadays, spam occurs in a variety of forms, and across different we witnessed the rise of many new forms of spam, including Wiki
techno-social systems. Each domain may warrant a slight different spam (injecting spam links into Wikipedia pages [1]), opinion and
definition that suits what spam is in that precise context: some review spam (promoting or smearing products by generating fake
features of spam in a domain, e.g., volume in mass spam campaigns, online reviews [27]), and mobile messaging spam (SMS and text
may not apply to others, e.g., carefully targeted phishing operations.
In an attempt to propose a general taxonomy, I here define digi-
tal spam as the attempt to abuse of, or manipulate, a techno-social Spam Type Start Today’s Volume ML Ref
system by producing and injecting unsolicited, and/or undesired con- Email 1978 Billions x day ✓ [10]
tent aimed at steering the behavior of humans or the system itself, Instant Messaging 1997 Millions x day ✓ [20]
at the direct or indirect, immediate or long-term advantage of the Search Engine 1998 Unknown ✓ [31]
spammer(s). Wiki 2001 Thousands x day - [1]
Opinion & Reviews 2005 Millions across platforms ✓ [11]
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed Mobile Messaging 2007 Millions x day ✓ [3]
for profit or commercial advantage and that copies bear this notice and the full citation Social Bots 2010 Millions across platforms ✓ [16]
on the first page. Copyrights for third-party components of this work must be honored. False News 2016 Thousands across Websites - [36]
For all other uses, contact the owner/author(s).
Communications of the ACM , August 2019, Vol. 62 No. 8, Pages 82-91 Multimedia 2018 Unknown - [25]
© 2019 Copyright held by the owner/author(s). Table 1: Examples of types of spam and relative statistics.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM.
https://doi.org/10.1145/3299768
Communications of the ACM , August 2019, Vol. 62 No. 8, Pages 82-91 E. Ferrara
THE SPANISH ARPANET SEARCH ENGINES FAKE REVIEWS SOCIAL BOTS AI SPAM
PRISONER The first reported case of Web content spam and Giants of e- Millions of accounts Systems based on AI can
Early 2010s
email spam is attributed link farms are common commerce like operated by software manipulate reality,
Early 2000s
The New York Times to Digital Equipment forms of spamdexing, the Amazon and populate social producing indistinguishable
reports of unsolicited Corporation and manipulation of Web Alibaba fight the media to carry out alternatives. AIs can also be
messages circulating circulated to 400 search result ranking. manipulation of nefarious spam target of manipulation and
in association with an ARPANET users. product popularity campaigns. spam to elicit behaviors of
1898
1978
1995
old swindle. by opinion spam. the AI system or of its users.
2018+
Early 1900s
Mid 1990s
Mid 2000s
2000s
Mid 2010s
POST MAIL THE EMAIL SOCIAL PHISHING FALSE NEWS
Advertisement based on EPIDEMIC NETWORKS Social engineering Spam Websites are
unsolicited content has and disguise may created to
been mailed to our doors A growing fraction of The rise of Facebook, allow attackers to deliberately
by Post Mail services for emails is spam. Twitter, and Reddit trick victims into propagate false news
over a century! Platforms and ISPs leads to new revealing sensitive related to politics,
start investing in opportunities for information. public health, and
spam filtering spammers to reach Ransomware are social issues.
techniques. billions of Social Web used to extort funds
users. from the victims.
2 4 6 8 10
Figure 1: Timeline of the major milestones in the history of spam, from its inception to modern days.
messages sent directly to mobile devices [3]). Ultimately, in the last a.k.a. the Nigerian Prince scam), called the Spanish Prisoner scam
decade, with the increasing pervasiveness of online social networks were circulating in the late 1800s.1
and the significant advancements in Artificial Intelligence (AI), new The first reported case of digital spam occurred in 1978 and
forms of spam involve social bots (accounts operated by software was attributed to Digital Equipment Corporation, who announced
to interact at scale with Social Web users [16]), false news Websites their new computer system to over 400 subscribers of ARPANET,
(to deliberately spread disinformation [36]), and multimedia spam the precursor network of modern Internet (see Figure 1). The first
based on AI [25]. mass email campaign occurred in 1994, known as the USENET green
In the following, I will focus on three of these domains: email card lottery spam: the law firm of Canter & Siegel advertised their
spam, Web spam (specifically, opinion spam and fake reviews), immigration-related legal services simultaneously to over six thou-
and social spam (with a focus on social bots). Furthermore, I will sand USENET newsgroups. This event contributed to popularizing
highlight the existence of a new form of spam that I will call AI the term spam. Both the ARPANET and USENET cases brought
spam. I will provide examples of spam in this new domain, and lay serious consequences to their perpetrators as they were seen as
out the risks associated with it and possible mitigation strategies. egregious violations of common code of conduct in the early days
of the Internet (for example, Canter & Siegel ran out of business
and Canter was disbarred by the Arizona bar association.) However,
2 FLOODED BY JUNK EMAILS things were bound to change as the Internet became an increasingly
2.1 The Origins of Email Spam more pervasive technology in our society.
Cranor and LaMacchia [10], in their 1998’s Communications of the
ACM article, characterized the problem of junk emails, or email 2.2 Email Spam: Risks and Challenges
spam, as one of the earliest forms of digital spam. The use of the Internet for distributing unsolicited messages pro-
Email spam has mainly two purposes, namely advertising (e.g., vides unparalleled scalability, and unprecedented reach, at a cost
promoting products, services, or contents), and frauds (e.g., attempt- that is infinitesimal compared to what it would take to accomplish
ing to perpetrate scams, or phishing). Neither ideas were particu- the same results via traditional means [10]. These three conditions
larly new or unique to the digital realm: advertisement based on created the ideal conjecture of economical incentives that made
unsolicited content delivered by traditional post mail (and, later, email spam so pervasive.
phone calls, including more recently the so-called “robo-calls”) has
been around for nearly a century. As for scams, the first reports of 1 SeeThe New York Times, March 20, 1898: https://www.nytimes.com/1898/03/20/
the popular advance-fee scam (in modern days known as 419 scam, archives/an-old-swindle-revived-the-spanish-prisoner-and-buried-treasure.html
The History of Digital Spam Communications of the ACM , August 2019, Vol. 62 No. 8, Pages 82-91
In contrast to old-school post mail spam, digital email spam in- In the Sidebar: Detecting Spam Emails, I summarize some of the
troduced a number of unique challenges [10]: (i) if left unfiltered, technical milestones accomplished in the quest to identify spam
spam emails can easily outnumber legitimate ones, overwhelming emails. Unfortunately, I suspect that much of the state-of-the-art
the recipients and thus rendering the email experience from un- research on spam detection lies behind close curtains, mainly for
pleasant to unusable; (ii) email spam often contains explicit content three reasons: (i) large email-related service providers, such as
that can hurt the sensibility of the recipients—depending upon the Google (Gmail™), Microsoft (Outlook™ , Hotmail™), Cisco (Iron-
sender/recipient country’s laws, perpetrating this form of spam Port™, Email Security Appliance—ESA™), etc., devote(d) massive
could constitute a criminal offense;2 (iii) by embedding HTML or R&D investments to develop machine learning methods to auto-
Javascript code into spam emails, the spammers can emulate the look matically filter out spam in the platforms they operate (Google,
and feel of legitimate emails, tricking the recipients and eliciting Microsoft, etc.) or protect (Cisco); the companies are thus often in-
unsuspecting behaviors, thus enacting scams or enabling phishing centivized to use patented and close-sourced solutions to maintain
attacks [23]; finally, (iv) mass spam operations pose a burden on their competitive advantage; (ii) related to the former point, fight-
Internet Service Providers (ISPs), which have to process and route ing email spam is a continuous arms-race: revealing one’s spam
unnecessary, and often large, amounts of digital junk information filtering technology gives out information that can be exploited by
to millions of recipients—for the larger spam campaigns, even more. the spammers to create more sophisticated campaigns that can ef-
The Internet was originally designed by and for tech-savvy users: fectively and systematically escape detection, thus calling for more
spammers quickly developed ways to take advantage of the unso- secrecy. Finally, (iii) the accuracy of email spam detection systems
phisticated ones. Phishing is the practice of using deception and deployed by these large service providers has been approaching
social engineering strategies by which attackers manage to trick nearly-perfect detection: a diminishing return mechanism comes
victims by disguising themselves as a trusted entity [9, 23]. The into play where additional efforts to further refine detection algo-
end goal of phishing attacks is duping the victims into revealing rithms may not warrant the costs of developing increasingly more
sensitive information for identity theft, or extorting funds via ran- sophisticated techniques fueling complex spam detection systems;
somware or credit card frauds. Email has been by far and large the this makes established approaches even more valuable and trusted,
most common vector of phishing attacks. In 2006, Indiana Univer- thus motivating the secrecy of their functioning.
sity carried out a study to quantify the effectiveness of phishing
emails [23]. The researchers demonstrated that a malicious attacker 3 WEB 2.0 OR SPAM 2.0?
impersonating the university would have a 16% success rate in ob-
The new millennium brought us the Social Web, or Web 2.0, a para-
taining the users’ credentials when the phishing email came from
digm shift with an emphasis on user-generated content and on the
an unknown sender; however, success rate arose to 72% when the
participatory, interactive nature of the Web experience [22]. From
email came from an attacker impersonating a friend of the victim.
knowledge production (Wikipedia) to personalized news (social
media) and social groups (online social networks), from blogs to
2.3 Fighting Email Spam image and video sharing sites, from collaborative tagging to social
Over the course of the last two decades, solutions to the problem of e-commerce, this wealth of new opportunities brought us as many
email spam revolved around implementing new regulatory policies, new forms of spam, commonly referred to as social spam.
increasingly sophisticated technical hurdles, and combinations of Differently from spam emails, where spam can only be conveyed
the two [10]. Regarding the former, in the context of the U.S. or in one form (i.e., emails), social spam can appear in multiple forms
the European Union (EU), policies that regulate access to personal and modi operandi. Social spam can be in the form of textual content
information (including email addresses), such as the EU’s General (e.g., a secretly-sponsored post on social media), or multimedia (e.g.,
Data Protection Regulation (GDPR) enacted in 2018, hinder the a manufactured photo on 4chan); social spam can aim at pointing
ability of bulk mailers based in EU countries to effectively carry users to unreliable resources, e.g., URLs to unverified information
out mass email spam operations without risks and possibly serious or false news Websites [36]; social spam can aim at altering the
consequences. However, it has become increasingly more obvious popularity of digital entities, e.g., by manipulating user votes (up-
that solutions based exclusively on regulatory affairs are ineffective: votes on Reddit™ posts, retweets on Twitter™), and even that of
spam operations can move to countries with less restrictive Internet physical products, e.g., by posting fake online reviews (e.g., about a
regulations. However, regulatory approaches in conjunction with product on an e-commerce Website).
technical solutions have brought significant progress in the fight
against email spam. 3.1 Spammy Opinions
From a technical standpoint, two decades of research advance-
ments led to sophisticated techniques that strongly mitigate the In the early 2000s (cf. Figure 1), the growing popularity of e-commerce
amount of spam emails ending up in the intended recipients’ in- Websites like Amazon and Alibaba motivated the emergence of
boxes. A number of review papers have been published that sur- opinion spam (a.k.a. review spam) [24, 27].
veyed data mining and machine learning approaches to detect and According to Liu [27], there are three types of spam reviews:
filter out email spam [7], some with a specific focus on scams and (i) fake reviews, (ii) reviews about brands only, and (iii) non-reviews.
phishing spam [21]. The first type of spam, fake reviews, consists of posting untruthful,
or deceptive reviews on online e-commerce platforms, in an attempt to
2 E.g.,
see the U.S. Federal Law on Obscenity https://www.justice.gov/criminal-ceos/
manipulate the public perception (in a positive or negative manner) of
citizens-guide-us-federal-law-obscenity specific products or services presented on the affected platform(s). Fake
Communications of the ACM , August 2019, Vol. 62 No. 8, Pages 82-91 E. Ferrara
Detecting Spam Emails positive reviews can be used to enhance the popularity and positive
perception of the product(s) or service(s) the spammer intends to
Email spam detection is an arms-race between attackers promote, while fake negative reviews can contribute to smear the
(spammers) and defenders (service providers). Two decades spammer’s competitor(s) and their products/services. Opinion spam
of research in the data mining and machine learning commu- of the second type, reviews about brands only, pertains comments
nities produced troves of techniques to tackle this problem. on the manufacturer/brand of a product but not on the product
Some milestones include: itself—albeit genuine, according to Liu [27] they are considered
SMTP solutions. SMTP is the protocol at the foundation of spam because they are not targeted at specific products and are often
the email exchange infrastructure. Blacklists were intro- biased. Finally, spam reviews of the third type, non-reviews, are
duced to keep track of spam propagators [7]. Mail servers technically not opinion spam as they do not provide any opinion,
can consult blacklisting services to determine whether to they only contain generic, unrelated content (e.g., advertisement, or
route emails to their destination. A softer version of black- questions, rather than reviews, about a product). Fake reviews are,
listing is greylisting. Greylists keep track of triplets of IP
by far and large, the most common type of opinion spam, and the
addresses (sender, receiver, STMP host) involved into an email
exchange. The first time a triplet involving a dubious SMTP
one that has received more attention in the research community
host appears, the exchange is denied, but the triplet is stored [27]. Furthermore, Jindal and Liu [24] showed that spam of the
to authorize future exchanges. This is based on the ratio- second and third type is simple to detect and address.
nale that spammers rarely retry sending spam through the Unsurprisingly, the practice of opinion spam, and in particular
same relay, and was proven effective in reducing early spam fake reviews, is widely considered as unfair and deceptive, and
circulation [7]. Another approach is keyword-based filter- as such it has been subject of extensive legal scrutiny and court
ing: whenever the subject or the body of an email contains battles. If left unchecked, opinion spam can poison a platform and
flagged terms (belonging to a keyword list), the SMTP ser- negatively affect both customers and platform providers (including
vice provider would not route it to its intended recipient, incurring in financial losses for both parties, as customers may
and flag the sending offender — multiple offenses would lead
be tricked into purchasing undesirable items and grow frustrated
to permanent bans. Other strategies like DomainKeys Iden-
tified Mail (DKIM) and digital signatures are authentication
against the platform), at the sole advantage of the spammer (or the
methods designed to detect email spoofing and assess email entity they represent)—as such, depending on the country’s laws,
provenance. opinion spam may qualify as a form of digital fraud.
Detecting fake reviews is complex for a variety of reasons: for
Supervised learning. In their seminal work, Drucker et al.
example, spam reviews can be posted by fake or real user accounts.
[13] proposed one of the first machine learning systems for
spam detection, based on Support Vector Machines (then
Furthermore, they can be posted by individual users or even group
the state of the art in terms of supervised learning). The of users [27, 30]. Spammers can deliberately use fake accounts on
success of supervised learning over traditional keyword- e-commerce platforms, created only with the scope of posting fake
based filters demonstrated by Drucker et al. [13] motivated reviews. Fortunately, fake accounts on e-commerce platforms are
the first wave of machine learning research in email spam generally easy to detect, as they engage in intense reviewing ac-
detection. Shortly after, Androutsopoulos et al. [4] showed tivity without any product purchases. An alternative and more
the power of naive Bayesian anti-spam filtering: Bayesian complex scenario occurs when fake reviews are posted by real
systems yielded state-of-the-art spam detection performance users. This tends to occur under two very different circumstances:
for many years. The advent of more sophisticated learning (i) compromised accounts (i.e., accounts originally owned by le-
models, like boosting trees, set the accuracy bar higher but
gitimate users that have been hacked and sold to spammers) are
paradigm shifts lagged for nearly a decade.
frequently re-purposed and utilized in opinion spam campaigns
Hybrid neural systems. More recently, Wu [37] proposed [11]; (ii) fake review markets became very popular where real users
behavior-based spam detection using combinations of simple collude in exchange for direct payments to write untruthful re-
association rules and neural networks. Given their ability to views (e.g., without actually purchasing or trying a given product
naturally handle visual information, neural network methods
or service). To complicate this matter, researchers showed that fake
to detect spam were extended to multimedia content. For ex-
ample, Wu et al. [38] and Fumera et al. [17] proposed meth-
personas, e.g., Facebook profiles, can be created and associated
ods exploiting visual cues to detect spam content injected in with such spam accounts [18]. During the late 2000s, many online
images embedded into emails. fake-review markets emerged, whose legality was battled in court
by e-commerce giants. Action on both legal and technical fronts
Dedicated hardware. Networking companies are developing
has helped mitigating the problem of opinion spam.
anti-spam appliances. Dedicated hardware can detect various
types of spam, including phishing, malware, and ransomware,
From a technical standpoint, a variety of techniques have been
guaranteeing high efficiency and accuracy. For example, proposed to detect review spam. Liu [27] identified three main
Cisco advertises that their Email Security Appliance (ESA™) approaches, namely supervised, unsupervised, and group spam de-
detects over 99.9% of incoming spam email with lower than 1 tection. In supervised spam detection, the problem of separating fake
in a million false positive rate. from genuine (non-fake) reviews is formulated as a classification
problem. Jindal and Liu [24] pointed out that the main challenge of
this task is to work around the shortage of labeled training data. To
address this problem, the authors exploited the fact that spammers,
to minimize their work, often produce (near-)duplicate reviews,
The History of Digital Spam Communications of the ACM , August 2019, Vol. 62 No. 8, Pages 82-91
that can be used as examples of fake reviews. Feature engineer- thousands (or more) of accounts that will be used to follow a target
ing and analysis was key to build informative features of genuine user in order to boost its apparent influence. Such “disposable ac-
and fake reviews, enriched by features of the reviewing users and counts” are often referred to as fake followers as their purpose is
the reviewed products. Models based on Logistic Regression have solely to participate in such link-farming networks. In some plat-
been proven successful in detecting untruthful opinions in large forms, link farming was so pervasive that spammers reportedly
corpora of Amazon reviews [24]. Detection algorithms based on controlled millions of fake accounts [19]. Link farming introduced
Support Vector Machines or Naive Bayes models generally per- a first level of automation in social media spam, namely the tools
form well (above 98% accuracy) and scale to production systems to automatically create large swaths of social media accounts.
[29]. These pipelines are often enhanced by human-in-the-loop In the late 2000s, social spam obtained a new potent tool to
strategies, where annotators recruited through Amazon Mechani- exploit: bots (short for software robots, a.k.a. social bots). In my
cal Turk (or similar crowd-sourcing services) manually label subsets 2016 CACM review titled The Rise of Social Bots [16], I noted that
of reviews to separate genuine from fake ones, to feed online learn- “bots have been around since the early days of computers”: examples
ing algorithms so to constantly adapt to new strategies and spam of bots include chatbots, algorithms designed to hold a conversation
techniques [11, 27]. with a human, Web bots, to automate the crawling and indexing
Unsupervised spam detection was used both to detect spammers of the Web, trading bots, to automate stock market transactions,
as well as for detecting fake reviews. Liu [27] reported on meth- and much more. Although isolated examples exist of such bots
ods based on detecting anomalous behavioral patterns typical of being used for nefarious purposes, I am unaware of any reports of
spammers. Models of spam behaviors include targeting products, systematic abuse carried out by bots in those contexts.
targeting groups (of products or brands), general and early rating A social bot is a new breed of “computer algorithm that automat-
deviations [27]. Methods based on association rules can capture ically produces content and interacts with humans on the Social
atypical behaviors of reviewers, detecting anomalies in reviewers’ Web, trying to emulate and possibly alter their behavior.” Since bots
confidence, divergence from average product scores, entropy (di- can be programmed to carry out arbitrary operations that would
versity or homogeneity) of attributed scores, temporal dynamics, otherwise be tedious or time-consuming (thus expensive) for hu-
etc. [39]. For what concerns the unsupervised detection of fake mans, they allowed to scale spam operations on the Social Web to
reviews, linguistic analysis was proved useful to identify stylistic an unprecedented level. Bots, in other words, are the dream spam-
features of fake reviews e.g., language markers that are over- or mers have been dreaming of since the early days of the Internet:
under-represented in fake reviews. Opinion spam to promote prod- they allow for personalized, scalable interactions, increasing the
ucts, for example, exhibits on average three times fewer mentions cost effectiveness, reach, and plausibility of social spam campaigns,
of social words, negative sentiment, and long words (> 6 letters) with the added advantage of increased credibility and the ability
than genuine reviews, while containing twice more positive terms to escape detection achieved by their human-like disguise. Fur-
and references to self than formal texts [11]. thermore, with the democratization and popularization of machine
Concluding, group spam detection aims at identifying signatures learning and AI technologies, the entry barrier to creating social
of collusion among spammers [30]. Collective behaviors such as bots has significantly lowered [16]. Since social bots have been
spammers’ coordination can emerge by using combinations of fre- used in a variety of nefarious scenarios (see Sidebar: Social Spam
quent pattern mining and group anomaly ranking. In the first stage, Applications), from the manipulation of political discussion, to the
the algorithm proposed by Mukherjee et al. [30] identifies groups spread of conspiracy theories and false news, and even by extremist
of reviewers who all have reviewed a same set of products—such groups for propaganda and recruitment, the stakes are high in the
groups are flagged as potentially suspicious. Then, anomaly scores quest to characterize bot behavior and detect them [35].3
for individual and group behaviors are computed and aggregated, Maybe due to their fascinating morphing and disguising nature,
accounting for indicators that measure the group burstiness (i.e., spam bots have attracted the attention of the AI and machine learn-
writing reviews in short timespan), group reviews similarity, etc. ing research communities: the arms-race between spammers and
Groups are finally ranked in terms of their anomaly scores [30]. detection systems yielded technical progress on both the attacker’s
and the defender’s technological fronts. Recent advancements in
Artificial Intelligence (especially Artificial Neural Networks) fuel
3.2 The Rise of Spam Bots
bots that can generate human-like natural language and interact
Up to the early 2000s, most of the spam activity was still coordinated with human users in near real time [16, 35]. On the other hand, the
and carried out, at least in significant part, by human operators: cyber-security and machine-learning communities came together
email spam campaigns, Web link farms, fake reviews, etc. all rely to develop techniques to detect the signature of artificial activity of
on human intervention and coordination. In other words, these bots and social network sybils [16, 40].
spam operations scale at a (possibly significant) cost. With the rise In [16], we flashed out techniques used to both create spam
in popularity of online social network and social media platforms bots, and detect them. Although the degree of sophistication of
(see Figure 1), new forms of spam started to emerge at scale. One such bots, and therefore their functionalities, varies vastly across
such example is social link farms [19]: similarly to Web link farms,
whose goal is to manipulate the perception of popularity of a certain 3 It
should be noted that bots are not used exclusively for nefarious purposes: for
Website by artificially creating many pointers (hyperlinks) to it, in example, some researchers used bots for positive health behavioral interventions [16].
social link farming spammers create online personas with many Furthermore, it has been noted that the most problematic aspect of nefarious bots is
their attempt to deceive and disguise themselves as human users [16]: however, many
artificial followers. This type of spam operation requires creating bots are labeled as such and may provide useful services, like live-news updates, etc.
Communications of the ACM , August 2019, Vol. 62 No. 8, Pages 82-91 E. Ferrara
platforms and application domains, commonalities also emerge. Social Spam Applications
Simple bots can do unsophisticated operations, such as posting
content according to a schedule, or interact with others according
Political manipulation. In a peer-reviewed study published on
to pre-determined scripts, whereas complex bots can motivate their November 7, 2016 [6] (the day before the U.S. presidential
reasoning and react to further human scrutiny. Beyond anecdotal election), I unveiled a massive-scale spam operation affecting
evidence, there is no systematic way to survey the state of AI- the American political Twitter. With the aid of Botometer, an
fueled spam bots and consequently their capabilities—researchers AI system that leverages over a thousand features to separate
adjust their expectations based on advancements made public in bots from humans [35], hundreds of thousands of bots were
AI technologies (with the assumptions that these will be abused identified. By studying the activity signatures of these bots, I
by spammers with the right incentives and technical means), and noted that they were being retweeted at the same rate than
based on proof-of-concept tools that are often originally created human users, which may have contributed to the spread of
with other non-nefarious purposes in mind (one such example is political misinformation [36]. Since most of these bots aimed
at sowing chaos, their presence may have inflamed and fur-
the so-called DeepFakes, discussed more later).
ther polarized the political conversation, with unknown
In the Sidebar: Social Spam Applications, I highlight some of the consequences on the integrity of the vote. Since then, dozens
domains where bots made the headlines: one such example is the of studies corroborated these results; many other studies,
wake to the 2016 U.S. presidential election, during which Twitter before and after mine, showed the perils associated with
and Facebook bots have been used to sow chaos and further polar- social spam campaigns in political domains. Most recently,
ize the political discussion [6]. Although it is not always possible the emerging phenomenon of fake news spreading attracted
for the research community to pinpoint the culprits, the research a lot of attention. Vosoughi et al. [36] investigated the role
of my group, among many others, contributed to unveil anomalous of social media, as well as bots, in the spread of true and
communication dynamics that attracted further scrutiny by law false news: the authors showed that humans are more likely
enforcement and were ultimately connected to state-sponsored op- to share false stories inspired by fear, disgust, and surprise.
This suggests that conditioning and manipulation operations
erations (if you wish, a form of social spam aimed at influencing in-
online can affect human behavior.
dividual behavior). Spam bots operate in other highly-controversial
conversation domains: in the context of public health, they promote Public heath. Conspiracy and denialism are endemic of so-
products or spread scientifically unsupported claims [2, 15]; they cial networks. Spam in public health discussions has become
have been used to create spam campaigns to manipulate the stock commonplace for social media: in a recent study, for example,
my team highlighted how bots are used to promote elec-
market [15]; finally, bots have also been used to penetrate online
tronic cigarettes as cessation devices with health benefits, a
social circles to leak personal user information [18]. fact not definitively corroborated by science [2]. The use of
bots to carry out anti vaccination campaigns has been the
subject of investigation of a DARPA Challenge in 2016 [32].
4 AI SPAM
Artificial Intelligence has been advancing at vertiginous speed, rev- Stock market. Automatic trading algorithms leverage infor-
olutionizing many fields including spam. Beyond powering conver- mation from social media to predict stock prices. Using bots,
spam campaigns have been carried out to give the false im-
sational agents such as social bots, as discussed above, AI systems
pression that certain stocks were spoken positively about
can be used, beyond their original scope, to fuel spam operations of on Twitter, successfully tricking trading algorithms into buy-
different sorts. I will refer to this phenomenon next as spamming ing them in a pump-and-dump scheme unveiled by the U.S.
with AI, hinting to the fact that AI is used as a tool to create new Securities and Exchange Commission (SEC) in 2015 [15].
forms of spams. However, given their sophistication, AI systems can
themselves be subject of spam attacks. I will refer to this new con- Data leaks. Social platforms enable the often unwilling disclo-
cept as spamming into AI, suggesting that AIs can be manipulated, sure of private user information. A recent study showed that
and even compromised, by spammers (or attackers in a broader over a third of content shared on Facebook has the default
sense) to exhibit anomalous and undesirable behaviors. public-visibility privacy settings [28]. The amount of content
accessible to undesirable users may be even higher when
considering privacy settings that allow one’s friends to access
4.1 Spamming with AI private information and preferences: Research showed that
Advancements in computer vision, augmented and virtual realities most users indiscriminately accept friendship connections on
are projecting us in an era where the boundary between reality Facebook [18]. Spam bots can inject themselves into tightly-
and fiction is increasingly more blurry. Proofs-of-concept of AIs ca- connected communities, by leveraging the weak-tie structure
pable to analyze and manipulate video footages, learning patterns of online social networks [12], and obtain private user infor-
of expressions, already exist: Suwajanakorn et al. [33] designed mation on large swaths of users. Phishing is also responsible
a deep neural network to map any audio into mouth shapes and for data leaks. Attacks based on short-URLs are popular on
convincing facial expressions, to impose an arbitrary speech on social media: they can hide the true identity of the spammers
a video clip of a speaking actor, with results hard to distinguish, and have been proven effective to steal personal data [9, 19].
to the human eye, from genuine footage. Thies et al. [34] show-
cased a technique for real-time facial reenactment, to convincingly
re-render the synthesized target face on top of the corresponding
The History of Digital Spam Communications of the ACM , August 2019, Vol. 62 No. 8, Pages 82-91
2. Don’t forget the arms-race. The fight against spam is a constant [17] Giorgio Fumera, Ignazio Pillai, and Fabio Roli. 2006. Spam filtering based on the
arms-race between attackers and defenders, and as in most adver- analysis of text information embedded into images. Journal of Machine Learning
Research 7, Dec (2006), 2699–2720.
sarial settings, the party with the highest stakes will prevail: since [18] Hongyu Gao, Jun Hu, Christo Wilson, Zhichun Li, Yan Chen, and Ben Y Zhao.
with each new technology comes abuse, researchers shall anticipate 2010. Detecting and characterizing social spam campaigns. In Proceedings of the
10th ACM SIGCOMM conference on Internet measurement. ACM, 35–47.
the need for counter-measures to avoid being caught unprepared [19] Saptarshi Ghosh, Bimal Viswanath, Farshad Kooti, Naveen Kumar Sharma, Gau-
when spammers will abuse their newly-designed technologies. tam Korlam, Fabricio Benevenuto, Niloy Ganguly, and Krishna Phani Gummadi.
2012. Understanding and combating link farming in the Twitter social network. In
3. Blockchain technologies. The ability to carry out massive spam Proceedings of the 21st international conference on World Wide Web. ACM, 61–70.
attacks in most systems exists predominantly due to the lack of au- [20] Joshua Goodman, Gordon V Cormack, and David Heckerman. 2007. Spam and
the ongoing battle for the inbox. Commun. ACM 50, 2 (2007), 24–33.
thentication measures that reliably guarantee the identity of entities [21] BB Gupta, Aakanksha Tewari, Ankit Kumar Jain, and Dharma P Agrawal. 2017.
and the legitimacy of transactions on the system. The Blockchain Fighting against phishing attacks: state of the art and future challenges. Neural
Computing and Applications 28, 12 (2017), 3629–3654.
as a proof-of-work mechanism to authenticate digital personas (in- [22] James Hendler, Nigel Shadbolt, Wendy Hall, Tim Berners-Lee, and Daniel
cluding in virtual realities), AIs, etc. may prevent several forms of Weitzner. 2008. Web science: an interdisciplinary approach to understanding the
spam and mitigate the scale and impact of others.8 web. Commun. ACM 51, 7 (2008), 60–69.
[23] Tom N Jagatic, Nathaniel A Johnson, Markus Jakobsson, and Filippo Menczer.
2007. Social phishing. Commun. ACM 50, 10 (2007), 94–100.
Spam is here to stay: let’s fight it together! [24] Nitin Jindal and Bing Liu. 2008. Opinion spam and analysis. In Proceedings of the
2008 international conference on web search and data mining. ACM, 219–230.
ACKNOWLEDGMENTS [25] Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies,
Matthias Nießner, Patrick Pérez, Christian Richardt, Michael Zollhöfer, and Chris-
The author would like to thank current and former members of the USC tian Theobalt. 2018. Deep Video Portraits. arXiv preprint arXiv:1805.11714 (2018).
Information Sciences Institute’s MINDS research group, as well as of the [26] Ben Laurie and Richard Clayton. 2004. Proof-of-work proves not to work; version
Indiana University’s CNetS group, for invaluable research collaborations 0.2. In Workshop on Economics and Information, Security.
and discussions on the topics of this work. The author is grateful to his [27] Bing Liu. 2012. Sentiment analysis and opinion mining. Synthesis lectures on
human language technologies 5, 1 (2012), 1–167.
research sponsors including the Air Force Office of Scientific Research [28] Yabing Liu, Krishna P Gummadi, Balachander Krishnamurthy, and Alan Mis-
(AFOSR), award FA9550-17-1-0327, and the Defense Advanced Research love. 2011. Analyzing facebook privacy settings: user expectations vs. reality.
Projects Agency (DARPA), contract W911NF-17-C-0094. In Proceedings of the 2011 ACM SIGCOMM conference on Internet measurement
conference. ACM, 61–70.
[29] Arjun Mukherjee, Abhinav Kumar, Bing Liu, Junhui Wang, Meichun Hsu, Malu
REFERENCES Castellanos, and Riddhiman Ghosh. 2013. Spotting opinion spammers using
[1] B Adler, Luca De Alfaro, and Ian Pye. 2010. Detecting wikipedia vandalism using behavioral footprints. In Proceedings of the 19th ACM SIGKDD international
wikitrust. Notebook papers of CLEF 1 (2010), 22–23. conference on Knowledge discovery and data mining. ACM, 632–640.
[2] Jon-Patrick Allem, Emilio Ferrara, Sree Priyanka Uppu, Tess Boley Cruz, and [30] Arjun Mukherjee, Bing Liu, and Natalie Glance. 2012. Spotting fake reviewer
Jennifer B Unger. 2017. E-cigarette surveillance with social media data: social groups in consumer reviews. In Proceedings of the 21st international conference on
bots, emerging topics, and trends. JMIR public health and surveillance 3, 4 (2017). World Wide Web. ACM, 191–200.
[3] Tiago A Almeida, José María G Hidalgo, and Akebo Yamakami. 2011. Contribu- [31] Nikita Spirin and Jiawei Han. 2012. Survey on web spam detection: principles
tions to the study of SMS spam filtering: new collection and results. In Proceedings and algorithms. Acm Sigkdd Explorations Newsletter 13, 2 (2012), 50–64.
of the 11th ACM symposium on Document engineering. ACM, 259–262. [32] V.S. Subrahmanian, Amos Azaria, Skylar Durst, Vadim Kagan, Aram Galstyan,
[4] Ion Androutsopoulos, John Koutsias, Konstantinos V Chandrinos, and Constan- Kristina Lerman, Linhong Zhu, Emilio Ferrara, Alessandro Flammini, and Filippo
tine D Spyropoulos. 2000. An experimental comparison of naive Bayesian and Menczer. 2016. The DARPA Twitter Bot Challenge. Computer 49, 6 (2016), 38–46.
keyword-based anti-spam filtering with personal e-mail messages. In ACM SIGIR [33] Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2017.
Conference on Research and Development in Information Retrieval. ACM, 160–167. Synthesizing Obama: learning lip sync from audio. ACM Trans Graphics (2017).
[5] Ricardo Baeza-Yates. 2018. Bias on the web. Commun. ACM 61, 6 (2018), 54–61. [34] J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner. 2016.
[6] Alessandro Bessi and Emilio Ferrara. 2016. Social bots distort the 2016 US Face2Face: Real-time Face Capture and Reenactment of RGB Videos. In Proc.
Presidential election online discussion. First Monday 21, 11 (2016). Computer Vision and Pattern Recognition (CVPR), IEEE.
[7] Godwin Caruana and Maozhen Li. 2012. A survey of emerging approaches to [35] Onur Varol, Emilio Ferrara, Clayton Davis, Filippo Menczer, and Alessandro
spam filtering. ACM Computing Surveys (CSUR) 44, 2 (2012), 9. Flammini. 2017. Online Human-Bot Interactions: Detection, Estimation, and
[8] Robert Chesney and Danielle Citron. 2018. Deep Fakes: A Looming Crisis for Characterization. In International AAAI Conference on Web and Social Media.
National Security, Democracy and Privacy. The Lawfare Blog (2018). [36] Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018. The spread of true and false
[9] Sidharth Chhabra, Anupama Aggarwal, Fabricio Benevenuto, and Ponnurangam news online. Science 359, 6380 (2018), 1146–1151.
Kumaraguru. 2011. Phi.sh/$ocial: the phishing landscape through short urls. In [37] Chih-Hung Wu. 2009. Behavior-based spam detection using a hybrid method of
Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and rule-based techniques and neural networks. Expert Systems with Applications 36,
Spam Conference. ACM, 92–101. 3 (2009), 4321–4330.
[10] Lorrie Faith Cranor and Brian A LaMacchia. 1998. Spam! Commun. ACM (1998). [38] Ching-Tung Wu, Kwang-Ting Cheng, Qiang Zhu, and Yi-Leh Wu. 2005. Using
[11] Michael Crawford, Taghi M Khoshgoftaar, Joseph D Prusa, Aaron N Richter, visual features for anti-spam filtering. In IEEE International Conference on Image
and Hamzah Al Najada. 2015. Survey of review spam detection using machine Processing, Vol. 3. IEEE, III–509.
learning techniques. Journal of Big Data 2, 1 (2015), 23. [39] Sihong Xie, Guan Wang, Shuyang Lin, and Philip S Yu. 2012. Review spam
[12] Pasquale De Meo, Emilio Ferrara, Giacomo Fiumara, and Alessandro Provetti. detection via temporal pattern discovery. In Proceedings of the 18th ACM SIGKDD
2014. On Facebook, most ties are weak. Commun. ACM 57, 11 (2014), 78–84. international conference on Knowledge discovery and data mining. ACM, 823–831.
[13] Harris Drucker, Donghui Wu, and Vladimir N Vapnik. 1999. Support vector [40] Zhi Yang, Christo Wilson, Xiao Wang, Tingting Gao, Ben Y Zhao, and Yafei
machines for spam categorization. IEEE Trans Neural networks 10 (1999). Dai. 2014. Uncovering social network sybils in the wild. ACM Transactions on
[14] Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Knowledge Discovery from Data (TKDD) 8, 1 (2014), 2.
Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Song. 2018. Robust Physical-
World Attacks on Deep Learning Visual Classification. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 1625–1634.
[15] Emilio Ferrara. 2015. Manipulation and abuse on social media. ACM SIGWEB
Newsletter Spring (2015), 4.
[16] Emilio Ferrara, Onur Varol, Clayton Davis, Filippo Menczer, and Alessandro
Flammini. 2016. The rise of social bots. Commun. ACM 59, 7 (2016), 96–104.
8 Itis worth noting that proof-of-work has been proposed to prevent spam email
in the past, however its feasibility remains debated, especially in its original non
Blockchain-based implementation [26].
The History of Digital Spam Communications of the ACM , August 2019, Vol. 62 No. 8, Pages 82-91
The History of
Digital Spam
1898
THE SPANISH PRISONER.
The New York Times reports
of unsolicited messages
circulating in association
with an old swindle.
Early 1900s
POST MAIL. Advertisement
based on unsolicited
content has been mailed to
our doors by Post Mail
2005
to reach billions of Social
Web users.
FAKE REVIEWS. Giants of
e-commerce like Amazon
and Alibaba fight the
manipulation of product
popularity by opinion spam.
2010
SOCIAL BOTS. Millions of
accounts operated by
software populate social
media to carry out
2016
nefarious spam campaigns.
2018
spam to elicit behaviors of
the AI system or of its users.