Information: COVID-19 Public Sentiment Insights and Machine Learning For Tweets Classification
Information: COVID-19 Public Sentiment Insights and Machine Learning For Tweets Classification
Information: COVID-19 Public Sentiment Insights and Machine Learning For Tweets Classification
Article
COVID-19 Public Sentiment Insights and Machine
Learning for Tweets Classification
Jim Samuel 1, * , G. G. Md. Nawaz Ali 2, * , Md. Mokhlesur Rahman 3,4 , Ek Esawi 5
and Yana Samuel 6
1 Department of Business Analytics, University of Charleston, Charleston, WV 25304, USA
2 Department of Applied Computer Science, University of Charleston, Charleston, WV 25304, USA
3 The William States Lee College of Engineering, University of North Carolina at Charlotte,
Charlotte, NC 28223, USA; [email protected]
4 Department of Urban and Regional Planning (URP), Khulna University of Engineering & Technology
(KUET), Khulna 9203, Bangladesh
5 Department of Data Analytics, University of Charleston, Charleston, WV 25304, USA; [email protected]
6 Department of Education, Northeastern University, Boston, MA 02115, USA; [email protected]
* Correspondence: [email protected] (J.S.); [email protected] (G.G.M.N.A.)
Received: 28 April 2020; Accepted: 9 June 2020; Published: 11 June 2020
Abstract: Along with the Coronavirus pandemic, another crisis has manifested itself in the form
of mass fear and panic phenomena, fueled by incomplete and often inaccurate information.
There is therefore a tremendous need to address and better understand COVID-19’s informational
crisis and gauge public sentiment, so that appropriate messaging and policy decisions can be
implemented. In this research article, we identify public sentiment associated with the pandemic
using Coronavirus specific Tweets and R statistical software, along with its sentiment analysis
packages. We demonstrate insights into the progress of fear-sentiment over time as COVID-19
approached peak levels in the United States, using descriptive textual analytics supported by
necessary textual data visualizations. Furthermore, we provide a methodological overview of
two essential machine learning (ML) classification methods, in the context of textual analytics,
and compare their effectiveness in classifying Coronavirus Tweets of varying lengths. We observe a
strong classification accuracy of 91% for short Tweets, with the Naïve Bayes method. We also observe
that the logistic regression classification method provides a reasonable accuracy of 74% with shorter
Tweets, and both methods showed relatively weaker performance for longer Tweets. This research
provides insights into Coronavirus fear sentiment progression, and outlines associated methods,
implications, limitations and opportunities.
Keywords: COVID-19; Coronavirus; machine learning; sentiment analysis; textual analytics; twitter
1. Introduction
In this research article, we cover four critical issues: (1) public sentiment associated with the
progress of Coronavirus and COVID-19, (2) the use of Twitter data, namely Tweets, for sentiment
analysis, (3) descriptive textual analytics and textual data visualization, and (4) comparison of
textual classification mechanisms used in artificial intelligence (AI). The rapid spread of Coronavirus
and COVID-19 infections have created a strong need for discovering efficient analytics methods
for understanding the flow of information and the development of mass sentiment in pandemic
scenarios. While there are numerous initiatives analyzing healthcare, preventative, care and recovery,
economic and network data, there has been relatively little emphasis on the analysis of aggregate
personal level and social media communications. McKinsey [1] recently identified critical aspects
for COVID-19 management and economic recovery scenarios. In their industry-oriented report,
they emphasized data management, tracking and informational dashboards as critical components of
managing a wide range of COVID-19 scenarios.
There has been an exponential growth in the use of textual analytics, natural language processing
(NLP) and other artificial intelligence techniques in research and in the development of applications.
Despite rapid advances in NLP, issues surrounding the limitations of these methods in deciphering
intrinsic meaning in text remain. Researchers at CSAIL, MIT (Computer Science and Artificial
Intelligence Laboratory, Massachusetts Institute of Technology), demonstrated how even the most
recent NLP mechanisms can fall short and thus remain “vulnerable to adversarial text” [2]. It is,
therefore, important to understand inherent limitations of text classification techniques and relevant
machine learning algorithms. Furthermore, it is important to explore whether multiple exploratory,
descriptive and classification techniques contain complimentary synergies which will allow us to
leverage the “whole is greater than the sum of its parts” principle in our pursuit for artificial intelligence
driven insights generation from human communications. Studies in electronic markets demonstrated
the effectiveness of machine learning in modeling human behavior under complex informational
conditions, highlighting the role of the nature of information in affecting human behavior [3].
The source data for all Tweets data analysis, tables and every figure, including the fear curve in
Figure 1 below, in this research consists of publicly available Tweets data, specifically downloaded for
the purposes of this research and further described in the Data acquisition and preparation Section 3.1.1
of this study.
400
300
200
100
0
The rise in emphasis on AI methods for textual analytics and NLP followed the tremendous
increase in public reliance on social media (e.g., Twitter, Facebook, Instagram, blogging, and LinkedIn)
for information, rather than on the traditional news agencies [4–6]. People express their opinions,
moods, and activities on social media about diverse social phenomena (e.g., health, natural hazards,
cultural dynamics, and social trends) due to personal connectivity, network effects, limited costs
and easy access. Many companies are using social media to promote their product and service
to the end-users [7]. Correspondingly, users share their experiences and reviews, creating a rich
reservoir of information stored as text. Consequently, social media and open communication platforms
are becoming important sources of information for conducting research, in the contexts of rapid
development of information and communication technology [8]. Researchers and practitioners mine
massive textual and unstructured datasets to generate insights about mass behavior, thoughts and
emotions on a wide variety of issues such as product reviews, political opinions and trends,
Information 2020, 11, 314 3 of 22
motivational principles and stock market sentiment [4,9–13]. Textual data visualization is also used to
identify the critical trend of change in fear-sentiment, using the “Fear Curve” in Figure 1, with the
dotted Lowess line demonstrating the trend, and the bars indicating the day to day increase in fear
Tweets count. Tweets were first classified using sentiment analysis, and then the progression of
the fear-sentiment was studied, as it was the most dominant emotion across the entire Tweets data.
This exploratory analysis revealed the significant daily increase in fear-sentiment towards the end of
March 2020, as shown in Figure 1.
In this research article, we present textual analyses of Twitter data to identify public sentiment,
specifically, tracking the progress of fear, which has been associated with the rapid spread of
Coronavirus and COVID-19 infections. This research outlines a methodological approach to analyzing
Twitter data specifically for identification of sentiment, key words associations and trends for crisis
scenarios akin to the current COVID-19 phenomena. We initiate the discussion and search for insights
with descriptive textual analytics and data visualization, such as exploratory Word Clouds and
sentiment maps in Figures 2–4.
democrats
chinese beer
vaccine stock
since economy
worry
even
time
need little
virus
hope money
president
disease amp last public
bad next infected
person
everyone big
realdonaldtrump
getting probably
says things health already
charge this flu made
please come got
state
today
many
long
mask
don
making
actually
america
thought
make con stay better
help buy told americans masks
want panic
died
cnn
catch
way
lol
mike said
man
gets
feel pence
spread usa joke won well media you deaths
heard
week
real
going f9a0 worse how started doesn know
else
country
pandemic
viruses care
year trump keep serious look news
away election give
cause job mexico
anything believe que
rate put worried also sure
home work there
talking first coronavirus another
cases world might
the
god sick
spreading never nothing back blame use
damn
outbreak lime
american saying china
something market cdc
what everything day
trying around open ever
cold people find
corona
Figure 2. An instance of word cloud in twitter data.
Early stage exploratory analytics of Tweets revealed interesting aspects, such as the relatively
higher number of Coronavirus Tweets coming from iPhone users, as compared to Android users,
along with a proportionally higher use of word-associations with politics (mention of Republican
and Democratic party leaders), URLs and humour, depicted by the word-association of beer with
Coronavirus, as summarized in Table 1 below. We observed that such references to humour and beer
were overtaken by “Fear Sentiment” as COVID-19 progressed and its seriousness became evident
(Figure 1). Tweets insights with textual analytics and NLP thus serve as a good reflector of shifts in
public sentiment.
Source Total Hashtags Mentions Urls Pols Corona Flu Beer AbuseW
iPhone 3281 495 2305 77 218 4238 171 336 111
Android 1180 149 1397 37 125 1050 67 140 41
iPad 75 6 96 4 12 85 4 8 2
Cities 30 0 0 0 0 0 0 0 0
One of the key contributions of this research is our discussion, demonstration and comparison
of Naïve Bayes and Logistic methods-based textual classification mechanisms commonly used in AI
Information 2020, 11, 314 4 of 22
applications for NLP, and specifically contextualized in this research using machine learning for Tweets
classifications. Accuracy is measured by the ratio of correct classifications to the total number of test
items. We observed that Naïve Bayes is better for small to medium size tweets and can be used for
classifying short Coronavirus Tweets sentiments with an accuracy of 91%, as compared to logistic
regression with an accuracy of 74%. For longer Tweets, Naïve Bayes provided an accuracy of 57% and
logistic regression provided an accuracy of 52%, as summarized in Tables 6 and 7.
2. Literature Review
This study was informed by research articles from multiple disciplines and therefore, in this
section, we cover literature review of textual analytics, sentiment analysis, Twitter and NLP,
and machine learning methods. Machine learning and strategic structuring of information
characteristics are necessary to address evolving behavioral issues in big data [3]. Textual analytics
deals with the analysis and evocation of characters, syntactic features, semantics, sentiment and
visual representations of text, its characteristics, and associated endogenous and exogenous features.
Endogenous features refer to aspects of the text itself, such as the length of characters in a social media
post, use of keywords, use of special characters and the presence or absence of URL links and hashtags,
as illustrated for this study in Table 2. These tables summarize the appearances of “mentions” and
“hashtags” in descending order, indicating the use of screen names and “#” symbol within the text of
the Tweet, respectively.
Exogenous variables, in contrast, are those aspects which are external but related to the text, such
as the source device used for making a post on social media, location of Twitter user and source types,
as illustrated for this study in Table 3. The Table summarizes “source device” and “screen names”,
indicating variables representing type of device used post the Tweet, and the screen name of the Twitter
user, respectively, both external to the text of the Tweet. Such exploratory summaries describe the
data succinctly, provide a better understanding of the data, and helps generate insights which inform
subsequent classification analysis. Past studies explored custom approaches to identifying constructs
such as dominance behavior in electronic chat, indicating the tremendous potential for extending such
analyses by using machine learning techniques to accelerate automated sentiment classification and
the subsections that follow present key insights gained from literature review to support and inform
the textual analytics processes used in this study [14–17].
Information 2020, 11, 314 5 of 22
analysis demonstrated that elite users (e.g., local authorities, traditional media reporters) play an
important role in information dissemination and dominated the wildfire retweet network.
Negative Sentiment in Tweets by State, USA Fear Sentiment in Tweets by State, USA
Florida Florida
² ²
Proportion Proportion
1.00000000 - 1.24000000 0.250000000 - 0.428571429
Twitter data has also been extensively used for crisis situations analysis and tracking, including the
analysis of pandemics [25–28]. Nagar et al. [29] validated the temporal predictive strength of daily
Twitter data for influenza-like illness for emergency department (ILI-ED) visits during the New York
City 2012–2013 influenza season. Widener and Li (2014) [8] performed sentiment analysis to understand
how geographically located tweets on healthy and unhealthy food are geographically distributed across
the US. The spatial distribution of the tweets analyzed showed that people living in urban and suburban
areas tweet more than people living in rural areas. Similarly, per capita food tweets were higher in
large urban areas than in small urban areas. Logistic regression revealed that tweets in low-income
areas were associated with unhealthy food related Tweet content. Twitter data has also been used in
the context of healthcare sentiment analytics. De Choudhury et al. (2013) [10] investigated behavioral
changes and moods of new mothers in the postnatal situation. Using Twitter posts this study evaluated
postnatal changes (e.g., social engagement, emotion, social network, and linguistic style) to show that
Twitter data can be very effective in identifying mothers at risk of postnatal depression. Novel analytical
frameworks have also been used to analyze supply chain management (SCM) related twitter data
about, providing important insights to improve SCM practices and research [30]. They conducted
descriptive analytics, content analysis integrating text mining and sentiment analysis, and network
analytics on 22,399 SCM tweets. Carvaho et al. [31] presented an efficient platform named MISNIS
(intelligent Mining of Public Social Networks’ Influence in Society) to collect, store, manage, mine and
visualize Twitter and Twitter user data. This platform allows non-technical users to mine data easily
and has one of the highest success rates in capturing flowing Portuguese language tweets.
2.3.5. Summary
Table 4 represents main features of different classifiers with their respective strengths and
weaknesses. This table provides a good overview of all the classifiers mentioned in the above section.
Based on a review of multiple machine learning methods, we decided to apply Naïve Bayes and
logistic regression classification methods to train and test binary sentiment categories associated
with the Coronavirus Tweets data. Naïve Bayes and logistic regression classification methods were
selected based on their parsimony, and their proven performance with textual classification provides
for interesting comparative evaluations.
split into train and test data, to apply machine learning classification methods using two prominent
methods described below, and their results are discussed.
words and “lemmatization”—which like stemming, aims to transform words to simpler forms and
uses dictionaries and more complex rules and processes than in stemming.
2500
4000
Word Frequencies
Word Frequencies
2000
3000
1500
2000
1000
1000
500
0 0
corona
virus
people
trump
abuvs
beer
now
corona virus
corona beer
stock market
drink corona
got corona
virus outbreak
drinking corona
50
12
10
40
8
30 6
4
20
2
10 0
corona clear sky âf
0
corona virus outbreak
corona virus uf
Figure 5. N-Grams.
Information 2020, 11, 314 11 of 22
P(y| x ) P( x )
P( x |y) = (1)
P(y)
The Naïve Bayes classifier identifies the estimated class ĉ among all the classes c ∈ C for a given
document d. Hence the estimated class is,
P(d|c) P(c)
ĉ = argmaxc∈C P(c|d) = argmaxc∈C (3)
P(d)
Simplifying (3) (as P(d) is the same for all classes, we can drop P(d) from the denominator) and
using the likelihood of P(d|c), we get,
Hence, from (4) and (5) we get the final equation of the Näive Bayes classifier as,
To apply the classifier in the textual analytics, we consider the index position of words (wi ) in the
documents, namely replace yi by wi . Now considering features in log space, (6) becomes,
Nc
P̂(c) = (8)
Ndoc
count(wi , c)
P̂(wi |c) = (9)
∑w∈V count(w, c)
where count(wi , c) is the number of occurrences of wi in class c, and V is the entire word vocabulary.
Information 2020, 11, 314 14 of 22
Now since Naïve Bayes multiplies all the features likelihood together (refer to (6)), the zero
probabilities in the likelihood term for any class will turn the whole probability to zero, to avoid such
situation, we use the Laplace add-one smoothing method, hence (9) becomes,
count(wi , c) + 1
P̂(wi |c) =
∑w∈V (count(w, c) + 1)
(10)
count(wi , c) + 1
=
∑w∈V (count(w, c) + |V |)
From an applied perspective, the text needs to be cleaned and prepared to contain clear,
distinct and legitimate words (wi ) for effective classification. Custom abbreviations, spelling errors,
emoticons, extensive use of punctuation, and such other stylistic issues in the text can impact the
accuracy of classification in both the Naïve Bayes and logistic classification methods, as text cleaning
processes may not be 100% successful.
Interestingly, though we found strong classification accuracy for shorter Tweets with around
nine out of every ten Tweets being classified correctly (91.43% accuracy). We observed an inverse
relationship between the length of Tweets and classification accuracy, as the classification accuracy
decreased to 57% with increase in the length of Tweets to below 120 characters.We calculated the
Sensitivity of the classification test, which is given by the ratio of the number of correct positive
predictions (30) in the output, to the total number of positives (35), to be 0.86 for the short Tweets
and 0.17 for the longer Tweets. We calculated the Specificity of the classification test, which is given
by the ratio of the number of correct negative predictions (34) in the output, to the total number of
negatives (35), to be 0.97 for both the short and long Tweets classification. Naïve Bayes thus had better
performance with classifying negative Tweets.
Information 2020, 11, 314 15 of 22
1. A feature representation of the input: For each input observation ( x (i) ), this will be represented
by a vector of features, [ x1 , x2 , · · · , xn ].
2. A classification function: It computes the estimated class ŷ. The sigmoid function is used in
classification.
3. An objective function: The job of objective function is to minimize the error of training examples.
The cross-entropy loss function is often used for this purpose.
4. An optimizing algorithm: This algorithm will be used for optimizing the objective function.
The stochastic gradient descent algorithm is popularly used for this task.
representing w.x as the element-wise dot product of vectors of w and x, we can simplify (11) as,
z = w.x + b (12)
We use the following sigmoid function to map the real-valued number into the range [0, 1],
1
y = σ(z) = (13)
1 + e−z
After applying sigmoid function in (12) and making sure that P(y = 1| x ) + P(y = 0| x ) = 1, we
get the following two probabilities,
P(y = 1| x ) = σ (w.x + b)
1 (14)
=
1 + e−(w.x+b)
P ( y = 0| x ) = 1 − P ( y = 1| x )
e−(w.x+b) (15)
=
1 + e−(w.x+b)
considering 0.5 as the decision boundary, the estimated class ŷ will be,
(
1 if P(y = 1| x ) > 0.5
ŷ = (16)
0 otherwise
Information 2020, 11, 314 16 of 22
To turn (18) into a minimizing function (loss function), we take the negation of (18), which yields,
where
L ( f ( x; θ ), y) = L(w, b)
(23)
= L(ŷ, y) = − [y log σ(w.x + b) + (1 − y) log(1 − σ(w.x + b))]
∂
and the partial derivative ( ∂w ) for this function for one observation vector x is,
j
∂L(w, b)
= [σ(w.x + b) − y] x j (24)
∂w j
where the gradient in (24) represents the difference between ŷ and y multiplied by the
corresponding input x j . Please note that in (22), we need to do the partial derivatives for all the
values of x j where 1 ≤ j ≤ n.
accuracy under varying lengths of Coronavirus Tweets. As with classification of Tweets using Naïve
Bayes, positive sentiment Tweets were assigned a value of 1, and negative sentiment Tweets were
denoted by 0, allowing for a simple binary classification using logistic regression methodology. Subsets
of data were created, based on the length of Tweets, in a similar process as for Naïve Bayes classification
and the same two groups of data containing Tweets with less than 77 characters (approximately 25%
of the Tweets), and Tweets with less than 125 characters (approximately 50% of the data) respectively,
were used. We used R [56] and associated packages for logistic regression modeling, and to train and
test the data. The results of using logistic regression for Coronavirus Tweet Classification are presented
in Table 7.
We observed on the test data with 70 items that, akin to the Naïve Bayes classification accuracy,
shorter Tweets were classified using logistic regression with a greater degree of accuracy of just above
74%, and the classification accuracy decreased to 52% with longer Tweets. We calculated the Sensitivity
of the classification test, which is given by the ratio of the number of correct positive predictions (22)
in the output, to the total number of positives (35), to be 0.63 for the short Tweets, and 0.46 for the
longer Tweets. We calculated the Specificity of the classification test, which is given by the ratio of the
number of correct negative predictions (30) in the output, to the total number of negatives (35), to be
0.86 for the short Tweets, and 0.60 for the longer Tweets classification. Logistic regression thus had
better performance with a balanced classification of Tweets.
4. Discussion
The classification results obtained in this study are interesting and indicate a need for additional
validation and empirical model development with more Coronavirus data, and additional methods.
Models thus developed with additional data and methods, and using Naïve Bayes and logistic
regression Tweet Classification methods can then be used as independent mechanisms for automated
classification of Coronavirus sentiment. The model and the findings can also be further extended
to similar local and global pandemic insights generation in the future. Textual analytics has gained
significant attention over the past few years with the advent of big data analytics, unstructured
data analysis and increased computational capabilities at decreasing costs, which enables the
analysis of large textual datasets. Our research demonstrates the use of the NRC sentiment lexicon,
using the Syuzhet and sentimentr packages in R ([51,52]), and it will be a useful exercise to evaluate
comparatively with other sentiment lexicons such as Bing and Afinn lexicons [51]. Furthermore,
each type of text corpus will have its own features and peculiarities, such as Twitter data will tend to
be different from LinkedIn data in syntactic features and semantics. Past research has also indicated
the usefulness of applying multiple lexicons, to generate either a manually weighted model or a
statistically derived model based on a combination of multiple sentiment scores applied to the same
text, and hybrid approaches [57], and a need to apply strategic modeling to address big data challenges.
We have demonstrated a structured approach which is necessary for successful generation of insights
from textual data. When analyzing crisis situations, it is important to map sentiment against time,
such as in the fear curve plot (Figure 1), and where relevant geographically, such as in Figure 3a,b.
Associating text and textual features with carefully selected and relevant non-textual features is
another critical aspect of insights generation through textual analytics as has been demonstrated
through Tables 1–7.
Information 2020, 11, 314 18 of 22
4.1. Limitations
The current study focused on a textual corpus consisting of Tweets filtered by “Coronavirus” as
the keyword. Therefore the analysis and the methods are specifically applied to data about a particular
pandemic as a crisis situation, and hence it could be argued that the analytical structure outlined in
this paper can only be weakly generalized. Future research could address this and explore “alternative
dimensionalities and perform sensitivity analysis” to improve the validity of the insights gained [58].
The Novel Coronavirus pandemic is a phenomena of an unprecedented nature, and associated social
media trends can also therefore be considered to possess distinct characteristics. Hence, it is important
to contextualize the use of ML, because using data from pre-COVID-19 time periods mixed with one
of more phases of the spread of the virus would confound ML modeling and results, unless control
mechanisms are used to control for pandemic effects. The present study addresses this data-validity
challenge in applying machine learning by using Twitter data, filtered and processed to provide a
clean Coronavirus dataset, from a single phase of the spread of the pandemic. ML was applied to
classify sentiment for Tweets only within this period, and is therefore justifiably useful for classifying
COVID-19 Tweets sentiment. The data-validity limitation for unique events such as COVID-19 must
be accounted for in future studies using machine learning on pandemic associated data.
Furthermore, the analysis used one sentiment lexicon to identify positive and negative sentiments,
and one sentiment lexicon to classify the tweets into categories such as fear, sadness, anger and
disgust [7,51,52]. Varying information categories have the potential to influence human beliefs and
decision making [59], and hence it is important to consider multiple social media platforms with
differing information formats (such as short text, blogs, images and comments) to gain a holistic
perspective. The present study intended to generate rapid insights for COVID-19 related public
sentiment using Twitter data, which was successfully accomplished. This study also intended to
explore the viability of machine learning classification methods, and we found sufficient directional
support for the use of Naïve Bayes and Logistic classification for short to medium length Tweets,
but the accuracy decreased with the increase in the length of Tweets. We have not stated a formal
model for Tweets sentiment classification, as that is not a goal of this research. While the absence of
such a formal model may also be perceived as a limitation which we acknowledge, it must be noted
that our research goal of evaluating the viability of using machine learning classification for Tweets
of varying lengths was accomplished. Finally, we also acknowledge that Twitter data alone is not a
reflection of general mass sentiment in a nation or even in a state or local area [8,11,29]. However,
the current research provides a clear direction for more comprehensive analysis of multiple textual
data sources including other social media platforms, news articles and personal communications
data. The mismatch between Coronavirus negative sentiment map, fear sentiment map, and the
factually known hot spots in New York, New Jersey and California, as shown in Figure 3 could have
been driven by the timing of tweets posted just before the magnitude of the problem was recognized,
and could also be reflective of cultural attitudes. The sentiment map presents a fair degree of acceptable
association with states such as West Virginia and North Dakota. Overall, though these limitations are
acknowledged from a general perspective, they do not diminish the contributions made by this study,
as the generic weaknesses are not associated with the primary goals of this study.
stream of thought towards using social media data to help understand and manage contagions and
crisis scenarios.
As a global pandemic COVID-19 is adversely affecting people and countries. Besides necessary
healthcare and medical treatments, it is critical to protect people and societies from psychological
shocks (e.g., distress, anxiety, fear, mental illness). In this context, automated machine learning driven
sentiment analysis could help health professionals, policymakers, and state and federal governments
to understand and identify rapidly changing psychological risks in the population. Consequently,
timely responses and initiatives (e.g., counseling, internet-based psychological support mechanisms)
taken by the agencies to mitigate and prevent adverse emotional and psychological consequences
will significantly improve public health and well being during crisis phenomena. Sentiment analysis
using social media data will thus provide valuable insights on attitudes, perceptions, and behaviors
for critical decision making for business and political leaders, and societal representatives.
Author Contributions: This work was completed with contributions from all the authors. conceptualization, J.S.,
G.G.M.N.A., Y.S. and M.M.R.; methodology, J.S. and M.M.R.; software, J.S.; validation, J.S., M.M.R., E.E. and Y.S.;
formal analysis, J.S. and M.M.R.; data curation, J.S.; writing—original draft preparation, J.S., G.G.M.N.A., M.M.R.,
E.K., and Y.S.; writing—review and editing, J.S., G.G.M.N.A., M.M.R., E.E., and Y.S.; visualization, J.S., and M.M.R.;
funding acquisition, G.G.M.N.A. All authors did edit, review and improve the manuscript. All authors have read
and agreed to the published version of the manuscript.
Funding: This research did not receive any external funding.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
References
1. COVID-19:Briefing Materials. Available online: https://www.mckinsey.com/~/media/mckinsey/
business%20functions/risk/our%20insights/covid%2019%20implications%20for%20business/covid%
2019%20may%2013/covid-19-facts-and-insights-may-6.ashx (accessed on 11 June 2020).
2. Jin, D.; Jin, Z.; Zhou, J.T.; Szolovits, P. Is bert really robust? natural language attack on text classification and
entailment. arXiv 2019, arXiv:1907.11932.
3. Samuel, J. Information Token Driven Machine Learning for Electronic Markets: Performance Effects in
Behavioral Financial Big Data Analytics. JISTEM J. Inf. Syst. Technol. Manag. 2017, 14, 371–383. [CrossRef]
4. Shu, K.; Sliva, A.; Wang, S.; Tang, J.; Liu, H. Fake news detection on social media: A data mining perspective.
ACM SIGKDD Explor. Newsl. 2017, 19, 22–36. [CrossRef]
5. Makris, C.; Pispirigos, G.; Rizos, I.O. A Distributed Bagging Ensemble Methodology for Community
Prediction in Social Networks. Information 2020, 11, 199. [CrossRef]
6. Heist, N.; Hertling, S.; Paulheim, H. Language-agnostic relation extraction from abstracts in Wikis.
Information 2018, 9, 75. [CrossRef]
7. He, W.; Wu, H.; Yan, G.; Akula, V.; Shen, J. A novel social media competitive analytics framework with
sentiment benchmarks. Inf. Manag. 2015, 52, 801–812. [CrossRef]
8. Widener, M.J.; Li, W. Using geolocated Twitter data to monitor the prevalence of healthy and unhealthy
food references across the US. Appl. Geogr. 2014, 54, 189–197. [CrossRef]
9. Kretinin, A.; Samuel, J.; Kashyap, R. When the Going Gets Tough, The Tweets Get Going! An Exploratory
Analysis of Tweets Sentiments in the Stock Market. Am. J. Manag. 2018, 18.
10. De Choudhury, M.; Counts, S.; Horvitz, E. Predicting Postpartum Changes in Emotion and Behavior via
Social Media. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Paris,
France, 27 April–2 May 2013; pp. 3267–3276.
11. Wang, Z.; Ye, X.; Tsou, M.H. Spatial, temporal, and content analysis of Twitter for wildfire hazards.
Nat. Hazards 2016, 83, 523–540. [CrossRef]
12. Skoric, M.M.; Liu, J.; Jaidka, K. Electoral and Public Opinion Forecasts with Social Media Data:
A Meta-Analysis. Information 2020, 11, 187. [CrossRef]
13. Samuel, J. Eagles & Lions Winning Against Coronavirus! 8 Principles from Winston Churchill for
Overcoming COVID-19 & Fear. Researchgate Preprint. 2020. Available online: https://www.researchgate.
net/publication/340610688 (accessed on 21 April 2020). [CrossRef]
14. Chen, X.; Xie, H.; Cheng, G.; Poon, L.K.; Leng, M.; Wang, F.L. Trends and Features of the Applications
of Natural Language Processing Techniques for Clinical Trials Text Analysis. Appl. Sci. 2020, 10, 2157.
[CrossRef]
15. Reyes-Menendez, A.; Saura, J.R.; Alvarez-Alonso, C. Understanding# WorldEnvironmentDay user opinions
in Twitter: A topic-based sentiment analysis approach. Int. J. Environ. Res. Public Health 2018, 15, 2537.
16. Saura, J.R.; Palos-Sanchez, P.; Grilo, A. Detecting indicators for startup business success: Sentiment analysis
using text data mining. Sustainability 2019, 11, 917. [CrossRef]
17. Samuel, J.; Holowczak, R.; Benbunan-Fich, R.; Levine, I. Automating Discovery of Dominance in
Synchronous Computer-Mediated Communication. In Proceedings of the 2014 47th IEEE Hawaii
International Conference on System Sciences, Waikoloa, HI, USA, 6–9 January 2014; pp. 1804–1812.
18. Rocha, G.; Lopes Cardoso, H. Recognizing textual entailment: Challenges in the Portuguese language.
Information 2018, 9, 76. [CrossRef]
19. Carducci, G.; Rizzo, G.; Monti, D.; Palumbo, E.; Morisio, M. Twitpersonality: Computing personality traits
from tweets using word embeddings and supervised learning. Information 2018, 9, 127. [CrossRef]
20. Ahmad, T.; Ramsay, A.; Ahmed, H. Detecting Emotions in English and Arabic Tweets. Information 2019,
10, 98. [CrossRef]
21. Pépin, L.; Kuntz, P.; Blanchard, J.; Guillet, F.; Suignard, P. Visual analytics for exploring topic long-term
evolution and detecting weak signals in company targeted tweets. Comput. Ind. Eng. 2017, 112, 450–458.
[CrossRef]
22. De Maio, C.; Fenza, G.; Loia, V.; Parente, M. Time aware knowledge extraction for microblog summarization
on twitter. Inf. Fusion 2016, 28, 60–74. [CrossRef]
Information 2020, 11, 314 21 of 22
23. Ahmad, N.; Siddique, J. Personality assessment using Twitter tweets. Procedia Comput. Sci. 2017, 112,
1964–1973. [CrossRef]
24. Jain, V.K.; Kumar, S.; Fernandes, S.L. Extraction of emotions from multilingual text using intelligent text
processing and computational linguistics. J. Comput. Sci. 2017, 21, 316–326. [CrossRef]
25. Ye, X.; Li, S.; Yang, X.; Qin, C. Use of social media for the detection and analysis of infectious diseases in
China. ISPRS Int. J. Geo-Inf. 2016, 5, 156. [CrossRef]
26. Fung, I.C.H.; Yin, J.; Pressley, K.D.; Duke, C.H.; Mo, C.; Liang, H.; Fu, K.W.; Tse, Z.T.H.; Hou, S.I. Pedagogical
Demonstration of Twitter Data Analysis: A Case Study of World AIDS Day, 2014. Data 2019, 4, 84. [CrossRef]
27. Kim, E.H.J.; Jeong, Y.K.; Kim, Y.; Kang, K.Y.; Song, M. Topic-based content and sentiment analysis of Ebola
virus on Twitter and in the news. J. Inf. Sci. 2016, 42, 763–781. [CrossRef]
28. Samuel, J.; Ali, N.; Rahman, M.; Samuel, Y.; Pelaez, A. Feeling Like it is Time to Reopen Now? COVID-19
New Normal Scenarios Based on Reopening Sentiment Analytics. arXiv 2020, arXiv:2005.10961
29. Nagar, R.; Yuan, Q.; Freifeld, C.C.; Santillana, M.; Nojima, A.; Chunara, R.; Brownstein, J.S. A case study
of the New York City 2012–2013 influenza season with daily geocoded Twitter data from temporal and
spatiotemporal perspectives. J. Med Internet Res. 2014, 16, e236. [CrossRef]
30. Chae, B.K. Insights from hashtag# supplychain and Twitter Analytics: Considering Twitter and Twitter data
for supply chain practice and research. Int. J. Prod. Econ. 2015, 165, 247–259.
31. Carvalho, J.P.; Rosa, H.; Brogueira, G.; Batista, F. MISNIS: An intelligent platform for twitter topic mining.
Expert Syst. Appl. 2017, 89, 374–388. [CrossRef]
32. Vijayan, V.K.; Bindu, K.; Parameswaran, L. A comprehensive study of text classification algorithms.
In Proceedings of the 2017 International Conference on Advances in Computing, Communications and
Informatics (ICACCI), Udupi, India, 13–16 September 2017; pp. 1109–1113.
33. Zhang, J.; Yang, Y. Robustness of regularized linear classification methods in text categorization.
In Proceedings of the 26th Annual International ACM SIGIR Conference On Research and Development in
Informaion Retrieval, Toronto, ON, Canada, 28 July 2003; pp. 190–197.
34. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.;
Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12,
2825–2830.
35. Liu, B.; Blasch, E.; Chen, Y.; Shen, D.; Chen, G. Scalable sentiment classification for big data analysis using
naive bayes classifier. In Proceedings of the 2013 IEEE International Conference on Big Data, Silicon Valley,
CA, USA, 6–9 October 2013 ; pp. 99–104.
36. Kowsari, K.; Jafari Meimandi, K.; Heidarysafa, M.; Mendu, S.; Barnes, L.; Brown, D. Text classification
algorithms: A survey. Information 2019, 10, 150. [CrossRef]
37. Troussas, C.; Virvou, M.; Espinosa, K.J.; Llaguno, K.; Caro, J. Sentiment analysis of Facebook statuses
using Naive Bayes classifier for language learning. In Proceedings of the IISA 2013, Piraeus, Greece,
10–12 July 2013; pp. 1–6.
38. Ting, S.; Ip, W.; Tsang, A.H. Is Naive Bayes a good classifier for document classification. Int. J. Softw. Eng.
Appl. 2011, 5, 37–46.
39. Boiy, E.; Hens, P.; Deschacht, K.; Moens, M.F. Automatic Sentiment Analysis in On-line Text. In Proceedings
ELPUB 2007 Conference on Electronic Publishing, Vienna, Austria, 13–15 June 2007; pp. 349–360.
40. Pranckevičius, T.; Marcinkevičius, V. Comparison of naive bayes, random forest, decision tree, support vector
machines, and logistic regression classifiers for text reviews classification. Balt. J. Mod. Comput. 2017, 5, 221.
[CrossRef]
41. Ramadhan, W.; Novianty, S.A.; Setianingsih, S.C. Sentiment analysis using multinomial logistic regression.
In Proceedings of the 2017 International Conference on Control, Electronics, Renewable Energy and
Communications (ICCREC), Yogyakarta, Indonesia, 26–28 September 2017; pp. 46–49.
42. Rubegni, P.; Cevenini, G.; Burroni, M.; Dell’Eva, G.; Sbano, P.; Cuccia, A.; Andreassi, L. Digital dermoscopy
analysis of atypical pigmented skin lesions: A stepwise logistic discriminant analysis approach. Skin Res.
Technol. 2002, 8, 276–281. [CrossRef]
43. Silva, I.; Eugenio Naranjo, J. A Systematic Methodology to Evaluate Prediction Models for Driving Style
Classification. Sensors 2020, 20, 1692. [CrossRef] [PubMed]
Information 2020, 11, 314 22 of 22
44. Buldin, I.D.; Ivanov, N.S. Text Classification of Illegal Activities on Onion Sites. In Proceedings of the
2020 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (EIConRus),
St. Petersburg/Moscow, Russia, 27–30 January 2020; pp. 245–247.
45. Tan, Y. An improved KNN text classification algorithm based on K-medoids and rough set. In Proceedings
of the 2018 10th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC),
Hangzhou, China, 25–26 August 2018; Volume 1, pp. 109–113.
46. Conner, C.; Samuel, J.; Kretinin, A.; Samuel, Y.; Nadeau, L. A Picture for The Words! Textual Visualization in
Big Data Analytics. Northeast Bus. Econ. Assoc. Annu. Proc. 2019, 46, 37–43.
47. Samuel, Y.; George, J.; Samuel, J. Beyond STEM, How Can Women Engage Big Data, Analytics, Robotics
& Artificial Intelligence? An Exploratory Analysis of Confidence & Educational Factors in the Emerging
Technology Waves Influencing the Role of, & Impact Upon, Women. arXiv 2020, arXiv:2003.11746.
48. Svetlov, K.; Platonov, K. Sentiment Analysis of Posts and Comments in the Accounts of Russian Politicians on
the Social Network. In Proceedings of the 2019 25th Conference of Open Innovations Association (FRUCT),
Helsinki, Finland, 5–8 November 2019; pp. 299–305.
49. Saif, H.; Fernández, M.; He, Y.; Alani, H. On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of
Twitter; European Language Resources Association (ELRA): Reykjavik, Iceland, 2014.
50. Ravi, K.; Ravi, V. A survey on opinion mining and sentiment analysis: Tasks, approaches and applications.
Knowl. Based Syst. 2015, 89, 14–46. [CrossRef]
51. Jockers, M.L. Syuzhet: Extract Sentiment and Plot Arcs from Text, R Package Version 1.0.4; CRAN, 2017. Available
online: https://cran.r-project.org/web/packages/syuzhet/syuzhet.pdf (accessed on 11 June 2020).
52. Rinker, T.W. sentimentr: Calculate Text Polarity Sentiment; Version 2.7.1; Buffalo: New York, NY, USA, 2019.
53. Almatarneh, S.; Gamallo, P. Comparing supervised machine learning strategies and linguistic features to
search for very negative opinions. Information 2019, 10, 16. [CrossRef]
54. Jurafsky, D.; Martin, J. Speech and Language Processing, 3rd ed.; Stanford University: Stanford, CA, USA, 2019.
55. Bayes, T. An Essay Toward Solving a Problem in the Doctrine of Chances, 1763. In MD Computing: Computers
in Medical Practice; NCBI: Bethesda, MD, USA, 1991; Volume 8, p. 157.
56. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing:
Vienna, Austria, 2020.
57. Sharma, S.; Jain, A. Hybrid Ensemble Learning With Feature Selection for Sentiment Classification in Social
Media. Int. J. Inf. Retr. Res. (IJIRR) 2020, 10, 40–58. [CrossRef]
58. Evangelopoulos, N.; Zhang, X.; Prybutok, V.R. Latent semantic analysis: Five methodological
recommendations. Eur. J. Inf. Syst. 2012, 21, 70–86. [CrossRef]
59. Samuel, J.; Holowczak, R.; Pelaez, A. The Effects of Technology Driven Information Categories on
Performance in Electronic Trading Markets. J. Inf. Technol. Manag. 2017, 28, 1–14.
60. Ahmed, W.; Bath, P.; Demartini, G. Using Twitter as a data source: An overview of ethical, legal, and
methodological challenges. Adv. Res. Ethics Integr. 2017, 2, 79–107.
61. Buchanan, E. Considering the ethics of big data research: A case of Twitter and ISIS/ISIL. PLoS ONE 2017,
12, e0187155. [CrossRef] [PubMed]
c 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).