ABSTRACT The coronavirus outbreak has brought unprecedented measures, which forced the authorities
to make decisions related to the instauration of lockdowns in the areas most hit by the pandemic. Social media
has been an important support for people while passing through this difficult period. On November 9, 2020,
when the first vaccine with more than 90% effective rate has been announced, the social media has reacted
and people worldwide have started to express their feelings related to the vaccination, which was no longer a
hypothesis but closer, each day, to become a reality. The present paper aims to analyze the dynamics of the
opinions regarding COVID-19 vaccination by considering the one-month period following the first vaccine
announcement, until the first vaccination took place in UK, in which the civil society has manifested a higher
interest regarding the vaccination process. Classical machine learning and deep learning algorithms have been
compared to select the best performing classifier. 2 349 659 tweets have been collected, analyzed, and put in
connection with the events reported by the media. Based on the analysis, it can be observed that most of the
tweets have a neutral stance, while the number of in favor tweets overpasses the number of against tweets.
As for the news, it has been observed that the occurrence of tweets follows the trend of the events. Even more,
the proposed approach can be used for a longer monitoring campaign that can help the governments to create
appropriate means of communication and to evaluate them in order to provide clear and adequate information
to the general public, which could increase the public trust in a vaccination campaign.
INDEX TERMS opinion mining; social media; COVID-19; SARS-CoV-2; stance classification; vaccine
information voluntarily offered by the user, a tweet may also best performing algorithm has been selected and used for
retain information related to the location of the user and might analyzing both the entire and the cleaned datasets.
contain links, emoticons and hashtags which can help the user The contribution of the paper is three-folded: we have
in better expressing his/her sentiments, making it a source of collected and annotated a COVID-19 vaccination dataset, we
valuable information [5], [6]. Even more, Twitter has been have determined the best performing classifier for COVID-19
used by government officials and political figures for vaccination stance detection and we have put in relation the
informing the general public either regarding their activity or number of tweets and the stance (e.g. in favor, against or
in the case of major events occurrence [7]. neutral) with the events reported by the media in the analyzed
Over time, the information extracted from Twitter has been period.
used in various studies, featuring, but not being limited to: The chosen approach can be easily integrated in a system
analyzing public opinion related to refugee crisis [8], natural which can allow interested organizations a proper monitoring
disasters and social movements [9], evaluating companies’ of the public opinion regarding the vaccination process in the
services [10] and reputation [11], sports’ fans sentiments [12], case of the new coronavirus.
[13], forecasting the prices of cryptocurrencies [14], The remainder of the paper is organized as follows. Section
predicting vehicle sales [15], political attitudes in multi-party 2 provides a literature review structured in two main parts:
contexts [16], healthcare [17], infectious disease [3], [18], natural language processing – focusing on sentiment analysis
celiac disease [19] and cancer patients sentiments [20], and stance detection from social media messages, and recent
vaccination [5]. studies analyzing public opinion based on COVID-19 data
The vaccination topic has been, over time, one of the themes extracted from Twitter. Section 3 describes the proposed
which have raised a series of questions in social media, most methodology, while Section 4 focuses on the dataset collection
of them related to the safety of the entire process. As a result, and annotation process. Section 5 describes the steps required
a series of studies have analyzed the impact of different social for stance detection and analyzes the performance of the
media campaigns on vaccination hesitancy [21]–[23] or the classification algorithms. Section 6 presents the dynamics of
general public sentiment in connection with the vaccination opinions in the analyzed period. The limitations of the present
process [5], [24]. Additionally, compared to other vaccination study are mentioned in Section 7. The paper closes with a
situations studied in the scientific literature, the COVID-19 conclusion section and references. A series of supplementary
vaccination comes with new inquietudes related to the materials accompany the paper, in the form of the collected
relatively short period of time needed for the vaccine and annotated datasets, along with the extracted unigrams,
development. As known, the process of developing a vaccine bigrams and trigrams for each day in the selected period.
typically takes a decade [25]. Note, however, that the fastest
vaccine development before has been four years [26] in the II. LITERATURE REVIEW
case of mumps vaccine and that, almost forty years after the In the following, a short literature review regarding sentiment
discovery of HIV, no effective vaccine has yet been analysis and stance detection is conducted in order to
developed. However, the vaccine timelines for COVID-19 are underline the current approaches in the research literature.
reduced due to the emergency [25]. On December 18, 2020, Afterwards, a series of studies that have analyzed the public
the web site COVID-19 Vaccine Tracker1, held by Milken opinion, in the context of the COVID-19 pandemic, using data
Institute, shows 236 vaccines are in development, 38 are now extracted from Twitter are discussed.
in clinical testing and 7 have reached a regulatory decision.
Nevertheless, on December 8, 2020 the first vaccine has been A. SENTIMENT ANALYSIS AND STANCE DETECTION
administrated in UK. Opinion mining is a growing area of the Natural Language
In this context, the present paper analyzes the public Processing field commonly used to determine viewpoints
opinion related to the vaccination process in the case of towards targets of interest using computational methods [27].
COVID-19, by considering the messages posted on Twitter. It is also known as sentiment analysis and includes many sub-
The period between November 9, 2020 – when Pfizer and tasks, such as polarity detection – in which the goal is to
BioNTech announced the development of a vaccine that is determine whether a text has positive, negative or neutral
more than 90% effective, to December 8, 2020 – when the connotation [28], emotion identification – in which the
vaccination process has started in UK, has been considered. A objective is to uncover specific emotions such as happiness,
number of 2 349 659 tweets have been collected and a cleaned fear or sadness [29], subjectivity detection – in which the goal
dataset containing 752 951 tweets has been extracted. The is to determine if the text is objective or subjective [30].
performance for stance detection of several machine learning Stance detection [31], [32] is an opinion mining task used
algorithms (both classical machine learning and deep learning in debate analysis, for determining the opinions towards a
algorithms) has been compared on an annotated dataset. The specific target. It can be formalized as the task of identifying
the tuple < 𝑡, 𝑠 >, in which 𝑡 represents the target entity, while
𝑠 represents the opinion. The target entity (𝑡) can be any a vector having the values computed in such a way that allows
discussion topic, including products, services, economic words which frequently appear in similar contexts to have a
measures, or life choices, such as vaccination. The opinion (𝑠) similar representation [46]. The main benefit of this
towards the target is identified as in favor, against or neutral representation is that additional clues become available for the
[27]. classification algorithms. Another advantage resides in the fact
While similar in some respects to polarity detection, stance that the number of required dimensions is greatly reduced
detection is a different natural language processing task, given when compared to a sparse vector representation, such as one-
the fact that positive tweets can be against the target entity, hot encoding, in which each term is as a binary vector that
while on the contrary, negative tweets can sometimes express contains only zeros, besides a single one-value, corresponding
a favorable view of the target entity. Moreover, when to the term's index in the vocabulary [45]. Among the most
compared to polarity detection, stance detection always popular word embedding techniques, one can mention:
determines the agreement or disagreement in relation to a embedding layer, Word2Vec [47], GloVe [48] and FastText
specific target, even in cases in which the target is not [49].
explicitly mentioned in the analyzed text [5]. Machine learning approaches include classical machine
The types of approaches that can be used for polarity learning and deep learning algorithms. Frequently used
analysis and stance detection include: lexicon-based methods classical machine learning algorithms for stance detection are
[33], machine learning methods [34] and hybrid methods – in Support Vector Machines (SVM) [5], [31], [50] and Naïve
which lexicons and machine learning are combined [35], [36]. Bayes (NB) [5]. In the context of the “SemEval-2016 Task 6:
Lexicon based methods rely on sentiment lexicons, such as Detecting Stance in Tweets” [51], the SVM classifier with
Bing Liu’s opinion lexicon [37], MaxDiff [38], Sentiment140 unigram features, used as a baseline for the algorithms
[39], VaderSentiment [40], SentiWordNet [41] or SenticNet developed by the competing teams, has achieved and F-Score
[42], which contain words and sequences of words, together of 63.31. By incorporating also word n-grams (unigrams,
with the polarity score, indicating the strength of the positive, bigrams and trigrams) and character n-grams (with lengths {2,
neutral or negative perception. For performing polarity 3, 4, 5}) the F-Score has increased to 68.98, higher than all the
detection, the sentiment lexicons are used together with scores recorded by the algorithms proposed during the
semantic methods, which typically consider negations and competition [51]. D’Andrea et al. [5] have compared several
booster words [40]. A simple rule-based model incorporating classical machine learning (including SVM and NB) and deep
a sentiment lexicon, as well as grammatical and syntactical learning algorithms for detecting the stance towards
conventions, called Vader, is proposed by Hutto and Gilbert vaccination in Italian tweets, achieving the best results when
[40]. The authors show that the proposed model outperforms using SVM. The approach proposed by D’Andrea et al. [5] has
individual human raters. When compared to classical machine constituted the basis for the current study.
learning algorithms (such as Support Vector Machines, Naïve Deep Learning algorithms have become particularly
Bayes and Maximum Entropy), the authors show that Vader popular in recent years for both stance detection [31] and
offers a better performance on the datasets collected from sentiment analysis [52]. The Deep Learning based techniques
Twitter, Amazon reviews and NYT editorials. Given the fact have predominantly used Convolutional Neural Networks
that the creation of lexicons is time consuming, Cotfas et al. (CNN) [5], [53] and Recurrent Neural Networks (RNN) [54],
[33] have shown that multiple existing lexicons can be [55], with its variant Long Short-Term Memory (LSTM) [5],
combined to create more comprehensive lexicons through the [56]–[58]. Zarrella and Marsh [58] have proposed a LSTM
advantages brought by the grey systems theory. Compared to approach that has achieved an F-Score of 67.82, one of the
machine learning, lexicon-based approaches have the highest scores among the competing teams at “SemEval-2016
advantage of not requiring the collection and annotation of Task 6: Detecting Stance in Tweets”. However, the algorithm
training data, making them preferable when the volume or the has performed worse than the baseline SVM n-grams
quality of the training data is not sufficient [43], [44]. algorithm.
Machine Learning approaches use supervised classification As an alternative to RNN and CNN, Vaswani et al. [59]
algorithms to extract knowledge regarding the sentiment have proposed transformers, an attention-based architecture,
polarity or the stance of a text. As a preliminary step, before replacing the recurrent layers with multi-headed self-attention,
applying machine learning, the text needs to be first converted achieving state of the art results for machine translation [59],
into numerical vectors, using schemes such as Bag-of-Words document generation [60] and syntactic parsing [61].
and word embeddings. The Bag-of-Words approach is a Transformer-based language models, pre-trained on large and
flexible text representation scheme that describes the number diverse corpuses of unlabeled data, such as Generative Pre-
of occurrences of words in the encoded document. As a trained Transformer (Open-AI GPT) [62] and Bidirectional
disadvantage, this scheme does not consider the sequence in Encoder Representations from Transformers (BERT) [63] can
which the words appear in the document, thus ignoring the be afterwards easily fine-tuned for a wide range of Natural
context in which they are used [45]. Word embeddings are a Language Processing (NLP) tasks [62], [63]. While Open-AI
text representation approach in which each word is mapped to GPT uses a unidirectional left-to-right architecture, BERT
relies on a bidirectional approach, providing better results on COVID-19 which state that human psychological conditions
many NLP tasks, including sentiment analysis [63]. are significantly impacted by the coronavirus outbreak [68].
Hybrid methods feature a combination of lexicons and On the other hand, Bhat et al. [69] found that the most
machine learning algorithms. Aloufi and Saddik [35] have prominent sentiment was positive in the analysis conducted in
performed polarity detection from football-specific tweets their paper. The authors state that the occurrence of the
using several machine learning algorithms and a sentiment positive sentiments in 51.97% of tweets can be a sign that the
lexicon automatically generated starting from a manually users who have posted the messages are hopeful and enjoy the
labeled dataset. Even though some improvements have been socialization experience shared with the family in this period
noticed by the authors in comparison to using general of lockdown and limited social interaction.
lexicons, the best results have been achieved by SVM with At regional level, Kruspe et al. [70] have analyzed
unigrams. geotagged tweets in Europe regarding COVID-19 through the
Comparisons between various stance analysis approaches use of a neural network, featuring a multilingual version of
used in social media analysis are included in Wang et al. [31] BERT, which has been trained on an external dataset, not
and Mohammad et al. [51]. connected to the COVID-19 outbreak. Based on their results,
the authors state that they have observed a general downward
B. TWITTER SENTIMENT ANALYSIS ON COVID-19 trend of the negative sentiments as the time passes.
DATA At national level, several studies have been conducted for
In the case of epidemics, Merchant and Lurie [64] have different countries around the world. For example, in a study
observed that besides the role assumed by social media of conducted on tweets extracted for Nepal, Pokharel [71]
becoming the fastest channel of communication between observed that the public opinion faced positive sentiments
people found in situations of social distancing due to (58% of the tweets), while the negative sentiments have only
lockdown, the social media can also act as a tool which can be been expressed in 15% of the tweets. The study used a Naïve
used for anticipating the circumstances related to the spread of Bayes model applied on a limited number of tweets (615
epidemics around the world. The authors have observed a high tweets). Barkur et al. [72] determined that in the case of the
correlation between the information posted on Twitter tweets from India, the positive sentiment was dominant when
regarding the evolution of an epidemic and the official data analyzing the national lockdown situation announced by the
released by the Center for Disease Control and Prevention. As government. Similar conclusions have been reached by Khan
a result, the authors have concluded that Twitter can provide et al. [73] in a research that has used Naïve Bayes classifier.
real-time estimations and predictions in the case of epidemic- The difference between the reactions towards the pandemic in
related activities. Based on this research, Kaur et al. [65] have different cultures has been studied by Imran et al. [74] through
used the data extracted from Twitter to monitor the dynamics sentiment and emotion analysis, implemented with deep
of emotions during the first months after the COVID-19 has learning classifiers. Besides the correlation between tweets’
become known to the public. A total number of 16 138 tweets polarity from different countries, the authors also state that
have been extracted and analyzed using IBM Watson Tome NLP can be used to link the emotions expressed on social
Analyzer. As expected, the number of negative tweets platforms to the actual events during the coronavirus
exceeded the number of neutral and positive tweets in all the pandemic. Samuel et al. [75] have shown insights related to
three months considered in the paper. Comparing the the evolution of the fear-sentiment over time in the United
sentiments extracted for June with the ones extracted for States.
February, it has been observed that the proportion of negative At regional level, Zhou et al. [76] analyzed the sentiments
sentiments has decreased (from 43.92% to 38.05%), while the in local government areas located in Australia and found that
positive sentiments proportion has increased (from 21.38% to the general sentiment during the COVID-19 pandemic was a
27.01%). The proportion of the neutral sentiments has been positive one, but there have been observed decreases in the
almost the same (34.07% in February vs. 34.94% in June). positive polarity as the pandemic advanced, with significant
The prevalence of negative sentiments over the positive changes from positive to negative sentiments depending on the
ones in the case of the COVID-19 pandemic has been also government policies or social events. Wang et al. [77] made a
underlined by Singh et al. [66], while Boon-Itt and Skunkan comparative analysis between the tweets posted in California
[67] have recorded a high discrepancy between the negative and New York and concluded that California had more
sentiments (covering 77.88% of tweets) and the positive negative sentiments than New York and that the fluctuation in
sentiments (covering the rest of 22.12%). sentiment scores can be correlated with the severity of
Xue et al. [68] have analyzed the public sentiment related COVID-19 pandemic and policy changes. Pastor [78]
to 11 selected topics determined using Latent Dirichlet analyzed the sentiment of the Filipinos located in Luzon area
Allocation on COVID-19 tweets. The authors have concluded and concluded that most Filipinos had negative sentiments,
that fear is the most dominant emotion in all the considered most of them due to the extreme community quarantine.
topics and that the findings are in line with other studies on
/ Neutral
Extend the dataset Pre-processing Representation Stance classification
- Against
Extract annotation
Dataset cleaning
Trend Analysis
Dataset annotation
Some other analyses on Twitter in the context of COVID- followed by deep learning and 4) Bidirectional Encoder
19 have focused, but have not been limited to: topical Representations from Transformers.
sentiment analysis regarding the use of masks [79], monitoring In order to determine the best performing classification
depression trends [80], sentiment dynamics related to cruise algorithm, the text has been represented using both Bag-of-
tourism [81], identifying discussion topics and emotions [82], Words and word embeddings schemes. In the present paper,
thematic analysis [83], detecting misleading information [84]. the performance of multiple classical machine learning and
As shown above, the prominent sentiments related to deep learning algorithms has been evaluated based on the
COVID-19 have been found to be either positive or negative. following widely used metrics: Accuracy, Precision, Recall
The expressed sentiments have been shown to depend on the and F-score. Accuracy, which indicates the ratio of correctly
geographic area, government decisions and number of predicted observations to the total observations is defined as
recorded cases. A more in-depth analysis related to the studies shown in (1), in which TP, TN, FP and FN refer to true
on sentiment analysis featuring COVID-19 and other positive, true negative, false positive and false negative. Thus,
infectious diseases can be found in Alamoodi et al. [3]. TP represents the number of real positive tweets classified as
In this context, the present paper aims to analyze the stance positive, FP is the number of real negative tweets classified
of the Twitter users in connection to the new upcoming incorrectly classified as positives, TN represents the number
vaccines for COVID-19 in the first month after Pfizer and of negative tweets correctly classified as negative and FN is
BioNTech announced their results on the new vaccine. The the number of real positive tweets incorrectly classified as
methodology and data collection process are presented in the negative.
following sections. TP+TN
Accuracy = (1)
Precision, which represents the ratio of correctly predicted
The steps taken in order to analyze the public’s opinion
positive observations to the total predicted positive
regarding COVID-19 vaccination from social media messages
observations, is computed as shown in (2).
are shown in Figure 1.
The initial step is to collect a COVID-19 vaccination stance TP
Precision = (2)
dataset containing English language tweets. A randomly
sampled subset from this dataset has been afterwards manually Recall, representing the ratio of correctly predicted positive
annotated as neutral, in favor, or against vaccination, in order observations to all the observations in the actual class, is
to be used in the training phase of the stance classification computed as shown in (3).
algorithms. TP
Given their unstructured nature and informal writing style, Recall = (3)
in the following step, the tweets from the collected dataset Starting from Precision and Recall, the F-Score can be
have been pre-processed, with the purpose of improving the computed as a weighted average, as shown in (4).
performance of the stance classification algorithms.
Precision ∙ Recall
For text representation and classification, four approaches F-score = 2 ∙ (4)
have been investigated: 1) Bag-of-Words representation
followed by classical machine learning, 2) Word embeddings Finally, the best performing algorithm has been used to
followed by classical machine learning, 3) Word embeddings analyze the evolution of the public stance towards vaccination
in the considered period. The evolution has been correlated
with the major events and news that have followed the English have been considered. Thus, between November 9 and
announcement of the Pfizer and BioNTech vaccine results. December 8 a number of 2 349 659 tweets concerning the
topic of COVID-19 vaccination have been identified.
A machine learning approach has been chosen for detecting B. DATASET ANNOTATION
the stance of the tweets, which requires a labeled dataset for To ensure the quality of the annotated dataset, that will be used
training the classification models. Since we have not identified for training the machine learning algorithms, duplicated
an already labeled dataset for stance towards COVID-19 tweets have been discarded, as well as retweets. The retweets
vaccination in the scientific literature, a domain-specific have been easily identified due to the presence of the “RT”
dataset, having Twitter as a data source, has been collected and symbol. This choice is in accordance with the approach from
manually annotated. It should be also mentioned, that other studies, including, but not limited to [5], [35]. The
according to [31], there is a general lack of annotated corpuses remaining number of tweets in the cleaned dataset is 752 951,
for stance detection. representing 32.04% of the initial dataset. Table 2 includes for
each day in the considered period both the total number of
A. DATASET COLLECTION tweets, as well as the remaining number of tweets after the
Several public datasets including large-scale collections of duplicates and retweets have been eliminated.
tweets related to the coronavirus pandemic have been From the cleaned dataset we have randomly selected and
proposed in the scientific literature, including the ones manually annotated 7530 tweets, representing approximately
presented in [85]–[88]. Some of the datasets, such as [86], 1.00% of all the tweets in the dataset. The number is higher
[88], are multi-lingual, while others, such as [85], [87] are than the one used in other stance detection approaches, such
language specific, including only tweets written in English. as D’Andrea et al. [5] and Mohammad et al. [89]. D’Andrea
In order to collect a dataset centered around COVID-19 et al. [5] have trained the algorithms on a manually labeled
vaccination, a hybrid approach has been chosen, in which the dataset containing 693 tweets. The dataset proposed by
tweets that we have fetched through the Twitter API for the Mohammad et al. [89] is organized on several topics, with the
keywords in Table 1, have been supplemented with the ones largest topic numbering 984 tweets.
in the dataset described in [86], selected using the same In the present approach, the stance of the tweets towards
keywords. vaccination has been evaluated by three independent human
raters into three classes: in favor, against and neutral.
Table 1. Set of keywords used to fetch tweets
Disagreements between the annotated tweets have only been
Topic Keywords recorded between the in favor and neutral or between the
neutral and against stances. No disagreement has been
covid19, covid-19, coronavirus, coronaoutbreak,
covid-19 recorded between in favor and against annotations. In the case
coronaviruspandemic, wuhanvirus, 2019nCoV
vaccine, vaccination, vaccinate, vaccinating, of disagreement, the class chosen by most annotators has been
vaccinated associated with the tweet.
The distribution of the tweets in the annotated dataset in the
Gathering the tweets from the Twitter API has been three considered categories is illustrated in Table 3.
performed through the Twitter Filtered Stream API, with the Tweets that have been assigned to the class in favor express
help of the TweetInvi2 library. a positive opinion regarding the vaccination. Tweets
While the approach proposed in this paper can be extended belonging to the against vaccination class express a negative
to other languages, in the present study only tweets written in opinion towards COVID-19 vaccination. The neutral class
Table 2. Number of vaccine related tweets published in the considered period
Date Nov. 9 Nov. 10 Nov. 11 Nov. 12 Nov. 13 Nov. 14 Nov. 15 Nov. 16 Nov. 17 Nov. 18
Tweets 57 265 109 839 85 901 75 341 72 512 41 387 41 335 159 611 79 275 110 131
Cleaned 56 768 29 693 22 583 19 690 19 854 11 103 11 873 38 072 23 814 28 148
Date Nov. 19 Nov. 20 Nov. 21 Nov. 22 Nov. 23 Nov. 24 Nov. 25 Nov. 26 Nov. 27 Nov. 28
Tweets 58 878 59 674 37 816 44 717 99 690 62 186 44 621 35 807 55 701 52 977
Cleaned 19 530 21 036 12 733 13 940 31 514 22 300 15 246 12 236 15 425 13 469
Date Nov. 29 Nov. 30 Dec. 1 Dec. 2 Dec. 3 Dec. 4 Dec. 5 Dec. 6 Dec. 7 Dec. 8
Tweets 35 497 82 639 51 407 154 004 135 162 97 410 69 856 64 438 122 134 216 822
Cleaned 11 704 21 697 19 719 49 589 45 288 32 345 21 428 20 104 31 851 60 199
mainly includes news related to the development of vaccines, written using a casual language, a pre-processing step has been
tweets that do not express a clear opinion, such as questions used in order to prepare the tweets in the annotated dataset for
regarding the vaccine, informative tweets concerning training the machine learning classifiers. This step is
vaccination, as well as off-topic tweets, many of them related considered crucial by D’Andrea et al. [5] for the success of the
to the 2020 presidential election in the United States, which entire system, while Bao et al. [90] provide a comprehensive
was held nominally, just a few days before the analyzed discussion regarding the importance of pre-processing in
period, on November 3, 2020. Several examples of manually social media analysis. The impact of the different pre-
labeled tweets belonging to the three categories are included processing steps, such as the removal of links, on the
in Table 4. performance of classical machine learning classifiers has been
discussed by Jianqiang and Xiaolin [91].
Table 3. Statistics for the manually annotated dataset
During this pre-processing step, all the user mentions, easily
Class Number Percent identified through the presence of the @ symbol at the
beginning of the message have been normalized, since they do
against 1083 14.38%
neutral 5188 68.90% not provide any useful information for the classification
in favor 1259 16.72% process. All the links and email addresses have been
Total 7530 100.00% normalized as well. The emoticons have been replaced with
the corresponding words. Minor spelling mistakes have been
The n-grams and balanced annotated dataset are available automatically corrected to improve performance. Contractions
at the following link: and hashtags have been unpacked, while elongated words
vaccination-stance-detection have been corrected and annotated. Finally, all the letters have
been converted to a lowercase representation. The pre-
V. COVID-19 VACCINATION STANCE DETECTION processing has been implemented with the help of the
The main components of the stance detection process are the ekphrasis library [92]. Additional processing has been
pre-processing, the feature extraction and the machine performed through Natural Language Toolkit (NLTK) library
learning classification. In the pre-processing step the text is [93] and the “re” python module.
cleaned, while in the feature extraction the raw textual data is
converted to feature vectors. The classical machine learning B. FEATURES
and deep learning classifiers, that have been compared in this In order to use machine learning algorithms for text
paper, are described within this section. classification, the text content has to be first converted into
numerical feature vectors. The Bag-of-Words (BoW) scheme
A. PRE-PROCESSING converts the text to a numerical representation, having as a
Given the fact that social media messages are frequently starting point the frequency of the words. Given a vocabulary
Stance Tweet
WHY would i take a vaccine for a bug that the CDC says i have a 99% chance against, with just my God given immune system? I
have a bigger chance of dying from a car accident or a bee sting.
Do you want to be a guinea pig for a virus that 97% of people survive? Why is there even a vaccine for an illness with such a high
survival rate?
No way!!! Over my dead body, I won’t have a vaccine in my body, why would I.. I’m healthy!!!! #NoVaccine #FreedomFirst
People cheering for the nano tech infested vaccines. Humanity has reached a level of stupidity that is truly mind blowing
Italy will receive an initial 3.4 mln shots of the Pfizer coronavirus vaccine,probably in January,a government source told Reuters on
Tuesday. The country will be allocated some 13.6% of the first 200 million doses made available to Europe, the source
neutral So how does the Pfizer Covid-19 vaccine injection work? Do they inject it at -90 degrees F. ?
#Covid19 vaccine is here
Airlines scramble to prepare for ultra-cold COVID-19 vaccine distribution
Omg 90% effective coronavirus vaccine ???? Light at the end of the tunnel ! so you’re telling me there a CHANCE that I can actually
walk graduation ✨
All I want for Christmas is covid19 vaccine.
in favor Kudos to all the scientists and medical experts working tirelessly to produce a safe and productive vaccine to defeat COVID-19
I want someone to find a damn vaccine or cure for this STUPID VIRUS man. Tired of seeing it affect families and kill ppl. Tired of
walking past the ICU of my job seeing folks fighting for their lives. Sick of these lingering effects from me having it. SICK OF COVID-
𝑉 = {𝑤1 , … , 𝑤𝑁 }, containing 𝑁 tokens, denoted using 𝑤𝑖 , a considered in the present study: Datastories3, GloVe4 and Fast-
tweet, or any other textual document 𝑑, belonging to a corpus Text5.
𝐷, can be represented using a feature vector 𝑋 = {𝑥1 , . . , 𝑥𝑁 },
in which 𝑥𝑖 can either represent a binary variable that indicates C. LEARNING ALGORITHMS
whether the word 𝑤𝑖 appears in the text or a numeric variable A machine learning approach has been used in order to
indicating the number of times the word 𝑤𝑖 appears in the text. accurately determine the stance towards vaccination in the
Given the fact that very frequent words can sometimes carry collected tweets. Starting from the annotated dataset, the
little “informational content”, the performance of performance of several popular classification algorithms has
classification algorithms that rely on word frequencies can be been investigated: Multinomial Naive Bayes (MNB), Random
improved using a more complex feature representation, called Forest (RF), Support Vector Machine (SVM), Bidirectional
Term Frequency - Inverse Document Frequency (TF-IDF), Long Short-Term Memory (Bi-LSTM) and Convolutional
that reduces the weight associated to words that frequently Neural Network (CNN).
appear in all the documents in the corpus. TF-IDF is computed
as shown in (5): 1) MULTINOMIAL NAIVE BAYES
|𝐷| Naive Bayes classifiers are a family of probabilistic
TF-IDF(𝑤𝑖 ) = TF(𝑤𝑖 ) × log (5) classification algorithms that apply the Bayes theorem. They
DF(𝑤 )
are called naïve because they perform the classification under
where TF(𝑤𝑖 ) represents the number of appearances of the a strong assumption that every feature is independent from the
word 𝑤𝑖 , |𝐷| stands for the number of documents and other features. Despite their simplicity, this family of
DF(𝑤𝑖 ) is the number of documents containing the term 𝑤𝑖 . algorithms has been demonstrated to be fast, reliable and
The TF-IDF statistical measure is used throughout the present accurate in many NLP classification tasks [95]. The
study for features representation. Multinomial Naive Bayes [96] classifier implements a variant
By only focusing on the number of times a word occurs in of the Naïve Bayes algorithm which can be used with
a given text, the Bag-of-Words approach does not provide any multinomially distributed data, such as the frequencies of n-
information regarding the succession of the words. This issue grams in text classification problems.
can be addressed if the n-gram language model is used, in
which the text is represented through successions of N
consecutive words. Common types of n-grams include grams
Random Forest (RF) [97] is an ensemble classifier that
of size one, called unigrams (1-grams), grams of size two,
consists of multiple decision tree classifiers, trained in parallel
called bigrams (2-grams), and grams of size three, called
with bootstrapping followed by bagging. According to Misra
trigrams (3-grams) [94].
and Li [98] the RF classifier offers better results when
In the present study, various combinations of unigrams,
compared to other classification methods in terms of accuracy
bigrams and trigrams have been considered as features for the
and does not require feature scaling. Furthermore, the RF
machine learning algorithms, as shown in Table 5.
classifier has been determined to be more robust in the
Table 5. N-gram combinations selection of training samples. Even though the RF might be
hard to interpret, its hyperparameters can more easily be
N-gram model Types of n-grams
turned than in the case in which a decision tree classifier is
(1-1) unigrams used [98].
(1-2) unigrams + bigrams
(1-3) unigrams + bigrams + trigrams
(2-3) bigrams + trigrams
(3-3) trigrams
Support Vector Machines (SVM) [99] are a family of
supervised learning algorithms used for classification,
regression and other tasks such as outlier detection. While
Besides the Bag-of-Words representation, word embeddings other classification algorithms suffer from overfitting, one of
have been used. In word embeddings the words are mapped to the advantages of SVM is that they are less prone to this
vectors, having similar representations for the words which situation [100]. Another advantage resides in the fact that
frequently appear in the same context. Compared to one-hot besides binary classification, multiclass classification can be
encodings, word embeddings provide a denser representation performed by combining several binary classification
that requires a smaller number of dimensions for representing functions. For this, each class is considered individually at a
the words. The similar representation of words with close time, and for each class a classifier is searched that separates
meanings provides additional clues for the classification it from the other classes [101].
algorithms. The following word embeddings have been
3 5
6 8
maximum number of features to 𝐹 = 2000. For the C1-C6 maximum number of features, an alpha parameter value of
classifiers, the best results have been achieved when using TF- 0.0001, a maxDF threshold equal to 1.0 and choosing
IDF, without excluding the general stop words. “elasticnet” as a regularization term.
In the case of the Multinomial Naïve Bayes classifier (C1 As expected, C1, C3 and C5, for which the parameters have
and C2) the best results have been achieved for the C1 been determined through grid search have performed better
classifier, for which the maximum number of features has than the corresponding classifiers of the same type, C2, C4 and
been reduced to 𝐹 = 3000, while including both unigrams and C6.
bigrams as features, keeping maxDF = 1.0. The overall best performing classifier has been C5, a SVM
It has been observed that the Random Forest classifier (C3 classifier which had 76.23% accuracy, followed by C6, with
and C4) performed best when using only unigrams, without 74.20% accuracy. In terms of precision and F-score, C5
limiting the number of features, while applying a frequency overperformed all the other classifiers for each of the three
threshold, maxDF, for corpus specific stop words of 0.5, considered classes, in favor, against and neutral. A small
namely in the case of C3. difference is recorded in the case of recall, where the value for
In the case of the Support Vector Machines classifier (C5 the neutral class is slightly lower for C5 than for C6 (76.18%
and C6), the best results have been achieved in the case of C5, versus 77.84%).
configured with the n-gram model (1, 2), without limiting the
The worst performing classifier has been C4, a RF results have been achieved when using a batch size of 16, a
classifier, with an accuracy of 70.79%. In terms of precision learning rate of 3e-5 and a number of epochs equal to 3.
and recall, the classifier C4 performed worse than C5 and C6 Having an accuracy of 78.94%, the C14 classifier outperforms
on all three classes, in favor, against and neutral. all the other classifiers. Moreover, it clearly outperforms the
second-best performing classifier, C5, in terms of precision,
2) WORD EMBEDINGS AND CLASSICAL MACHINE recall and F-score, for all the considered classes.
Starting from the algorithm that has provided the best results E. DISCUSSION
in the context of the Bag-of-Words approach, C5, in the The results achieved by the deep learning classifier C9 are
following we have analyzed if the performance can be further worse than the ones obtained in the case of the classical
improved by considering pre-trained word embeddings. machine learning classifier C5 in terms of accuracy and F-
Similar approaches, using word embeddings with classical score. This result is consistent with the ones in other studies,
machine learning algorithms, have been investigated in [5] and such as D’Andrea et al. [5], in which classical machine
[110]. learning algorithms have outperformed deep learning
To this end, a word embedding, called glove.6B, that approaches, such as CNN and LSTM, in the case of vaccine
includes six billion tokens, created through the Glove stance classification.
approach from a corpus extracted from Wikipedia and from As noted in the review paper of Wang et al. [31], stance
the news archive Gigaword [111], has been used. This detection approaches typically do not perform extremely well.
implementation is marked in Table 6, as C7. The reasons mentioned by the authors include the sparsity, the
As shown in Table 6, the values of all the four considered colloquial language and the absence of large, labeled datasets
metrics (precision, recall, F-score and accuracy) of the C7 that could be used for training. Moreover, Mohammad et al.
classifier are worse than those achieved in the case of C5. [51] summarize the results of the “SemEval-2016 Task 6:
Detecting Stance in Tweets” mentioning that the SVM
3) WORD EMBEDINGS AND DEEP LEARNING baseline with n-grams has performed relatively well compared
In the case of the deep learning classifiers, in the present paper, to other machine learning approaches.
the Adam approach has been applied for tuning the learning In the following we have used the best performing
rate [112]. The resulting classifiers are listed in Table 6 under classifier, C14, to analyze the tweets collected over the
the C8 – C13 classifiers. considered period of time. The model has been trained on all
As shown in Table 6, among the Bi-LSTM classifiers (C8- the tweets in the annotated dataset.
C10), the best results have been achieved by the C9 classifier,
with an accuracy of 74.70%, higher than in the case of C8,
The evolution of the daily number of tweets is discussed in this
(73.41%) and C10 (68.36%). The C9 classifier has used the
section in connection with the major events which have
word embeddings created through the Glove approach from a
occurred around the world related to COVID-19 vaccination,
corpus composed of 2 billion tweets.
with an accent on the English-speaking countries.
In the case of CNN classfieirs (C11-C13) the best results
have been achieved by the C13 classifier, using the word
embeding created through the FastText approach on a corpus
The major events have been extracted from the news
extracted from Wikipedia and news stories (accuracy
published online in each day of the analyzed period using
69.01%). search engine by selecting the “News” section and
Classifiers C8 and C11, ranked second in the Bi-LSTM
“COVID” keyword and by pointing one-by-one the days in the
category (73.41%) and third in the CNN category (65.71%)
mentioned period. Each time, the first 10 pages of News titles
based on accuracy, have used the Datastories word
have been considered and the most relevant news have been
embeddings, created from a corpus of 330 million tweets, by
extracted in connection to the COVID-19 vaccination theme,
applying the GloVe approach.
relevance being given by the connection to the COVID-19
The best performing deep learning classifier has been C9,
vaccination and the amount of news on a specific topic.
implementing Bi-LSTM, which outperforms C8, C10-C13
As a result, it has been observed that in all the analyzed days
classifiers, both in terms of accuracy and F-score.
there have been news regarding the COVID-19 vaccination
theme, starting from the announcement of the vaccine
4) BIDIRECTIONAL ENCODER REPRESENTATIONS effectiveness by different producers, the amount of money
funded by various organizations for COVD-19 vaccine,
In order to establish the best values for the hyperparameters of
adverse events encountered in the pre-test phase, ethical issues
the BERT language model (C14), the approach recommended
related to whom should have first access to the vaccine, the
by Devlin et al. [63] has been followed during the fine-tuning
predicted quantity of vaccines to be distributed in different
procedure in regarding the batch sizes (16, 32), learning rate
(5e-5, 3e-5, 2e-5) and number or epochs (2, 3, 4). The best
countries and areas and ending with the vaccination in Russia namely E10, and another one that has generated a
and UK. comparatively smaller number of tweets, namely E6.
The following events have been put in connection with the Analyzing the n-grams for the 154 004 tweets collected for
number of tweets recorded daily, which might have December 2, the day of E10, it has been observed that among
determined the variation in the tweets’ number: the top-15 unigrams, besides the specific COVID-19 terms
(e.g. “vaccine”, “covid”, “19”, “coronavirus”, “covid19”,
E1. Nov. 9: Pfizer and BioNTech announcement regarding “vaccines”) in this day “Pfizer” has been referred 57 342
their COVID-19 vaccine effectiveness9 times, followed by “UK” referred 48 789 times, “first”
E2. Nov. 10: Positive news regarding stock trading, oil referred 39 438 times, “BioNTech” referred 30 993 times and
futures and cruise bookings rise as a result of COVID- “approve” referred 18 949 times. Based on the top-10 bigrams
19 vaccine10, 11 and trigrams, it can be observed the occurrence of the
E3. Nov. 13: World Health Organization exceeded the target following words’ combinations: “Pfizer BioNTech” referred
of $ 2 billion to buy and distribute COVID-19 cures to 29 714 times, “first country” referred 18 085 times, “approve
poorer countries12 Pfizer” referred 15 453 times. Considering the extracted
E4. Nov. 16: Moderna's COVID-19 vaccine shows 94.5% unigrams, bigrams and trigrams and the E10 event, UK
efficiency in clinical trials13 authorization of the Pfizer and BioNTech COVID-19 vaccine,
E5. Nov. 18: Sinovac's COVID-19 vaccine induces a quick it can easily be noted that there exists a correspondence
immune response14 between E10 and the analyzed tweets from December 2.
E6. Nov. 20: Pfizer’s announcement regarding COVID-19 On November 20, there have been collected 59 674 tweets.
vaccine emergency authorization15 From the top-15 unigram analysis, the following have been
E7. Nov. 23: Oxford AstraZeneca COVID-19 vaccine extracted: “Pfizer” (17 639 times), “emergency” (13 466
shows an up to 90% efficacy16 times), “authorization” (7251 times), “fda” (6622 times) and
E8. Nov. 27: UK hospitals start preparing for the arrival of “BioNTech” (5347 times). As for the top-10 bigrams and
the COVID-19 vaccine in 10-day time17 trigrams, the words’ combinations have been: “emergency
E9. Nov. 30: Moderna seeks approval for the COVID-19 use” (9613 times), “use authorization” (4769 times),
vaccine in Europe and United States18 “emergency use authorization” (4768 times) and “Pfizer
E10. Dec. 2: UK authorize the Pfizer BioNTech COVID-19 BioNTech” (4583 times). It can thus be observed that even in
vaccine19 the case of a less significant event (“Pfizer’s announcement
E11. Dec. 3: The first batch of vaccines arrived in UK20 regarding COVID-19 vaccine emergency authorization”) the
E12. Dec. 8: UK starts COVID-19 vaccination21 correspondence between the tweets and the event exists.
In order to validate the correspondence between the events and B. STANCE ANALYSIS
the analyzed tweets we have extracted for each date in the In the following, the best performing classifier - determined in
analyzed period the unigrams, bigrams and trigrams sorted Section V, BERT (C14) - is used to perform stance analysis
according to the number of appearances. The analysis has been on the gathered dataset. As it will be observed, not all the news
performed for both the cleaned dataset and the whole dataset, published in the analyzed period have generated the same
that also includes the retweets. Before the n-gram extraction, amount of interest from the general public, a series of local
the tweets have been minimally pre-processed by removing peaks being identified in some of the analyzed days.
stop words and duplicated white spaces.
From the events presented above, we have selected two
events, one that has generated a large number of tweets,
9 15
than-90percent-effective-in-preventing-infection.html (accessed December authorization-covid-19-vaccine-christmas/ (accessed December 9, 2020)
9, 2020)
10 its-coronavirus-vaccine-has-70-per-cent-efficacy-covid-oxford-university
coronavirus-vaccine-news-norwegian-cruise-line-ceo-says.html (accessed (accessed December 9, 2020)
December 9, 2020)
11 told-prepare-early-december-covid-vaccine-rollout-nhs (accessed
amid-hopes-for-coronavirus-vaccine-idINL4N2HX0O6 (accessed December 9, 2020)
December 9, 2020) (accessed December 9,
12 2020)
idUKKBN27T138 (accessed December 9, 2020)
13 technology/2020/12/01/britain-becomes-the-first-country-to-license-a-
(accessed December 9, 2020) fully-tested-covid-19-vaccine (accessed December 9, 2020)
14 20 (accessed December 9,
sinovac/sinovacs-covid-19-vaccine-induces-quick-immune-response- 2020)
study-idUKKBN27X35I (accessed December 9, 2020) (accessed December 9,
1) CLEANED TWEETS STANCE ANALYSIS in most of the major events cases presented above. Even more,
The evolution of the stance expressed and the distribution in the events announcing the effectiveness of the COVID-19
of the stances on the three considered categories: in favor, vaccines from different companies (E1, E4, E5 and E7) the in
against and neutral is depicted in Figure 2, which considers favor tweets have overpassed the number of against tweets.
only the cleaned tweets. By simply considering the stances’ Even after the vaccine authorization in UK, the situation has
evolution, one can easily observe that there have been not changed and the number of in favor tweets overpassed
variations in the number of tweets published, especially in daily the number of against tweets (with approximately, on
the days following a major announcement or news. average, 2050 tweets per day).
From Figure 2 it can be observed that the general
dominating stance is neutral, as it was expected from the 2) ALL TWEETS STANCE ANALYSIS
initial stage in which most of the annotated tweets belonged to As the difference, in the number of tweets, between the in
this category. Based on the tweets in the neutral category, it favor and against stances in the cleaned dataset seem to have
has been noticed that most of them deal with presenting news close values during the analyzed period, a stance analysis
related to the occurrence of the COVID-19 vaccine. considering all tweets dataset has been performed.
As for the number of against and in favor tweets, it can be In the study conducted on vaccination in Italian tweets,
observed that they have oscillated during the analyzed period. D’Andrea et al. [5] considered that one should analyze the
Based on Figure 2 it can be observed that between November whole tweets dataset (including retweets) as, some of the users
9 and December 1 the number of against tweets has kept a who retweet a certain opinion or piece of information,
constant trend, with a few “spikes” on November 9, November generally believe in it and, instead of writing their own words
16 and November 23. In all of these days, the news released in to a particular situation, they might decide instead to share the
the media were speaking about the efficiency of COVID-19 information. As a result, we have run the stance analysis over
vaccines in clinical trials: for Pfizer and BioNTech on the entire tweet-dataset and the results are presented in Figure
November 9 (event E1), for Moderna on November 16 (E4) 3.
and for Oxford AstraZeneca on November 23 (E7). Two As it can be observed from Figure 3, on November 9, when
major turning points for increasement in the number of against Pfizer and BioNTech announced their vaccine effectiveness,
tweets have been represented by December 2 and December the number of retweets has been only slightly different from
8, these being the days in which UK has authorized the Pfizer the ones of tweets (a difference of 497 tweets). On November
BioNTech COVID-19 vaccine (E10, 6242 against tweets) and 10, the number of cleaned tweets has been half compared to
the day in which the COVID-19 vaccination started in the UK November 9, while the total number of tweets doubled,
(E12, 8429 against tweets). showing a high increasement in the number of retweets. As the
Considering the evolution of the graphic containing the events marking November 10 were mostly referring to the
cleaned tweets from all the three categories, it can be observed economic impact of a possible COVID-19 vaccine, such as the
that the number of against tweets follows on a smaller scale rose of oil futures and the announcement of a growth in stock
the evolution of the number of neutral tweets (the curve of the market, in general, an increasement in the in favor tweets over
against tweets being more flattened than in the case of the the against tweets can be observed (which has not been
neutral tweets), while the in favor tweets follow more observed in the cleaned tweets dataset).
precisely the trend imposed by the number of neutral tweets Even in this case, one can notice that the evolution of the
(Figure 2; Figure A1.1 and Figure A1.2 in Annex 1). As a number of in favor tweets resembles more closely to the
result, for the in favor tweets one can observe an increasement
evolution of the neutral tweets. The increasement in the of relative distribution, one can note that that percentage of the
number of tweets marked as in favor is better visible in the neutral tweets in total number of tweets has recorded a slight
case of the occurrence of the events mentioned above E1-E12, increase of 17% (from 56% for E1 to 73% for E12) when all
more precisely in the cases of E2, E4, E5, E7, E10 and E12 the tweets are considered and an increase of 7% on cleaned
(Figure A2.1 in Annex 2), than in the other days of the tweets (from 56% for E1 to 63% for E12) – please see Figure
analyzed period. For the against tweets, except for the 4.
increasement observed in the period following E10 (Figure
A2.2 in Annex 2), the evolution is almost constant, recording
an average number of against tweets of approximately 6826
tweets/day. After E10, the number of daily against tweets
recorded an increasement of 95.28%.
Comparing the number of tweets recorded in the three
categories for the period between December 2 – December 8
(E10-E12) with the period between November 9 – December
1 (E1-E9), it can be observed that the higher relatively
increasement has been recorded in the against tweets category
(95.28%), followed by neutral (87.86%) and in favor
(54.86%). The smaller difference noticed between the number
of tweets in the against category compared to the in favor
category can be attributed to the fact that starting from
December 2 (E10) the potential of having a vaccine was no
longer a “dream” but it became “reality”, which might have
“activated” the anti-vaccination community.
percentage dropped from 31% to 20%, while in absolute should try to provide more information regarding the
value, the number of in favor tweets almost doubled from vaccination process, its advantages and its presumed
17 526 tweets for E1 to 44 447 tweets for E12 (with 27 487 disadvantages, offering to the general public all the needed
tweets for E10). instruments and information for increasing their trust in the
A decrease in the percentage of tweets from the entire decisions taken at a macro-scale, with impact on everybody’s
sample can be also observed in the case of against tweets, even life.
though, in this case, the decrease is of only 6% (from 13% to
7%). Considering the cleaned tweets, the percentage of the
against tweets has faced an increase of 1% (between E1 and
E10-E12). Even in this case, the absolute number of against
tweets on all dataset increased from 7604 tweets on E1 to
12 541 on E10 and to 14 359 on E12.
Another observation can be made in connection to E1 and
E10: it can be observed that in the case of the against tweets,
the increasement of the number of these tweets happened
mostly in the following days after the events occurred, not
Figure 6. Cumulative stance comparison
being so visible in the day of the event. This observation is in
line with the study conducted by D’Andrea et al. [5], in which VII. LIMITATIONS OF THE STUDY
the authors have shown that sometimes the effect of an event A potential limitation of the current study is represented by the
is immediately visible, while in some other cases, it might classification algorithms selected in the paper. As the Natural
need some hours / a day in order to be visible in the tweets. A Language Processing field is in continuous development,
possible explanation for this lag in response might be related better algorithms could be developed over time providing an
to the fact that the users posting against tweets might have improved classification performance. Another limitation is
needed some time in order to look for information related to related to the selected dataset, which only includes tweets
the event prior to tweeting. extracted between November 9, 2020 and December 8, 2020,
written in English. This limitation opens the path towards
4) CUMULATIVE STANCE ANALYSIS other possible extensions, which might include, but not limited
As expected, the predominant stance of the cumulative set is to analyzing the tweets written in other languages or from
neutral both in the entire set (Figure 5) and in the cleaned set. specific geographical areas, which could be identified using
The percentage of the against and in favor reaches the same the associated GPS coordinates. Additionally, a different
value in the case of cleaned tweets, even though in absolute period of time could be considered due to the vaccination
value, there can be noticed a difference in favor of the against process dynamics worldwide.
tweets (Figure 6). On the entire dataset, the in favor tweets
have a 10% difference compared to the against tweets (Figure VIII. CONCLUSIONS
6). As the difference between the in favor and against tweets In the current paper, the one-month period passed between the
is not as visible as expected (Figure 6) and a global vaccination first announcement of a coronavirus vaccine and the first
campaign is expected, the involved agencies and governments actual vaccination process started outside the limited clinical
trials has been analyzed using machine learning-based stance Bourgogne Franche-Comté and Pays de Montbéliard
detection. Multiple classical machine learning and deep Agglomération for the whole support offered during the time
learning algorithms have been compared and the best when this work has been carried out.
performing classifier has been chosen based on four
