MTP Report
MTP Report
MTP Report
media posts
MTP Report
by
Alpesh Kaushal
17CS30003
CERTIFICATE
This is to certify that work in the thesis entitled Analysis of COVID-19 related
social media posts, submitted by Alpesh Kaushal (Roll Number: 17CS30003 ) a
dual degree student of Department of Computer Science and Engineering,
Indian Institute of Technology Kharagpur in partial fulfillment for the award of
Dual Degree is carried out by him. We hereby accord our approval of it as a study
carried out and presented in a manner required for its acceptance in partial fulfillment for
the Dual Degree for which it has been submitted. The thesis has fulfilled all the
requirements as per the regulations of the Institute and has reached the standard needed
for submission.
Supervisor
Department of Computer
Science and Engineering
Indian Institute of Technology,
Kharagpur
Place: Kharagpur
Date: 25th April
2022
ACKNOWLEDGEMENTS
I would like to thank Prof. Saptarshi Ghosh who gave me this golden opportunity to
work on this project. I got to learn a lot from this project. I would also like to express my
Alpesh Kaushal
IIT Kharagpur
Date: 25/04/2022
ABSTRACT
Authorities must be alert in the face of potential rises in COVID-19 cases, which were on
the rise in several parts of the world. Forecasting COVID-19 outbreaks can help prevent
harm and offer time to prepare for circumstances, as well as create applicable policies to
address difficulties and educate people about immunisation. According to the World
Health Organization, vaccine hesitancy was one of the top ten global health threats in
2019. Nowadays, social media plays a significant role in the distribution of vaccine
related information, misinformation, and disinformation. There are a wide variety of
topics and questions encapsulated under the umbrella of vaccinations that one might
notice people discussing while browsing across social media. Some of these include
whether the vaccine is necessary, or what about its side effects, etc. There are several of
these social media platforms to express one's view. One of them is GAB.com which
revels in the freedom of speech it provides. By this we aim to collect all related relevant
data from GAB.com and build a classifier to classify these types of statements of
gab.com into relative categories. Further we will use twitter data to predict the number of
future cases/deaths by Covid19. The purpose of this study is to see if social media
signals (from Twitter) can be used to create automatic predictors for the amount of
COVID-19 cases that will occur in the future. We developed classifiers that can clearly
determine such symptom-reporting tweets, and then investigate how related signals link
to the frequency of COVID-19 cases. We find social media signals that exhibit good
associations with the number of future COVID-19 cases/deaths through experiments
done over worldwide tweets and tweets from India posted in 2020 and 2021
Table of Contents
1. Abstract
2. Introduction
2.1. Vaccine hesitancy
2.2. Predicting future cases/death for
Covid-19
3. Related Work
4. Data
4.1. Gab Data
4.1.1. Manual Analysis of Gabs
4.1.2. Labeled Latent Dirichlet Allocation
4.2. Twitter Data for cases prediction
5. Classification
5.1. Traditional Classification model
5.1.1. Vaccine hesitancy
5.1.2. For symptoms-reporting tweets
5.2. fasttext Classification
5.3. Bert Transformer
5.3.1. Vaccine hesitancy
5.3.2. For symptoms-reporting tweets
6. Correlation of different classes
7. Conclusion
7.1. Summary of Work
7.2. Future Scope
7.2.1. Vaccine hesitancy
7.2.2. For symptoms-reporting tweets
8. References
1. Introduction
Nearly 227 million individuals have been affected by the ongoing COVID-19 pandemic,
which has resulted in over 4.6 million deaths worldwide (as of September 2021) 1. To
stop the virus from spreading, strict measures were implemented in many parts of the
world. However, once limits were eased, different countries saw an increase in cases.
Despite the fact that immunizations began in the first quarter of 2021, the risk of a sharp
increase in COVID-19 persists. As a result, authorities must be able to forecast probable
increases in COVID cases/deaths in order to take precautionary measures and prepare for
medical catastrophes.
Apart from that, understanding the reasons for vaccine hesitancy is critical so that
authorities may develop policies to educate people and respond to their worries, as well
as take action against reported side effects. This is linked because an increase in vaccine
hesitancy can lead to an increase in COVID cases, which will alter the distribution of
resources to aid. It is clear from this that case monitoring is crucial, and that addressing
vaccination reluctance and being aware of it is beneficial to authorities.
Previous research has shown that social media can be beneficial in gathering situational
information during emergency situations such as natural disasters and epidemics (Imran
et al. 2013; Househ 2016). In addition, social media has been utilised to predict future
pandemic outbreaks (Grover and Aujla 2014). People have been routinely writing on
Twitter or Gab about their experiences as a result of COVID-19, and how their lives have
been altered. People have posted stories about someone (or themselves) getting sick,
suffering from COVID-19 symptoms, or why they are opposed to vaccination.
With different degrees of success, certain governments throughout the world have
sought to forecast an increase in COVID-19 cases using applications that rely on people
reporting their symptoms or getting diagnosed (e.g., Aarogya Setu app in India).
Since vast amounts of tweets are being posted everyday, using social media to
automatically gather such insights on the symptoms of users can be a more cost- and
time-effective alternative to help predict potential rise in cases of COVID-19, and any
future disease outbreaks.
In this report, we attempt to investigate various social media signals that could be utilised
to anticipate future COVID-19 cases. We're focusing on symptom-reporting tweets,'
which provide information on someone who is experiencing COVID-19 symptoms. We
gathered tweets containing symptom-keywords (related to a standard set of COVID-19
symptoms established by the WHO) over a long period of time, from February 2020 to
June 2021, Indian tweets. We discovered that a huge percentage of the tweets contain no
information on anyone who is suffering the symptoms. As a result, we created a
customised BERT-based classifier to not only identify tweets that actually report
someone experiencing symptoms (symptom-reporting tweets), but also to differentiate
between different subcategories of symptom-reporting tweets, such as primary/self-
reporting tweets, secondary-reporting tweets, and third-party reporting tweets that report
another person experiencing COVID-19 symptoms. Then, notably in 2021, we show that
certain of these sub-categories have a strong association with the Indian COVID-19 case
dynamics. We also show how these signals may be utilised to create prediction models
for the amount of COVID-19 cases in the future.
We developed an accurate 4-class classifier to identify and classify among different
types of symptom-reporting tweets, that achieves a macro F-score of
0.79 for this challenging task.
We extract various signals from these symptom-reporting tweets and compare them to
COVID-19 cases and deaths in India and around the world. In a few
instances, we see strong relationships. The number of secondary-reporting tweets posted
within a given week, in particular, has a strong link to the number of COVID
cases/deaths that occur 1-2 weeks later.). Even when employing simple regression
models for prediction, we see good outcomes. We anticipate that the findings of this
study will aid in the development of improved models for predicting COVID-19 cases
and deaths (and other diseases). This could aid governments in tracking future COVID-
19 waves or illness outbreaks and preventing/mitigating them.
Vaccine Hesitancy Previous work by Hilary Piedrahita-Valdés et al. (2021) used a hybrid
approach to perform an opinion-mining analysis on 1,499,227 vaccine- related tweets
published on Twitter from 1st June 2011 to 30th April 2019. Their algorithm classified
69.36% of the tweets as neutral, 21.78% as positive, and 8.86% as negative. The percentage
of neutral tweets showed a decreasing tendency, while the proportion of positive and
negative tweets increased over time.
Jens Lemmens et al (2022) made a Dutch language model adapted to the domain of COVID-
19 tweets. They Adapted BERT for Vaccine Hesitancy and Argumentation Detection.
Steven Lloyd Wilson et al(2020) employed a large-n cross-country regression approach to
assess the global impact of social media on vaccination reluctance. They also discovered a
link between social media activity by organisations and public concerns about vaccine
safety. Furthermore, there is a strong link between foreign disinformation tactics and falling
vaccination rates.
Article by Ariana Remmel (2021) mentions, public confidence in the safety of COVID-19
vaccines in the United States declined after government officials halted vaccinations with the
Johnson & Johnson (J&J) shot in April 2021. Officials investigated whether the vaccine was
linked to a rare type of blood clot during the ten-day delay, but they finally declared the
vaccine safe and granted the go-ahead to restart use. In this social media plays a huge role in
informing others about the investigations and can result in distress in different countries.
According to a report by Centre for Countering Digital Hate (CCDH), anti-vaxxers' social
media accounts have grown their following by at least 7-8 million individuals since 2019.
Haiyan Yu et al (2022) also stated that Negative sentiment COVID-19 tweets of public
organizations attract more responses from followers hence results into sperading to large
audience.
It is significant from this output that social media posts hold a lot of potential to know the
reason behind vaccine hesitancy.
Here in this report we had used some predefined categories/reasons for classification. They
are neutral, mandatory, pharma, conspiracy, political, country
,rushed, ingredients, side-effect, ineffective, religious, none. They are defined in section
3.1.1.
This work uses Gab data and we developed classifier for finding reason why this posts are
against Covid vaccination.
Predicting number of Cases/ Death Previous works have tried using different
indicators to estimate the trends in the number of cases and deaths due to COVID-19.
Karisani and Karisani (2020) were among the first to apply machine learning and
natural language processing (NLP) techniques to predict tweets containing information about
someone who has been infected with COVID-19. To conduct the classification, they used
BERT-based models. Shen et al. (2020) used data from Weibo (China's Twitter) to identify
postings including information about people reporting symptoms and their diagnosis using
traditional machine learning algorithms (such as SVM and random forest). Klein et al. (2021)
employed BERT-based models once more to separate tweets including people reporting that
someone tested positive for COVID-19.
According to Singh et al. (2020), social media conversations are more highly connected with
COVID-19 instances, with the United States, Italy, and China taking the lead. As a result,
social media chats could be used as a precursor to COVID-19 instances.
Li et al. (2020a) found correlations between rising cases in China in early 2020 and search
trends (from Google and Baidu) and social media data (from Weibo). Similarly,
Yousefinaghani et al. (2021) used the SH-ESD algorithm to anticipate COVID-19 waves in
the United States and Canada using a search index (from Google) and tweets relating to
symptoms.
(Shen et al. 2020), which works on Weibo postings from early 2020, and (Klein et al. 2021),
which works on self-reports from early to mid 2020, are two previous works that have
employed comparable signals that we will apply. However, none of these studies have
attempted to investigate the various sub-categories of symptom-reporting tweets, nor have
they examined tweets from the year 2021. It's crucial to figure out which signals are reliable
predictors of COVID cases/deaths over longer time periods, which no previous research has
looked at.
This work is different from prior studies that attempted to correlate social media signals with
COVID cases/deaths. We have designed a customized BERT-based classifier that detects
people reporting COVID-19 symptoms in tweets. This work analyses trends from India over
a much longer period (spanning the first and second COVID waves in India). In addition to
extra features, Part of Speech tagging is also tested along with adding addition linear layer to
this Bert model resulting in improvement of the old model.
3. Data
Gab is well-known for its tolerance of hate speech. Far-right or alt-right users who have been
banned or suspended from other services have flocked to the site. Torba (Gab CEO) said in a
Gab post in late July 2021 that he was "being bombarded" with text messages from members
of the US military claiming that if they refused the COVID-19 vaccine, they would be court-
martialed. The post received 10,000 likes and shares on Facebook. Torba also posted
documents on Gab's news site that contain false information regarding the COVID-19
vaccine, claiming in an email to The New York Times that "I'm stating the truth" and that
"Your Facebook-funded 'fact checkers' like Graphika are wrong and are the ones selling
disinformation here." All this results making it best platform to check for antivax posts and
understand people opinion about vaccination.
Gab only allows searching for a hashtag or users. There was a feature to do public search but
was removed a few years back. We had written code to extract this data using Gab Api. For
this purpose we had used two types of hashtags to collect data. Gabs for about 120 antivax
keywords and 150 provax keywords were collected along with all vaccine names and their
companies name.
For training purposes about 4500 annotated tweets are used which are classified into
mentioned categories in 3.1.1, manually. Then this data is used for training
various classifiers and observing results.
neutral The tweet does NOT indicate hesitancy towards any vaccine
unnecessary COVID is not dangerous / Vaccine not required
mandatory Against mandatory vaccination
pharma Against Big Pharma
conspiracy Deeper Conspiracy
political Political side of vaccines
country Country of origin
rushed Rushed Process
ingredients Vaccine Ingredients / technology
side-effect Side Effects
ineffective Vaccine is ineffective
religious Religious Reasons
none No specific reason stated in the tweet
Observation
Most of them were classified under the side-effect category.
Gabs containing this words can be categorized into side-effects :
Bell's Palsy, Bellspalsy, Blind, Deaf, Throat Paralysis, Tremors, vaccines cause autism,
blood clots, bad headaches, high fever, sore muscles, bad headaches, high fever, sore
muscles, spike protein, diarrhea, bloating, high levels of aluminum, autoimmune
disease, chest pains, infertility, Myocarditis, Pericarditis, HeartInflammation,
produces toxins, cardiac arrest, Alzheimers, ALS ,
Neurological Degenerative Diseases, brain thrombosis, death
One of its usage is about finding top 5 terms associated with topics.
L-LDA is implemented here to find top 5 terms associated with each category.
Observation
Knn 0.4795
Bagging 0.5520
Boosting 0.5602
Multinomial NB 0.3661
Knn 0.5399
Bagging 0.5665
Boosting 0.5594
Multinomial NB 0.5300
Observed:
accuracy 0.9431
We examined a small number of forecasts. This gives you a sense of how good the
predictions are (qualitative evaluation).
Test comment :
"johnson a vaccine is a good help but pointless if the there is shit still in the water polio for
example"
Unnecessary 0.4786
ineffective 0.4927
By thresholding (0.5), we were able to reduce the noise in the forecasts. Only tag
predictions that were higher than (or equal to) the threshold were considered.
Test comment :
"yeah that's not how the vaccine works there s no evidence on what it actually even does
though so who knows"
ineffective 0.8603545427322388
1 i, am, me Primary
6 <URL> Third-party
Part-of-speech (POS) tagging is a task in natural language processing that involves labelling
words in context with their grammatical category, such as noun, verb,
preposition, and so on. The universal dependency treebank, a corpus of texts in many
languages annotated with syntactic trees in the dependency frame, morphological features,
and word-level part of speech tags, is the usual benchmark for this task.
We classified about 800K tweets with our finally trained model for categorizing them into
mentioned 4 categories.
BERT-Base 0.7578
BERT-Large 0.7613
CT-BERT 0.7744
0 1 2 3 4
Additionally, authorities can use regional data (at the city or state level) to better assess
people's needs and follow any new issues that arise (for example rising of fungal infection
cases in COVID-19 survivors). We've just looked at simple regression models so far, but
more complicated models based on many signals can be developed. By combining these
social media signals with other real-world signals, better prediction models can be created.
Models that combine these social media signals with real-world signals like public
transportation utilisation trends, for example, can be utilised to provide better regional level
predictions to authorities.
7. References
Ariana Remmel. Communicating COVID Vaccine Safety Poses a Unique Challenge. (2021).
Available online at:
https://media.nature.com/original/magazine-assets/d41586-021-01257-8/d41586-02 1-
01257-8.pdf (accessed August 14, 2021).
Bhanot, D.; et al. 2020. Stigma and discrimination during COVID-19 pandemic. Frontiers in
public health 8: 829.
Burki, T. The online anti-vaccine movement in the age of COVID-19. Lancet Digit. Health 2,
e504–e505. https://doi.org/10.1016/S2589-7500(20)30227-2 (2020).
Devlin, J.; et al. 2018. Bert: Pre-training of deep bidi- rectional transformers for language
understanding. arXiv preprint arXiv:1810.04805 .
Dong, E.; Du, H.; and Gardner, L. 2020. An interactive web- based dashboard to track
COVID-19 in real time. The Lancet infectious diseases 20(5).
Dutta, U.; et al. 2021. Analyzing Twitter Users’ Behavior Before and After Contact by the
Russia’s Internet Research Agency. Proc. CSCW .
Goran Muric et al(2021), COVID-19 Vaccine Hesitancy on Social Media: Building a Public
Twitter Data Set of Antivaccine Content, Vaccine Misinformation, and Conspiracies
Grover, S.; and Aujla, G. S. 2014. Prediction model for influenza epidemic based on Twitter
data. International Journal of Advanced Research in Computer and Communication
Engineering 3(7): 7541–7545.
Higgins, T. S.; et al. 2020. Correlations of online search engine trends with coronavirus
disease (COVID-19) incidence: infodemiology study. JMIR public health and surveillance
6(2).
Hilary Piedrahita-Valdés et al. (2021) Vaccine Hesitancy on Social Media: Sentiment
Analysis from June 2011 to April 2019
Jens Lemmens et al (2022), CoNTACT: A Dutch COVID-19 Adapted BERT for
Vaccine Hesitancy and Argumentation Detection
Klein, A. Z.; et al. 2021. Toward using twitter for tracking covid-19: A natural language
processing pipeline and exploratory data set. JMIR 23(1): e25314.
Lazarus, J. V. et al. A global survey of potential acceptance of a COVID-19 vaccine. Nat.
Med. https://doi.org/10.1038/s41591-020-1124-9 (2020)
Singh, L.; et al. 2020. A first look at COVID-19 information and misinformation
sharing on Twitter. arXiv preprint arXiv:2003.13907 .
Steven Lloyd Wilson et al (2020), Social media and vaccine hesitancy