Adharsh Rajeev Seminar Report
Adharsh Rajeev Seminar Report
Adharsh Rajeev Seminar Report
A Seminar Report
Submitted to the APJ Abdul Kalam Technological University
in partial fulfillment of requirements for the award of degree
Bachelor of Technology
in
Electronics and Communication Engineering
by
Adharsh Rajeev
KNP18CS005
CERTIFICATE
Mr Manoj Ray D
Professor and Head
Dept.of CSE
College of Engineering
Karunagappally
DECLARATION
I Adharsh Rajeev hereby declare that the seminar report Coronavirus Pandemic
Analysis Through Tripartite Graph Clustering In Online Social Networks, sub-
mitted for partial fulfillment of the requirements for the award of degree of Bachelor
of Technology of the APJ Abdul Kalam Technological University, Kerala is a bonafide
work done by me under supervision of Mrs. Shani Raj and Mrs. Neethu Thomas.
This submission represents my ideas in my own words and where ideas or words
of others have been included, I have adequately and accurately cited and referenced
the original sources.
I also declare that I have adhered to ethics of academic honesty and integrity
and have not misrepresented or fabricated any data or idea or fact or source in my
submission. I understand that any violation of the above will be a cause for disciplinary
action by the institute and/or the University and can also evoke penal action from the
sources which have thus not been properly cited or from whom proper permission has
not been obtained. This report has not been previously formed the basis for the award
of any degree, diploma or similar title of any other University.
19-01-2022
Abstract
The COVID-19 pandemic has hit the world hard. The reaction to the pandemic
related issues has been Pouring into social platforms, such as Twitter. Many public
officials and governments use Twitter to make policy Announcements. People keep
close track of the related information and express their concerns about the policies On
Twitter. It is beneficial yet challenging to derive important information or knowledge
out of such Twitter data. In This paper, we propose a Tripartite Graph Clustering for
Pandemic Data Analysis (TGC-PDA) framework that builds On the proposed models
and analysis: (1) tripartite graph representation, (2) non-negative matrix factorization
withRegularization, and (3) sentiment analysis. We collect the tweets containing a set
of keywords related to coronavirus Pandemic as the ground truth data. Our framework
can detect the communities of Twitter users and analyze the Topics that are discussed
in the communities. The extensive experiments show that our TGC-PDA framework
can Effectively and efficiently identify the topics and correlations within the Twitter
data for monitoring and understanding Public opinions, which would provide policy
makers useful information and statistics for decision making.
i
Acknowledgement
I take this opportunity to express my deepest sense of gratitude and sincere thanks
to everyone who helped me to complete this work successfully. I express my
sincere thanks to Mr Manoj Ray D, Head of Department, Computer Science and
Engineering , College of Engineering Karunagappally for providing me with all the
necessary facilities and support.
I would like to express my sincere gratitude to Mrs. Shani Raj, Assistant Profes-
sor, Computer Science and Engineering, College of Engineering Karunagappally for
their support and co-operation.
I would like to place on record my sincere gratitude to my seminar guide Mrs.
Neethu Thomas, Associate Professor, Computer Science and Engineering, College of
Engineering Karunagappally for the guidance and mentorship throughout the course.
Finally I thank my family, and friends who contributed to the succesful fulfilment
of this seminar work.
Adharsh Rajeev
ii
Contents
Abstract i
Acknowledgement ii
List of Figures v
List of Tables vi
1 Introduction 1
2 Related Works 3
2.1 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Non-negative Matrix Factorization . . . . . . . . . . . . . . . . . . . 4
6 Experimental Result 12
6.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
iii
6.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.4 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . 15
7 Conclusion 19
References 20
iv
List of Figures
3. Build the user-topic bipartite by removing the tweet nodes of the tripartite graph
v
List of Tables
1. Notation
vi
Chapter 1
Introduction
The COVID-19 pandemic has hit the world hard. The reaction to the pandemic related
issues has been Pouring into social platforms, such as Twitter. Many public officials
and governments use Twitter to make Policy Announcements. It is beneficial yet
challenging to derive important information or knowledge out of such Twitter data. In
Twitter, the data (such as users and tweets) are linked rather than a bunch of standalone
informationunits. Thus, it is natural to represent such linked data asgraphs. The
representation of the graph can largely affect the performance of Twitter data analysis.
Meanwhile, the scale of the graphs increases explosively from thousands of vertices to
billions of vertices, which makes it important to find a proper graph representation for
the data. A suitable graph representation can make the entire descriptive or predictive
graph analyzing process efficient and effective. For example, multipartite graphs can
be used to model networks with different objects, such as documents and terms, movies
and preferences, or buyers and sellers.
Sentiment analysis in Twitter can analyze the tweet texts to identify the opinions or
ideas that users express. Much literal work on Twitter sentiment analysis focused on
understanding the sentiments of individual tweets and user-level sentiments[12–14].
Some researchers studied both tweet-level and user-level sentiments[15, 16]. Senti-
ment analysis is challenging because the sentiments of users are correlated with the
sentiments expressed in many short tweets, which are intrinsically noisy and labile. In
addition, it is difficult to understand and characterize the dynamics in user’s sentiments,
as different time may lead to contradict opinions towards the same topic. It is not
1
uncommon to see people having a lukewarm and reluctant attitude towards a product
at first glance, but later cannot live without it.
In This paper, we propose a Tripartite Graph Clustering for Pandemic Data
Analysis (TGC-PDA) framework that builds On the proposed models and analysis:
(1) tripartite graph representation, (2) non-negative matrix factorization with Regular-
ization, and (3) Sentiment Analysis
2
Chapter 2
Related Works
We first discuss related work on sentiment analysis. We also discuss some related work
onNon-negative matrix factorization.
3
relatively lower than users and Re-tweeting users. Kim et al. utilized the collaborative
filtering techniques to analyze The sentiments of users based on the sentiments of
similar users. The similarity of two Users are evaluated by whether they have expressed
similar sentiments towards the same Set of topics. This approach totally ignored the
rich information of tweets and features, as Well as social relationship such as user-user
re-tweeting relation. Instead, in this work, we Propose a tri-clustering framework, to
obtain the sentiment clustering of both tweets and Users simultaneously. Our approach
utilizes the re-tweeting social relation and dependencies Among users, tweets, and
features, and is independent with the quality of labeled data.
4
Chapter 3
In this section, we discuss how we construct the tripartite graph, the notations, and the
problem formulation for pandemic analysis
5
3.2 Problem definition
We denote the raw data from the Twitter platform with a 3-tuple Raw Data =
U,T,H,where U,T,H represent the set of users, tweets, and topics, respectively. Given
the raw data, our target is to generate the community’s attitude towards COVID-19
events via the following phases: (1) generate tripartite graph representation from the
raw data (2) detect the communities via graph clustering, and (3) Infer sentiments for
each community. The attitude of eachcommunity will be represented by positive or
neutral or negative. Table 1 shows the notations used in this work
6
Chapter 4
7
4.1 Tripartite graph representation
A tripartite graph G(V,E) can be constructed from the data to represent the relationships
among U, T , and H, where V and E represent the node set and edge set, respectively. In
graph theory, a tripartite graph is complete if and only if each node in one set of nodes
is fully connected with all nodes in the adjacent set. Based on the data we obtained
from Twitter, the tripartite graph generated in our case is not complete. To represent
the graph and find a suitable clustering solution for a tripartite graph, one way is to
divide thgraph intobipartitegraphs we propose to build a user-topic bipartite graph and
a user-tweet bipartite graph, as shown in Fig.3. We use tweet-level nodes as the bridges
to build the connection between user and topic nodes. In Fig.3, Node I has three paths
(a path is a finite sequence of edges connecting two end nodes) connecting to Node b,
which goes through Node 1 and Node 2.
8
it would be preferable to make tweet nodes close to their user nodes. In other words,
node locality needs to be preserved. Thus, standard NMF may not work properly. In
fact, as to be described later, our experiment results show that the accuracy of NMF is
poor. Hence, we propose the graph regularization technique into NMF to smooth the
result. To cluster the user-tweet bipartite graph, we utilize the cluster for tweets based
on the clustering results of users. If one tweet belongs to different clusters, we use the
majority vote strategy to choose a proper placement, which will be the clustering result
for user and tweet bipartite graph.
9
4.3 Sentiment Analysis
As our goal is to extract open source intelligence from each community, we aggregate
the tweets based on their cluster labels. Then, we run a sentiment analysis with
a mini-batch algorithm when running the full-batch algorithms is intractable. We
use the sentiment analysis library, such as Textblob, to provide a quantitative result
for the polarity in one cluster. Textblob is one of the commonly used libraries for
processing textual data[42]. It provides APIs to handle natural language processing
tasks, including text cleaning and sentiment analysis. To get the polarity of a cluster,
we measure the percentage of positive, neutral, and negative tweets in that cluster. This
way we can figure out the overall attitudes of the users in one cluster for the COVID-19
related events.
10
Chapter 5
Algorithm 1 shows the pseudocode for the proposed NMFR Updating (NMFRU)
algorithm, which can cluster the graph in phase two of the TGC-PDA framework. The
basic idea of the proposed NMFRU is to fix some factors and update one parameter
at a time. Here, we start with one parameter that appears least frequently in the loss
function and iteratively update the matrices
11
Chapter 6
Experimental Result
6.1 Dataset
We evaluate the performance of TGC-PDA with real Twitter dataset about “Covid-19”
collected between Feb. 15th, 2020 and Sep. 30th 2020. To get the tweet data, we wrote
a python program to crawl the tweets and the users who liked them. Multiple hashtag
keywords, such as COVID19, coronavirus, covid, covid pandemic, and COVID20 are
used to ensure we can get a large dataset. Since the free Twitter API we use has rate
limits and it restricts the number of retrieved tweets during each login access, we have
to crawl the data for several months. After removing the duplicate and non-English
posts, we obtain 18 327 tweets, with 752 649 users who interacted with the tweets.
Some users only interacted with one tweet in our dataset, which are identified as “less
interactive” users and excluded. After the data cleanup, we have 301 982 users left
12
6.2 Experimental setup
As all the clustering methods (i.e., Kmeans, NMF, SNMF, ONMTF, and our NMFRU)
have one or more parameters to be tuned, to make the comparison fair, we run these
methods under different parameters and choose the best result for each algorithm. In
NMFR, we have two Hyperparameters alpha and Beta To find a proper value for these
parameters, we plot a loss-value curve, with value ranging from 0.1 to 1000. Then, the
alpha and beta values can be found by scanning the plot. Since our data size is relatively
large and cannot be completely labeled manually, we randomly choose 5and use the
result tested by sample data as the framework result. Our online framework achieves
a good tradeoff between the above two extremes and is able to study the evolution of
sentiments. Before we present the online framework, we first introduce the following
two observations: (1) The frequency distribution of vocabularies changes over time;
however, the sentiments of vocabularies do not change or change slowly over time.
(2) Considering the entire population, the majority of users rarely change their mind
within a short time
13
6.3 Evaluation Metrics
To evaluate the clustering result, we use the widely used standard metrics, including
the clustering accuracy, cluster purity, and Normalized Mutual Information (NMI). The
Cluster accuracy is defined as follows:
14
6.4 Results and discussion
Table 2 shows the comparison between NMFRU and several baseline models, such
as Kmeans, NMF, SNMF, and ONMTF. When applying these baseline models toour
data, we do not embed the topic nodes to user nodes. Instead, we use the user and
tweet bipartite graph to calculate the clustering result. The matrix form of the bipartite
graph is that the columns and rows correspond to the two sets of vertices, with each
entry corresponding to an edge between a column and a row. From Table 2, we can
see that NMFRU achieves the best performance in terms of accuracy, purity, and NMI.
This is because our bipartite graph is created based on our tripartite graph model,
and it embedded more information than the plain bipartite graph. We also utilize
the tri-factorization and locality preserved schemes, which can further improve the
performance
15
We study the average convergence time of our framework in Fig 4 When the
number of iterations is around 23, our framework tends to converge with a total loss of
2, which shows that the calculation of NMFRU is fast. Meanwhile, when comparing
the convergence time by the different baseline methods in Fig. 5, we can see that
NMFRU is slower than Kmeans but faster than other baseline clustering methods. It
is because we do fewer matrix multiplication operations in NMFRU, hence saving
some running time. Therefore, TGC-PDA that utilizes NMFRU as the core clustering
algorithm can be used for a large dataset. As for the polarity of the communities, Table
3 shows the largest ten communities with its polarity ratio. From Table 3, we find that
the neutral ratio is quite high among all topics.
16
In Fig. 5, we can see that NMFRU is slower than Kmeans but faster than
other baseline clustering methods. It is because we do fewer matrix multiplication
operations in NMFRU, hence saving some running time.Therefore, TGC-PDA that
utilizes NMFRU as the core clustering algorithm can be used for a large dataset
17
Table 3 shows the largest ten communities with its polarity ratio. From Table 3, we
find that the neutral ratio is quite high among all topics.
18
Chapter 7
Conclusion
The outbreak of COVID-19 makes the whole world chaotic. People often search for
real-time news and ventilate their emotions through the Internet. OSNs are widely
used for opinions sharing, news publishing, and information spreading. The large
useful data from OSNs can be leveraged to help public officials and governments make
better decisions. In this paper, we build a framework of TGC-PDA to utilize Twitter
data to monitor and automatically collect the voice of the people during COVID-
19 pandemic. The TGC-PDA framework takes advantage of the characteristics of
the Twitter users and tweets network structure to effectively analyze the community
structures and sentiments. It enables us to extract the open source intelligence from
each community, which could be utilized to track people’s feedbacks and opinions
towards the coronavirus pandemic events. Our work currently is a pioneering work
and it only focused on English-language tweets. It would be feasible to extend our
work to handle tweets in other languages. Similar techniques can be applied to other
online and publicly available social media platforms, such as Reddit. Since a tweet may
contain not only text, but also embedded hyperlinks, images, or even videos, it would
be interesting and challenging to explore more information from them. Moreover,
some events during COVID-19 are time-sensitive, it would be also interesting to study
the tweets from the perspective of time-series analysis
19
References
[8] L. J. Chang, W. Li, L. Qin, W. J. Zhang, and S. Y. Yang, pSCAN: Fast and
exact structural graph clustering, IEEE Trans. Knowl. Data Eng., vol. 29, no. 2,
pp. 387–401, 2017. [9] R. El Bacha and T. T. Zin, Ranking of influential users
based on user-tweet bipartite graph, in Proc. of 2018 IEEE Int. Conf. Service
Operations and Logistics, and Informatics (SOLI), Singapore, 2018, pp. 97–101.
20
[10] A. Rodr´ıguez, C. Argueta, and Y. L. Chen, Automatic detection of hate
speech on facebook using sentiment and emotion analysis, in Proc. of 2019
Int. Conf. Artificial Intelligence in Information and Communication (ICAIIC),
Okinawa, Japan, 2019, pp. 169–174.
[11] J. Zhou and C. Kwan, Missing link prediction in social networks, in Proc.
15th Int. Symp. Neural Networks, Minsk, Belarus, 2018, pp. 346–354.
21