Adharsh Rajeev Seminar Report

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Coronavirus Pandemic Analysis Through Tripartite

Graph Clustering in Online Social Networks

A Seminar Report
Submitted to the APJ Abdul Kalam Technological University
in partial fulfillment of requirements for the award of degree

Bachelor of Technology
in
Electronics and Communication Engineering
by
Adharsh Rajeev
KNP18CS005

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


COLLEGE OF ENGINEERING KARUNAGAPPALLY
KERALA
January 2022
DEPT. OF COMPUTER SCIENCE AND ENGINEERING
COLLEGE OF ENGINEERING KARUNAGAPPALLY
2021-22

CERTIFICATE

This is to certify that the report entitled Coronavirus Pandemic Analysis


Through Tripartite Graph Clustering In Online Social Networks submitted by
Adharsh Rajeev ( KN18CS005), to the APJ Abdul Kalam Technological University
in partial fulfillment of the B.Tech. degree in Computer Science And Engineering is
a bonafide record of the seminar work carried out by him under our guidance and
supervision. This report in any form has not been submitted to any other University or
Institute for any purpose.

Mrs. Shani Raj Mrs. Neethu Thomas


(Seminar Coordinator) (Seminar Guide )
Assistant Professor Associate Professor
Dept.of CSE Dept.of CSE
College of Engineering College of Engineering
Karunagappally Karunagappally

Mr Manoj Ray D
Professor and Head
Dept.of CSE
College of Engineering
Karunagappally
DECLARATION

I Adharsh Rajeev hereby declare that the seminar report Coronavirus Pandemic
Analysis Through Tripartite Graph Clustering In Online Social Networks, sub-
mitted for partial fulfillment of the requirements for the award of degree of Bachelor
of Technology of the APJ Abdul Kalam Technological University, Kerala is a bonafide
work done by me under supervision of Mrs. Shani Raj and Mrs. Neethu Thomas.
This submission represents my ideas in my own words and where ideas or words
of others have been included, I have adequately and accurately cited and referenced
the original sources.
I also declare that I have adhered to ethics of academic honesty and integrity
and have not misrepresented or fabricated any data or idea or fact or source in my
submission. I understand that any violation of the above will be a cause for disciplinary
action by the institute and/or the University and can also evoke penal action from the
sources which have thus not been properly cited or from whom proper permission has
not been obtained. This report has not been previously formed the basis for the award
of any degree, diploma or similar title of any other University.

Karunagappally Adharsh Rajeev

19-01-2022
Abstract

The COVID-19 pandemic has hit the world hard. The reaction to the pandemic
related issues has been Pouring into social platforms, such as Twitter. Many public
officials and governments use Twitter to make policy Announcements. People keep
close track of the related information and express their concerns about the policies On
Twitter. It is beneficial yet challenging to derive important information or knowledge
out of such Twitter data. In This paper, we propose a Tripartite Graph Clustering for
Pandemic Data Analysis (TGC-PDA) framework that builds On the proposed models
and analysis: (1) tripartite graph representation, (2) non-negative matrix factorization
withRegularization, and (3) sentiment analysis. We collect the tweets containing a set
of keywords related to coronavirus Pandemic as the ground truth data. Our framework
can detect the communities of Twitter users and analyze the Topics that are discussed
in the communities. The extensive experiments show that our TGC-PDA framework
can Effectively and efficiently identify the topics and correlations within the Twitter
data for monitoring and understanding Public opinions, which would provide policy
makers useful information and statistics for decision making.

i
Acknowledgement

I take this opportunity to express my deepest sense of gratitude and sincere thanks
to everyone who helped me to complete this work successfully. I express my
sincere thanks to Mr Manoj Ray D, Head of Department, Computer Science and
Engineering , College of Engineering Karunagappally for providing me with all the
necessary facilities and support.
I would like to express my sincere gratitude to Mrs. Shani Raj, Assistant Profes-
sor, Computer Science and Engineering, College of Engineering Karunagappally for
their support and co-operation.
I would like to place on record my sincere gratitude to my seminar guide Mrs.
Neethu Thomas, Associate Professor, Computer Science and Engineering, College of
Engineering Karunagappally for the guidance and mentorship throughout the course.
Finally I thank my family, and friends who contributed to the succesful fulfilment
of this seminar work.

Adharsh Rajeev

ii
Contents

Abstract i

Acknowledgement ii

List of Figures v

List of Tables vi

1 Introduction 1

2 Related Works 3
2.1 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Non-negative Matrix Factorization . . . . . . . . . . . . . . . . . . . 4

3 Pandemic Analysis Through Twitter Data 5


3.1 Tripartite graph in twitter . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 Pandemic Data Analysis Framework 7


4.1 Tripartite graph representation . . . . . . . . . . . . . . . . . . . . . 8
4.2 Non-negative Matrix Factorization with Regularization (NMFR) . . . 8
4.3 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5 NMFR Updating Algorithm 11

6 Experimental Result 12
6.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

iii
6.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.4 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . 15

7 Conclusion 19

References 20

iv
List of Figures

1. An example of tripartite graph in Twitter

2. An overview of the TGC-PDA framework

3. Build the user-topic bipartite by removing the tweet nodes of the tripartite graph

4. Total loss with different numbers of iterations

5. Convergence time of methods.

v
List of Tables

1. Notation

2. Performance results of classifiers

3. Largest ten communities with its polarity ratio

vi
Chapter 1

Introduction

The COVID-19 pandemic has hit the world hard. The reaction to the pandemic related
issues has been Pouring into social platforms, such as Twitter. Many public officials
and governments use Twitter to make Policy Announcements. It is beneficial yet
challenging to derive important information or knowledge out of such Twitter data. In
Twitter, the data (such as users and tweets) are linked rather than a bunch of standalone
informationunits. Thus, it is natural to represent such linked data asgraphs. The
representation of the graph can largely affect the performance of Twitter data analysis.
Meanwhile, the scale of the graphs increases explosively from thousands of vertices to
billions of vertices, which makes it important to find a proper graph representation for
the data. A suitable graph representation can make the entire descriptive or predictive
graph analyzing process efficient and effective. For example, multipartite graphs can
be used to model networks with different objects, such as documents and terms, movies
and preferences, or buyers and sellers.
Sentiment analysis in Twitter can analyze the tweet texts to identify the opinions or
ideas that users express. Much literal work on Twitter sentiment analysis focused on
understanding the sentiments of individual tweets and user-level sentiments[12–14].
Some researchers studied both tweet-level and user-level sentiments[15, 16]. Senti-
ment analysis is challenging because the sentiments of users are correlated with the
sentiments expressed in many short tweets, which are intrinsically noisy and labile. In
addition, it is difficult to understand and characterize the dynamics in user’s sentiments,
as different time may lead to contradict opinions towards the same topic. It is not

1
uncommon to see people having a lukewarm and reluctant attitude towards a product
at first glance, but later cannot live without it.
In This paper, we propose a Tripartite Graph Clustering for Pandemic Data
Analysis (TGC-PDA) framework that builds On the proposed models and analysis:
(1) tripartite graph representation, (2) non-negative matrix factorization with Regular-
ization, and (3) Sentiment Analysis

2
Chapter 2

Related Works

We first discuss related work on sentiment analysis. We also discuss some related work
onNon-negative matrix factorization.

2.1 Sentiment Analysis


We summarize a set of representative (but by no means exhaustive) methods to
sentiment Analysis in Table 6, where we group existing approaches into three
directions. First, we Consider whether a method aims to identify positive or negative
sentiment in a piece of Text (tweet-level analysis) or to determine the sentiments
of users (user-level analysis). A Large amount of research in the area of sentiment
analysis has focused on classifying text Polarity Smith et al. and Deng et al. analyzed
the Sentiments of users by aggregating the sentiments of their tweets. Tan et al. directly
Analyzed the sentiments of users using a semi-supervised approach. Specifically, a
semi- Supervised label propagation algorithm is utilized to determine the sentiment
of a user by The sentiments of his/her tweets and the sentiments of his/her immediate
neighbor users In a heterogeneous graph built upon social relations. However, with
insufficient labeled Nodes or the labeled nodes are densely condensed in a small region
of the entire graph, the Performance of this approach is not encouraging. Another
issue is that Smith et al. Have pointed out that the emotion correlation among users
and following or @mention users (which are used in to build heterogeneous graph), is

3
relatively lower than users and Re-tweeting users. Kim et al. utilized the collaborative
filtering techniques to analyze The sentiments of users based on the sentiments of
similar users. The similarity of two Users are evaluated by whether they have expressed
similar sentiments towards the same Set of topics. This approach totally ignored the
rich information of tweets and features, as Well as social relationship such as user-user
re-tweeting relation. Instead, in this work, we Propose a tri-clustering framework, to
obtain the sentiment clustering of both tweets and Users simultaneously. Our approach
utilizes the re-tweeting social relation and dependencies Among users, tweets, and
features, and is independent with the quality of labeled data.

2.2 Non-negative Matrix Factorization


Due to its wide application in various areas such as text mining,pattern recogni-Tion
, machine learning and bioinformatics , nonnegative matrix factorization (NMF) has
attracted much interest from researchers. Generally, nonnegative matrix factorization
aims to factor a matrix X, into two or three lower dimension Matrices and minimizes
the square error/divergence between X and the approximation of X Using those lower
dimension matrices. There are several algorithms that are proposed to find The sub-
optimal solution of those lower-dimension matrices, for instance, Lee and Seung
Proposed two different multiplicative algorithms to update the matrices. Other more
re- Cent approaches include using the projected gradient descent methods [20], the
active-set Method and the block principal pivoting [16] to update the matrices. If one
of the factors (lower dimension matrices) satisfies the separability condition, Arora
et al also proposed A polynomial-time algorithm to find the exact NMF solution.
In addition to developing efficient algorithms to find the lower-dimension matrices,
current researches on NMF also focus On applications of NMF to different domains
such as link prediction in social networks

4
Chapter 3

Pandemic Analysis Through Twitter


Data

In this section, we discuss how we construct the tripartite graph, the notations, and the
problem formulation for pandemic analysis

3.1 Tripartite graph in twitter


The important information of a tweet includes: (1) user, (2) tweet text, and (3)
hashtag/keyword. The relationships among them are straightforward: a user can like
or comment or post a tweet, and a tweet might have some topics/hashtags/keywords.
In other words, users will perform actions (e.g., like/comment) on tweets, while
each tweet is associated with certain topics/hashtags/keywords. As users do not
directly perform actions on the topics/keywords/hashtags, we can abstract the rela-
tionships among user, tweet contents, and topics/keywords/hashtags into a tripartite
graph. For example, Fig.1 is an example to model Twitter data as a tripartite
graph. The tripartite graph is composed of three types of nodes: user nodes,
tweet nodes, and topic nodes. In the tripartite graph, the user nodes only con-
nect with the tweet nodes, while the tweet nodes only connect with the topic
nodes. In Fig.1, the solid lines with red heart icon and message icon represent
like or comment relationship between users and tweets, respectively; while the
lines without icon represent the containing relationship between tweets and topics.

5
3.2 Problem definition
We denote the raw data from the Twitter platform with a 3-tuple Raw Data =
U,T,H,where U,T,H represent the set of users, tweets, and topics, respectively. Given
the raw data, our target is to generate the community’s attitude towards COVID-19
events via the following phases: (1) generate tripartite graph representation from the
raw data (2) detect the communities via graph clustering, and (3) Infer sentiments for
each community. The attitude of eachcommunity will be represented by positive or
neutral or negative. Table 1 shows the notations used in this work

6
Chapter 4

Pandemic Data Analysis Framework

In this section, we propose a framework of TGC-PDA to automatically collect, cluster,


and infer the sentiments from the observed tweets. Figure 2 shows the overview of
the TGC-PDA framework. The input of the framework is Twitter raw data. TGC-PDA
consists of three main steps: (1) tripartite graph representation, (2) clustering, and (3)
sentiment analysis. In the tripartite graph representation step, we find a mathematical
model to represent the data with less information loss. Then, the clustering step builds
a matrix factorization based on clustering algorithm to find the communities in the
graph. Finally, the sentiment analysis step extracts attitudes within the communities.

7
4.1 Tripartite graph representation
A tripartite graph G(V,E) can be constructed from the data to represent the relationships
among U, T , and H, where V and E represent the node set and edge set, respectively. In
graph theory, a tripartite graph is complete if and only if each node in one set of nodes
is fully connected with all nodes in the adjacent set. Based on the data we obtained
from Twitter, the tripartite graph generated in our case is not complete. To represent
the graph and find a suitable clustering solution for a tripartite graph, one way is to
divide thgraph intobipartitegraphs we propose to build a user-topic bipartite graph and
a user-tweet bipartite graph, as shown in Fig.3. We use tweet-level nodes as the bridges
to build the connection between user and topic nodes. In Fig.3, Node I has three paths
(a path is a finite sequence of edges connecting two end nodes) connecting to Node b,
which goes through Node 1 and Node 2.

4.2 Non-negative Matrix Factorization with Regular-


ization (NMFR)
In the second step of the framework, we need to find the clustering result of the input
graph data. Since the matrix representation of the graph is a non-negative matrix, it is
straightforward to use the NMF for clustering. In this way, for the user-topic bipartite
graph generated in Section 4. At first, we can find an intermediate clustering result of
the graph by applying the clustering algorithms. Then, we can feed the intermediate
clustering result into clustering process of the user-tweet bipartite graph, and the
clusters can be found accordingly. Because users tend to have consistent preferences,

8
it would be preferable to make tweet nodes close to their user nodes. In other words,
node locality needs to be preserved. Thus, standard NMF may not work properly. In
fact, as to be described later, our experiment results show that the accuracy of NMF is
poor. Hence, we propose the graph regularization technique into NMF to smooth the
result. To cluster the user-tweet bipartite graph, we utilize the cluster for tweets based
on the clustering results of users. If one tweet belongs to different clusters, we use the
majority vote strategy to choose a proper placement, which will be the clustering result
for user and tweet bipartite graph.

9
4.3 Sentiment Analysis
As our goal is to extract open source intelligence from each community, we aggregate
the tweets based on their cluster labels. Then, we run a sentiment analysis with
a mini-batch algorithm when running the full-batch algorithms is intractable. We
use the sentiment analysis library, such as Textblob, to provide a quantitative result
for the polarity in one cluster. Textblob is one of the commonly used libraries for
processing textual data[42]. It provides APIs to handle natural language processing
tasks, including text cleaning and sentiment analysis. To get the polarity of a cluster,
we measure the percentage of positive, neutral, and negative tweets in that cluster. This
way we can figure out the overall attitudes of the users in one cluster for the COVID-19
related events.

10
Chapter 5

NMFR Updating Algorithm

Algorithm 1 shows the pseudocode for the proposed NMFR Updating (NMFRU)
algorithm, which can cluster the graph in phase two of the TGC-PDA framework. The
basic idea of the proposed NMFRU is to fix some factors and update one parameter
at a time. Here, we start with one parameter that appears least frequently in the loss
function and iteratively update the matrices

11
Chapter 6

Experimental Result

we analyze our experimental results and the performance of TGC-PDA. We also


compare NMFRU with the well-known clustering methods, such as Kmeans, NMF,
and the commonly used variants, including Semi-NMF (SNMF) and Orthogonal
NMTF (ONMTF).

6.1 Dataset
We evaluate the performance of TGC-PDA with real Twitter dataset about “Covid-19”
collected between Feb. 15th, 2020 and Sep. 30th 2020. To get the tweet data, we wrote
a python program to crawl the tweets and the users who liked them. Multiple hashtag
keywords, such as COVID19, coronavirus, covid, covid pandemic, and COVID20 are
used to ensure we can get a large dataset. Since the free Twitter API we use has rate
limits and it restricts the number of retrieved tweets during each login access, we have
to crawl the data for several months. After removing the duplicate and non-English
posts, we obtain 18 327 tweets, with 752 649 users who interacted with the tweets.
Some users only interacted with one tweet in our dataset, which are identified as “less
interactive” users and excluded. After the data cleanup, we have 301 982 users left

12
6.2 Experimental setup
As all the clustering methods (i.e., Kmeans, NMF, SNMF, ONMTF, and our NMFRU)
have one or more parameters to be tuned, to make the comparison fair, we run these
methods under different parameters and choose the best result for each algorithm. In
NMFR, we have two Hyperparameters alpha and Beta To find a proper value for these
parameters, we plot a loss-value curve, with value ranging from 0.1 to 1000. Then, the
alpha and beta values can be found by scanning the plot. Since our data size is relatively
large and cannot be completely labeled manually, we randomly choose 5and use the
result tested by sample data as the framework result. Our online framework achieves
a good tradeoff between the above two extremes and is able to study the evolution of
sentiments. Before we present the online framework, we first introduce the following
two observations: (1) The frequency distribution of vocabularies changes over time;
however, the sentiments of vocabularies do not change or change slowly over time.
(2) Considering the entire population, the majority of users rarely change their mind
within a short time

13
6.3 Evaluation Metrics
To evaluate the clustering result, we use the widely used standard metrics, including
the clustering accuracy, cluster purity, and Normalized Mutual Information (NMI). The
Cluster accuracy is defined as follows:

The Cluster Purity is defined as follows:

The NMI is defined as

14
6.4 Results and discussion
Table 2 shows the comparison between NMFRU and several baseline models, such
as Kmeans, NMF, SNMF, and ONMTF. When applying these baseline models toour
data, we do not embed the topic nodes to user nodes. Instead, we use the user and
tweet bipartite graph to calculate the clustering result. The matrix form of the bipartite
graph is that the columns and rows correspond to the two sets of vertices, with each
entry corresponding to an edge between a column and a row. From Table 2, we can
see that NMFRU achieves the best performance in terms of accuracy, purity, and NMI.
This is because our bipartite graph is created based on our tripartite graph model,
and it embedded more information than the plain bipartite graph. We also utilize
the tri-factorization and locality preserved schemes, which can further improve the
performance

15
We study the average convergence time of our framework in Fig 4 When the
number of iterations is around 23, our framework tends to converge with a total loss of
2, which shows that the calculation of NMFRU is fast. Meanwhile, when comparing
the convergence time by the different baseline methods in Fig. 5, we can see that
NMFRU is slower than Kmeans but faster than other baseline clustering methods. It
is because we do fewer matrix multiplication operations in NMFRU, hence saving
some running time. Therefore, TGC-PDA that utilizes NMFRU as the core clustering
algorithm can be used for a large dataset. As for the polarity of the communities, Table
3 shows the largest ten communities with its polarity ratio. From Table 3, we find that
the neutral ratio is quite high among all topics.

Fig. 4 Total loss with different numbers of iterations

16
In Fig. 5, we can see that NMFRU is slower than Kmeans but faster than
other baseline clustering methods. It is because we do fewer matrix multiplication
operations in NMFRU, hence saving some running time.Therefore, TGC-PDA that
utilizes NMFRU as the core clustering algorithm can be used for a large dataset

Fig. 5 Convergence time of methods.

17
Table 3 shows the largest ten communities with its polarity ratio. From Table 3, we
find that the neutral ratio is quite high among all topics.

18
Chapter 7

Conclusion

The outbreak of COVID-19 makes the whole world chaotic. People often search for
real-time news and ventilate their emotions through the Internet. OSNs are widely
used for opinions sharing, news publishing, and information spreading. The large
useful data from OSNs can be leveraged to help public officials and governments make
better decisions. In this paper, we build a framework of TGC-PDA to utilize Twitter
data to monitor and automatically collect the voice of the people during COVID-
19 pandemic. The TGC-PDA framework takes advantage of the characteristics of
the Twitter users and tweets network structure to effectively analyze the community
structures and sentiments. It enables us to extract the open source intelligence from
each community, which could be utilized to track people’s feedbacks and opinions
towards the coronavirus pandemic events. Our work currently is a pioneering work
and it only focused on English-language tweets. It would be feasible to extend our
work to handle tweets in other languages. Similar techniques can be applied to other
online and publicly available social media platforms, such as Reddit. Since a tweet may
contain not only text, but also embedded hyperlinks, images, or even videos, it would
be interesting and challenging to explore more information from them. Moreover,
some events during COVID-19 are time-sensitive, it would be also interesting to study
the tweets from the perspective of time-series analysis

19
References

[1] [1] Everyone included: Social impact of COVID-19,


https://www.un.org/development/desa/dspd/everyoneincluded- covid-19.html,
2020.

[2] Wikipedia, COVID-19 pandemic, https://en.wikipedia.org/ wiki/COVID-


19pandemic, 2021.

[3] Domestic travel during the COVID-19 pandemic, https://


www.cdc.gov/coronavirus/2019-ncov/travelers/travel-duringcovid19. html,
2020.

[4] Travelers prohibited from entry to the United States, https://


www.cdc.gov/coronavirus/2019-ncov/travelers/fromother- countries.html,
2020.

[5] K. Cohen, Tokyo 2020 Olympics officially postponed until 2021,


https://tv5.espn.com/olympics/story/ /id/ 28946033/tokyo-olympics-officially-
postponed-2021, 2020.

[6] Wikipedia, RNA virus, https://en.wikipedia.org/wiki/ RNAvirus, 2021.

[7] How does fake news of 5G and COVID-19 spread worldwide?,


https://www.medicalnewstoday.com/articles/ 5g-doesnt-cause-covid-19-but-the-
rumor-it-does-spreadlike- a-virus, 2021.

[8] L. J. Chang, W. Li, L. Qin, W. J. Zhang, and S. Y. Yang, pSCAN: Fast and
exact structural graph clustering, IEEE Trans. Knowl. Data Eng., vol. 29, no. 2,
pp. 387–401, 2017. [9] R. El Bacha and T. T. Zin, Ranking of influential users
based on user-tweet bipartite graph, in Proc. of 2018 IEEE Int. Conf. Service
Operations and Logistics, and Informatics (SOLI), Singapore, 2018, pp. 97–101.

20
[10] A. Rodr´ıguez, C. Argueta, and Y. L. Chen, Automatic detection of hate
speech on facebook using sentiment and emotion analysis, in Proc. of 2019
Int. Conf. Artificial Intelligence in Information and Communication (ICAIIC),
Okinawa, Japan, 2019, pp. 169–174.

[11] J. Zhou and C. Kwan, Missing link prediction in social networks, in Proc.
15th Int. Symp. Neural Networks, Minsk, Belarus, 2018, pp. 346–354.

[12] A. Reyes-Menendez, J. R. Saura, and C. Alvarez-Alonso, Understanding


worldEnvironmentDay user opinions in twitter: A topic-based sentiment analysis
approach, Int. J. Environ. Res. Public Health, vol. 15, no. 11, p. 2537, 2018

[13] C. H. Tan, L. L. Lee, J. Tang, L. Jiang, M. Zhou, and P. Li, User-level


sentiment analysis incorporating social networks, in Proc. 17th ACM SIGKDD
Int. Conf. Knowledge Discovery and Data Mining, New York, NY, USA, 2011,
pp. 1397–1405.

[14] A. Giachanou and F. Crestani, Like it or not: A survey of twitter sentiment


analysis methods, ACM Comput. Surv., vol. 49, no. 2, p. 28, 2016.

[15] R. R. Iyer, J. Chen, H. N. Sun, and K. Y. Xu, A heterogeneous graphical


model to understand userlevel sentiments in social media, arXiv preprint arXiv:
1912.07911, 2019.

21

You might also like