SD Final 0091

A project report on
SCALABLE MACHINE LEARNING

ALGORITHMS FOR A TWITTER FOLLOWER
RECOMMENDER SYSTEM
Submitted in partial fulfillment for the award of the degree of
M.tech Integrated Software Engineering
By
A SATHYA(17MIS0091)
SCHOOL OF INFORMATION TECHNOLOGY AND

ENGINEERING
December,2021
SCALABLE MACHINE LEARNING
ALGORITHMS FOR A TWITTER FOLLOWER
RECOMMENDER SYSTEM
Submitted in partial fulfillment for the award of the degree of
M.tech Integrated Software Engineering
By
A SATHYA(17MIS0091)
SCHOOL OF INFORMATION TECHNOLOGY AND

ENGINEERING
December,2021
DECLARATION
I here by declare that the thesis entitled “ SCALABLE MACHINE LEARNING

ALGORITHMS FOR A TWITTER FOLLOWER RECOMMENDER SYSTEM”
submitted by me, for the award of the degree of Specify the name of the degree VIT is a
record of bonafide work carried out by me under the supervision of Prof.Subhashini R
I further declare that the work reported in this thesis has not been submitted and will not
be submitted, either in part or in full, for the award of any other degree or diploma in this
institute or any other institute or university.
Place: Vellore
Date:03-12-2021 A Sathya
Signature of the Candidate

CERTIFICATE
This is to certify that the thesis entitled “SCALABLE MACHINE LEARNING

ALGORITHMS FOR A TWITTER FOLLOWER RECOMMENDER SYSTEM”
submitted by A SATHYA (17MIS0091) School Of Information Technology And
Engineering VIT, for the award of the degree of M.tech Integrated Software Engineering
is a record of bonafide work carried out by him/her under my supervision.
The contents of this report have not been submitted and will not be submitted either
in part or in full, for the award of any other degree or diploma in this institute or any other
institute or university. The Project report fulfils the requirements and regulations of VIT
and in my opinion meets the necessary standards for submission.
Signature of the Guide Signature of the HOD
Internal Examiner External Examiner

ABSTRACT
Recently, machine learning algorithms have been employed in social networking

recommender systems. In this paper, a Twitter recommender system is simulated by a
multi-agent system that can be used to provide the users with a list of useful
recommendations, specifically a list of users that a user is interested in following. The
simulator is used to test the scalability of a machine learning algorithm for data analysis
with parallel implementation on multi-node distributed systems. The distributed
environment is simulated by a multi-agent modeling. The initial parameters that should be
set up on the simulator include the number of nodes, the algorithm employed in the
simulated recommender system, and the actual followees and followers information. The
experimental results were obtained on three distinct datasets for evaluating the accuracy
and the execution time of a simulated recommender system when testing the ML algorithm
in different scenarios.
i
ACKNOWLEDGEMENT
It is my pleasure to express with deep sense of gratitude to Prof Subhashini R, Assistant

Professor Sr.Grade 2 SITE, Vellore Institute of Technology for his/her constant guidance,
continual encouragement, understanding; more than all, he taught me patience in my
endeavor. My association with him / her is not confined to academics only, but it is a great
opportunity on my part of work with an intellectual and expert in the field of Machine
Learning.
I would like to express my gratitude to Dr.G.Viswanathan,Mr.G.V.Selvam,Dr.Rambabu

Kodali,Dr.S.Narayanan, and Dr.Sumathy S,School of Information Technology and
Engineering , for providing with an environment to work in and for his inspiration during
the tenure of the course.
In jubilant mood I express ingeniously my whole-hearted thanks to . , all teaching staff and
members working as limbs of our university for their not-self-centered enthusiasm coupled
with timely encouragements showered on me with zeal, which prompted the acquirement
of the requisite knowledge to finalize my course study successfully. I would like to thank
my parents for their support.
It is indeed a pleasure to thank my friends who persuaded and encouraged me to take up

and complete this task. At last but not least, I express my gratitude and appreciation to all
those who have helped me directly or indirectly toward the successful completion of this
project.
Place: Vellore
Date: 03-12-2021 A Sathya
Name of the student
ii
TABLE OF CONTENTS
CHAPTER TITLE PAGE

NO NO
ABSTRACT i
1 CHAPTER 1: INTRODUCTION 1
1.1 INTRODUCTION 1
1.2 PROBLEM STATEMENT 1
1.3 OBJECTIVE AND SCOPE OF THE 1-2

PROJECT
1.4 EXISTING SYSTEM 2-3
1.4.1 DISADVANTAGES OF EXISTING 4

SYSTEM
1.5 PROPOSED SYSTEM 4-5
1.5.1 ADVANTAGES OF PROPOSED 6

SYSTEM
1.6 LITERATURE SURVEY 6-11
2 CHAPTER 2: PROJECT DESCRIPTION 12
2.1 INTRODUCTION 12-13
2.2 DIAGRAMS 14
2.2.1 ER DIAGRAM 14
2.2.2 ACTIVITY DIAGRAM 14
2.2.3 FLOW DIAGRAM 15

2.2.4 USE-CASE DIAGRAM 16
2.3 MODULES 16
2.3.1 MODULE DESCRIPTION 16-18
2.4 ALGORITHMS 18-19
3 CHAPTER 3: SOFTWARE SPECIFICATION 20
3.1 TECHNOLOGIES 20
3.1.1 MACHINE LEARNING 20-21
3.2.2 ANACONDA 21-24
3.3.3 PYTHON 24-27
3.2 HARDWARE REQUIREMENTS 27
3.3 SOFTWARE REQUIREMENTS 27
4 CHAPTER 4: IMPLEMENTATION 27
4.1 GENERAL 27
4.2 CODING 27-30
4.3 SNAPSHOTS 30-31
5 CHAPTER 5: CONCLUSION&REFERENCES 32
5.1 CONCLUSION 32
5.2 FUTURE WORK 32
5.3 REFERENCES 32-33
Iii
Chapter 1
Introduction
1.1 INRODUCTION
A Twitter recommender system is simulated by a multi-agent system that can
be used to provide the users with a list of useful recommendations, specifically a list of
users that a user is interested in following. The simulator is used to test the scalability of a
machine learning algorithm for data analysis with parallel implementation on multi-node
distributed systems. The distributed environment is simulated by a multi-agent modeling.
1.2 PROBLEM STATEMENT
The purpose of user recommendation is to identify relevant people to follow

among millions of users that interact in the social network. Previous attempts include both
content-based and graph-based approaches. The former focuses on metrics for measuring
the topic similarity among Twitter users, the latter exploits the graph of relationships
among users to infer correlations.
The main idea behind this work is that users may share similar interests but
have different opinions about them. Therefore, we extend the content-based
recommendation by means of the sentiments and opinions extracted from the user micro-
posts in order to improve the accuracy of the suggestions. This leads us to define a novel
weighting function in order to enrich content-based user profiles.
1.3 OBJECTIVE AND SCOPE OF THE PROJECT
• The objective of this project is to show how sentimental analysis can help improve
the user experience over a social network or system interface.
• The learning algorithm will learn what our emotions are from statistical data then
perform sentiment analysis.
• Our main objective is also maintain accuracy in the final result.
• The main goal of such a sentiment analysis is to discover how the audience
perceives the television show.
1
• The Twitter data that is collected will be classified into two categories; positive or
negative.
• Particular emphasis is placed on evaluating different machine learning algorithms
for the task of twitter sentiment analysis.
1.4 EXISTING SYSTEM
We have designed two different algorithms for followee recommendation on

Twitter. The first algorithm is only based on the topology of the followers/followees
network and suggests users that are neighboring the target user up to some distance. The
second algorithm is content-based and aims at suggesting users that may not be in the
neighborhood of the target user, but whose tweets
Topology-based recommender:-
The general idea behind this algorithm is to suggest users that are in the
neighborhood of the target user and that can be potential followees.
A user’s neighborhood is determined from the follower/followee relations in the
social network. We apply the following heuristic to obtain the list of candidate users for
recommendation:
1. Starting with the target user uT , obtain the list of users he/she follows, let’s call this
list S, i.e. S(uT ) = [ ∀ f∈f ollowees(uT ) f .
2. For each element in S get its followers, let’s call the union of all these lists L, i.e. L(uT
) = [ ∀s∈S f ollowers(s). 2
3. For each element in L obtain its followees, let’s call the union of all these lists T,
i.e.T(uT ) = [ ∀l∈L f ollowees(l).
4. Exclude from T those users that the target user is already following. Let’s call the
resulting list of candidates R, R = T −S.
Each element in R is a possible user to recommend to the target user. Notice that each
element can appear more than once in R, depending on the number of times that each
user appears in the the followees or followers lists obtained at steps 2 and 3 above.
The rationale behind this heuristic procedure is that the target user is an information
seeker that has already identified some interesting users acting as information sources,
which are his/her followees.
2
Other people that also follows some of the users in this
group have interests in common with the target user and might have discover other
relevant information sources in the same topics, which are in turn their followees.
Content-based recommender:-
1. Obtain the authors of the most recent publications that appear in Twitter’s public
timeline, U = {u1,u2,. . . ,um}.
2. For each user uC ∈ U, build pro filebase(uC). That is, we build the term vector
corresponding to each uC .
3. For each user uC ∈ U, compute sim(uC,uT ) = max∀i: fi∈f ollowees(uT ) simcos [pro
filebase(fi), pro filebase(uC)]
Where simcos is simply the cosine similarity between the two vectors.
If sim(uC,uT ) > γ, add uC to the list of recommendations ordered by similarity.
4. Repeat steps 1 to 4 until the desired number of recommendations is obtained.
In order to build the term vectors associated to users, we first detect the language of the
tweets2 and then we apply the corresponding stop-word and stemming filters. We use a
term frequency weighting scheme in the term vectors.
We use a similarity threshold of γ = 0.1 to consider a user relevant for
recommendation. This threshold was set very low so that the desired number of
recommendations could be obtained in a reasonable time. However, it can be adjusted
according to the recommender application. For example, if recommendations can be
calculated off-line the threshold can be set to a higher value, likely improving the
precision of recommendations, at expense of some additional calculation time.
1.4.1 DISADVANTAGES OF EXISTING SYSTEM
• Accuracy is low.
• These segmentation have shortcomings
3
• Feature extraction is not accurate
• Accuracy will be low Computation load very high
1.5 PROPOSED SYSTEM
Credentials
Tweet details
Extraction of text
Getting text using text blob and

Naïve bayes
Highlight user account
In proposed work, we have discussed how a sentiment is extracted from a tweet/text using
Twitter dataset. It is a place where the users posts their views and opinions based on the
situation. The main objective of our proposed system is to perform analysis on tweets
having sentiment which causes the great help to business intelligence on predicting the
future. This paper addresses the sentiment analysis on twitter dataset; that is at first
classification is performed on tweets using naïve bayes classifier. Each tweet is represented
in the form of sentiment asserted in terms of positive, negative and neutral. Performing
sentiment analysis is vital which is used to find out the pros and cons of their products in
the market by public that results in improving their business productivity.The aim of this
project is to develop a classification technique using machine learningwhich gives accurate
results and automatic sentiment classification of an unknown tweet by predicting the future.
4
Our main aim is to perform analysis on these tweets and conclude the tweets which are
positive and negative.
ARCHITECTURE FOR PROPOSED SYSTEM:-
Collection of data
Data Pre-processing
Train-test-split
Feature extraction
KNN [K=1 to 40]

Model selection
RF [n estimator=50-300]
Classification
Accuracy
5
1.5.1 ADVANTAGES OF PROPOSED SYSTEM
• Speed and very low complexity, which makes it very well suited to operate on real
scenarios.
• Computation load needed for image processing purpose is much reduced, combined
with very simple classifiers..
• Ability to learn and extract complex image features.
• With its simplicity and fast processing time, the proposed algorithm is suitable to
be implemented in embedded system or mobile application that has limited processing
resource.
1.6 LITERATURE SURVEY
1. TITLE: SCALABLE MACHINE LEARNING ALGORITHMS FOR A TWITTER

FOLLOWEE RECOMMENDER SYSTEM
AUTHORS: Sepideh Banihashemi, Jason Li, Abdolreza Abhari
DESCRIPTION:
Recently, machine learning (ML) algorithms have been employed in social networking
recommender systems. In this paper, a Twitter recommender system is simulated by a
multi-agent system that can be used to provide the users with a list of useful
recommendations, specifically a list of users that a user is interested in following. The
simulator is used to test the scalability of a machine learning algorithm for data analysis
with parallel implementation on multi-node distributed systems. The distributed
environment is simulated by a multi-agent modeling. The initial parameters that should
be set up on the simulator include the number of nodes, the algorithm employed in the
simulated recommender system, and the actual followees and followers information.
The experimental results were obtained on three distinct datasets for evaluating the
accuracy and the execution time of a simulated recommender system when testing the
ML algorithm in different scenarios.
6
2.TITLE: GENERATING STOCHASTIC DATA TO SIMULATE A TWITTER USER
AUTHOR: Jason Li, Abdolreza Abhari
DESCRIPTION:
Twitter is a popular social network that carries information in short messages. A user’s
tweets can contain information that is similar to another user’s tweets. In this research,
we aim to provide stochastic tweets that can be used for testing recommender systems
with large data. For this reason, we used term frequency and inverse document
frequency (tf-idf) to analyze users’ aggregated tweets. The empirical results show
Weibull distribution fits the model of tf-idf of the words in users' tweets. Then Weibull
distribution is used to generate stochastic data for users' tweets. A simulation of a
recommender system was also conducted to test classification of users based on
stochastic tweets. The recommender system uses collaborative filtering to find
similarity between users. The simulation used k-means clustering to verify the similarity
of the stochastic data versus real data.
3.TITLE: THE PARALLELIZATION OF BACK PROPAGATION NEURAL

NETWORK
AUTHOR: Yang Liu Lixiong Xu, Maozhen Li
DESCRIPTION:
Artificial neural network is proved to be an effective algorithm for dealing with

recognition, regression and classification tasks. At present a number of neural network
implementations have been developed, for example Hamming network, Grossberg
network, Hopfield network and so on. Among these implementations, back propagation
neural network (BPNN) has become the most popular one due to its sensational
function approximation and generalization abilities. However, in the current big data
researches, BPNN, as a both data intensivee and computational intensive algorithm, its
efficiency has been significantly impacted.
7
Therefore, this paper presents a parallel BPNN algorithm based on data separation in
three distributed computing environments including Hadoop, HaLoop and Spark.
Moreover to improve the algorithm performance in terms of accuracy, ensemble
techniques have been employed. The algorithm is firstly evaluated in a small-scale
cluster. And then it is further evaluated in a commercial cloud computing environment.
The experimental results indicate that the proposed algorithm can improve the
efficiency of BPNN with guaranteeing its accuracy.
4.TITLE: MACHINE LEARNING AND LEXICON BASED METHODS FOR

SENTIMENT CLASSIFICATION
AUTHOR: Zhang Hailong, Jiang Bo
DESCRIPTION:
Sentiment classification is an important subject in text mining research, which

concerns the application of automatic methods for predicting the orientation of
sentiment present on text documents, with many applications on a number of areas
including recommender and advertising systems, customer intelligence and information
retrieval. In this paper, we provide a survey and comparative study of existing
techniques for opinion mining including machine learning and lexicon-based
approaches, together with evaluation metrics. Also cross-domain and cross-lingual
approaches are explored. Experimental results show that supervised machine learning
methods, such as SVM and naive Bayes, have higher precision, while lexicon-based
methods are also very competitive because they require few effort in human-labeled
document and isn't sensitive to the quantity and quality of the training dataset.
5.TITLE: SENTIMENTAL ANALYSIS USING FUZZY AND NAÏVE BAYES
AUTHOR: Ruchi Mehra, Mandeep Kaur Bedi
DESCRIPTION:
Sentimental Analysis is the best way to judge people's opinion regarding a particular
post. In this paper we present analysis for sentiment behavior of Twitter data.
8
The proposed work utilizes the naive Bayes and fuzzy Classifier to classify Tweets into
positive, negative or neural behavior of a particular person. We present experimental
evaluation of our dataset and classification results which proved that combined
proposed method is more efficient in terms of Accuracy, Precision and Recall.
6.TITLE: SMART SENTIMENTAL AGENT ANALYSIS THROUGH LIVE
STREAMING DATA
AUTHOR: Gangan deep
DESCRIPTION:
The new branch of science which aims on getting the computers process data and
become more learned without the use of explicit programming is termed as Machine
Learning. The most premier task of Natural Language processing (NLP) is Sentiment
analysis or opinion mining. The need for Sentimental Analysis has gained much
popularity over the recent years. Through the paper, we target to deal with the problem
of Review System, an utmost important part of any organizational CRM.. Data inflow
in this project is through the twitter API supplying live stream of tweets.. Finally we
would set a stage to provide insights into our future work on sentiment analysis and
using this Smart Agent Analysis on existing CRM in order to improve their existing
Feedback structure. Answers to these questions are provided by statistical analysis on
keyword.
7.TITLE: TOWARDS EXTRACTING RELATION FROM TWITTER THROUGH
SUPERVISED LEARNING APPROACH
AUTHOR: Melody Moh
DESCRIPTION:
Advancements in social media technology have resulted in the booming of massive

public data. The availability of these huge data sets offers numerous research
opportunities for deriving meaningful cause-effect relationships for many applications.
One important application domain is the cause of side effects of drugs. In this paper, we
applied supervised learning to extract useful cause-and-effect information related to
drugs from Twitter. To filter out unrelated information and to increase the accuracy of
classification, a spam filter and a preprocessing procedure have been developed.
9
Validation experiments were performed using a manually labeled data set based on
streamed tweets collected continuously on Twitter in real-time for 48 hours, and
exploiting six different supervised machine-learning classifiers. Results have shown that
these classifiers have achieved up to 77% accuracy in identifying drugs' cause-effect
relations on Twitter data. This result has shown a positive feasibility for collecting drug
side effect information from Twitter. The proposed method may be applied to other
areas such as food, beverages, and other daily consumer products for finding their side
effects and people's opinions concerning them.
8.TITLE: TWEET ANALYSIS BASED ON DISTINCT OPINION OF SOCIAL

MEDIA USERS
AUTHOR: Geetha, Vishnu kumar kaliappan
DESCRIPTION:
The state of mind gets expressed via Emojis' and Text Messages for the huge population.
Microblogging and social networking sites emerged as a popular communication
channels among the internet users. Supervised text classifiers are used for sentimental
analysis in both general and specific emotions detection with more accuracy. The main
objective is to include intensity for predicting the different texts formats from twitter,
by considering a text context associated with the emoticons and punctuations. The novel
Future Prediction Architecture Based On Efficient Classification is designed with
various classification algorithms such as, Fisher's Linear Discriminant Classifier
Support Vector Machine Naïve Bayes Classifier and Artificial Neural Network
Algorithm along with the BIRCH (Balanced Iterative Reducing and Clustering using
clustering algorithm. The priliminary stage is to analyze the distinct classification
algorithm's efficiency, during the prediction process. Later, the classified data will be
clustered to extract the required information from the trained data set using BIRCH
method, for predicting the future. Finally, the perfomance of text analysis can get
improved by using efficient classification algorithm.
10
9.TITLE: SENTIMENTAL ANALYSIS OF MULTILINGUAL TWITTER DATA
USING NATURAL LANGUAGE PROCESSING
AUTHOR: Vikas Goel, Amit kr Gupta
DESCRIPTION:
The feelings of WEB users have a great influence on rest of the users, product sellers
and market analysis. It is necessary to well structure the unstructured data from various
social platforms for proper and meaningful analyses. For the classification of
multilingual data, the analysis of feelings has recognized significant attention. This is
called textual organization that may be used to classify state of mind or feelings
expressed in different ways like: negative, positive, favorable, unfavorable, thumbs up,
thumbs down, etc. in the field of Natural Language Processing. To solve this kind of
problem, sentiment analysis and deep learning techniques are two merging techniques.
Because of machine learning ability, deep learning models are effectively used for this
purpose. Recurrent Neural Networks and Naive Bayes algorithm are two popular deep
learning architectures to analyze feelings in sentences. These architectures may be used
in natural language processing. In this research article, we propose solutions to
multilingual sentiment analysis problem by implementing algorithms and in order to
contract the result, we compare precision factor to find the best solution for multilingual
sentimental analysis.
11
CHAPTER 2
Project Description
2.1 INTRODUCTION
Sentiment analysis is part of text mining, the dataset that will be analyzed later
can be sourced from the comments column, netizens tweets on Twitter, and various
sources of uploads from people related to their opinions or sentiment on a matter. For
people who work as data science, they may often hear the term about sentiment
analysis. Sentiment analysis it’s also processed from analyzing various data in the
form of views or opinions so as to produce conclusions from various existing
opinions. The result of sentiment analysis can be a percentage of positive, negative,
or neutral sentiment
Sentiment analysis is useful for various problems of interest to human-computer
interaction practitioners and researchers, as well as those from fields such as
sociology, marketing and advertising, psychology, economics, and political science
One from several social media which is widely used by society today is Twitter,
Twitter has a simple and fast concept because the message is short . Twitter as a social
media is widely used by researchers in the field of natural language
processing (NLP), in addition, concept simple text data and can be crawled, Twitter
also provides an API facility that makes it easy for researchers to retrieve the data.
some previous research has been done with various classification algorithms. here are
some of them :
An Ensemble Sentiment Classification System of Twitter Data for Airline Services
Analysis uses six methods for classification namely lexicon-based classifier, NB,
Bayesian Network, SVM C4.5 Random Forest and one method called the Ensemble
Classifier which combines five methods get higher accuracy. This study uses four
classes, namely positive class (4288 tweets), negative (35876 tweets), neutral (40987
tweets) and irrelevant (26715 tweets).
The accuracy of each when not combined with a two-class dataset is Lexicon Based
67.9%, Naïve Bayesian 90%, Bayesian Network 91.4%, SVM 84.6%, Random Forest
89.8%.
12
The Lexicon Based Method did not participate in the combination because its
accuracy was at least 67.9%, the acquisition of ensemble accuracy with a two-class
dataset was 91.7% while the ensemble's accuracy for the three-class dataset was
84.2%.
Sentiment Analysis of Review Datasets Using Naïve Bayes' and K-NN Classifier two
supervised methods are used with two datasets namely film and hotel, the more
training data that is entered the better the accuracy obtained in the NB algorithm with
the dataset film but for the K-NN method, accuracy is obtained randomly.
NB for the classification of documents, the data in this study were taken in three
periods, namely, before the legislative election, when the legislative election was held
and after the declaration of the legislative election announcement then from the data
the authors grouped public opinion whether positive, negative or neutral. The results
are 90% accurate.
Text classification research with the Naïve Bayes algorithm for the Grouping of News
Texts and Academic Abstracts. Seven experiments were conducted for news
documents and academic abstract documents, in the first experiment with the amount
of training data and 9: 1 test data, the highest accuracy was compared with the smallest
training data. The use of training data of 50% of the total data obtained an accuracy
of more than 75%.
Opinion Analysis Research on Smartphone Features on Indonesian Language
Website Reviews [7]. Data collection is done by means of web scraping, which is
taking data review from the target website. From the test results obtained an average
value of recall and precision respectively of 0.63 and 0.72 while the accuracy of
81.76%.
13
2.2 DIAGRAMS
2.2.1 ER DIAGRAM
Input Aggregating Scores
Tweets Retrival Naïve Base Classifier
Extract Feature
Data
Process Tweet Get Features
2.2.2 ACTIVITY DIAGRAM
EnterTwitter Search
Extract data from Twitter
Store reviews in MongoDB
Process Tweet
14
Get Feature vector

Extract Feature Training Data
Naïve Bayes Algorithm
Display Results in pie cahrt Display results in bar graph
2.2.3 FLOW DIAGRAM
Start Training data

Extract Feature
Enter Twitter Search
Naïve Bayes
Algorithm
Extract Data from
Twitter
Text Blob
Accur-
acy
Process Tweet
Display
result
Get Feature vector 15

End
2.2.4 USECASE DIAGRAM
Training data
Machine learning
algorithm
New data Classifier Prediction
2.3 MODULES
• Data Pre-processing
• Tweepy API
• Text-blob API and Naïve Bayes
2.3.1 MODULE DESCRIPTION
• Data Pre-processing
Raw tweets scraped from twitter generally result in a noisy dataset.This is
due to the casual nature of peoples usage of social media.Tweets have
certain special characters such as retweets,user mentions etc.Which have to
be suitably extracted. Therefore,raw twitter data has to be normalized to
create a dataset which can be easily learned by various classifiers. We have
applied an extensive number of pre-processing steps to standardize the
dataset and reduce its size.
• Tweepy API
Twitter is a popular social network where users share messages called tweets.
Twitter allows us to mine the data of any user using Twitter API or Tweepy. The data will
be tweets extracted from the user. The first thing to do is get the consumer key, consumer
secret, access key and access secret from twitter developer available easily for each user.
These keys will help the API for authentication.
16
Steps to obtain keys
– Login to twitter developer section
– Go to “Create an App”
– Fill the details of the application.
– Click on Create your Twitter Application
– Details of your new app will be shown along with consumer key and consumer secret.
– For access token, click ” Create my access token”. The page will refresh and generate
access token.
Tweepy is one of the library that should be installed using pip. Now in order to authorize
our app to access Twitter on our behalf, we need to use the OAuth Interface. Tweepy
provides the convenient Cursor interface to iterate through different types of objects.
Twitter allows a maximum of 3200 tweets for extraction.
Using user credentials tweepy will generate all the tweets of the particular user and would
be appended to the empty array tmp. Here Tweepy is introduced as a tool to access Twitter
data in a fairly easy way with Python. There are different types of data we can collect, with
the obvious focus on the “tweet” object. Once we have collected some data, the possibilities
in terms of analytics applications are endless.
One such application of extracting tweets is sentiment or emotion analysis. The emotion of
the user can be obtained from the tweets by tokenizing each word and applying machine
learning algorithms on that data. Such emotion or sentiment detection is used worldwide
and will be broadly used in the future.
• Text-blob API and Naïve Bayes
TextBlob is a python library and offers a simple API to access its methods and perform
basic NLP tasks. In this model textblob sentimental analysis is done to classify tweets into
positive negative and neutral based on polarity. However naive bayes classifier of textblob
can also be used for classification.
Text blob:A Text blob is a representation of text that describes the occurrence of words
within a document. The occurrence of words is represented in a numerical feature. It is a
way of extracting features from the text for use in modeling, such as with machine learning
algorithms. 17
The approach is very simple and flexible and can be used for extracting features from
documents. But there is some complexity on two cases i.e., one is on designing the
vocabulary of known words and the other is on scoring the presence of known words. Let
us consider there are 2 classes i.e., positive class and negative class.
Each class contains some words that is positive class contains some bag of positive words
(slow, fine, good, fantastic) and negative class contains some bag of negative words (hate,
terrible, heavy). We will give the input as a text/sentence and starts counting the frequency
of each word in the document and this gives the result whether the text/sentence belongs
to positive class or negative class.
Here is the explanation of naïve bayes classifier
In machine learning, a Bayes classifier is a simple probabilistic classifier, which is based

on applying Bayes’ theorem. The feature model used by a naive Bayes classifier makes
strong independence assumptions. This means that the existence of a particular feature of
a class is independent or unrelated to the existence of every other feature.
Naive Bayes simplifies the calculation of probabilities by assuming that the probability of
each attribute belonging to a given class value is independent of all other attributes. This is
a strong assumption but results in a fast and effective method.
The probability of a class value given a value of an attribute is called the conditional
probability. By multiplying the conditional probabilities together for each attribute for a
given class value, we have a probability of a data instance belonging to that class.
To make a prediction we can calculate probabilities of the instance belonging to each class
and select the class value with the highest probability.
2.4ALGORITHMS
• Naïve Bayes
• Random forest
18
Naïve Bayes Algorithm
Bayes’ Theorem provides a way that we can calculate the probability of a piece of data
belonging to a given class, given our prior knowledge. Bayes’ Theorem is stated as:
• P(class|data) = (P(data|class) * P(class)) / P(data)
Where P(class|data) is the probability of class given the provided data.
Naive Bayes simplifies the calculation of probabilities by assuming that the probability of
each attribute belonging to a given class value is independent of all other attributes. This is
a strong assumption but results in a fast and effective method.
Random Forest Algorithm
Random Forest is a classifier that contains a number of decision trees on various subsets
of the given dataset and takes the average to improve the predictive accuracy of the dataset.
The greater number of trees in the forest leads to higher accuracy and prevents the problem
Implementation steps of Random Forest Algorithm

• Data Pre-Processing step.
• Fitting the random forest algorithm to the training set.
• Predicting the test result.
• Test accuracy of the result.
• Visualizing the test set result.
19
Chapter 3
SOFTWARE SPECIFICATION
3.1 TECHNOLOGIES
3.1.1 MACHINE LEARNING
It is a artificial intelligence(AI) and computer science which focuses of

the use data and algorithms to imitate the way that human learn, gradually
improving its accuracy.
Machine learning is the component of data sets. It is the use
of statistical methods, algorithms are trained to make classifications or predictions.
JUPYTER
Jupyter, previously known as IPython Notebook, is a web-based, interactive

development environment. Originally developed for Python, it has since expanded to
support over 40 other programming languages including Julia and R.
Jupyter allows for notebooks to be written that contain text, live code, images,
and equations. These notebooks can be shared, and can even be hosted on GitHub for free.
For each section of this tutorial, you can download a Juypter notebook that allows you to
edit and experiment with the code and examples for each topic. Jupyter is part of the
Anaconda distribution; it can be started from the command line using the jupyter command:
$ jupyter notebook.
SCIKIT-LEARN
SciKit-Learn provides a standardised interface to many of the most commonly used
machine learning algorithms, and is the most popular and frequently used library for
machine learning for Python. As well as providing many learning algorithms, SciKit-
Learn has a large number of convenience functions for common preprocessing tasks
(for example, normalisation or k-fold cross validation).
SciKit-Learn is a very large software library.
20
CLUSTERING
Clustering algorithms focus on ordering data together into groups. In general
clustering algorithms are unsupervised—they require no y response variable as input.
That is to say, they attempt to find groups or clusters within data where you do not
know the label for each sample. SciKit-Learn have many clustering algorithms, but in
this section we will demonstrate hierarchical clustering on a DNA expression
microarray dataset using an algorithm from the SciPy library.
We will plot a visualisation of the clustering using what is known as a dendrogram,
also using the SciPy library.
The goal is to cluster the data properly in logical groups, in this case into the cancer
types represented by each sample’s expression data. We do this using agglomerative
hierarchical clustering, using Ward’s linkage method.
CLASSIFICATION
we analysed data that was unlabelled—we did not know to what class a sample
belonged (known as unsupervised learning). In contrast to this, a supervised problem
deals with labelled data where are aware of the discrete classes to which each sample
belongs. When we wish to predict which class a sample belongs to, we call this a
classification problem. SciKit-Learn has a number of algorithms for classification, in
this section we will look at the Support Vector Machine.
Support Vector Machines are a very powerful tool for classification. They work well
in high dimensional spaces, even when the number of features is higher than the
number of samples. However, their running time is quadratic to the number of samples
so large datasets can become difficult to train. Quadratic means that if you increase a
dataset in size by 10 times, it will take 100 times longer to train.
3.1.2 ANACONDA
It is a free and open-source distribution of the Python and R programming languages for
scientific computing (data science, machine learning applications, large-scale data
processing, predictive analytics, etc.), that aims to simplify package management and
deployment.
21
Anaconda distribution comes with more than 1,500 packages as well as
the Conda package and virtual environment manager. It also includes a GUI, Anaconda
Navigator, as a graphical alternative to the Command Line Interface (CLI).
The big difference between Conda and the pip package manager is in how package
dependencies are managed, which is a significant challenge for Python data science and
the reason Conda exists. Pip installs all Python package dependencies required, whether or
not those conflict with other packages you installed previously.
So your working installation of, for example, Google Tensorflow, can suddenly stop
working when you pip install a different package that needs a different version of the
Numpy library. More insidiously, everything might still appear to work but now you get
different results from your data science, or you are unable to reproduce the same results
elsewhere because you didn't pip install in the same order.
Conda analyzes your current environment, everything you have installed, any version
limitations you specify (e.g. you only want tensorflow >= 2.0) and figures out how to install
compatible dependencies. Or it will tell you that what you want can't be done. Pip, by
contrast, will just install the thing you wanted and any dependencies, even if that breaks
other things.
Open source packages can be individually installed from the Anaconda repository,
Anaconda Cloud (anaconda.org), or your own private repository or mirror, using the conda
install command.
Anaconda Inc compiles and builds all the packages in the Anaconda repository itself, and
provides binaries for Windows 32/64 bit, Linux 64 bit and MacOS 64-bit. You can also
install anything on PyPI into a Conda environment using pip, and Conda knows what it has
installed and what pip has installed. Custom packages can be made using the conda
build command, and can be shared with others by uploading them to Anaconda
Cloud, PyPI or other repositories.
22
The default installation of Anaconda2 includes Python 2.7 and Anaconda3 includes Python
3.7. However, you can create new environments that include any version of Python
packaged with conda.
Anaconda Navigator is a desktop Graphical User Interface (GUI) included in Anaconda

distribution that allows users to launch applications and manage conda packages,
environments and channels without using command-line commands.
Navigator can search for packages on Anaconda Cloud or in a local Anaconda Repository,
install them in an environment, run the packages and update them. It is available
for Windows, macOS and Linux.
The following applications are available by default in Navigator :
• JupyterLab
• Jupyter Notebook
• QtConsole
• Spyder
• Rstudio
• Visual Studio Code
Microsoft .NET is a set of Microsoft software technologies for rapidly building and
integrating XML Web services, Microsoft Windows-based applications, and Web
solutions. The .NET Framework is a language-neutral platform for writing programs that
can easily and securely interoperate. There’s no language barrier with .NET: there are
numerous languages available to the developer including Managed C++, C#, Visual Basic
and Java Script. The .NET framework provides the foundation for components to interact
seamlessly, whether locally or remotely on different platforms. It standardizes common
data types and communications protocols so that components created in different languages
can easily interoperate.
“.NET” is also the collective name given to various software components built upon the
.NET platform. These will be both products and services.
23
Microsoft VISUAL STUDIO is an Integrated Development Environment (IDE)
from Microsoft. It is used to develop computer programs, as well as websites, web apps,
web services and mobile apps.
Python is a powerful multi-purpose programming language created by Guido van

Rossum. It has simple easy-to-use syntax, making it the perfect language for someone
trying to learn computer programming for the first time. Python features are:
• Easy to code
• Free and Open Source
• Object-Oriented Language
• GUI Programming Support
• High-Level Language
• Extensible feature
• Python is Portable language
• Python is Integrated language
• Interpreted
• Large Standard Library
• Dynamically Typed Language
3.1.3 PYTHON
• Python is a powerful multi-purpose programming language created by Guido van
Rossum.
• It has simple easy-to-use syntax, making it the perfect language for someone
trying to learn computer programming for the first time.
Features Of Python :
1.Easy to code:
Python is high level programming language. Python is very easy to learn language as
compared to other language like c, c#, java script, java etc. It is very easy to code in
python language and anybody can learn python basic in few hours or days. It is also
developer-friendly language.
24
2. Free and Open Source:
Python language is freely available at official website and you can download it from the
given download link below click on the Download Python keyword.Since, it is open-
source, this means that source code is also available to the public. So you can download it
as, use it as well as share it.
3.Object-Oriented Language:
One of the key features of python is Object-Oriented programming. Python supports
object oriented language and concepts of classes, objects encapsulation etc.
4. GUI Programming Support:

Graphical Users interfaces can be made using a module such as PyQt5, PyQt4,
wxPython or Tk in python.PyQt5 is the most popular option for creating graphical apps
with Python.
5. High-Level Language:
Python is a high-level language. When we write programs in python, we do not need
to remember the system architecture, nor do we need to manage the memory.
6.Extensible feature:
Python is a Extensible language. we can write our some python code into c or c++
language and also we can compile that code in c/c++ language.
7. Python is Portable language:

Python language is also a portable language. for example, if we have python code for
windows and if we want to run this code on other platform such as Linux, Unix and Mac
then we do not need to change it, we can run this code on any platform.
8. Python is Integrated language:

Python is also an Integrated language because we can easily integrated python with other
language like c, c++ etc.
25
9. Interpreted Language:
Python is an Interpreted Language. because python code is executed line by line at a
time. like other language c, c++, java etc there is no need to compile python code this
makes it easier to debug our code. The source code of python is converted into an
immediate form called bytecode.
10. Large Standard Library

Python has a large standard library which provides rich set of module and functions so
you do not have to write your own code for every single thing.There are many libraries
present in python for such as regular expressions, unit-testing, web browsers etc.
11. Dynamically Typed Language:

Python is dynamically-typed language. That means the type (for example- int, double,
long etc) for a variable is decided at run time not in advance.because of this feature we
don’t need to specify the type of variable.
APPLICATIONS OF PYTHON :
WEB APPLICATIONS
• You can create scalable Web Apps using frameworks and CMS (Content
Management System) that are built on Python. Some of the popular platforms for
creating Web Apps are: Django, Flask, Pyramid, Plone, Django CMS.
• Sites like Mozilla, Reddit, Instagram and PBS are written in Python.
SCIENTIFIC AND NUMERIC COMPUTING
• There are numerous libraries available in Python for scientific and numeric
computing. There are libraries like: SciPy and NumPy that are used in general
purpose computing. And, there are specific libraries like: EarthPy for earth
science, AstroPy for Astronomy and so on.
• Also, the language is heavily used in machine learning, data mining and deep
learning.
26
CREATING SOFTWARE PROTOTYPES
• Python is slow compared to compiled languages like C++ and Java. It might not
be a good choice if resources are limited and efficiency is a must.
• However, Python is a great language for creating prototypes. For example: You
can use Pygame (library for creating games) to create your game's prototype first.
If you like the prototype, you can use language like C++ to create the actual
game.
GOOD LANGUAGE TO TEACH PROGRAMMING
• Python is used by many companies to teach programming to kids

• It is a good language with a lot of features and capabilities. Yet, it's one of the
easiest language to learn because of its simple easy-to-use system.
3.2 HARDWARE REQUIREMENTS
Processor : Intel I5
RAM : 4GB
Hard Disk : 40GB
3.3 SOFTWARE REQUIREMENTS
Python IDE : Anaconda Jupyter Notebook

Programming : Python
27
CHAPTER 4
IMPLEMENTATION
4.1 GENERAL
Python is a program that was originally designed to simplify the implementation of

numerical linear algebra routines. It has since grown into something much bigger, and it is
used to implement numerical algorithms for a wide range of applications. The basic
language used is very similar to standard linear algebra notation, but there are a few
extensions that will likely cause you some problems at first.
4.2 CODE IMPLEMENTATION
import numpy as np
import pandas as pd
import math
import re
import warnings
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from textblob import TextBlob,classifiers
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
27
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
import seaborn as sns
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('words')
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
data=pd.read_csv('twcs.csv',encoding = "ISO-8859-1")
data.head()
data.columns
data.isnull().sum()
data1=data[data['author_id']=='sprintcare']
data['text']=data['text'].apply(lambda x:re.sub('[â-zA-Z\s]','',x))
stop_words = set(stopwords.words('english'))
28
stemmer=SnowballStemmer('english')
data['text']=data['text'].str.lower()
data['text']=data['text'].apply(lambda x:' '.join([stemmer.stem(y) for y in x.split()]))
y_pred=[]
for i in range(data_1.shape[0]):
analysis = TextBlob(data['text'].loc[i])
if analysis.sentiment.polarity>0:
y_pred.append('positive')
elif analysis.sentiment.polarity==0:
y_pred.append('neutral')
else:
y_pred.append('negative')
data1['polarity']=y_pred
data1[data1['author_id']=='sprintcare']['polarity'].value_counts()
dict(data1[data1['author_id']=='AmazonHelp']['polarity'].value_counts())
recomender_data=data1[['author_id','text']]
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5,
ngram_range=(1, 1),
X= tfidf.fit_transform(recomender_data['text']).toarray()
29
data1['polarity'].value_counts()
data=np.hstack((X_train,Y_train.reshape(-1,1)))
df=pd.DataFrame(data)
df.to_csv('twitter.csv')
4.3 SNAPSHOTS
RESULT:
30
31
CHAPTER 5
CONCLUSION AND REFERENCES
5.1 CONCLUSION
In this paper, we comprehensively reviewed the available literature and highlighted
why early prediction of ES is required, how ML and DL techniques are used for ES
prediction. In the context of EEG analysis, feature selection, ES detection, and prediction,
and the evaluation of prediction or detection algorithms, ES prediction. Contrary to the
findings of this paper, most of the previous survey papers focused only on EEG analysis
and a few of them covered the developments of prediction techniques; while we tried to
provide insights by considering the aspects of the feature selection, prediction techniques,
and evaluation methodologies, etc. In addition, we have also highlighted future work
directions and open research problems that require further investigation.
5.2 FUTURE WORK

A possible future work includes testing the proposed method using a larger EEG dataset.
More robust feature extraction method will be investigated to improve the prediction
accuracy. To tackle the low specificity of the proposed method, several smoothing
approaches will be investigated. Furthermore, the proposed seizure prediction
method should also be integrated with a suitable application such as mobile devices to
perform more effective seizure monitoring.
5.3 REFERENCES
[1] S. Jaysri, J. Priyadharshini, P. Subathra, and Dr. (Col.) P. N. Kumar, “Analysis and
performance of collaborative filtering and classification algorithms,” International Journal
of Applied Engineering Research, vol. 10, pp. 24529–24540, 2015.
[2] M.D.Ekstrand,J.T.Riedl,andJ.A.Konstan,“Collaborative Filtering Recommender

Systems,” Foundations and Trends® in Human—Computer Interaction, vol. 4, no. 2, pp.
81–173, 2011.
32
[3] M. Poussevin, V. Guigue, and P. Gallinari, “Extracting a vocabulary of surprise by
collaborative filteringmixture and analysis of feelings,” in Proceedings of the CORIA
2015—Conference in Search Infomations and Applications—12th French Information
Retrieval Conference, Paris, France, March 2015.
[4] M. Z. Kurdi, “Lexical and syntactic features selection for an adaptive reading
recommendation system based on text complexity,” in Proceedings of the 2017
International Conference on Information System and Data Mining, pp. 66–69, Charleston,
SC, USA, April 2017.
[5] M. A. Ghazanfar and A. Pru¨gel-Bennett, “An improved switching hybrid

recommender system using naive Bayes classifier and collaborative filtering,” in
Proceedings of the International MultiConference of Engineers and Computer Scientists
2010 (IMECS), Hong Kong, China, 2010.
[6] A. I. Schein, A. Popescul, L. H. Ungar, and D. M. Pennock, “Methods and metrics for
cold-start recommendations,” in Proceedings of the 25th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval—SIGIR ’02, New
York, NY, USA, 2002.
[7] M. Ghazanfar and A. Pru¨gel-Bennett, “Building switching hybrid recommender

system using machine learning classifiers and collaborative filtering,” IAENG International
Journal of Computer Science, vol. 37, no. 3, 2010.
[8] Z. Hailong, G. Wenyan, and J. Bo, “Machine learning and

lexiconbasedmethodsforsentimentclassification:asurvey,” in Proceedings of the 11th Web
Information System and ApplicationConference(WISA),pp.262–265,Tianjin,China,
September 2014.
[9] R. Mu, “A survey of recommender systems based on deep learning,” IEEE Access, vol.
6, pp. 69009–69022, 2018.
33

SD Final 0091

Uploaded by

Copyright:

Available Formats

SD Final 0091

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SD Final 0091

Uploaded by

Copyright:

Available Formats

A project report on

SCALABLE MACHINE LEARNING

Submitted in partial fulfillment for the award of the degree of

M.tech Integrated Software Engineering

SCHOOL OF INFORMATION TECHNOLOGY AND

Submitted in partial fulfillment for the award of the degree of

M.tech Integrated Software Engineering

SCHOOL OF INFORMATION TECHNOLOGY AND

I here by declare that the thesis entitled “ SCALABLE MACHINE LEARNING

Signature of the Candidate

This is to certify that the thesis entitled “SCALABLE MACHINE LEARNING

Signature of the Guide Signature of the HOD

Internal Examiner External Examiner

Recently, machine learning algorithms have been employed in social networking

It is my pleasure to express with deep sense of gratitude to Prof Subhashini R, Assistant

I would like to express my gratitude to Dr.G.Viswanathan,Mr.G.V.Selvam,Dr.Rambabu

It is indeed a pleasure to thank my friends who persuaded and encouraged me to take up

Date: 03-12-2021 A Sathya

Name of the student

CHAPTER TITLE PAGE

1.2 PROBLEM STATEMENT 1

1.3 OBJECTIVE AND SCOPE OF THE 1-2

1.4.1 DISADVANTAGES OF EXISTING 4

1.5.1 ADVANTAGES OF PROPOSED 6

2 CHAPTER 2: PROJECT DESCRIPTION 12

2.1 INTRODUCTION 12-13

2.2.2 ACTIVITY DIAGRAM 14

2.2.3 FLOW DIAGRAM 15

2.3.1 MODULE DESCRIPTION 16-18

2.4 ALGORITHMS 18-19

3 CHAPTER 3: SOFTWARE SPECIFICATION 20

3.1.1 MACHINE LEARNING 20-21

3.2.2 ANACONDA 21-24

3.3.3 PYTHON 24-27

3.2 HARDWARE REQUIREMENTS 27

3.3 SOFTWARE REQUIREMENTS 27

4.2 CODING 27-30

4.3 SNAPSHOTS 30-31

5.2 FUTURE WORK 32

5.3 REFERENCES 32-33

1.2 PROBLEM STATEMENT

The purpose of user recommendation is to identify relevant people to follow

1.3 OBJECTIVE AND SCOPE OF THE PROJECT

We have designed two different algorithms for followee recommendation on

1.5 PROPOSED SYSTEM

Getting text using text blob and

Highlight user account

ARCHITECTURE FOR PROPOSED SYSTEM:-

KNN [K=1 to 40]

1.6 LITERATURE SURVEY

1. TITLE: SCALABLE MACHINE LEARNING ALGORITHMS FOR A TWITTER

AUTHORS: Sepideh Banihashemi, Jason Li, Abdolreza Abhari

AUTHOR: Jason Li, Abdolreza Abhari

3.TITLE: THE PARALLELIZATION OF BACK PROPAGATION NEURAL

AUTHOR: Yang Liu Lixiong Xu, Maozhen Li

Artificial neural network is proved to be an effective algorithm for dealing with

4.TITLE: MACHINE LEARNING AND LEXICON BASED METHODS FOR

AUTHOR: Zhang Hailong, Jiang Bo

Sentiment classification is an important subject in text mining research, which