SD Final 0091
SD Final 0091
SD Final 0091
By
A SATHYA(17MIS0091)
December,2021
SCALABLE MACHINE LEARNING
ALGORITHMS FOR A TWITTER FOLLOWER
RECOMMENDER SYSTEM
By
A SATHYA(17MIS0091)
December,2021
DECLARATION
I further declare that the work reported in this thesis has not been submitted and will not
be submitted, either in part or in full, for the award of any other degree or diploma in this
institute or any other institute or university.
Place: Vellore
Date:03-12-2021 A Sathya
The contents of this report have not been submitted and will not be submitted either
in part or in full, for the award of any other degree or diploma in this institute or any other
institute or university. The Project report fulfils the requirements and regulations of VIT
and in my opinion meets the necessary standards for submission.
i
ACKNOWLEDGEMENT
In jubilant mood I express ingeniously my whole-hearted thanks to . , all teaching staff and
members working as limbs of our university for their not-self-centered enthusiasm coupled
with timely encouragements showered on me with zeal, which prompted the acquirement
of the requisite knowledge to finalize my course study successfully. I would like to thank
my parents for their support.
Place: Vellore
ii
TABLE OF CONTENTS
1 CHAPTER 1: INTRODUCTION 1
1.1 INTRODUCTION 1
2.2 DIAGRAMS 14
2.2.1 ER DIAGRAM 14
2.3 MODULES 16
3.1 TECHNOLOGIES 20
4 CHAPTER 4: IMPLEMENTATION 27
4.1 GENERAL 27
5 CHAPTER 5: CONCLUSION&REFERENCES 32
5.1 CONCLUSION 32
Iii
Chapter 1
Introduction
1.1 INRODUCTION
A Twitter recommender system is simulated by a multi-agent system that can
be used to provide the users with a list of useful recommendations, specifically a list of
users that a user is interested in following. The simulator is used to test the scalability of a
machine learning algorithm for data analysis with parallel implementation on multi-node
distributed systems. The distributed environment is simulated by a multi-agent modeling.
The main idea behind this work is that users may share similar interests but
have different opinions about them. Therefore, we extend the content-based
recommendation by means of the sentiments and opinions extracted from the user micro-
posts in order to improve the accuracy of the suggestions. This leads us to define a novel
weighting function in order to enrich content-based user profiles.
• The objective of this project is to show how sentimental analysis can help improve
the user experience over a social network or system interface.
• The learning algorithm will learn what our emotions are from statistical data then
perform sentiment analysis.
• Our main objective is also maintain accuracy in the final result.
• The main goal of such a sentiment analysis is to discover how the audience
perceives the television show.
1
• The Twitter data that is collected will be classified into two categories; positive or
negative.
• Particular emphasis is placed on evaluating different machine learning algorithms
for the task of twitter sentiment analysis.
1.4 EXISTING SYSTEM
The general idea behind this algorithm is to suggest users that are in the
neighborhood of the target user and that can be potential followees.
A user’s neighborhood is determined from the follower/followee relations in the
social network. We apply the following heuristic to obtain the list of candidate users for
recommendation:
1. Starting with the target user uT , obtain the list of users he/she follows, let’s call this
list S, i.e. S(uT ) = [ ∀ f∈f ollowees(uT ) f .
2. For each element in S get its followers, let’s call the union of all these lists L, i.e. L(uT
) = [ ∀s∈S f ollowers(s). 2
3. For each element in L obtain its followees, let’s call the union of all these lists T,
i.e.T(uT ) = [ ∀l∈L f ollowees(l).
4. Exclude from T those users that the target user is already following. Let’s call the
resulting list of candidates R, R = T −S.
Each element in R is a possible user to recommend to the target user. Notice that each
element can appear more than once in R, depending on the number of times that each
user appears in the the followees or followers lists obtained at steps 2 and 3 above.
The rationale behind this heuristic procedure is that the target user is an information
seeker that has already identified some interesting users acting as information sources,
which are his/her followees.
2
Other people that also follows some of the users in this
group have interests in common with the target user and might have discover other
relevant information sources in the same topics, which are in turn their followees.
Content-based recommender:-
1. Obtain the authors of the most recent publications that appear in Twitter’s public
timeline, U = {u1,u2,. . . ,um}.
2. For each user uC ∈ U, build pro filebase(uC). That is, we build the term vector
corresponding to each uC .
3. For each user uC ∈ U, compute sim(uC,uT ) = max∀i: fi∈f ollowees(uT ) simcos [pro
filebase(fi), pro filebase(uC)]
Where simcos is simply the cosine similarity between the two vectors.
If sim(uC,uT ) > γ, add uC to the list of recommendations ordered by similarity.
4. Repeat steps 1 to 4 until the desired number of recommendations is obtained.
In order to build the term vectors associated to users, we first detect the language of the
tweets2 and then we apply the corresponding stop-word and stemming filters. We use a
term frequency weighting scheme in the term vectors.
We use a similarity threshold of γ = 0.1 to consider a user relevant for
recommendation. This threshold was set very low so that the desired number of
recommendations could be obtained in a reasonable time. However, it can be adjusted
according to the recommender application. For example, if recommendations can be
calculated off-line the threshold can be set to a higher value, likely improving the
precision of recommendations, at expense of some additional calculation time.
1.4.1 DISADVANTAGES OF EXISTING SYSTEM
• Accuracy is low.
• These segmentation have shortcomings
3
• Feature extraction is not accurate
• Accuracy will be low Computation load very high
Credentials
Tweet details
Extraction of text
In proposed work, we have discussed how a sentiment is extracted from a tweet/text using
Twitter dataset. It is a place where the users posts their views and opinions based on the
situation. The main objective of our proposed system is to perform analysis on tweets
having sentiment which causes the great help to business intelligence on predicting the
future. This paper addresses the sentiment analysis on twitter dataset; that is at first
classification is performed on tweets using naïve bayes classifier. Each tweet is represented
in the form of sentiment asserted in terms of positive, negative and neutral. Performing
sentiment analysis is vital which is used to find out the pros and cons of their products in
the market by public that results in improving their business productivity.The aim of this
project is to develop a classification technique using machine learningwhich gives accurate
results and automatic sentiment classification of an unknown tweet by predicting the future.
4
Our main aim is to perform analysis on these tweets and conclude the tweets which are
positive and negative.
Collection of data
Data Pre-processing
Train-test-split
Feature extraction
RF [n estimator=50-300]
Classification
Accuracy
5
1.5.1 ADVANTAGES OF PROPOSED SYSTEM
• Speed and very low complexity, which makes it very well suited to operate on real
scenarios.
• Computation load needed for image processing purpose is much reduced, combined
with very simple classifiers..
• Ability to learn and extract complex image features.
• With its simplicity and fast processing time, the proposed algorithm is suitable to
be implemented in embedded system or mobile application that has limited processing
resource.
DESCRIPTION:
Recently, machine learning (ML) algorithms have been employed in social networking
recommender systems. In this paper, a Twitter recommender system is simulated by a
multi-agent system that can be used to provide the users with a list of useful
recommendations, specifically a list of users that a user is interested in following. The
simulator is used to test the scalability of a machine learning algorithm for data analysis
with parallel implementation on multi-node distributed systems. The distributed
environment is simulated by a multi-agent modeling. The initial parameters that should
be set up on the simulator include the number of nodes, the algorithm employed in the
simulated recommender system, and the actual followees and followers information.
The experimental results were obtained on three distinct datasets for evaluating the
accuracy and the execution time of a simulated recommender system when testing the
ML algorithm in different scenarios.
6
2.TITLE: GENERATING STOCHASTIC DATA TO SIMULATE A TWITTER USER
DESCRIPTION:
Twitter is a popular social network that carries information in short messages. A user’s
tweets can contain information that is similar to another user’s tweets. In this research,
we aim to provide stochastic tweets that can be used for testing recommender systems
with large data. For this reason, we used term frequency and inverse document
frequency (tf-idf) to analyze users’ aggregated tweets. The empirical results show
Weibull distribution fits the model of tf-idf of the words in users' tweets. Then Weibull
distribution is used to generate stochastic data for users' tweets. A simulation of a
recommender system was also conducted to test classification of users based on
stochastic tweets. The recommender system uses collaborative filtering to find
similarity between users. The simulation used k-means clustering to verify the similarity
of the stochastic data versus real data.
DESCRIPTION:
7
Therefore, this paper presents a parallel BPNN algorithm based on data separation in
three distributed computing environments including Hadoop, HaLoop and Spark.
Moreover to improve the algorithm performance in terms of accuracy, ensemble
techniques have been employed. The algorithm is firstly evaluated in a small-scale
cluster. And then it is further evaluated in a commercial cloud computing environment.
The experimental results indicate that the proposed algorithm can improve the
efficiency of BPNN with guaranteeing its accuracy.
DESCRIPTION:
DESCRIPTION:
Sentimental Analysis is the best way to judge people's opinion regarding a particular
post. In this paper we present analysis for sentiment behavior of Twitter data.
8
The proposed work utilizes the naive Bayes and fuzzy Classifier to classify Tweets into
positive, negative or neural behavior of a particular person. We present experimental
evaluation of our dataset and classification results which proved that combined
proposed method is more efficient in terms of Accuracy, Precision and Recall.
6.TITLE: SMART SENTIMENTAL AGENT ANALYSIS THROUGH LIVE
STREAMING DATA
AUTHOR: Gangan deep
DESCRIPTION:
The new branch of science which aims on getting the computers process data and
become more learned without the use of explicit programming is termed as Machine
Learning. The most premier task of Natural Language processing (NLP) is Sentiment
analysis or opinion mining. The need for Sentimental Analysis has gained much
popularity over the recent years. Through the paper, we target to deal with the problem
of Review System, an utmost important part of any organizational CRM.. Data inflow
in this project is through the twitter API supplying live stream of tweets.. Finally we
would set a stage to provide insights into our future work on sentiment analysis and
using this Smart Agent Analysis on existing CRM in order to improve their existing
Feedback structure. Answers to these questions are provided by statistical analysis on
keyword.
7.TITLE: TOWARDS EXTRACTING RELATION FROM TWITTER THROUGH
SUPERVISED LEARNING APPROACH
AUTHOR: Melody Moh
DESCRIPTION:
9
Validation experiments were performed using a manually labeled data set based on
streamed tweets collected continuously on Twitter in real-time for 48 hours, and
exploiting six different supervised machine-learning classifiers. Results have shown that
these classifiers have achieved up to 77% accuracy in identifying drugs' cause-effect
relations on Twitter data. This result has shown a positive feasibility for collecting drug
side effect information from Twitter. The proposed method may be applied to other
areas such as food, beverages, and other daily consumer products for finding their side
effects and people's opinions concerning them.
DESCRIPTION:
The state of mind gets expressed via Emojis' and Text Messages for the huge population.
Microblogging and social networking sites emerged as a popular communication
channels among the internet users. Supervised text classifiers are used for sentimental
analysis in both general and specific emotions detection with more accuracy. The main
objective is to include intensity for predicting the different texts formats from twitter,
by considering a text context associated with the emoticons and punctuations. The novel
Future Prediction Architecture Based On Efficient Classification is designed with
various classification algorithms such as, Fisher's Linear Discriminant Classifier
Support Vector Machine Naïve Bayes Classifier and Artificial Neural Network
Algorithm along with the BIRCH (Balanced Iterative Reducing and Clustering using
clustering algorithm. The priliminary stage is to analyze the distinct classification
algorithm's efficiency, during the prediction process. Later, the classified data will be
clustered to extract the required information from the trained data set using BIRCH
method, for predicting the future. Finally, the perfomance of text analysis can get
improved by using efficient classification algorithm.
10
9.TITLE: SENTIMENTAL ANALYSIS OF MULTILINGUAL TWITTER DATA
USING NATURAL LANGUAGE PROCESSING
DESCRIPTION:
The feelings of WEB users have a great influence on rest of the users, product sellers
and market analysis. It is necessary to well structure the unstructured data from various
social platforms for proper and meaningful analyses. For the classification of
multilingual data, the analysis of feelings has recognized significant attention. This is
called textual organization that may be used to classify state of mind or feelings
expressed in different ways like: negative, positive, favorable, unfavorable, thumbs up,
thumbs down, etc. in the field of Natural Language Processing. To solve this kind of
problem, sentiment analysis and deep learning techniques are two merging techniques.
Because of machine learning ability, deep learning models are effectively used for this
purpose. Recurrent Neural Networks and Naive Bayes algorithm are two popular deep
learning architectures to analyze feelings in sentences. These architectures may be used
in natural language processing. In this research article, we propose solutions to
multilingual sentiment analysis problem by implementing algorithms and in order to
contract the result, we compare precision factor to find the best solution for multilingual
sentimental analysis.
11
CHAPTER 2
Project Description
2.1 INTRODUCTION
Sentiment analysis is part of text mining, the dataset that will be analyzed later
can be sourced from the comments column, netizens tweets on Twitter, and various
sources of uploads from people related to their opinions or sentiment on a matter. For
people who work as data science, they may often hear the term about sentiment
analysis. Sentiment analysis it’s also processed from analyzing various data in the
form of views or opinions so as to produce conclusions from various existing
opinions. The result of sentiment analysis can be a percentage of positive, negative,
or neutral sentiment
Sentiment analysis is useful for various problems of interest to human-computer
interaction practitioners and researchers, as well as those from fields such as
sociology, marketing and advertising, psychology, economics, and political science
One from several social media which is widely used by society today is Twitter,
Twitter has a simple and fast concept because the message is short . Twitter as a social
media is widely used by researchers in the field of natural language
processing (NLP), in addition, concept simple text data and can be crawled, Twitter
also provides an API facility that makes it easy for researchers to retrieve the data.
some previous research has been done with various classification algorithms. here are
some of them :
An Ensemble Sentiment Classification System of Twitter Data for Airline Services
Analysis uses six methods for classification namely lexicon-based classifier, NB,
Bayesian Network, SVM C4.5 Random Forest and one method called the Ensemble
Classifier which combines five methods get higher accuracy. This study uses four
classes, namely positive class (4288 tweets), negative (35876 tweets), neutral (40987
tweets) and irrelevant (26715 tweets).
The accuracy of each when not combined with a two-class dataset is Lexicon Based
67.9%, Naïve Bayesian 90%, Bayesian Network 91.4%, SVM 84.6%, Random Forest
89.8%.
12
The Lexicon Based Method did not participate in the combination because its
accuracy was at least 67.9%, the acquisition of ensemble accuracy with a two-class
dataset was 91.7% while the ensemble's accuracy for the three-class dataset was
84.2%.
Sentiment Analysis of Review Datasets Using Naïve Bayes' and K-NN Classifier two
supervised methods are used with two datasets namely film and hotel, the more
training data that is entered the better the accuracy obtained in the NB algorithm with
the dataset film but for the K-NN method, accuracy is obtained randomly.
NB for the classification of documents, the data in this study were taken in three
periods, namely, before the legislative election, when the legislative election was held
and after the declaration of the legislative election announcement then from the data
the authors grouped public opinion whether positive, negative or neutral. The results
are 90% accurate.
Text classification research with the Naïve Bayes algorithm for the Grouping of News
Texts and Academic Abstracts. Seven experiments were conducted for news
documents and academic abstract documents, in the first experiment with the amount
of training data and 9: 1 test data, the highest accuracy was compared with the smallest
training data. The use of training data of 50% of the total data obtained an accuracy
of more than 75%.
Opinion Analysis Research on Smartphone Features on Indonesian Language
Website Reviews [7]. Data collection is done by means of web scraping, which is
taking data review from the target website. From the test results obtained an average
value of recall and precision respectively of 0.63 and 0.72 while the accuracy of
81.76%.
13
2.2 DIAGRAMS
2.2.1 ER DIAGRAM
Extract Feature
Data
EnterTwitter Search
Process Tweet
14
Naïve Bayes
Algorithm
Extract Data from
Twitter
Text Blob
Accur-
acy
Process Tweet
Display
result
Training data
Machine learning
algorithm
2.3 MODULES
• Data Pre-processing
• Tweepy API
• Text-blob API and Naïve Bayes
2.3.1 MODULE DESCRIPTION
• Data Pre-processing
Raw tweets scraped from twitter generally result in a noisy dataset.This is
due to the casual nature of peoples usage of social media.Tweets have
certain special characters such as retweets,user mentions etc.Which have to
be suitably extracted. Therefore,raw twitter data has to be normalized to
create a dataset which can be easily learned by various classifiers. We have
applied an extensive number of pre-processing steps to standardize the
dataset and reduce its size.
• Tweepy API
Twitter is a popular social network where users share messages called tweets.
Twitter allows us to mine the data of any user using Twitter API or Tweepy. The data will
be tweets extracted from the user. The first thing to do is get the consumer key, consumer
secret, access key and access secret from twitter developer available easily for each user.
These keys will help the API for authentication.
16
Steps to obtain keys
– Login to twitter developer section
– Go to “Create an App”
– Fill the details of the application.
– Click on Create your Twitter Application
– Details of your new app will be shown along with consumer key and consumer secret.
– For access token, click ” Create my access token”. The page will refresh and generate
access token.
Tweepy is one of the library that should be installed using pip. Now in order to authorize
our app to access Twitter on our behalf, we need to use the OAuth Interface. Tweepy
provides the convenient Cursor interface to iterate through different types of objects.
Twitter allows a maximum of 3200 tweets for extraction.
Using user credentials tweepy will generate all the tweets of the particular user and would
be appended to the empty array tmp. Here Tweepy is introduced as a tool to access Twitter
data in a fairly easy way with Python. There are different types of data we can collect, with
the obvious focus on the “tweet” object. Once we have collected some data, the possibilities
in terms of analytics applications are endless.
One such application of extracting tweets is sentiment or emotion analysis. The emotion of
the user can be obtained from the tweets by tokenizing each word and applying machine
learning algorithms on that data. Such emotion or sentiment detection is used worldwide
and will be broadly used in the future.
TextBlob is a python library and offers a simple API to access its methods and perform
basic NLP tasks. In this model textblob sentimental analysis is done to classify tweets into
positive negative and neutral based on polarity. However naive bayes classifier of textblob
can also be used for classification.
Text blob:A Text blob is a representation of text that describes the occurrence of words
within a document. The occurrence of words is represented in a numerical feature. It is a
way of extracting features from the text for use in modeling, such as with machine learning
algorithms. 17
The approach is very simple and flexible and can be used for extracting features from
documents. But there is some complexity on two cases i.e., one is on designing the
vocabulary of known words and the other is on scoring the presence of known words. Let
us consider there are 2 classes i.e., positive class and negative class.
Each class contains some words that is positive class contains some bag of positive words
(slow, fine, good, fantastic) and negative class contains some bag of negative words (hate,
terrible, heavy). We will give the input as a text/sentence and starts counting the frequency
of each word in the document and this gives the result whether the text/sentence belongs
to positive class or negative class.
Naive Bayes simplifies the calculation of probabilities by assuming that the probability of
each attribute belonging to a given class value is independent of all other attributes. This is
a strong assumption but results in a fast and effective method.
The probability of a class value given a value of an attribute is called the conditional
probability. By multiplying the conditional probabilities together for each attribute for a
given class value, we have a probability of a data instance belonging to that class.
To make a prediction we can calculate probabilities of the instance belonging to each class
and select the class value with the highest probability.
2.4ALGORITHMS
• Naïve Bayes
• Random forest
18
Naïve Bayes Algorithm
Bayes’ Theorem provides a way that we can calculate the probability of a piece of data
belonging to a given class, given our prior knowledge. Bayes’ Theorem is stated as:
• P(class|data) = (P(data|class) * P(class)) / P(data)
Where P(class|data) is the probability of class given the provided data.
Naive Bayes simplifies the calculation of probabilities by assuming that the probability of
each attribute belonging to a given class value is independent of all other attributes. This is
a strong assumption but results in a fast and effective method.
Random Forest is a classifier that contains a number of decision trees on various subsets
of the given dataset and takes the average to improve the predictive accuracy of the dataset.
The greater number of trees in the forest leads to higher accuracy and prevents the problem
19
Chapter 3
SOFTWARE SPECIFICATION
3.1 TECHNOLOGIES
JUPYTER
20
CLUSTERING
Clustering algorithms focus on ordering data together into groups. In general
clustering algorithms are unsupervised—they require no y response variable as input.
That is to say, they attempt to find groups or clusters within data where you do not
know the label for each sample. SciKit-Learn have many clustering algorithms, but in
this section we will demonstrate hierarchical clustering on a DNA expression
microarray dataset using an algorithm from the SciPy library.
We will plot a visualisation of the clustering using what is known as a dendrogram,
also using the SciPy library.
The goal is to cluster the data properly in logical groups, in this case into the cancer
types represented by each sample’s expression data. We do this using agglomerative
hierarchical clustering, using Ward’s linkage method.
CLASSIFICATION
we analysed data that was unlabelled—we did not know to what class a sample
belonged (known as unsupervised learning). In contrast to this, a supervised problem
deals with labelled data where are aware of the discrete classes to which each sample
belongs. When we wish to predict which class a sample belongs to, we call this a
classification problem. SciKit-Learn has a number of algorithms for classification, in
this section we will look at the Support Vector Machine.
Support Vector Machines are a very powerful tool for classification. They work well
in high dimensional spaces, even when the number of features is higher than the
number of samples. However, their running time is quadratic to the number of samples
so large datasets can become difficult to train. Quadratic means that if you increase a
dataset in size by 10 times, it will take 100 times longer to train.
3.1.2 ANACONDA
It is a free and open-source distribution of the Python and R programming languages for
scientific computing (data science, machine learning applications, large-scale data
processing, predictive analytics, etc.), that aims to simplify package management and
deployment.
21
Anaconda distribution comes with more than 1,500 packages as well as
the Conda package and virtual environment manager. It also includes a GUI, Anaconda
Navigator, as a graphical alternative to the Command Line Interface (CLI).
The big difference between Conda and the pip package manager is in how package
dependencies are managed, which is a significant challenge for Python data science and
the reason Conda exists. Pip installs all Python package dependencies required, whether or
not those conflict with other packages you installed previously.
So your working installation of, for example, Google Tensorflow, can suddenly stop
working when you pip install a different package that needs a different version of the
Numpy library. More insidiously, everything might still appear to work but now you get
different results from your data science, or you are unable to reproduce the same results
elsewhere because you didn't pip install in the same order.
Conda analyzes your current environment, everything you have installed, any version
limitations you specify (e.g. you only want tensorflow >= 2.0) and figures out how to install
compatible dependencies. Or it will tell you that what you want can't be done. Pip, by
contrast, will just install the thing you wanted and any dependencies, even if that breaks
other things.
Open source packages can be individually installed from the Anaconda repository,
Anaconda Cloud (anaconda.org), or your own private repository or mirror, using the conda
install command.
Anaconda Inc compiles and builds all the packages in the Anaconda repository itself, and
provides binaries for Windows 32/64 bit, Linux 64 bit and MacOS 64-bit. You can also
install anything on PyPI into a Conda environment using pip, and Conda knows what it has
installed and what pip has installed. Custom packages can be made using the conda
build command, and can be shared with others by uploading them to Anaconda
Cloud, PyPI or other repositories.
22
The default installation of Anaconda2 includes Python 2.7 and Anaconda3 includes Python
3.7. However, you can create new environments that include any version of Python
packaged with conda.
Navigator can search for packages on Anaconda Cloud or in a local Anaconda Repository,
install them in an environment, run the packages and update them. It is available
for Windows, macOS and Linux.
• JupyterLab
• Jupyter Notebook
• QtConsole
• Spyder
• Rstudio
• Visual Studio Code
Microsoft .NET is a set of Microsoft software technologies for rapidly building and
integrating XML Web services, Microsoft Windows-based applications, and Web
solutions. The .NET Framework is a language-neutral platform for writing programs that
can easily and securely interoperate. There’s no language barrier with .NET: there are
numerous languages available to the developer including Managed C++, C#, Visual Basic
and Java Script. The .NET framework provides the foundation for components to interact
seamlessly, whether locally or remotely on different platforms. It standardizes common
data types and communications protocols so that components created in different languages
can easily interoperate.
“.NET” is also the collective name given to various software components built upon the
.NET platform. These will be both products and services.
23
Microsoft VISUAL STUDIO is an Integrated Development Environment (IDE)
from Microsoft. It is used to develop computer programs, as well as websites, web apps,
web services and mobile apps.
• Easy to code
• Free and Open Source
• Object-Oriented Language
• GUI Programming Support
• High-Level Language
• Extensible feature
• Python is Portable language
• Python is Integrated language
• Interpreted
• Large Standard Library
• Dynamically Typed Language
3.1.3 PYTHON
• Python is a powerful multi-purpose programming language created by Guido van
Rossum.
• It has simple easy-to-use syntax, making it the perfect language for someone
trying to learn computer programming for the first time.
Features Of Python :
1.Easy to code:
Python is high level programming language. Python is very easy to learn language as
compared to other language like c, c#, java script, java etc. It is very easy to code in
python language and anybody can learn python basic in few hours or days. It is also
developer-friendly language.
24
2. Free and Open Source:
Python language is freely available at official website and you can download it from the
given download link below click on the Download Python keyword.Since, it is open-
source, this means that source code is also available to the public. So you can download it
as, use it as well as share it.
3.Object-Oriented Language:
One of the key features of python is Object-Oriented programming. Python supports
object oriented language and concepts of classes, objects encapsulation etc.
5. High-Level Language:
Python is a high-level language. When we write programs in python, we do not need
to remember the system architecture, nor do we need to manage the memory.
6.Extensible feature:
Python is a Extensible language. we can write our some python code into c or c++
language and also we can compile that code in c/c++ language.
25
9. Interpreted Language:
Python is an Interpreted Language. because python code is executed line by line at a
time. like other language c, c++, java etc there is no need to compile python code this
makes it easier to debug our code. The source code of python is converted into an
immediate form called bytecode.
APPLICATIONS OF PYTHON :
WEB APPLICATIONS
• You can create scalable Web Apps using frameworks and CMS (Content
Management System) that are built on Python. Some of the popular platforms for
creating Web Apps are: Django, Flask, Pyramid, Plone, Django CMS.
• Sites like Mozilla, Reddit, Instagram and PBS are written in Python.
• There are numerous libraries available in Python for scientific and numeric
computing. There are libraries like: SciPy and NumPy that are used in general
purpose computing. And, there are specific libraries like: EarthPy for earth
science, AstroPy for Astronomy and so on.
• Also, the language is heavily used in machine learning, data mining and deep
learning.
26
CREATING SOFTWARE PROTOTYPES
• Python is slow compared to compiled languages like C++ and Java. It might not
be a good choice if resources are limited and efficiency is a must.
• However, Python is a great language for creating prototypes. For example: You
can use Pygame (library for creating games) to create your game's prototype first.
If you like the prototype, you can use language like C++ to create the actual
game.
Processor : Intel I5
RAM : 4GB
Hard Disk : 40GB
27
CHAPTER 4
IMPLEMENTATION
4.1 GENERAL
import numpy as np
import pandas as pd
import math
import re
import warnings
import nltk
27
from sklearn.model_selection import train_test_split
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('words')
warnings.filterwarnings('ignore')
data=pd.read_csv('twcs.csv',encoding = "ISO-8859-1")
data.head()
data.columns
data.isnull().sum()
data1=data[data['author_id']=='sprintcare']
data['text']=data['text'].apply(lambda x:re.sub('[^a-zA-Z\s]','',x))
stop_words = set(stopwords.words('english'))
28
stemmer=SnowballStemmer('english')
data['text']=data['text'].str.lower()
y_pred=[]
for i in range(data_1.shape[0]):
analysis = TextBlob(data['text'].loc[i])
if analysis.sentiment.polarity>0:
y_pred.append('positive')
elif analysis.sentiment.polarity==0:
y_pred.append('neutral')
else:
y_pred.append('negative')
data1['polarity']=y_pred
data1[data1['author_id']=='sprintcare']['polarity'].value_counts()
dict(data1[data1['author_id']=='AmazonHelp']['polarity'].value_counts())
recomender_data=data1[['author_id','text']]
ngram_range=(1, 1),
X= tfidf.fit_transform(recomender_data['text']).toarray()
29
data1['polarity'].value_counts()
data=np.hstack((X_train,Y_train.reshape(-1,1)))
df=pd.DataFrame(data)
df.to_csv('twitter.csv')
4.3 SNAPSHOTS
RESULT:
30
31
CHAPTER 5
5.1 CONCLUSION
In this paper, we comprehensively reviewed the available literature and highlighted
why early prediction of ES is required, how ML and DL techniques are used for ES
prediction. In the context of EEG analysis, feature selection, ES detection, and prediction,
and the evaluation of prediction or detection algorithms, ES prediction. Contrary to the
findings of this paper, most of the previous survey papers focused only on EEG analysis
and a few of them covered the developments of prediction techniques; while we tried to
provide insights by considering the aspects of the feature selection, prediction techniques,
and evaluation methodologies, etc. In addition, we have also highlighted future work
directions and open research problems that require further investigation.
32
[3] M. Poussevin, V. Guigue, and P. Gallinari, “Extracting a vocabulary of surprise by
collaborative filteringmixture and analysis of feelings,” in Proceedings of the CORIA
2015—Conference in Search Infomations and Applications—12th French Information
Retrieval Conference, Paris, France, March 2015.
[4] M. Z. Kurdi, “Lexical and syntactic features selection for an adaptive reading
recommendation system based on text complexity,” in Proceedings of the 2017
International Conference on Information System and Data Mining, pp. 66–69, Charleston,
SC, USA, April 2017.
[6] A. I. Schein, A. Popescul, L. H. Ungar, and D. M. Pennock, “Methods and metrics for
cold-start recommendations,” in Proceedings of the 25th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval—SIGIR ’02, New
York, NY, USA, 2002.
[9] R. Mu, “A survey of recommender systems based on deep learning,” IEEE Access, vol.
6, pp. 69009–69022, 2018.
33