Projects 2021 B4
Projects 2021 B4
Projects 2021 B4
BACHELOR OF TECHNOLOGY IN
COMPUTER SCIENCE ENGINEERING
Submitted by
K. MANOJ (317126510078)
I. MANIKANTA (318126510L13)
CH. DEEKSHITH (318126510L16)
K.V. MUKESH (318126510L20)
(Associate Professor)
i
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ANIL NEERUKONDA INSTITUTE OF TECHNOLOGY AND
SCIENCES (UGC AUTONOMOUS)
(Affiliated to AU, Approved by AICTE and Accredited by NBA & NAAC with ‘A’
Grade)
Sangivalasa, bheemili mandal, visakhapatnam dist.(A.P)
BONAFIDE CERTIFICATE
ii
DECLARATION
K. MANOJ 317126510078
I. MANIKANTA 318126510L13
CH. DEEKSHITH 318126510L16
K.V. MUKESH 318126510L20
iii
ACKNOWLEDGEMENT
We would like to express our deep gratitude to our project guide Dr. K.S. Deepthi
Associate Professor, Department of Computer Science and Engineering, ANITS, for her
guidance with unsurpassed knowledge and immense encouragement. We are grateful to
Dr. R. Sivaranjani, Head of the Department, Computer Science and Engineering, for
providing us with the required facilities for the completion of the project work.
We are very much thankful to the Principal and Management, ANITS, Sangivalasa, for
their encouragement and cooperation to carry out this work.
We express our thanks to Project Coordinator Dr. K.S. Deepthi, for her Continuous support
and encouragement. We thank all teaching faculty of Department of CSE, whose
suggestions during reviews helped us in accomplishment of our project. We would like to
thank Mrs B.V. Udaya Lakshmi of the Department of CSE, ANITS for providing great
assistance in accomplishment of our project.
We would like to thank our parents, friends, and classmates for their encouragement
throughout our project period. At last but not the least, we thank everyone for supporting
us directly or indirectly in completing this project successfully.
PROJECT STUDENTS
K. MANOJ (317126510078),
I. MANIKANTA (318126510L13),
CH. DEEKSHITH (318126510L16),
K.V. MUKESH (318126510L20).
iv
ABSTRACT
In various data repositories, there are large medical datasets available that
are used to identifying the diseases. Parkinson’s is considered one of the
deadliest and progressive nervous system diseases that affect movement.
It is the second most common neurological disorder that causes disability,
reduces the life span, and still has no cure. Nearly 90% of affected people
with this disease have speech disorders. In real-world applications, the
information is been generated by using various Machine Learning
techniques. Machine learning algorithms help to generate useful content
from it. To increase the lifespan of elderly people the machine learning
algorithms are used to detect diseases in the early stages. Speech features
are the main concept while taking into consideration the term
‘Parkinson’s’. In this paper, the author is using various Machine Learning
techniques like KNN, Naïve Bayes, and Logistic Regression and how
these algorithms are used to predict Parkinson’s based on the input taken
from the user and the input for algorithms is the dataset. Based on these
features the author predicts the algorithm that gives more accuracy. The
accuracies obtained for the three algorithms are KNN with 80%, Logistic
Regression with 79%, and Naïve Bayes with the highest accuracy of 81%
and it is used in the frontend to predict whether the patient has Parkinson’s
is present or not. To recover the patients from early stages, prediction is
important. This process can be done with the help of Machine Learning.
v
CONTENTS
ABSTRACT v
LIST OF FIGURES ix
LIST OF TABLES xi
LIST OF ABBREVATIONS xi
CHAPTER 1 INTRODUCTION 01
1.1 Introduction to Parkinson’s disease 01
1.2 Parkinson’s disease symptoms 04
1.3 Introduction to Machine Learning 04
1.3.1 Supervised learning 05
1.3.2 Unsupervised learning 07
1.3.3 Applications of Machine Learning 07
1.4 Motivation of the work 09
1.5 Problem Statement 09
1.6 Organization of Thesis 10
CHAPTER 3 METHODOLOGY 16
3.1 Proposed System 16
3.1.1 System Architecture 16
3.2 Modules Division 17
3.2.1 Speech Dataset 17
3.2.2 Data Pre-processing 20
3.2.3 Training data 23
3.2.4 Apply Machine Learning Algorithms 23
3.2.4.1 K-Nearest Neighbor 24
3.2.4.2 Naïve Bayes 27
3.2.4.3 Logistic Regression 29
3.2.5 Testing Data 32
vi
3.3 User Interface 32
4.6 Results 60
vii
CHAPTER 5 CONCLUSION AND FUTURE WORK 65
5.1 Conclusion 65
5.2 Future Work 65
REFERENCES 66
viii
LIST OF FIGURES
3.3 Reading the data from the CSV file into notebook 19
ix
4.5 Logistic Regression Test Accuracy 61
x
LIST OF TABLES
LIST OF ABBREVATIONS
DNA DeoxyriboNucleicAcid
PD Parkinson’s Disease
IT Information
KNN K-Nearest Neighbor
CSV Comma Separated Value
NB Naïve Bayes
HTML Hyper Text Markup Language
CSS Cascading Style Sheets
xi
CHAPTER 1 – INTRODUCTION
1
Fig-1.1 Structure of Neuron
This work deals with the prediction of Parkinson’s disorder which is now a day is
tremendously increasing incurable disease. Parkinson’s disease is a most spreading
disease which gets its name from James Parkinson who earlier described it as a
paralysis agitans and later gave his surname was known as PD. It generally affects the
neurons which are responsible for overall body movements. The main chemicals are
dopamine and acetylcholine which affect the human brain. There is a various
environmental factor which has been implicated in PD below are the listed factor which
caused Parkinson’s disease in an individual.
Environmental factors: Environment is defined as the surroundings or the
place in which an individual lives. So the environment is the major factor that
will not only affects the human’s brain but also affects all the living organism
who lives in the vicinity of it. Many types of research and evidence have proved
that the environment has a big hand in the development of neurodegenerative
disorders mainly Alzheimer’s and Parkinson’s. There are certain environmental
factors that are influencing neurodegenerative disorder with high pace are:-
Exposure to heavy metals (like lead and aluminum) and pesticides.
Air Quality: Pollution results in respiratory diseases.
2
Water quality: Biotic and Abiotic contaminants present in water lead to
water pollution.
Unhealthy lifestyle: It leads to obesity and a sedentary lifestyle.
Psychological stress: It increases the level of stress hormone that
depletes the functions of neurons.
Brain injuries or Biochemical Factors: The brain is the control center of our
complete body. Due to certain trauma, people have brain injuries which leads
some biochemical enzymes to come into the picture which provides neurons
stability and provides support to some chromosomes and genes in maintenance.
Aging Factor: Aging is one of the reasons for the development of Parkinson’s
disease. According to the author in India, 11,747,102 people out of 1, 065, 070,
6072 are affected by Parkinson’s disease.
3
1.2 Parkinson’s disease symptoms
Machine Learning may be a sub-area of AI, whereby the term refers to the
power of IT systems to independently find solutions to problems by
recognizing patterns in databases. In other words: Machine Learning enables
IT systems to acknowledge patterns in the idea of existing algorithms and data
sets and to develop adequate solution concepts. Therefore, in Machine
Learning, artificial knowledge is generated on the idea of experience. In order
to enable the software to independently generate solutions, the prior action of
4
people is important. For example, the required algorithms and data must be
fed into the systems in advance and the respective analysis rules for the
recognition of patterns in the data stock must be defined. Once these two steps
have been completed, the system can perform the following tasks by Machine
Learning:
5
Techniques of Supervised Machine Learning algorithms include linear and
logistic regression, multi-class classification, Decision Tree, and Support
Vector Machine.
Regression:
Linear regression could also be a linear model, e.g. a model
that assumes a linear relationship between the input variables (x) and
thus the only output variable (y). More specifically, that y is usually
calculated from a linear combination of the input variables (x).
Classification:
Classification could also be a process of categorizing a given
set of data into classes, It is often performed on both structured or
unstructured data. the tactic starts with predicting the category of given
data points. The classes are often mentioned as target, label, or
categories.
6
1.3.2 Unsupervised learning:
7
Later, this set of knowledge is employed to render results that are
tailored to your preferences.
Videos Surveillance:
Imagine one person monitoring multiple video cameras! Certainly, a
difficult job to try to do and boring also. This is why the thought of
coaching computers to try to do this job is sensible.
The video closed-circuit television nowadays is powered by AI that creates
it possible to detect crimes before they happen. They track unusual
behavior of individuals like standing motionless for an extended time,
stumbling, or napping on benches, etc. The system can thus give an
awareness of human attendants, which may ultimately help to avoid
mishaps. And when such activities are reported and counted to be true,
they assist to enhance the surveillance services. This happens with
machine learning doing its job at the backend.
Face Recognition
8
Search Engine Result Refining:
Google and other search engines use machine learning to enhance
the search results for you. Every time you execute an inquiry, the
algorithms at the backend keep a watch on how you answer the
results. If you open the highest results and stay on the online page
for long, the program assumes that the results it displayed were in
accordance with the query. Similarly, if you reach the second or third
page of the search results but don't open any of the results, the
program estimates that the results served did not match the
requirement. This way, the algorithms performing at the backend
improve the search results
The main aim is to predict the prediction efficiency that would be beneficial for the
patients who are suffering from Parkinson and the percentage of the disease will be
reduced. Generally in the first stage, Parkinson's can be cured by the proper treatment.
9
So it‘s important to identify the PD at the early stage for the betterment of the patients.
The main purpose of this research work is to find the best prediction model i.e. the best
machine learning technique which will distinguish the Parkinson’s patient from the
healthy person. The techniques used in this problem are KNN, Naïve Bayes, and
Logistic Regression. The experimental study is performed on the voice dataset of
Parkinson’s patients which is downloaded from the Kaggle. The prediction is evaluated
using evaluation metrics like confusion matrix, precision, recall accuracy, and f1-score.
The author used feature selection where the important features are taken into
consideration to detect Parkinson’s.
10
CHAPTER 2 – LITERATURE SURVEY
Anila M and Dr G Pradeepini proposed the paper titled “Diagnosis of Parkinson’s disease
using Artificial Neural network” [2]. The main objective of this paper is that the detection
of the disease is performed by using the voice analysis of the people affected with
Parkinson's disease. For this purpose, various machine learning techniques like ANN,
Random Forest, KNN, SVM, XG Boost are used to classify the best model, error rates are
calculated, and the performance metrics are evaluated for all the models used. The main
drawback of this paper is that it is limited to ANN with only two hidden layers. And this
type of neural networks with two hidden layers are sufficient and efficient for simple
datasets. They used only one technique for feature selection which reduces the number of
features.
Arvind Kumar Tiwari Proposed the paper titled “Machine Learning-based Approaches for
Prediction of Parkinson’s Disease” [3]. In this paper, minimum redundancy maximum
relevance feature selection algorithms were used to select the most important feature among
all the features to predict Parkinson diseases. Here, it was observed that the random forest
with 20 number of features selected by minimum redundancy maximum relevance feature
selection algorithms provide the overall accuracy 90.3%, precision 90.2%, Mathews
11
correlation coefficient values of 0.73 and ROC values 0.96 which is better in comparison
to all other machine learning based approaches such as bagging, boosting, random forest,
rotation forest, random subspace, support vector machine, multilayer perceptron, and
decision tree based methods.
Mohamad Alissa Proposed the paper titled “Parkinson’s Disease Diagnosis Using Deep
Learning” [14]. This project mainly aims to automate the PD diagnosis process using deep
learning, Recursive Neural Networks (RNN) and Convolutional Neural Networks (CNN),
to differentiate between healthy and PD patients. Besides that, since different datasets may
capture different aspects of this disease, this project aims to explore which PD test is more
effective in the discrimination process by analysing different imaging and movement
datasets (notably cube and spiral pentagon datasets). In general, the main aim of this paper
is to automate the PD diagnosis process in order to discover this disease as early as possible.
If we discover this disease earlier, then the treatments are more likely to improve the quality
of life of the patients and their families.
There are some limitations to this paper namely:
They used the validation set only to investigate the model performance during the
training and this reduced the number of samples in the training set.
RNN training is too slow and this is not flexible in practice work.
Disconnecting and resource exhaustion: working with cloud services like Google
Collaboratory causes many problems like disconnecting suddenly. And because it
is shareable service by the world zones, this leads to resource exhaustion error many
times.
Afzal Hussain Shahid and Maheshwari Prasad Singh proposed the paper titled “A deep
learning approach for prediction of Parkinson’s disease progression” [19]. This paper
proposed a deep neural network (DNN) model using the reduced input feature space of
Parkinson’s telemonitoring dataset to predict Parkinson’s disease (PD) progression and also
proposed a PCA based DNN model for the prediction of Motor-UPDRS and Total-UPDRS
in Parkinson's Disease progression. The DNN model was evaluated on a real-world PD
dataset taken from UCI. Being a DNN model, the performance of the proposed model may
12
improve with the addition of more data points in the datasets.
Siva Sankara Reddy Donthi Reddy and Udaya Kumar Ramanadham proposed the paper
“Prediction of Parkinson’s Disease at Early Stage using Big Data Analytics” [21]. This
paper describes mainly various Big Data Analytical techniques that may be used in
diagnosing of right disease in the right time. The main intention is to verify the accuracy of
prediction algorithms. Their future study aims to propose an efficient method to diagnose
this type of neurological disorder by some symptoms at the early stage with better accuracy
using different Big Data Analytical techniques like Hadoop, Hive, R Programming,
MapReduce, PIG, Zookeeper, HBase, Cassandra, Mahout etc…
Daiga Heisters proposed the paper titled “Parkinson’s: symptoms, treatments and research”
[9]. This paper initially says that Current treatments can help to ease the symptoms but none
can repair the damage in the brain or slow the progress of the condition; now, Parkinson’s
UK researchers are working to develop new treatments that can and finally worked together
to build on existing discoveries and explore these innovative areas of research, it is hoped
that a cure for Parkinson’s will be found. Parkinson’s UK offers support for everyone
affected,, including people with the condition, their family, friends and careers, researchers
13
and professionals working in this area.
Dragana Miljkovic, et al, proposed a paper “Machine Learning and Data Mining Methods
for Managing Parkinson’s Disease” [7]. In this paper, the author concluded that based on
the medical tests taken by the patients the Predictor part was able to predict the 15 different
Parkinson’s symptoms separately. The machine learning and data mining techniques are
applied on different symptoms separately and gives an accuracy range between 57.1% and
77.4% where tremor detection has the highest accuracy.
14
Sriram, T. V., et al. proposed a paper “Intelligent Parkinson Disease Prediction Using
Machine Learning Algorithms” [22]. In this paper, the author used voice measures of the
patients to check whether the patient has Parkinson’s or not. The author applied the dataset
to various machine learning algorithms and find the maximum accuracy. To analyse the
models the author used the ROC curve and sieve graph. The random forest results with
more accuracy i.e. are 90.26%.
A. Ozcift, proposed a paper “SVM feature selection based rotation forest ensemble
classifiers to improve computer-aided diagnosis of Parkinson disease” [1]. In this paper,
the author summarizes that improve the PD diagnosis accuracy with the use of support
vector machine feature selection. To evaluate the performances the author used accuracy,
kappa statistics, and area under the curve of the classification algorithms. The rotation
Forest ensemble of these classifiers used to increase the performance of the system.
15
CHAPTER 3 – METHODOLOGY
3.1 Proposed system
3.1.1 System Architecture
Machine learning has given computer systems the ability to automatically learn
without being explicitly programmed. In this, the author has used three machine learning
algorithms (Logistic Regression, KNN, and Naïve Bayes). The architecture diagram
describes the high-level overview of major system components and important working
relationships. It represents the flow of execution and it involves the following five major
steps:
The architecture diagram is defined with the flow of the process which is used to
refine the raw data and used for predicting the Parkinson’s data.
The next step is preprocessing the collected raw data into an understandable format.
Then we have to train the data by splitting the dataset into train data and test data.
The Parkinson’s data is evaluated with the application of a machine learning
algorithm that is Logistic Regression, KNN, and Naïve Bayes algorithm, and the
classification accuracy of this model is found.
After training the data with these algorithms we have to test on the same algorithms.
Finally, the result of these three algorithms is compared on the basis of classification
accuracy.
16
SPEECH DATASET
PRE-PROCESSING
DATA
TRAINING DATA
APPLY MACHINE
LEARNING ALGORITHMS
TEST DATA
OUTPUT
17
sources like files and databases. The number and quality of the collected data will determine
the efficiency of the output. The more are going to be the info, the more accurate are going
to be the prediction. We’ve collected our data from the Kaggle website.
In the above Fig-3.2, we can see the speech dataset that has collected from kaggle
website. This acquired dataset has around 756 patient’s data and each row has 755 different
voice features. But in this paper, we chosen 10 main features that required to find the
prediction.
The features are listed below:
Id
Gender
PPE(Pitch Period Entropy)
DFA(Detrended Fluctuation Analysis)
RPDE(Recurrent Period Density Entropy)
18
numPulses
numPeriodPulses
meanPeriodPulses
stdDevPeriodPulses
locPctJitter
locAbsJitter
rapJitter
locShimmer, etc.
Fig-3.3 Reading the dataset from the CSV file into notebook
The dataset we chose is in the form of CSV (Comma Separated Value) file. After
acquiring the data our next step is to read the data from the CSV file into the Google
colab also called a Python notebook. Python notebook is used in our project for data pre-
processing, features selection, and for model comparison. In the fig-3.3, we have shown
how to read data from CSV files using the inbuilt python functions that are part of the
19
pandas library.
When compared in genders the Parkinson’s disease is mostly found in the male
rather than female. As this dataset consists of more male persons we chose. In Fig-3.4, we
have shown the male people are more than female.
The main aim of this step is to study and understand the nature of data that was
acquired in the previous step and also to know the quality of data. A real-world data
generally contains noises, missing values, and maybe in an unusable format that cannot be
directly used for machine learning models. Data pre-processing is a required task for
cleaning the data and making it suitable for a machine learning model which also increases
the accuracy and efficiency of a machine learning model. Identifying duplicates in the
dataset and removing them is also done in this step.
Actually, in this dataset, we have 755 features out of which some may not be useful
in building our model. So, we have to leave out all those unnecessary features which are
not responsible to produce the output. If we take more features in this model the accuracy
we got is less. When we check the correlation of the features, some of them are the same.
In Fig 3.5, a screenshot of our notebook is shown the correlation of the columns where two
of the columns have similar values. So, one of them is removed.
20
Fig-3.5 Correlation matrix
As the correlation values of the two attributes are similar and one of them can be
removed. This kind of feature must be dropped. As our data is now stored as a data frame
in a python notebook, we can easily drop those unnecessary features using the inbuilt
functions. In Fig 3.6, a screenshot of our notebook is shown where we have dropped some
features.
21
After identifying and dropping some features, the initial 755 features that we have
are reduced to 10 features. Those features are as follows:
Id
Gender
PPE(Pitch Period Entropy)
DFA(Detrended Fluctuation Analysis)
RPDE(Recurrent Period Density Entropy)
numPulses
numPeriodPulses
meanPeriodPulses
stdDevPeriodPulses
locPctJitter
After pre-processing the acquired data, the next step is to identify the best features.
The identified best features should be able to give high efficiency. In Fig 3.7, a
screenshot of our notebook is shown how to select k best features using sckitlearn. The
classes within the sklearn.feature_selection module are often used for feature
selection/dimensionality reduction on sample sets, either to enhance estimators’
accuracy scores or to spice up their performance on very high-dimensional datasets.
22
3.2.3 Training data:
Usually, we split the dataset into train and test in the ratio of 7:3 i.e., 70 percent of data is
used for training and 30 percent of data is used for testing the model. We have done it in
the same way and it has been shown in the above Fig 3.8.
Now, we've both the train and test data. The subsequent step is to spot the possible
training methods and train our models. As this is often a classification problem, we've used
three different classification methods KNN, Naïve Bayes, and Logistic Regression. Each
23
algorithm has been run over the Training dataset and their performance in terms of accuracy
is evaluated alongside the prediction wiped out the testing data set.
K-Nearest Neighbor:
Let's see this algorithm can be seen with the help of a simple example. Suppose
the dataset have two variables, which are plotted and shown in fig 3.9.
24
Your task is to classify a replacement datum with 'X' into "Blue" class or "Red"
class. The coordinate values of the info point are x=45 and y=50. If the K value is
of 3 then the KNN algorithm starts by calculating the space of point X from all the
points. Then it finds the nearest three points with least distance to point X. This
process can be shown in the fig 3.10. The three nearest points in the results have
been encircled.
The final step of the KNN algorithm is to assign a replacement point to the category
to which the bulk of the three nearest points belong. From the figure above we can
see that two of the three nearest points belong to the class "Red" while one belongs
to the class "Blue". Therefore the new datum are going to be classified as "Red".
In KNN, finding the value of K is not so easy. So we used an optimal way to identify
the k value through error rate. We will find the error value at each k value and from
that we identify the value which gives minimal error. It is show in Fig 3.11.
25
Fig-3.11 Finding K value using error rate
In scikit-learn python library, from sklearn.neighbors import KNeighborsClassifier
Module is used for carrying out the K Nearest Neighbor. We have to specify the value of
K from the above and assign an object to the classifier. We will use our training dataset to
fit the model. Fig 3.12 shows the sample code for training model using K-Nearest Neighbor.
26
3.2.4.1 Naïve Bayes:
P (h): the probability of hypothesis h being true (regardless of the data). This is known
as the prior probability of h.
P (D): the probability of the data (regardless of the hypothesis). This is known as the
prior probability.
P (h|D): the probability of hypothesis h given the data D. This is known as posterior
probability.
P (D|h): the probability of data d given that the hypothesis h was true. This is known
as posterior probability.
We can frame classification as a conditional classification problem with Bayes
Theorem as follows:
P(yi | x1, x2, …, xn) = P(x1, x2, …, xn | yi) * P(yi) / P(x1, x2, …, xn)
The prior P(yi) is easy to estimate from a dataset, but the conditional probability of
the observation based on the class P(x1, x2, …, xn | yi) is not feasible unless the
number of examples is extraordinarily large, e.g. large enough to effectively estimate
27
the probability distribution for all different possible combinations of values.
As such, the direct application of Bayes Theorem also becomes intractable, especially
as the number of variables or features (n) increases.
Naive Bayes classifier calculates the probability of an event in the following steps:
Step 1: Calculate the prior probability for given class labels
Step 2: Find Likelihood probability with each attribute for each class
Step 3: Put these value in Bayes Formula and calculate posterior probability.
Step 4: See which class has a higher probability, given the input belongs to the higher
probability class.
Types of Naive Bayes Algorithms:
Gaussian Naïve Bayes: When the feature values are continuous in the nature
then there is an assumption to be made that the values linked with each
category are dispersed according to Gaussian that is Normal Distribution.
Multinomial Naïve Bayes: Multinomial Naive Bayes is mostly favored to be
used on the data that is multinomial distributed. It is widely utilized in text
classification in NLP. Each event in text classification constitutes the
presence of a word in a document.
Bernoulli Naïve Bayes: When data is deleted according to the multivariate
Bernoulli distributions then came the Bernoulli Naive Bayes. That means
there can be exist a different number of features but each one is assumed to
contain a binary value. So, it requires features to be binary-valued.
28
Fig-3.13 Naïve Bayes Classifier Model
The decision for the worth of the edge value is majorly suffering from the values of
precision and recall. Ideally, we would like both precision and recall to be 1, but this
seldom is that the case. within the case of Precision-Recall tradeoffs we use the
subsequent arguments to make a decision upon the threshold:
29
negatives, we elect a choice value that features a high value of Precision or a
low value of Recall.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).The curve from the
logistic function indicates the likelihood of something such as whether the cells are
cancerous or not, a mouse is obese or not based on its weight, etc.
30
x = input to the function
e = base of natural logarithm.
On the idea of the categories, Logistic Regression are often classified into three
types:
Binomial: The target variable can have only 2 possibilities either “0” or “1”
which may represent “win” or “loss”, “pass” or “fail”, “dead” or “alive”,
etc.
Multinomial: In multinomial Logistic regression, the target variable can
have 3 or more possibilities which are not ordered that means it has no
measure in quantity like “disease A” or “disease B” or “disease C”.
Ordinal: In ordinal Logistic regression, the target variables deals with
ordered categories. For example, a test score can be categorized as: “very
poor”, “poor”, “good”, and “very good”. Here, each category can be given
a score like 0, 1, 2, and 3.
In scikit-learn python library, from sklearn.linear_model importLogisticRegression
Module is used for carrying out the Logistic Regression. We have to specify the
iterations to the function parameter and assign an object to the classifier. We will
use our training dataset to fit the model. Fig 3.15 shows the sample code for training
model using Logistic Regression.
31
3.2.5 Testing Data
Once Parkinson’s disease Prediction model has been trained on the pre-processed
dataset, then the model is tested using different data points. In this testing step, the model
is checked for correctness and accuracy by providing a test dataset to it. All the training
methods need to be verified for finding out the best model to be used. In figures 3.12, 3.13,
3.15, after fitting our model with training data, we used this model to predict values for the
test dataset. These predicted values on testing data are used for model comparison and
accurate calculation.
32
CHAPTER-4
EXPERIMENTAL ANALYSIS AND RESULTS
Functional Requirements.
Non-Functional Requirements.
33
non-functional standards that are critical to the success of the software.
An example of a nonfunctional requirement, “how fast does the website load?”
Failing to satisfy non-functional requirements may result in systems that fail to satisfy user
needs.
Non-functional Requirements allow you to impose constraints or restrictions on the
planning of the system across the varied agile backlogs.
Accuracy
Reliability
Flexibility
34
Python has many inbuilt library functions that can be used easily for working with machine
learning algorithms. All the necessary python libraries must be pre-installed using “pip”
command.
A Flask application is started by calling the run() method. However, while the appliance is
under development, it should be restarted manually for every change within the code. To
avoid this inconvenience, enable debug support. The server will then reload itself if the
code changes. It will also provide a useful debugger to trace the errors if any, within the
application.
The Debug mode is enabled by setting the debug property of the application object
to True before running or passing the debug parameter to the run() method.
35
4.2.1.3 Python Libraries:
NumPy:
Besides its obvious scientific uses, NumPy also can be used as an efficient multi-
dimensional container of generic data.
Pandas:
Sklearn:
Scikit-learn (Sklearn) is that the most useful and robust library for machine learning
in Python. It is an open-source Python library that implements a variety of machine
learning, pre-processing, cross-validation and visualization algorithms employing a unified
interface. Sklearn provides a selection of efficient tools for machine learning and statistical
modeling including classification, regression, clustering and dimensionality reduction via
36
a consistence interface in Python. This library, which is essentially written in Python, is
made upon NumPy, SciPy and Matplotlib.
Pickle:
Python pickle module is employed for serializing and de-serializing a Python object
structure. Pickling is a way to convert a python object (list, dict, etc.) into a character
stream. The idea is that this character stream contains all the information necessary to
reconstruct the thing in another python script. Pickling is benefitial for applications where
you would like a point of persistency in your data. Your program's state data are often saved
to disk, so you’ll continue working on it later on.
Matplotlib:
It is a very powerful plotting library useful for those working with Python and
NumPy. And for creating statistical interference, it becomes very necessary to visualize our
data and Matplotlib is that the tool which will be very helpful for this purpose. It provides
MATLAB like interface only difference is that it uses Python and is open source.
Seaborn:
Seaborn may be a data visualization library built on top of matplotlib and closely
integrated with pandas data structures in Python. Visualization is that the central part of
Seaborn which helps in exploration and understanding of data.
It offers the following functionalities:
Dataset oriented API to determine the relationship between variables.
Automatic estimation and plotting of linear regression plots.
It supports high-level abstractions for multi-plot grids.
Visualizing univariate and bivariate distribution.
37
1. RAM: 4 GB or above
2. Storage: 30 to 50 GB
3. Processor: Any Processor above 500MHz
38
of system requirements, to work out whether the corporate has the technical expertise to
handle completion of the project. When writing a feasibility report, the subsequent should
be taken to consideration:
A brief description of the business to assess more possible factors which could
affect the study
The part of the business being examined
The human and economic factor
The possible solutions to the problem At this level, the concern is whether the
proposal is both technically and legally feasible (assuming moderate cost). The
technical feasibility assessment is focused on gaining an understanding of the
present technical resources of the organization and their applicability to the
expected needs of the proposed system. It is an evaluation of the hardware and
software and how it meets the need of the proposed system.
39
4.4 Sample Code
(a) model.py:
import numpy as np # linear algebra
import pandas as pd # analyze data
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("/content/drive/MyDrive/pd_speech_features.csv")
df.head()
sonuc=sonuc/len(class_id)
return sonuc
40
sonuc=sonuc/len(class_id)
return sonuc
def accuracy(class_id,TP, FP, TN, FN):
sonuc=0
for i in range(0,len(class_id)):
sonuc+=((TP[i]+TN[i])/(TP[i]+FP[i]+TN[i]+FN[i]))
sonuc=sonuc/len(class_id)
return sonuc
def specificity(class_id,TP, FP, TN, FN):
sonuc=0
for i in range(0,len(class_id)):
if (TN[i]==0 or FP[i]==0):
TN[i]=0.00000001
FP[i]=0.00000001
sonuc+=(TN[i]/(FP[i]+TN[i]))
sonuc=sonuc/len(class_id)
return sonuc
def NPV(class_id,TP, FP, TN, FN):
sonuc=0
for i in range(0,len(class_id)):
if (TN[i]==0 or FN[i]==0):
TN[i]=0.00000001
FN[i]=0.00000001
sonuc+=(TN[i]/(TN[i]+FN[i]))
sonuc=sonuc/len(class_id)
return sonuc
def perf_measure(y_actual, y_pred):
41
class_id = set(y_actual).union(set(y_pred))
TP = []
FP = []
TN = []
FN = []
df.info()
df.columns
man=df.gender.sum()
total=df.gender.count()
woman=total-man
42
print("man: "+str(man)+" woman: "+str(woman))
sns.heatmap(df[df.columns[0:10]].corr(),annot=True)
df.shape
auc_scor=[]
precision_scor=[]
x.head()
x=pd.DataFrame(xnew2)
x.head()
y.value_counts()
y=y.values
type(y)
score_liste=[]
recall_scor=[]
43
f1_scor=[]
LR_plus=[]
LR_eksi=[]
odd_scor=[]
NPV_scor=[]
youden_scor=[]
specificity_scor=[]
error_rate = []
for i in range(1,100):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(x_train,y_train)
pred_i = knn.predict(x_test)
error_rate.append(np.mean(pred_i != y_test))
plt.figure(figsize=(10,6))
plt.plot(range(1,100),error_rate,color='blue', linestyle='dashed',
marker='o',markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
print("Minimum error:-",min(error_rate),"at K =",error_rate.index(min(error_rate)))
k=10
knn = KNeighborsClassifier(n_neighbors = k)
knn.fit(x_train,y_train)
y_head=knn.predict(x_test)
print("KNN Algorithm test accuracy",knn.score(x_test,y_test))
44
classid,tn,fp,fn,tp=perf_measure(y_test,y_head)
auc_scor.append(roc_auc_score(y_test,y_head))
score_list.append(accuracy(classid,tn,fp,fn,tp))
precision_scor.append(precision(classid,tn,fp,fn,tp))
recall_scor.append(recall(classid,tn,fp,fn,tp))
f1_scor.append(f1_score(y_test,y_head,average='macro'))
NPV_scor.append(NPV(classid,tn,fp,fn,tp))
specificity_scor.append(specificity(classid,tn,fp,fn,tp))
LR_plus.append((recall(classid,tn,fp,fn,tp)/(1-specificity(classid,tn,fp,fn,tp))))
LR_minus.append(((1-recall(classid,tn,fp,fn,tp))/specificity(classid,tn,fp,fn,tp)))
odd_scor.append(((recall(classid,tn,fp,fn,tp)/(1-specificity(classid,tn,fp,fn,tp))))/(((1-
recall(classid,tn,fp,fn,tp))/specificity(classid,tn,fp,fn,tp))))
youden_scor.append((recall(classid,tn,fp,fn,tp)+specificity(classid,tn,fp,fn,tp)-1))
45
print("Naive Bayes Algorithm test accuracy",nb.score(x_test,y_test))
classid,tn,fp,fn,tp=perf_measure(y_test,y_head)
auc_scor.append(roc_auc_score(y_test,y_head))
score_list.append(accuracy(classid,tn,fp,fn,tp))
precision_scor.append(precision(classid,tn,fp,fn,tp))
recall_scor.append(recall(classid,tn,fp,fn,tp))
f1_scor.append(f1_score(y_test,y_head,average='macro'))
NPV_scor.append(NPV(classid,tn,fp,fn,tp))
specificity_scor.append(specificity(classid,tn,fp,fn,tp))
TPR=recall(classid,tn,fp,fn,tp)
TNR=specificity(classid,tn,fp,fn,tp)
FPR=1-TNR
if FPR==0:
FPR=0.00001
FNR=1-TPR
lrminus=FNR/TNR
lrarti=TPR/FPR
if lrminus==0:
lrminus=0.00000001
LR_plus.append(TPR/FPR)
LR_minus.append(FNR/TNR)
odd_scor.append(lrarti/lrminus)
youden_scor.append(TPR+TNR-1)
cmnb = confusion_matrix(y_test,y_head)
f, ax = plt.subplots(figsize =(5,5))
sns.heatmap(cmnb,annot = True,linewidths=0.5,linecolor="red",fmt = ".0f",ax=ax)
plt.xlabel("y_pred")
46
plt.ylabel("y_true")
plt.title("Naive Bayes Algorithm")
plt.show()
classid,tn,fp,fn,tp=perf_measure(y_test,y_head)
auc_scor.append(roc_auc_score(y_test,y_head))
score_list.append(accuracy(classid,tn,fp,fn,tp))
precision_scor.append(precision(classid,tn,fp,fn,tp))
recall_scor.append(recall(classid,tn,fp,fn,tp))
f1_scor.append(f1_score(y_test,y_head,average='macro'))
NPV_scor.append(NPV(classid,tn,fp,fn,tp))
specificity_scor.append(specificity(classid,tn,fp,fn,tp))
TPR=recall(classid,tn,fp,fn,tp)
TNR=specificity(classid,tn,fp,fn,tp)
FPR=1-TNR
if FPR==0:
FPR=0.00001
FNR=1-TPR
lrminus=FNR/TNR
lrarti=TPR/FPR
if lrminus==0:
lrminus=0.00000001
LR_plus.append(TPR/FPR)
LR_minus.append(FNR/TNR)
odd_scor.append(lrarti/lrminus)
47
youden_scor.append(TPR+TNR-1)
cmlr = confusion_matrix(y_test,y_head)
f, ax = plt.subplots(figsize =(5,5))
sns.heatmap(cmlr,annot = True,linewidths=0.5,linecolor="red",fmt = ".0f",ax=ax)
plt.xlabel("y_pred")
plt.ylabel("y_true")
plt.title("Logistic Regression")
plt.show()
z=pd.DataFrame(score)
z
48
sns.pointplot(x=df['algo_list'], y=df['LR-
'],data=df,color='orange',alpha=0.8,label="YOUDEN")
sns.pointplot(x=df['algo_list'],
y=df['YOUDEN'],data=df,color='brown',alpha=0.8,label="LR-")
sns.pointplot(x=df['algo_list'],
y=df['Specificity'],data=df,color='purple',alpha=0.8,label="Specificity")
plt.xlabel('Algorithms',fontsize = 15,color='blue')
plt.ylabel('Metrics',fontsize = 15,color='blue')
plt.xticks(rotation= 45)
plt.title('Parkinsons Disease (PD) Evaluation Metrics',fontsize = 20,color='blue')
plt.grid()
plt.legend()
plt.show()
with open('model.pkl','wb') as f:
pickle.dump(nb,f)
model=pickle.load(open('model.pkl','rb'))
print(model)
(b) app.py:
import numpy as np
from flask import Flask, request, jsonify, render_template
import pickle
app = Flask(__name__)
model = pickle.load(open('model.pkl', 'rb'))
@app.route('/')
49
def home():
return render_template('index.html')
@app.route('/predict',methods=['POST'])
def predict():
prediction = model.predict(final_features)
print("final features",final_features)
print("prediction:",prediction)
output = round(prediction[0], 2)
print(output)
if output == 0:
return render_template('index.html', prediction_text='THE PATIENT DOESNOT
HAVE A PARKINSONS DISEASE')
else:
return render_template('index.html', prediction_text='THE PATIENT HAVE A
PARKINSONS DISEASE')
@app.route('/predict_api',methods=['POST'])
def results():
data = request.get_json(force=True)
prediction = model.predict([np.array(list(data.values()))])
output = prediction[0]
return jsonify(output)
if __name__ == "__main__":
50
app.run(debug=False)
(c) index.html:
<!DOCTYPE html>
<html >
<!--From https://codepen.io/frytyler/pen/EGdtg-->
<head>
<meta charset="UTF-8">
<title>PREDICTION</title>
<link href='https://fonts.googleapis.com/css?family=Pacifico' rel='stylesheet'
type='text/css'>
<link href='https://fonts.googleapis.com/css?family=Arimo' rel='stylesheet'
type='text/css'>
<link href='https://fonts.googleapis.com/css?family=Hind:300' rel='stylesheet'
type='text/css'>
<link href='https://fonts.googleapis.com/css?family=Open+Sans+Condensed:300'
rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="{{ url_for('static', filename='css/style.css') }}">
</head>
<body>
<div class="login">
<h1>Parkinson's Disease Prediction Analysis</h1>
{{ prediction_text }}
<!-- Main Input For Receiving Query to our ML -->
<form action="{{ url_for('predict')}}"method="post">
51
<input type="text" name="RPDE" placeholder="rpde" required="required"
/>
<input type="text" name="numPulses" placeholder="numpulses"
required="required" />
<input type="text" name="numPeriodsPulses"
placeholder="numperiodpulses" required="required" />
</div>
</body>
</html>
(d) style.css:
@import url(https://fonts.googleapis.com/css?family=Open+Sans);
.btn { display: inline-block; *display: inline; *zoom: 1; padding: 4px 10px 4px; margin-
bottom: 0; font-size: 13px; line-height: 18px; color: #333333; text-align: center;text-
shadow: 0 1px 1px rgba(255, 255, 255, 0.75); vertical-align: middle; background-color:
#f5f5f5; background-image: -moz-linear-gradient(top, #ffffff, #e6e6e6); background-
image: -ms-linear-gradient(top, #ffffff, #e6e6e6); background-image: -webkit-
gradient(linear, 0 0, 0 100%, from(#ffffff), to(#e6e6e6)); background-image: -webkit-
linear-gradient(top, #ffffff, #e6e6e6); background-image: -o-linear-gradient(top, #ffffff,
#e6e6e6); background-image: linear-gradient(top, #ffffff, #e6e6e6); background-repeat:
repeat-x; filter: progid:dximagetransform.microsoft.gradient(startColorstr=#ffffff,
endColorstr=#e6e6e6, GradientType=0); border-color: #e6e6e6 #e6e6e6 #e6e6e6; border-
color: rgba(0, 0, 0, 0.1) rgba(0, 0, 0, 0.1) rgba(0, 0, 0, 0.25); border: 1px solid #e6e6e6; -
webkit-border-radius: 4px; -moz-border-radius: 4px; border-radius: 4px; -webkit-box-
52
shadow: inset 0 1px 0 rgba(255, 255, 255, 0.2), 0 1px 2px rgba(0, 0, 0, 0.05); -moz-box-
shadow: inset 0 1px 0 rgba(255, 255, 255, 0.2), 0 1px 2px rgba(0, 0, 0, 0.05); box-shadow:
inset 0 1px 0 rgba(255, 255, 255, 0.2), 0 1px 2px rgba(0, 0, 0, 0.05); cursor: pointer;
*margin-left: .3em; }
.btn:hover, .btn:active, .btn.active, .btn.disabled, .btn[disabled] { background-color:
#e6e6e6; }
.btn-large { padding: 9px 14px; font-size: 15px; line-height: normal; -webkit-border-radius:
5px; -moz-border-radius: 5px; border-radius: 5px; }
.btn:hover { color: #333333; text-decoration: none; background-color: #e6e6e6;
background-position: 0 -15px; -webkit-transition: background-position 0.1s linear; -moz-
transition: background-position 0.1s linear; -ms-transition: background-position 0.1s
linear; -o-transition: background-position 0.1s linear; transition: background-position 0.1s
linear; }
.btn-primary, .btn-primary:hover { text-shadow: 0 -1px 0 rgba(0, 0, 0, 0.25); color: #ffffff;
}
.btn-primary.active { color: rgba(255, 255, 255, 0.75); }
.btn-primary { background-color: #4a77d4; background-image: -moz-linear-gradient(top,
#6eb6de, #4a77d4); background-image: -ms-linear-gradient(top, #6eb6de, #4a77d4);
background-image: -webkit-gradient(linear, 0 0, 0 100%, from(#6eb6de), to(#4a77d4));
background-image: -webkit-linear-gradient(top, #6eb6de, #4a77d4); background-image: -
o-linear-gradient(top, #6eb6de, #4a77d4); background-image: linear-gradient(top,
#6eb6de, #4a77d4); background-repeat: repeat-x; filter:
progid:dximagetransform.microsoft.gradient(startColorstr=#6eb6de,
endColorstr=#4a77d4, GradientType=0); border: 1px solid #3762bc; text-shadow: 1px 1px
1px rgba(0,0,0,0.4); box-shadow: inset 0 1px 0 rgba(255, 255, 255, 0.2), 0 1px 2px rgba(0,
0, 0, 0.5); }
.btn-primary:hover, .btn-primary:active, .btn-primary.active, .btn-primary.disabled, .btn-
primary[disabled] { filter: none; background-color: #4a77d4; }
.btn-block { width: 100%; display:block; }
53
box; -o-box-sizing:border-box; box-sizing:border-box; }
body {
width: 100%;
height:100%;
font-family: 'Open Sans', sans-serif;
background: #092756;
color: #fff;
overflow: scroll;
font-size: 18px;
text-align:center;
letter-spacing:1.2px;
background: -moz-radial-gradient(0% 100%, ellipse cover, rgba(104,128,138,.4)
10%,rgba(138,114,76,0) 40%),-moz-linear-gradient(top, rgba(57,173,219,.25) 0%,
rgba(42,60,87,.4) 100%), -moz-linear-gradient(-45deg, #670d10 0%, #092756 100%);
background: -webkit-radial-gradient(0% 100%, ellipse cover, rgba(104,128,138,.4)
10%,rgba(138,114,76,0) 40%), -webkit-linear-gradient(top, rgba(57,173,219,.25)
0%,rgba(42,60,87,.4) 100%), -webkit-linear-gradient(-45deg, #670d10 0%,#092756
100%);
background: -o-radial-gradient(0% 100%, ellipse cover, rgba(104,128,138,.4)
10%,rgba(138,114,76,0) 40%), -o-linear-gradient(top, rgba(57,173,219,.25)
0%,rgba(42,60,87,.4) 100%), -o-linear-gradient(-45deg, #670d10 0%,#092756 100%);
background: -ms-radial-gradient(0% 100%, ellipse cover, rgba(104,128,138,.4)
10%,rgba(138,114,76,0) 40%), -ms-linear-gradient(top, rgba(57,173,219,.25)
0%,rgba(42,60,87,.4) 100%), -ms-linear-gradient(-45deg, #670d10 0%,#092756 100%);
background: -webkit-radial-gradient(0% 100%, ellipse cover, rgba(104,128,138,.4)
10%,rgba(138,114,76,0) 40%), linear-gradient(to bottom, rgba(57,173,219,.25)
0%,rgba(42,60,87,.4) 100%), linear-gradient(135deg, #670d10 0%,#092756 100%);
filter: progid:DXImageTransform.Microsoft.gradient( startColorstr='#3E1D6D',
54
endColorstr='#092756',GradientType=1 );
}
.login {
position: absolute;
top: 40%;
left: 50%;
margin: -150px 0 0 -150px;
width:400px;
height:400px;
}
input {
width: 100%;
margin-bottom: 10px;
background: rgba(0,0,0,0.3);
border: none;
outline: none;
padding: 10px;
font-size: 13px;
color: #fff;
text-shadow: 1px 1px 1px rgba(0,0,0,0.3);
border: 1px solid rgba(0,0,0,0.3);
border-radius: 4px;
box-shadow: inset 0 -5px 45px rgba(100,100,100,0.2), 0 1px 1px
rgba(255,255,255,0.2);
-webkit-transition: box-shadow .5s ease;
-moz-transition: box-shadow .5s ease;
55
-o-transition: box-shadow .5s ease;
-ms-transition: box-shadow .5s ease;
transition: box-shadow .5s ease;
}
Positive Negative
Positive
TP FN
Negative
FP TN
56
Where TP: True positive
FP: False Positive
FN: False Negative
TN: True Negative
4.5.2 Accuracy:
Accuracy is the proportion of the total number of predictions that were correct. It
can be obtained by the sum of true positive and true negative instances divided by the total
number of Samples.
(TP+TN)
It is expressed as: Accuracy = (TP+FP+FN+TN) (3)
4.5.3 Precision:
Precision is fraction of true positive and predicted yes instances. It is also known as
the ratio of correct positive results to the total number of positive results predicted by the
system.
TP
It is expressed as: Precision(P) = (4)
(TP + FP)
4.5.5 F1-Score:
F1-score is the fraction between product of the recall and precision to the
summation of recall and precision parameter of classification. It is the harmonic mean of
Precision and Recall. It measures the test accuracy. The range of this metric is 0 to 1.
1 2𝑃𝑅
It is expressed as: F1 score = 2 ∗ 1 1 = (6)
( )+( ) (𝑃+𝑅)
Precision Recall
4.5.6 Specificity:
57
Specificity is a measure of how well a test can identify true negatives. Specificity is
also referred to as selectivity or true negative rate, and it is the percentage, or proportion,
of the true negatives out of all the samples that do not have the condition (true negatives
and false positives).
TN
It is expressed as: Specificity = (7)
(TN+FP)
4.5.7 LR-:
LR- is defined as the likelihood ratio for negative results in the test.
1−sensitivity
It is expressed as: LR− = (8)
specificity
4.5.8 LR+:
LR+ is defined as the likelihood ratio for positive results in the test.
sensitivity
It is expressed as: LR+ = (9)
1−specificity
58
Comparative Analysis
1.2
0.97
1 0.9
0.86 0.87 0.89
0.8 0.81 0.79 0.81
0.8
0.6
0.4
0.2
0
KNN Naïve Bayes Logistic Regression
We also compared remaining evaluation metrics that are evaluated on the three
algorithms and that can be seen in the Fig 4.2. In this we can see that AUC, LR-, LR+,
Odd, Youden and Specificity.
59
4.6 Results
To demonstrate the results of our project, we take the remaining test data and it is
tested using three algorithms. After that our trained model to ready to predict the disease is
present or not. The test accuracy is done in the Google colab which is our python notebook.
Below we described how the three algorithms are processed. First, KNN algorithm is
trained with the training dataset and later it was tested with the remaining test data. In Fig
4.3, a screenshot of our notebook is showing that how the process of KNN algorithms is
done and the accuracy the model returns and it is of 80%.
Second, the Naïve Bayes algorithm is trained with the training dataset and later it was tested
with the remaining test data. In Fig 4.4, a screenshot of our notebook is showing that how
the process of Naïve Bayes algorithms is done and the accuracy the model returns and it is
of more with 81%.
60
Fig 4.4 Naïve Bayes Test Accuracy
Third, the Logistic Regression algorithm is trained with the training dataset and later it was
tested with the remaining test data. In Fig 4.5, a screenshot of our notebook is showing that
how the process of Logistic Regression algorithms is done and the accuracy the model
returns and it is of more with 79%.
From this three techniques, we got Naïve Bayes with more accuracy and this model is used
in front end. The model is loaded into the pickle file and that file is opened in the frontend
61
and compares the user input values with this corresponding model. Finally it results with a
text message displaying that either the patient having Parkinson’s disease or not.
62
Fig 4.8 User Interface to enter the another patient details
63
In Fig-4.6 and Fig-4.8, the screenshots describe that we have given the data in the user
interface and it is stored in a data frame. This new data will also go through all the data pre-
processing steps for converting the data into the same format as that of training data set.
Now, we have used our trained model variable to make prediction on the new data and we
got the predicted result i.e., the output of our system. This can be seen in Fig-4.7 and Fig-
4.9. As like the above data, every time we want to make a prediction, we need to continue
the whole process. And we gave both class prediction, one stating that “THE PATIENT
HAVE A PARKINSON’S DISEASE” and another one stating that “THE PATIENT
DOESN’T HAVE A PARKINSON’S DISEASE”. But many people who are not familiar
with programming or python notebooks will find difficult to do the whole thing. To avoid
this type of problems we have created a user interface from which everyone can just enter
the details and they get to know about the report.
64
CHAPTER-5
CONCLUSION AND FUTURE WORK
5.1 Conclusion
Parkinson’s disease is the second most dangerous neurodegenerative disease which
has no cure till now and to make it reduce prediction is important. In this project, we have
used three various prediction models to predict the Parkinson’s disease which are Machine
Learning Techniques i.e. KNN, Naïve Bayes and Logistic Regression. The dataset is trained
using these models and we also compared these different models built using different
methods and identifies the best model that fits.
The aim is to use various evaluation metrics such as Accuracy, Precision, Recall,
Specificity, F1-score, LR+, LR- and Youden score that produce the predicts the disease
efficiently. We have used the Speech dataset that contains voice features of the patients
which is available in the Kaggle website. The dataset consists of more than 700 features
and 750 patient details. The models are built using the five best features which were
identified by feature selection.
From this results, Naïve Bayes outstands from the other two machine learning algorithms
with an accuracy of 81%. This system we designed can make the predictions of the
Parkinson’s disease.
65
REFERENCES
[1] A. Ozcift, “SVM feature selection based rotation forest ensemble classifiers to improve
computer-aided diagnosis of Parkinson disease” Journal of medical systems, vol-36, no. 4,
pp. 2141-2147, 2012.
[2] Anila M Department of CS1, Dr G Pradeepini Department of CSE, “DIAGNOSIS OF
PARKINSON’S DISEASE USING ARTIFICIAL NEURAL NETWORK”, JCR, 7(19):
7260-7269, 2020.
[3] Arvind Kumar Tiwari, “Machine Learning based Approaches for Prediction of
Parkinson’s Disease” Machine Learning and Applications: An International Journal
(MLAU) vol. 3, June 2016.
[4] Carlo Ricciardi, et al, “Using gait analysis’ parameters to classify Parkinsonism: A data
mining approach” Computer Methods and Programs in Biomedicine vol. 180, Oct. 2019.
[5] Dr. Anupam Bhatia and Raunak Sulekh, “Predictive Model for Parkinson’s Disease
through Naive Bayes Classification” International Journal of Computer Science &
Communication vol-9, Dec. 2017, pp. 194- 202, Sept 2017 - March 2018.
[6] Dr. R.GeethaRamani, G.Sivagami, ShomonaGraciajacob “Feature Relevance Analysis
and Classification of Parkinson’s Disease TeleMonitoring data Through Data Mining”
International Journal of Advanced Research in Computer Science and Software
Engineering, vol-2, Issue 3, March 2012.
[7] Dragana Miljkovic et al, “Machine Learning and Data Mining Methods for Managing
Parkinson’s Disease” LNAI 9605, pp. 209-220, 2016.
[8] FarhadSoleimanianGharehehopogh, PeymenMohammadi, “A Case Study of
Parkinson’s Disease Diagnosis Using Artificial Neural Networks” International Journal of
Computer Applications, Vol-73, No.19, July 2013.
[9] Heisters. D, “Parkinson’s: symptoms, treatments and research”. British Journal of
Nursing, 20(9), 548–554. doi:10.12968/bjon.2011.20.9.548, 2011.
66
2018, Article ID 9315285, 8 pages, 2018.
[12] Md. Redone Hassan, et al, “A Knowledge Base Data Mining based on Parkinson’s
Disease” International Conference on System Modelling & Advancement in Research
Trends, 2019.
[13] Mandal, Indrajit, and N. Sairam. “New machine-learning algorithms for prediction of
Parkinson's disease” International Journal of Systems Science 45.3: 647-666, 2014.
[14] Mohamad Alissa,” Parkinson’s Disease Diagnosis Using Deep Learning”, August
2018.
[15] PeymanMohammadi, AbdolrezaHatamlou and Mohammed Msdaris “A Comparative
Study on Remote Tracking of Parkinson’s Disease Progression Using Data Mining
Methods” International Journal in Foundations of Computer Science and
Technology(IJFCST), vol-3, No.6, Nov 2013.
[16] R. P. Duncan, A. L. Leddy, J. T. Cavanaugh et al., “Detecting and predicting balance
decline in Parkinson disease: a prospective cohort study” Journal of Parkinson’s Disease,
vol-5, no. 1, pp. 131–139, 2015.
[17] Ramzi M. Sadek et al., “Parkinson’s Disease Prediction using Artificial Neural
Network” International Journal of Academic Health and Medical Research, vol-3, Issue 1,
January 2019.
[18] Satish Srinivasan, Michael Martin & Abhishek Tripathi, “ANN based Data Mining
Analysis of Parkinson’s Disease” International Journal of Computer Applications, vol-168,
June 2017.
[19] Shahid, A.H., Singh, M.P. A deep learning approach for prediction of Parkinson’s
disease progression, https://doi.org/10.1007/s13534-020-00156-7, Biomed. Eng. Lett. 10,
227–239, 2020.
[20] Shubham Bind, et al, “A survey of machine learning based approaches for Parkinson
disease prediction” International Journal of Computer Science and Information
Technologies vol-6, Issue 2, pp. 1648- 1655, 2015.
[21] Siva Sankara Reddy Donthi Reddy and Udaya Kumar Ramanadham “Prediction of
Parkinson’s Disease at Early Stage using Big Data Analytics”ISSN: 2249 – 8958, Volume-
9 Issue-4, April 2020
[22] Sriram, T. V., et al. “Intelligent Parkinson Disease Prediction Using Machine Learning
67
Algorithms” International Journal of Engineering and Innovative Technology, vol-3, Issue
3, September 2013.
[23] T. Swapna, Y. Sravani Devi, “Performance Analysis of Classification algorithms on
Parkinson’s Dataset with Voice Attributes”. International Journal of Applied Engineering
Research ISSN 0973-4562 Volume 14, Number 2 pp. 452-458, 2019.
[24] T. J. Wroge, Y. Özkanca, C. Demiroglu, D. Si, D. C. Atkins and R. H. Ghomi,
"Parkinson’s Disease Diagnosis Using Machine Learning and Voice," IEEE Signal
Processing in Medicine and Biology Symposium (SPMB), pp.1-7, doi:
10.1109/SPMB.2018.8615607, 2018.
68
International conference on Signal Processing, Communication, Power and Embedded System (SCOPES)-2016
Babita Majhi
Department of Computer Science and IT
G.G Vishwavidyalaya, Central University
Bilaspur, India 495009
Email: [email protected]
Abstract—Parkinson’s disease (PD) is one of the major public the Parkinson’s disease is treated as disorder of the central
health problems in the world. It is a well-known fact that nervous system which is the result of loss of cells from various
around one million people suffer from Parkinson’s disease in parts of the brain. These cells also include substantia nigra
the United States whereas the number of people suffering from
Parkinson’s disease worldwide is around 5 millions. Thus, it is cells that produce dopamine. Dopamine plays a vital role in
important to predict Parkinson’s disease in early stages so that the coordination of movement. It acts as a chemical messenger
early plan for the necessary treatment can be made. People are for transmitting signals within the brain. Due to the loss of
mostly familiar with the motor symptoms of Parkinson’s disease, these cells, patients suffer from movement disorder.
however an increasing amount of research is being done to predict The symptoms of PD can be classified into two types i.e.
the Parkinson’s disease from non-motor symptoms that precede
the motor ones. If early and reliable prediction is possible then non-motor and motor symptoms. Many people are aware of
a patient can get a proper treatment at the right time. Non- the motor symptoms as they can be visually perceived by
motor symptoms considered are Rapid Eye Movement (REM) human beings. These symptoms are also called as cardinal
sleep Behaviour Disorder (RBD) and olfactory loss. Developing symptoms, these include resting tremor, slowness of movement
machine learning models that can help us in predicting the (bradykinesia), postural instability (balance problems) and
disease can play a vital role in early prediction. In this paper we
extend a work which used the non-motor features such as RBD rigidity [2]. It is now established that there exists a time-
and olfactory loss. Along with this the extended work also uses span in which the non-motor symptoms can be observed.
important biomarkers. In this paper we try to model this classifier This symptoms are called as dopamine-non-responsive symp-
using different machine learning models that have not been toms. These symptoms include cognitive impairment, sleep
used before. We developed automated diagnostic models using difficulties, loss of sense of smell, constipation, speech and
Multilayer Perceptron, BayesNet, Random Forest and Boosted
Logistic Regression. It has been observed that Boosted Logistic swallowing problems, unexplained pains, drooling, constipa-
Regression provides the best performance with an impressive tion and low blood pressure when standing. It must be noted
accuracy of 97.159 % and the area under the ROC curve was that none of these non-motor symptoms are decisive, however
98.9%. Thus, it is concluded that this models can be used for when these features are used along with other biomarkers
early prediction of Parkinson’s disease. from Cerebrospinal Fluid measurement (CSF) and dopamine
Keywords—Improved Accuracy, Prediction of Parkinson’s Dis-
ease, Non Motor Features, Biomarkers, Machine Learning Tech- transporter imaging, they may help us to predict the PD.
niques, Boosted Logistic Regression, BayesNet, Multilayer Per- In this paper we extend works by Prashant et al [3]. This
ceptron, work takes into consideration the non-motor symptoms and
the biomarkers such as cerebrospinal fluid measurements and
I. I NTRODUCTION dopamine transporter imaging. In this paper we follow a simi-
Parkinson’s disease (PD) is a chronic, degenerative neu- lar approach, however we try to use different machine learning
rological disorder. The main cause of Parkinson’s disease is algorithms that can help in improving the performance of
actually unknown. However, it has been researched that the model and also play a vital role in making in early prediction
combination of environmental and genetic factors play an of PD which in turn will help us to initiate neuroprotective
important role in causing PD [1]. For general understanding therapies at the right time.
1
The rest of the paper is organized as follows. Section 2 con- A. Database
tains the related work. Section 3 contains the flowchart of the In this study the data from Parkinson’s Progression Markers
analysis carried out and describes about the PPMI database, Initiative (PPMI) database [8] was obtained. PPMI is an obser-
explanation of different features extracted, statistical analysis vational, multicentre study that collects clinical and imaging
of this features, classification and prediction/prognostic model data and biologic samples from various cohorts that can be
design. Section 4 provides the results and discussion from the used by researchers to establish markers of disease progression
experiments carried out. And finally conclusion of the work in PD. PPMI has established a comprehensive, standardized,
is provided in Section 5. longitudinal PD data and biological sample repository that can
play a vital role in the development of tools which assist in
II. R ELATED R ESEARCH W ORK prediction of PD. To obtain the recent information, the official
website of PPMI. ( www.ppmi-info.org ) can be visited. This
Different researchers have used different features and data dataset is similar to the one used in [3]. We downloaded the
to predict Parkinson’s disease. Indira et al. [4] have used database on 8th August 2016. On this date the data of 184
biomedical voice of human as the main feature. The authors normal patients and 402 early PD subjects were collected. It
have developed a model to automatically predict whether a is noted that PPMI has observations from each of the patients
person is suffering from PD by analysing the voice of the at different time intervals. Thus the data of each patient at
patients. They have used fuzzy c-means (FCM) clustering and different periods like screening or baseline, first visit, second
pattern recognition methods on the dataset and have attained visit and so on are available. In the present investigation the
an accuracy of 68.04%, 75.34% sensitivity and 45.83% speci- data at baseline observation are considered.
ficity. Amit et al. [5] have presented a unique approach of In [3] , the authors have used features from University
classifying PD patients on the basis of their postural instability of Pennsylvania Smell Identification Test, RBD screening
and have used L2 norm metric in conjunction with support questionnaire, CSF Markers of Aβ1-42,α- syn, P-tau181, T-
vector machine. In [6], the authors have applied University of tau, T-tau/Aβ1-42, P-tau181/Aβ1-42 and P-tau181/T tau, and
Pennsylvania 40-item smell identification test (UPSIT-40) and SPECT measurements of striatal binding ratio (SBR) data.
16-item identification test from Sniffins Sticks. This study was In this study these features have been used because we felt
conducted on Brazilian population. The authors have applied that they are a good combination of non-motor features and
logistic regression considering each of the above features sep- biomarkers. The details of these features are given in section
arately. They observed that the Sniffin Sticks gave a specificity III B.
of 89.0 % and a sensitivity of 81.1 %. Similarly they found out
that the UPSIT-40 specificity was 83.5% and sensitivity 82.1%. B. Feature Description
Prashant et al. [7] have used olfactory loss feature loss from 1) University of Pennsylvania Smell Identification Test (UP-
40-item UPSIT and sleep behaviour disorder from Rapid eye SIT): Olfactory dysfunction is an important marker of Parkin-
movement sleep Behaviour Disorder Screening Questionnaire son’s disease [9]. It acts as sensitive and early marker for
(RBDSQ). Support Vector machine and classification tree Parkinson’s disease. It is a fact that most of the people who
methods have been employed to train their methods. They suffer from PD have olfactory loss however it doesn‘t mean
have reported an accuracy of 85.48% accuracy and 90.55% that all the people with olfactory loss are suffering from PD
sensitivity. This work has been extended by the same authors [10]. Olfactory dysfunction are in various forms for instance it
in [3]. In this paper they added new features in the form may be impairment in odour detection or odour differentiation.
of CSF measurements and SPECT imaging markers. They A study by Posen et al [11] showed that about 10% of the
reported an accuracy of 96.40% and 97.03% sensitivity. This subjects who were suffering from odour dysfunction were at
paper has motived us to further the study. In the present the risk of PD.
paper an attempt has been made to improve the accuracy For quantifying this odour loss the data of University of
by using advanced machine learning models. Some recent Pennsylvania Smell Identification Test is used. This test is
machine learning algorithms have been chosen for prediction commercially available and is also one of the most reliable
and have made a comparative performance analysis of these tests [12]. The procedure of the test is as follows. A subject is
models based on accuracy, area under the ROC curve and other provided with 4 different 10 page booklets. Each of this pages
measures. has a different odour. A subject has to scratch the page and
smell it. For each of this pages, there exists a question with
III. M ATERIALS AND M ETHODS four options. Depending on the odour the subject selects one
of the options. This procedure is repeated for all the pages in
A flowchart of the proposed analysis is shown in Fig 1. all the booklets. Once the test is completed the UPSIT score
The data was first collected and the required non-motor and is calculated. The maximum score can be 40 when the subject
biomarker features are then extracted. Then different machine identifies each of the odours correctly. One main advantage of
learning algorithms are employed for the classification task. this is that the test takes only a few minutes. For the present
Finally, a comparative analysis is made based on the accuracy analysis the UPSIT score at baseline check-up from PPMI [8]
provided by different machine learning models. has been taken.
2) REM sleep Behaviour Disorder Screening Questionnaire uses gamma rays [17]. The SPECT is a common routine for
(RBDSQ): RBD is another non-motor symptom that plays an helping a doctor to decide whether a subject is suffering from
important role in early prediction of Parkinson’s disease. Peo- neurodegenerative diseases. According to [18], the SPECT
ple suffering from RBD have disturbances in sleep. These dis- imaging can detect the dopaminergic transporter loss during
turbances include vivid, aggressive or action packed dreams. the early stages of PD. When a subject has an abnormal scan-
Similar to olfactory loss, studies have shown that disorder ning then the person has more probability of being affected
in sleep behaviour increases the risk of being affected with with Parkinson’s disease or other neuro degenerative disease.
Parkinson’s disease. For quantifying this non-motor symptom, However, a normal scan denotes that the subject is suffering
the REM Sleep Behaviour Disorder Screening Questionnaire is from other type of diseases [18].
used. The RBDSQ is a 10-item patient self-rating instrument DatScan SPECT imaging obtained from PPMI imaging centres
[13]. The test contains ten short questions with answers as are used in this study. At PPMI the striatal binding ratios
yes or no. A yes is equivalent to 1 and a no is equivalent were calculated. The DatScan SPECT images are collected
to 0. The ten questions are divided such that each of the according to the PPMI imaging protocol. This raw images are
group of the questions provides the observations about a then reconstructed so as to ensure consistency among different
particular behaviour. Some of the examples of the questions imaging centres. After this attenuation correction is performed
from [13] are “I sometimes have vivid dreams”, “The dream on these images. After this the Gaussian filter is applied and
contents mostly matches my nocturnal behaviour”, “My sleep it is followed by normalization. Finally the required part is
is frequently disturbed”, etc. As some of the subjects may have extracted from the images and then the striatal binding ratio for
a bed partner, they can also be used in this test. left and right caudate, the left and right putamen are calculated
Each of the answers are provided as either one or zero. In [19]. In this paper, these four striatal binding values are used
the present study the feature for sleep disorder is obtained by as neuroimaging biomarkers.
summing up all the answers. This sum can be a maximum
of 12 if we take the first nine questions. It is observed here C. Prediction models for distinguishing early PD and healthy
that a higher score in this case means a higher risk of PD in normal subjects
contrast to that of UPSIT score. This RBDSQ score is taken In this study, four different machine learning classifiers
from PPMI [8]. are chosen for classification task. A brief description of
3) Cerebrospinal Fluid Biomarkers: Biomarkers play a each of them is provided in this section. WEKA [20] is
pivotal role in this analysis. Without the aid of biomarkers used for classification using Multilayer Perceptron, Bayesian
the prediction of PD is less accurate. The biomarkers are the Network, Random Forest, and Boosted Logistic Regression.
significant factors in increasing the accuracy of the model. The main motive is to find an algorithm that can improve
Biomarkers need to be sensitive, reproducible and must be the already reported accuracy as well as to see how various
closely associated with the disease. Cerebrospinal fluid is a models are performing. Firstly, the dataset is normalized using
clear, colourless body fluid found in the brain. It has more the Normalize filter in WEKA [20].Then then the dataset
physical contact with the brain as compared to any other fluid is divided in such a way that 70% is used for training
[14]. Due to the close proximity with the brain, any protein and the rest 30% is used for testing. While partitioning the
or peptide which is related to the brain specific functionalities dataset the same class proportion in both the test and train
or disease are diffused into CSF. Hence, the CSF can act as data is maintained. For example, if the proportion of healthy
an important biomarker for brain related diseases which in people in the complete data is 40% then both in training
the present case is Parkinson’s disease. and testing the proportion of healthy people to PD subjects
is maintained at 40%. This type of partitioning is known as
The CSF samples are collected from PPMI. In PPMI, for stratified partitioning. The accuracy, recall, precision and f-
each of subjects the CSF samples are obtained and certain measure for each these algorithms are computed and the ROC
measurements are made. These measurements include Aβ1- of each of the classifiers are plotted. Finally the performance
42(amyloid beta (1-42), T-tau (total tau) and P-tau181 (tau measure of different classifiers used in this paper as well as
phosphorylated at threonine) [15]. According to PPMI Re- in [3] are compared.
search Laboratory these three are the important biomarkers 1) Multilayer Perceptron: Multilayer perceptron is a feed-
that can be extracted from the CSF fluid. Along with this forward artificial neural network. The basic principle of mul-
the concentration of α-Syn was also collected from PPMI tilayer perceptron is that it takes the input and maps it to
database. Kang et al have mentioned that ratios like T- a nonlinear space, then it tries to predict the corresponding
tau/Aβ1-42, P-tau181/Aβ1-42 and P-tau181/T-tau also play outputs. A MLP architecture is viewed as a multiple layers
a significant role in early detection of Parkinson’s disease of nodes, with each layer being fully connected with the next
[16]. In the present investigation the measurements of Aβ1- layer. Each node in the MLP is interpreted as a neuron that
42, T-tau and P-tau181 and also the ratios T- tau/Aβ1-42, P- has an activation function which is non-linear [21] [22]. The
tau181/Aβ1-42 and P-tau181/T-tau are taken. back-propagation algorithm which is a supervised learning
4) Neuroimaging markers: Single-photon emission com- technique is used for training the model. The number of
puted tomography (SPECT) is a neuroimaging technique that hidden layers in the MLP have a significant impact on the
TABLE I: Performance Measures for various classifiers used in the study
Multilayer Perceptron BayesNet Random Forest Boosted Logistic Regression
Performance Measures
Training Testing Training Testing Training Testing Training Testing
Accuracy(%) 96.09 95.4545 96.5854 96.027 100 96.59 95.8537 97.1591
Recall 0.961 0.955 0.966 0.960 1 0.966 0.959 0.972
Precision 0.962 0.955 0.967 0.965 1 0.970 0.959 0.974
F-Measure 0.961 0.955 0.966 0.961 1 0.967 0.959 0.972
AUC 0.989 0.986 0.994 0.994 1 0.997 0.995 0.989
performance of the classifier. search algorithms and quality measures are used. A Simple
2) Bayesian Network: The Bayesian network is one of the Estimator is chosen for estimating the conditional probability
probabilistic graphical models used in machine learning. The tables of a Bayes network once the structure has been learned.
Bayes Net corresponds to graphical model structures which For searching K2 algorithm is used. It uses a hill climbing
are known as directed acyclic graph (DAG). This graphical algorithm which is restricted by an order on the variables. In
models are understood in the following manner [23]. The boosted logistic regression Adaboost M1 method is used to
nodes in the graph represent the random variables and the boost the logistic regression.
edge between node x and node y denotes the probabilistic
dependencies among random variables corresponding to the It is observed that all the classifiers performed reasonably
respective nodes. Hence the nodes that are not connected in well with boosted logistic regression giving the best perfor-
the Bayesian network are the random variables which are inde- mance with 97.16% accuracy and 98.9% area under the ROC
pendent to each other. Different computational and statistical (AUC). Table 2: shows how this models performed in relation
methods are used to estimate the conditional dependencies. to the previous work [3]. It is found that the accuracy and
Bayes Network learning uses various search algorithms and area under the ROC curve are nearly same among the different
quality measures. In the present model K2 learning algorithm classifiers used. The present work and [3] have the advantage
for searching is used. that the dataset used is very large as compared to others.
3) Random Forest: Random forest are part of ensemble However, it is noted that the PPMI study includes subjects
learning method that is used for classification, regression and who are in early stages of PD and healthy normal, however it
other tasks. In Random forest, there are many decision trees. doesn‘t include subjects who are having premotor symptoms
For a given input, each of the decision trees classify it as but are not diagnosed as PD due to lack of motor symptoms.
yes/no (in case of binary classification)[24] [25]. Then once
each of the trees have classified as yes/no, the value which has V. C ONCLUSION
the majority among them is taken as output. The advantages The diagnosis of Parkinson’s Disease is not direct which
are that this algorithm runs effectively on large inputs and it means that one particular test like blood test or ECG cannot
also helps in estimating which of the features are important. determine whether a person is suffering from PD or not.
4) Boosted Logistic Regression: Logistic regression was Doctors go through the medical history of a patient, followed
developed by statistician David Cox in 1958[26] [27]. A by a thorough neurological examination. They find out at least
logistic model is used to predict the binary class using one or two cardinal symptoms among the subjects and then predict
more features. Logit- the natural algorithm for an odds ratio whether the subject is suffering from PD. The misdiagnosis
is the central mathematical concept behind logistic regression. rate of PD is significant due to a no definitive test. In such
Logistic regression is well suited in case when one wants to case it will be helpful for us to aid the doctor by providing a
establish relationship between a categorical outcome variable machine learning model. The prediction models are developed
and one or more categorical or continuous predictor variables using machine learning techniques of boosted logistic regres-
[28]. sion, classification trees , Bayes Net and multilayer perceptron
Boosting is a machine learning ensemble meta-algorithm based on these significant features. It is observed that the
for primarily reducing bias, and also variance in supervised performance is better. It is demonstrated that Boosted Logistic
learning. It belongs to the family of machine learning algo- Regression produce superior results. These results encourage
rithms which convert weak learners to strong ones. AdaBoost us to try other ensemble learning techniques. The present
is used for boosting different classifiers. work employs different machine learning algorithms which
are not used in [3]. This study plays an important role in
IV. R ESULTS AND D ISCUSSION having a comparative analysis of various machine learning
Table 1 shows the performance of various classifiers used algorithms. In conclusion, this models can provide the nuclear
in the study. Fig 2 shows the corresponding ROC plots. In experts an assistance that can aid them in better and accurate
Multilayer Perceptron the back propagation algorithm is used decision making and clinical diagnosis. It is also found that the
to train the model. The learning rate is set at 0.4 and number proposed method is fully automated and provides improved
of hidden layers is chosen as 8 in this case ((number of performance and hence can be recommended for real life
attributes + number of classes)/2 =8). In BayesNet various applications.
TABLE II: Comparative Analysis of Machine Learning models
in the current work and previous work
Accuracy(%) AUC(%)
Machine Learning Algorithms
Training Testing Training Testing
Multilayer Perceptron 96.09 95.45 98.9 98.6
BayesNet 96.5854 96.02 99.4 99.4
Random Forest 100 96.59 100 0.997
Boosted Logistic Regression 95.8537 97.16 99.5 98.9
Boosted Trees 100 95.08 100 98.23
Naive Bayes 94.67 93.12 98.66 96.77
Support Vector Machine 97.14 96.40 99.27 98.88
Logistic Regression 96.50 95.63 99.20 98.66
(a) ROC for Classification using Multilayer Perceptron (Test Data) (b) ROC for Classification using BayesNet (Test Data)
Mrs. Dr.Selvani Deepthi Kavila, Ph.D2, Immidi Manikanta3, Cheemalapathi Deekshith4, Kalaga Venkata Mukesh5
1 Student, Department of CSE, Anil Neerukonda Institute of Technology and Sciences, Visakhapatnam
2Associate professor, Department of CSE, Anil Neerukonda Institute of Technology and Sciences, Visakhapatnam
3 Student, Department of CSE, Anil Neerukonda Institute of Technology and Sciences, Visakhapatnam
4 Student, Department of CSE, Anil Neerukonda Institute of Technology and Sciences, Visakhapatnam
5 Student, Department of CSE, Anil Neerukonda Institute of Technology and Sciences, Visakhapatnam
Abstract- Parkinson’s is considered as one of the deadliest and progressive nervous system diseases that affect the
movement. It is the second most common neurological disorder that causes disability, reduces the life span, and still has no
cure. Nearly 90% of affected people with this disease have speech disorders. In various data repositories, large datasets are
available which used to solve real-world applications. Machine Learning Techniques also help in the medical field to detect
diseases such as Parkinson’s which has affected various people. In this paper, the author have proposed a Parkinson’s
prediction model for better classification of Parkinson’s which includes features like PPE (Pitch Period Entropy), DFA
(Detrended Fluctuation Analysis), RPDE (Recurrent Period Density Entropy), etc. In this paper, the author is using various
Machine Learning techniques like KNN, Naïve Bayes, and Logistic Regression and how these algorithms are used to predict
Parkinson’s based on the input taken from the user and the input for algorithms is the dataset. The dataset used in this paper
is downloaded from the Kaggle website which contains the speech features of Parkinson’s patients. Based on these features
the author predict the algorithm that gives more accuracy. The accuracies obtained for the three algorithms are KNN with
80%, Logistic Regression with 79%, and Naïve Bayes with the highest accuracy of 81% and it is used in the frontend to
predict whether the patient has Parkinson’s is present or not.
Keywords - Health, Parkinson, PPE, DFA, RPDE Logistic Regression, KNN, Naive Bayes, Prediction.
I. INTRODUCTION
The recent report of the World Health Organization shows a remarkable hike in the number and health burden of
Parkinson’s disease patient’s increases rapidly and it estimated that China will reach half of the Parkinson’s disease
population in 2030. Classification techniques are broadly used in the medical field for better classifying data into different
classes according to some features. Parkinson’s disease is a neurological disorder that leads to shaking, shivering, stiffness,
and difficulty walking and balance. Mainly it is caused by the breaking down of cells in the nervous system. Parkinson’s can
cause both motor and non-motor symptoms. The motor symptoms include slowness of movement, rigidity, balance problems,
and tremors. As the disease continues, the affected people may have difficulty walking and talking. The non-motor symptoms
include anxiety, breathing problems, depression, loss of smell, and change in speech. If the mentioned symptoms are present
in the person then the details are stored in the records. In this paper, the author consider the speech features of the patient,
and this data is used for predicting whether the patient has Parkinson’s disease or not.
The biggest risk factor for getting this disease is Age. It affects people who are 60 and older, as the years go the risk goes
up. Both genders can have Parkinson’s disease and it affects more in men than women and the ratio is 2:1. Whites get more
often than other groups. People working in farms, factory and more in contact with the chemicals have a chance of getting
this disease. Parkinson’s disease may affect speech in different ways. People with this disease get their voice softer, breathe,
speech may be a slur. The tone of this kind of people was monotone, and difficult in finding the right words. Such diseases
spread all over the body gradually without early warning. To solve this kind of problem, Machine Learning algorithms play
an important role. These algorithms find solutions to problems by recognizing patterns in databases. Machine Learning
analyse complex data and store medical records for further analysis. An automated machine can predict disease in a better
way once it trained. For the linear data, Classification algorithms are used. Classification is a process of categorizing the
given dataset into classes, it either predicts classification labels or classification data based on the training dataset, and the
values are used in classifying new datasets. There are many classification algorithms which include Logistic Regression,
Decision Tree, Random Forest, Naïve Bayes, and KNN. Out of these mentioned algorithms in this paper, the author used
KNN, Logistic Regression, and Naïve Bayes to predict Parkinson’s disease using the Speech dataset and produce the
accuracy of every algorithm depending on the speech features the author are going to predict Parkinson’s.
In this paper the author follow a similar approach, try to use different machine learning algorithms that help in increasing
the accuracy of the model and extend the work by taking the Naïve Bayes algorithm which gives more accuracy. It is used
to detect the Parkinson’s disease is present or not by entering the speech features of the person and it checks with the training
set and predicts the output.
Voice features of the patients are assumed to be 90% helpful to identify the presence of Parkinson’s disease.
Different researchers applied different features of the dataset to predict the disease. Many of the authors used the voices of
the patients to analyse Parkinson’s. In general, the speech problems are two and are hypophonia and dysarthria. Hypophonia
means a very weak and soft tone from a person. Dysarthria means a slow voice that is understandable at one time caused
due to the central nervous system.
Dr. Anupam Bhatia and Raunak Sulekh proposed a paper “Predictive Model for Parkinson’s Disease through Naive Bayes
Classification” [3]. This paper summarizes that the Naïve Bayes classifier is used to analyse the performance of the dataset.
The dataset used in this paper is recorded speech signals. The features in this dataset are voice measures and the aim is to
predict the PD 0 for healthy and 1 for a person having the disease. In this paper, the author has used the Rapid miner tool
to analyse the data. Naïve Bayes classifier produces 98.5% accuracy. This model helps doctors and patients to detect the
disease and take preventive measures at the right time.
Carlo Ricciardi, et al, proposed a paper “Using gait analysis’ parameters to classify Parkinsonism: A data mining approach”
[2]. In this paper, the author compares performance analysis using two algorithms. The algorithms used are Random Forest
and Gradient Boosted Trees. In this paper, the PD patients at different stages were taken into consideration and are identified
as typical and atypical based on gait analysis using a data mining approach. The comparative analysis gives the result as
Random Forest obtained the highest accuracy of 86.4%. This model also helped clinicians to distinguish PD patients at an
early stage.
Arvind Kumar Tiwari proposed a paper “Machine Learning-based Approaches for Prediction of Parkinson’s Disease” [1].
The dataset used in this paper is voice recordings of the patients. And in this paper, the author chooses the most important
features among all features to predict Parkinson’s disease by using minimum redundancy maximum relevance feature
selection algorithms. The author applied different machine learning algorithms and compared them. Random Forest
provides the highest accuracy of 90.3%.
Mehrbakhsh Nilashi, et al, proposed a paper “A hybrid intelligent system for the prediction of Parkinson’s Disease
progression using Machine Learning techniques” [7]. In this paper, Unified Parkinson's Disease Rating Scale (UPDRS) is
mostly used to assess Parkinsonism. The author described the relationship between speech signals and UPDRS is important
to find Parkinson’s. In this paper, the Incremental Support Vector Machine is used to predict Total-UPDRS and Motor-
UPDRS. The author concluded that a combination of ISVR, SOM, and NIPALS are used to get effective results to predict
the UPDRS.
M. Abdar and M. Zomorodi-Moghadam proposed a paper “Impact of Patients’ Gender on Parkinson’s disease using
Classification Algorithms” [5]. In this paper, the author chooses the UCI PD dataset for finding the accuracy of Parkinson’s
using SVM and Bayesian Network algorithms. The author chooses the most ten important features in the dataset to predict
PD. The output variable is Sex and other factors are input, the author provides an approach for finding relationships between
genders. The result obtained is SVM algorithm gives better performance than Bayesian Network with 90.98% accuracy.
Dragana Miljkovic, et al, proposed a paper “Machine Learning and Data Mining Methods for Managing Parkinson’s
Disease” [4]. In this paper, the author concluded that based on the medical tests taken by the patients the Predictor part was
able to predict the 15 different Parkinson’s symptoms separately. The machine learning and data mining techniques are
applied on different symptoms separately and gives an accuracy range between 57.1% and 77.4% where tremor detection
has the highest accuracy.
Md. Redone Hassan, et al, proposed a paper “A Knowledge Base Data Mining based on Parkinson ’s disease” [6]. In this
paper, different classification algorithms are used to predict Parkinson’s disease such as SVM, KNN, and Decision Tree.
These algorithms are applied to the training dataset and provides different accuracies. The paper summarizes that the
Decision tree algorithm provides 78.2% precision compared to the remaining algorithms.
Satish Srinivasan, Michael Martin & Abhishek Tripathi proposed a paper “ANN based Data Mining Analysis of Parkinson’s
Disease” [9]. In this paper study, it was intended to found that getting different accuracies based on different pre-processing
steps applied to the dataset. In the process the author has classified that the Parkinson’s disease dataset using the ANN-
based MLP classifier obtains the highest prediction accuracy when the dataset was pre-processed using this discretization
and Resample techniques, the train and test data are split in the ratio of 80:20 achieved 100% classification accuracy with
F1-score.
Ramzi M. Sadek, et al, proposed a paper “Parkinson’s Disease Prediction using Artificial Neural Network” [8]. In this
paper, the author chooses 195 samples in the dataset, further divided into 170 training and 25 validating samples. Then the
dataset was imported into the Just Neural Network (JNN) environment, the author trained and validated the Artificial Neural
Network model with the most important attributes contributing to this model. The ANN model provided 100% accuracy.
This architecture diagram describes the high-level overview of major system components and important working
relationships. It represents the flow of execution.
3.4.1 KNN
SPEECH DATASET
PRE-PROCESSING DATA
TRAINING DATA
TEST DATA
OUTPUT
The dataset the author collected is put in one place and prepared to use in the training phase. The dataset contains more
voice features namely PPE, RPDE, DFA, numPulses, numPeriodPulses, meanPeriodPulses, stdDevPeriodPulses,
locPctJitter, etc. which are useful to predict Parkinson’s disease.
3.2 Pre-Processing data:
The main aim of this step is to understand the features of data that the author have to work with. It must be in an
understandable way that the characteristics the author need to predict the output.
The data generally the author collected may contains noises, missing values, and duplicate values that cannot be directly
used for machine learning algorithms. Data preprocessing is the method for cleaning the data, removing duplicates, and
filling the null values with the average of that attribute. Thus making the final dataset must suitable for a machine learning
model which produces the accuracy and efficiency of machine learning algorithms. The author split the data into 70:30.
Suppose, if the resultant dataset the author get is given to our machine learning model and the author test it with a completely
different dataset. Then, there will be more difficulties the author get for our model to understand the correlations between
the models with different datasets. Then the performance will be decreased. So always the author need to do is giving the
same dataset as input to both training and testing. Training of machine learning model is required so that it can understand
various features and patterns.
3.4.1 KNN:
The k-nearest neighbor is the supervised machine learning algorithm used to analyze the accuracy of the system.
It solves both classification and regression problems. After training the data to the model it stores all the results and used
to classify the new data points. The new data which were given as input put into the category that is more similar to the
available cases that have been stored previously. This means when the test data is given to the model, then it spots easily
and classify the data point into a suitable category. As it was a non-parametric algorithm, does not make predictions on the
dataset. KNN is also called as lazy learner algorithm as it stores the data at the training phase and does not learn
immediately. It performs the action, at the time of classification. KNN is used in both statistical estimation and pattern
recognition. A data point is said to be classified by a getting more votes from its neighbors, as the trained data points are
being assigned to the class among its k nearest neighbors. The value of k is a non-negative integer and non-zero, which is
small. If the k value is 1 then it simply chooses the single nearest neighbor. In the KNN algorithm, the efficiency of the
problem depends on the value of k and it is user-defined. In this paper, the author finds the value automatically by using an
optimal algorithm that returns the value of k where the error rate is less.
Algorithm steps:
3.4.1.3) To find the class variable, iterate from starting to ending of all training data points.
3.4.1.3.1) Calculate the Euclidean distance between test data point and each data point in the training data. Here the author
used Euclidean distance as the distance metric since it is the most popular method. The other metrics that can be used are
3.4.1.3.4) Find the most frequent class occurred from these rows.
Naive Bayes classifier tells that every particular attribute in the dataset is independent of all other attributes. For example,
a patient having Parkinson’s or not depends on the speech features of the patient.
D
h P( )∗P(h)
h
P( ) = (1)
D P(D)
P(h): the probability of hypothesis h being true (regardless of the data). This is known as the prior probability of h.
P(D): the probability of the data (regardless of the hypothesis). This is known as the prior probability.
P(h|D): the probability of hypothesis h given the data D. This is known as posterior probability.
P(D|h): the probability of data d given that the hypothesis h was true. This is known as posterior probability.
3.4.3 Logistic Regression:
Logistic regression is also a type of supervised learning classification algorithm that is used to solve both regression
and classification problems. In the classification problems, the target variable may be either 0 or 1. Logistic regression
algorithm works on the function called sigmoid which is an s-shaped curve, so the class variable results as 0 or 1, yes or
no, true or false, correct or wrong, etc. It is a classification algorithm that works on mathematical functions. This algorithm
uses a function known as logistic or sigmoid which is a complex cost function. This functions return the value between 0
and 1. If the value less than 0.5 then it is considered as 0 and greater than 0.5 it considered as 1. Thus to build a model using
logistic regression sigmoid function is required.
3.4.3.1) Binomial: The resultant variable can have only 2 classes either “0” or “1” which represent “true” or “false”, “yes”
or “no”, “correct” or “wrong”, etc.
3.4.3.2) Multinomial: Here, the resultant variable can have more than 3 possibilities which are not in the order that means
it has no measure in quantity like “class A” or “class B” or “class C”.
3.4.3.3) Ordinal: In this case, the resultant variable deals with organized categories. For example, rating of teaching for the
faculty by students can be given as: “very bad”, “bad”, “average”, “good”, “very good” and “excellent”. Here, each category
can be given a score like 0, 1, 2, 3, 4 and 5.
1
f(x) = (2)
1+ex
The value of the logistic regression must be in the range from 0 to 1, does not go beyond this limit, so the only possible
curve formed is S-shaped and it is called the sigmoid function or the logistic function. In logistic regression, the threshold
value plays an essential role, which lies between 0 and 1. The resultant value which is greater than the threshold value
reaches 1, and a value lower than the threshold value reaches 0.
3.5 Test data:
Once our Parkinson’s disease Prediction model has been trained with the speech dataset, then the author test the model
using the remaining dataset. As the author divide the ratio into 70:30, the 70 percent data are trained and the other 30 percent
data is tested in this step. And the author check for the correctness and accuracy of our model by providing a test dataset to
it. The test data is applied to all three algorithms and then the author will check whether which model gives more accuracy.
This is the final step for prediction. In this paper, various machine learning-based methods like KNN, Logistic
Regression, and Naïve Bayes are used to predict Parkinson’s disease. The author chosen only ten important speech features
in the dataset includes gender, PPE, DFA, RPDE, numPulses, numPeriodPulses, meanPeriodPulses, stdDevPeriodPulses.
locPctJitter, locAbsJitter. It is important to measure the performance using these mentioned algorithms. To evaluate the
experiment, the author used different evaluation metrics like accuracy, confusion matrix, precision, recall, and f1-score are
used.
Classification Accuracy- It is the ratio of number of correct predictions to total number of inputs in the dataset.
(TP+TN)
It is expressed as: Accuracy =
(TP+FP+FN+TN)
(3)
Confusion Matrix- It gives us matrix as output and gives total performance of the system.
Positive Negative
Positive TP FN
Negative
FP TN
Precision- It is the ratio of correct positive results to the total number of positive results predicted by the system.
TP
It is expressed as: Precision(P) = (4)
(TP + FP)
Recall- It is the ratio of correct positive results to the number of all relevant samples.
TP
It is expressed as: Recall(R) = (5)
(TP +FN)
F1 score- It is the harmonic mean of Precision and Recall. It measures the test accuracy. The range of this metric is 0 to 1.
1 2𝑃𝑅
It is expressed as: F1 score = 2 ∗ 1 1 = (6)
( )+( ) (𝑃+𝑅)
Precision Recall
After applying various machine learning algorithms on the speech dataset, the results obtained for each individual algorithm
is:
Parkinson’s 161 18
Non Parkinson’s 27 21
TP=161 FN=18
FP=27 TN=21
FP=23 TN=25
FP=41 TN=7
Algorithms Accuracy
KNN 80%
Naïve Bayes 81%
Logistic Regression 79%
Comparative Analysis
1.2
0.97
1
0.9 0.87 0.89
0.86
0.8 0.81 0.79 0.81
0.8
0.6
0.4
0.2
0
KNN Naïve Bayes Logistic Regression
After applying various machine learning algorithms on the speech dataset and the evaluation metrics that are being used to
compare are Accuracy, Confusion Matrix, Precision, Recall and F1-score. Naïve Bayes gives the highest accuracy of 81%.
V. CONCLUSION
Parkinson’s is the second most neurodegenerative disease which has no cure. It results in difficulty of body movements,
anxiety, breathing problems, loss of smell, depression, and speech. In this paper, the three different machine learning
algorithms used to measure the performance are KNN, Naïve Bayes, and Logistic Regression applied on the dataset. The
author chose the voice features of patients as the dataset contains more than 700 features and finally took the ten important
features that are useful to evaluate the system. The author compared all the three machine learning methods accuracies and
based on this one prediction model is generated. Hence, the aim is to use various evaluation metrics like confusion matrix,
accuracy, precision, recall, and f1-score which predicts the disease efficiently. Comparing all the three the Naïve Bayes
REFERENCES
[1] Arvind Kumar Tiwari, “Machine Learning based Approaches for Prediction of Parkinson’s Disease”, Machine Learning
and Applications: An International Journal (MLAU) vol. 3, June 2016.
[2] Carlo Ricciardi, et al, “Using gait analysis’ parameters to classify Parkinsonism: A data mining approach” Computer
Methods and Programs in Biomedicine vol. 180, Oct. 2019.
[3] Dr. Anupam Bhatia and Raunak Sulekh, “Predictive Model for Parkinson’s Disease through Naive Bayes Classification”
International Journal of Computer Science & Communication vol. 9, Dec. 2017, pp. 194- 202, Sept 2017 - March 2018.
[4] Dragana Miljkovic et al, “Machine Learning and Data Mining Methods for Managing Parkinson’s Disease” LNAI 9605,
pp 209-220, 2016.
[5] M. Abdar and M. Zomorodi-Moghadam, “Impact of Patients’ Gender on Parkinson’s disease using Classification
Algorithms” Journal of AI and Data Mining, vol. 6, 2018.
[6] M. A. E. Van Stiphout, J. Marinus, J. J. Van Hilten, F. Lobbezoo, and C. De Baat, “Oral health of Parkinson’s disease
patients: a case-control study,” Parkinson’s Disease, vol. 2018, Article ID 9315285, 8 pages, 2018.
[7] Md. Redone Hassan et al, “A Knowledge Base Data Mining based on Parkinson’s Disease” International Conference
on System Modelling & Advancement in Research Trends, 2019.
[8] Mehrbakhsh Nilashi et al, “A hybrid intelligent system for the prediction of Parkinson’s Disease progression using
Machine Learning techniques” Biocybernetics and Biomedical Engineering 2017.
[9] R. P. Duncan, A. L. Leddy, J. T. Cavanaugh et al., “Detecting and predicting balance decline in Parkinson disease: a
prospective cohort study,” Journal of Parkinson’s Disease, vol. 5, no. 1, pp. 131–139, 2015.
[10] Ramzi M. Sadek et al., “Parkinson’s Disease Prediction using Artificial Neural Network” International Journal of
Academic Health and Medical Research, vol. 3, Issue 1, January 2019.
[11] Satish Srinivasan, Michael Martin & Abhishek Tripathi, “ANN based Data Mining Analysis of Parkinson’s Disease”
International Journal of Computer Applications, vol. 168, June 2017.