Prediction of Diseases Using Random Forest
Prediction of Diseases Using Random Forest
Prediction of Diseases Using Random Forest
Abstract: As people now-a-days face various diseases due to environmental conditions and living habits. It
has become so difficult to predict the diseases at earlier stage. Accurate prediction of the symptoms becomes
difficult for the doctor. So, the correct prediction of the disease becomes the most difficult task. Data mining
plays an important role in solving this problem. Due to increase amount of data growth in medical and
healthcare field the accurate analysis on medical data which has been benefits from early patient care. With
the help of disease data, data mining finds hidden pattern information in the huge amount of medical data. We
proposed general disease prediction based on symptoms of the patient. For the disease prediction, we use
Random forest machine learning algorithm for accurate prediction of disease. For disease prediction
required disease symptoms dataset. In this general disease prediction, the living habits of person and checkup
information consider for the accurate prediction. The accuracy of general disease prediction by using
Random forest algorithm is more accurate and efficient. Experimental results proved that the time and the
memory requirement is also more. After general disease prediction, this system able to gives the risk
associated with general disease which is lower risk of general disease or higher.
1. INTRODUCTION
Artificial Intelligence made computer more intelligent and can enable the computer to
think. AI study considers machine learning as subfield in numerous research work.
Different analysts feel that without learning, insight can't be created. There are numerous
kinds of Machine Learning Techniques like Unsupervised, Semi Supervised, Supervised,
Reinforcement, Evolutionary Learning and Deep Learning [1]. These learning’s are used
to classify huge data very fastly. So, we use Radom forest machine learning algorithm for
fast classification of big data and accurate prediction of disease. Because medical data is
increasing day by day so usage of that for predicting correct disease is crucial task but
processing big data is very crucial in general so data mining plays very important role and
classification of large dataset using machine learning becomes so easy [6].
It is critical to comprehend the accurate diagnosis of patients by clinical examination and
evaluation. For compelling determination decision support systems that depend on
computer may assume an indispensable job. Health care field creates enormous
information about clinical evaluation, report in regards to patient, cure, subsequent
meet-ups, medicine and so forth. It is intricate to orchestrate appropriately [4].
Quality of the data association has been influenced due to improper management of
the information [9]. Upgrade in the measure of data needs some legitimate way to
concentrate and process information viably and efficiently. One of the many
machine learning applications is utilized to construct such classifier that can
separate the data based on their characteristics. Data set is partitioned into two or
more than two classes. Such classifiers are utilized for medical data investigation
and disease prediction.
Today machine learning is present everywhere so that without knowing it, one can
possibly use it many times a day. Random forest uses both the structured and
unstructured data of a hospital to classification. While other machine learning
algorithms only work on structured data and time required for computation is high
also, they are lazy because they store entire data as a training dataset and uses
complex method for calculation.
2. LITERATURE SURVEY
M. Chen et.al used machine learning algorithms for effective prediction of chronic disease
outbreak in disease-frequent communities [1]. They used data which experimented the
modified prediction models over real-life hospital data collected from central China in
2013-2015. To overcome the difficulty of incomplete data, they used a latent factor model
to reconstruct the missing data. Also they experimented on a regional chronic disease of
cerebral infarction. Researchers proposed a new convolutional neural network (CNN)-
based multimodal disease risk prediction algorithm using structured and unstructured data
from hospital [1]. To the best of their knowledge, none of the existing work focused on
both data types in the area of medical big data analytics. Compared with several typical
prediction algorithms, the prediction accuracy of our proposed algorithm reaches 94.8%
with a convergence speed, which is faster than that of the CNN-based unimodal disease
risk prediction algorithm.
B. Qian et.al proposed a relative similarity-based method for interactive patient risk
prediction [2]. Their proposed relative queries take the form of “Is patient A or patient B
more similar to patient C?”, which can be answered by medical experts with more
confidence. These questions poll relative information as opposed to absolute information,
and even can be answered by non-experts in some cases. They explored their method on
both benchmark and real clinic datasets, and make several interesting discoveries
including that querying relative similarities is effective in-patient risk prediction, and
sometimes can even yield better prediction accuracy than asking for absolute questions
[2].
IM. Chen et.al proposed a Wearable 2.0 healthcare system to improve QoE and QoS of
the next generation healthcare system [3]. In that proposed system, washable smart
clothing, which consists of sensors, electrodes, and wires, is the critical component to
collect users’ physiological data and receive the analysis results of users’ health and
emotional status provided by cloud-based machine intelligence [3].
Y. Zhang et.al proposed a cyber-physical system for patient-centric healthcare
applications and services, called Health-CPS, built on cloud and big data analytics
technologies [4]. This system consists of a data collection layer with a unified standard, a
data management layer for distributed storage and parallel computing, and a data-oriented
service layer [4]. The results of this study showed that the technologies of cloud and big
data can be used to enhance the performance of the healthcare system so that humans can
then enjoy various smart healthcare applications and services.
L. Qiu, K. Gai, and M. Qiu focused on the problem of data sharing obstacles in cloud
computing and propose an approach that uses dynamic programming to produce optimal
solutions to data sharing mechanisms [5]. Their proposed approach is called Optimal
Telehealth Data Sharing Model (OTDSM), which considers transmission probabilities,
maximizing network capacities, and timing constraints. Their experimental results have
proved the flexibility and adoptability of the proposed method [5].
Kunjir et.al proposed an efficient multiclass Naïve Bayes algorithm is used for prediction
of a particular disease by training it on a set of data before implementation. Wrong
clinical decisions taken by medical practitioners can cause any harm or result in serious
loss of life of a patient which is hard to afford by any hospital [8]. To acquire a precise
and cost effective treatment, technology based Data Mining Systems can be constructed to
make worthy decisions [8]. The main aim of their research is to build a basic decision
support system which can determine and extract previously unseen patterns, relations and
concepts related with multiple disease from a historical database records of specified
multiple diseases. Their proposed system can solved a difficult queries for detecting a
particular disease and also can assist medical practitioners to make smart clinical
decisions which traditional decision support systems were not able to[8]. The decisions
taken by medical practitioners with the help of technology can result in effective and low-
cost treatments. There is an insufficiency of technology and analysis system and methods
to discover connections, concepts and patterns in the medical data. Data mining is an
engineering study of extracting previously undiscovered patterns from a selected set of
data., They compared data mining methods namely, Naive Bayes and J48 algorithms for
testing their accuracy and performance on the training medical datasets. The medical
datasets will be visualized by different visualization techniques like 2D/3D graphs, pie
charts and other methods.
3. PROPOSED SYSTEM
In this paper proposed a general disease prediction based on symptoms of the patient. For
the disease prediction, we use Random forest machine learning algorithm for accurate
prediction of disease. For disease prediction required disease symptoms dataset. In this
general disease prediction, the living habits of person and checkup information consider
for the accurate prediction. Initially we take disease dataset from UCI machine learning
website and that is in the form of disease list with its symptoms. After that preprocessing
is performed on that dataset for cleaning that is removing 3comma, punctuations and
white places. And that is used as training dataset. After that feature extracted and selected.
Then we classify that data using classification techniques such as Radom forest algorithm.
3.1 Data set: General disease prediction data set is available at UCI Machine Learning
Repository is available on KAGGLE. Due to big data progress in healthcare communities,
accurate study of medical data benefits early disease recognition, patient care and
community services.
In this paper, we used Random Forest for structured and unstructured data from hospital.
Which is used for partition of data.
In this case, The patient A is having symptoms like head ache, sore eyes, weakness. So,
he is suffering from fever. The patient B, is having symptoms like cough and weakness so
he is suffering from cold,
4. EXPERIMENTAL RESULTS
4.1 Language and Tool:
We used Python for our project because it is very easy to understand and less code, we
executed our code in Pycharm. It makes easier for developers to implement both local and
global changes quickly and efficiently. The developers can even take advantage of the
refactoring options provided by the IDE while writing plain Python code and working
with Python frameworks.
4.1.1 Django
Now, we will learn step by step process to create a Django application. To create a
Django project, we used the subsequent command. projectname is the name of Django
application.
$ django-admin startproject projectname Django Project Example
Here, we are creating a project djangpapp within the current directory.
$ django-admin startproject djangpapp Locate into the Project
Now, move to the project by changing the directory. The Directory can be changed by
using the following command. cd djangpapp
To see all the files and subfolders of Django project, we can use tree command to view
the tree structure of the application. This is a utility command, if it is not present, can be
downloaded via apt-get install tree command. A Django project contains the subsequent
packages and files. The outer directory is simply a container for the appliance . We can
rename it further.
PyCharm is an integrated development environment used for programming , primarily
used for Python . It is developed by JetBrains. It provides various operations like code
analysis, a graphical debugger, an integrated unit tester, integration with version control
systems and supports web development using Django and Data Science with Anaconda.
PyCharm is supported by Windows, macOS and Linux versions. The Community Edition
is released under Apache License and an Professional Edition available with extra
features released under proprietary license.
Various functions and operations which pycharm provide are:
o Coding assistance: analysis, code completion, syntax and error handling,
integration, and quick fixes
o Python refactoring: rename, extract method, introduce variable, introduce
constant, pull up, push down and many more.
o Supports web frameworks: Django, web2py and Flask. • Integrated Python
debugger
o Integrated unit testing, with line-by-line code.
o Google Engine Python development.
Login page: This is the login page. In this Admin, Doctor and Receptionist can login into their
accounts.
Admin can login into his page and view the doctors and receptionist. Admin can also add and delete the
doctor and receptionist. It is shown in Fig1.
Doctor can login into his account and view the patient’s. It is shown in Fig2.
From Table 2. It is proved that among different classifiers, Random Forest achieved a
highest accuracy for predicting diseases.
and instance. Estimate inaccessibility of the training sections arranged and the
neighbouring neighbour based on the minimum - the remoteness is determined in the
subsequent step. Trainingdata for all categories defined. Majority of the class of nearest
neighbours have the forecast value of the query record.
4.2.2.4 Navie Bayes (NB):
NB is prevalent and fits when the input data is large and needa short computational time.
Calculation based on prospect is done by applying Bayes formula.
p(h/D) = p((D/h)p(h))/p((D))
Where p(h) is refers to prior probability of hypothesis, h in this case is true p(D) is refers
to prior possibility of training data D. p(h/D) is refers to possibility of h given D.p(D/h) is
refers to possibility of D given h.
REFERENCES