Prediction of Diseases Using Random Forest

Zeichen Journal ISSN No: 0932-4747
PREDICTION OF DISEASES USING RANDOM

FOREST CLASSIFICATION ALGORITHM
R. Delshi Howsalya Devi1 P.Sreevalli2 Keerthana3 Prathyusha4 M.Asia5
1
Associate Professor, Department of CSE, Bharat Institute of Engineering & Technology,
Hyderabad.
2,3,4,5
B.Tech Students, Department of CSE, Bharat Institute of Engineering & Technology,
Hyderabad.
Abstract: As people now-a-days face various diseases due to environmental conditions and living habits. It
has become so difficult to predict the diseases at earlier stage. Accurate prediction of the symptoms becomes
difficult for the doctor. So, the correct prediction of the disease becomes the most difficult task. Data mining
plays an important role in solving this problem. Due to increase amount of data growth in medical and
healthcare field the accurate analysis on medical data which has been benefits from early patient care. With
the help of disease data, data mining finds hidden pattern information in the huge amount of medical data. We
proposed general disease prediction based on symptoms of the patient. For the disease prediction, we use
Random forest machine learning algorithm for accurate prediction of disease. For disease prediction
required disease symptoms dataset. In this general disease prediction, the living habits of person and checkup
information consider for the accurate prediction. The accuracy of general disease prediction by using
Random forest algorithm is more accurate and efficient. Experimental results proved that the time and the
memory requirement is also more. After general disease prediction, this system able to gives the risk
associated with general disease which is lower risk of general disease or higher.
Keywords: Data Mining, Classification Algorithms, Random Forest.
1. INTRODUCTION
Artificial Intelligence made computer more intelligent and can enable the computer to
think. AI study considers machine learning as subfield in numerous research work.
Different analysts feel that without learning, insight can't be created. There are numerous
kinds of Machine Learning Techniques like Unsupervised, Semi Supervised, Supervised,
Reinforcement, Evolutionary Learning and Deep Learning [1]. These learning’s are used
to classify huge data very fastly. So, we use Radom forest machine learning algorithm for
fast classification of big data and accurate prediction of disease. Because medical data is
increasing day by day so usage of that for predicting correct disease is crucial task but
processing big data is very crucial in general so data mining plays very important role and
classification of large dataset using machine learning becomes so easy [6].
It is critical to comprehend the accurate diagnosis of patients by clinical examination and
evaluation. For compelling determination decision support systems that depend on
computer may assume an indispensable job. Health care field creates enormous
information about clinical evaluation, report in regards to patient, cure, subsequent
meet-ups, medicine and so forth. It is intricate to orchestrate appropriately [4].
Quality of the data association has been influenced due to improper management of
the information [9]. Upgrade in the measure of data needs some legitimate way to
concentrate and process information viably and efficiently. One of the many
machine learning applications is utilized to construct such classifier that can
separate the data based on their characteristics. Data set is partitioned into two or
more than two classes. Such classifiers are utilized for medical data investigation
and disease prediction.
Today machine learning is present everywhere so that without knowing it, one can
possibly use it many times a day. Random forest uses both the structured and
unstructured data of a hospital to classification. While other machine learning
algorithms only work on structured data and time required for computation is high
Volume 6, Issue 5, 2020 Page No:19

also, they are lazy because they store entire data as a training dataset and uses
complex method for calculation.
2. LITERATURE SURVEY
M. Chen et.al used machine learning algorithms for effective prediction of chronic disease
outbreak in disease-frequent communities [1]. They used data which experimented the
modified prediction models over real-life hospital data collected from central China in
2013-2015. To overcome the difficulty of incomplete data, they used a latent factor model
to reconstruct the missing data. Also they experimented on a regional chronic disease of
cerebral infarction. Researchers proposed a new convolutional neural network (CNN)-
based multimodal disease risk prediction algorithm using structured and unstructured data
from hospital [1]. To the best of their knowledge, none of the existing work focused on
both data types in the area of medical big data analytics. Compared with several typical
prediction algorithms, the prediction accuracy of our proposed algorithm reaches 94.8%
with a convergence speed, which is faster than that of the CNN-based unimodal disease
risk prediction algorithm.
B. Qian et.al proposed a relative similarity-based method for interactive patient risk
prediction [2]. Their proposed relative queries take the form of “Is patient A or patient B
more similar to patient C?”, which can be answered by medical experts with more
confidence. These questions poll relative information as opposed to absolute information,
and even can be answered by non-experts in some cases. They explored their method on
both benchmark and real clinic datasets, and make several interesting discoveries
including that querying relative similarities is effective in-patient risk prediction, and
sometimes can even yield better prediction accuracy than asking for absolute questions
[2].
IM. Chen et.al proposed a Wearable 2.0 healthcare system to improve QoE and QoS of
the next generation healthcare system [3]. In that proposed system, washable smart
clothing, which consists of sensors, electrodes, and wires, is the critical component to
collect users’ physiological data and receive the analysis results of users’ health and
emotional status provided by cloud-based machine intelligence [3].
Y. Zhang et.al proposed a cyber-physical system for patient-centric healthcare
applications and services, called Health-CPS, built on cloud and big data analytics
technologies [4]. This system consists of a data collection layer with a unified standard, a
data management layer for distributed storage and parallel computing, and a data-oriented
service layer [4]. The results of this study showed that the technologies of cloud and big
data can be used to enhance the performance of the healthcare system so that humans can
then enjoy various smart healthcare applications and services.
L. Qiu, K. Gai, and M. Qiu focused on the problem of data sharing obstacles in cloud
computing and propose an approach that uses dynamic programming to produce optimal
solutions to data sharing mechanisms [5]. Their proposed approach is called Optimal
Telehealth Data Sharing Model (OTDSM), which considers transmission probabilities,
maximizing network capacities, and timing constraints. Their experimental results have
proved the flexibility and adoptability of the proposed method [5].
Kunjir et.al proposed an efficient multiclass Naïve Bayes algorithm is used for prediction
of a particular disease by training it on a set of data before implementation. Wrong
clinical decisions taken by medical practitioners can cause any harm or result in serious
loss of life of a patient which is hard to afford by any hospital [8]. To acquire a precise
and cost effective treatment, technology based Data Mining Systems can be constructed to
make worthy decisions [8]. The main aim of their research is to build a basic decision
support system which can determine and extract previously unseen patterns, relations and
concepts related with multiple disease from a historical database records of specified

multiple diseases. Their proposed system can solved a difficult queries for detecting a
particular disease and also can assist medical practitioners to make smart clinical
decisions which traditional decision support systems were not able to[8]. The decisions
taken by medical practitioners with the help of technology can result in effective and low-
cost treatments. There is an insufficiency of technology and analysis system and methods
to discover connections, concepts and patterns in the medical data. Data mining is an
engineering study of extracting previously undiscovered patterns from a selected set of
data., They compared data mining methods namely, Naive Bayes and J48 algorithms for
testing their accuracy and performance on the training medical datasets. The medical
datasets will be visualized by different visualization techniques like 2D/3D graphs, pie
charts and other methods.
3. PROPOSED SYSTEM
In this paper proposed a general disease prediction based on symptoms of the patient. For
the disease prediction, we use Random forest machine learning algorithm for accurate
prediction of disease. For disease prediction required disease symptoms dataset. In this
general disease prediction, the living habits of person and checkup information consider
for the accurate prediction. Initially we take disease dataset from UCI machine learning
website and that is in the form of disease list with its symptoms. After that preprocessing
is performed on that dataset for cleaning that is removing 3comma, punctuations and
white places. And that is used as training dataset. After that feature extracted and selected.
Then we classify that data using classification techniques such as Radom forest algorithm.
3.1 Data set: General disease prediction data set is available at UCI Machine Learning
Repository is available on KAGGLE. Due to big data progress in healthcare communities,
accurate study of medical data benefits early disease recognition, patient care and
community services.
In this paper, we used Random Forest for structured and unstructured data from hospital.
Which is used for partition of data.
3.1.1 Dataset Description:

There are different types of attributes which are useful in the exact prediction of diseases.
Attributes in dataset:
High fever, cough, fast heart rate, vomiting, weight loss, dehydration,knee pain,neck
pain,skin rashes, cold, elbow disjoint, weakness, sore eyes, head ache For example , if a
person is suffering from fever than the symptoms are like it is shown in Table 1.
Table 1. Patient Fever Symptoms.
Patient id Cough Head Sore Neck Weakness Disease

ache eyes pain
A 0 1 1 0 1 Fever
B 1 0 0 0 1 Cold
In this case, The patient A is having symptoms like head ache, sore eyes, weakness. So,
he is suffering from fever. The patient B, is having symptoms like cough and weakness so
he is suffering from cold,

3.2 Random Forest:

Step-1: Start with the selection of random samples from the given data set.
Step-2: Next, this algorithm will construct a decision tree for every sample. Then it will
get the prediction result from every decision tree.
Step-3: In this step, voting will be performed for every predicted result.
Step-4: At last, select the most voted prediction result as the final prediction result.
4. EXPERIMENTAL RESULTS
4.1 Language and Tool:
We used Python for our project because it is very easy to understand and less code, we
executed our code in Pycharm. It makes easier for developers to implement both local and
global changes quickly and efficiently. The developers can even take advantage of the
refactoring options provided by the IDE while writing plain Python code and working
with Python frameworks.
4.1.1 Django
Now, we will learn step by step process to create a Django application. To create a
Django project, we used the subsequent command. projectname is the name of Django
application.
$ django-admin startproject projectname Django Project Example
Here, we are creating a project djangpapp within the current directory.
$ django-admin startproject djangpapp Locate into the Project
Now, move to the project by changing the directory. The Directory can be changed by
using the following command. cd djangpapp
To see all the files and subfolders of Django project, we can use tree command to view
the tree structure of the application. This is a utility command, if it is not present, can be
downloaded via apt-get install tree command. A Django project contains the subsequent
packages and files. The outer directory is simply a container for the appliance . We can
rename it further.
PyCharm is an integrated development environment used for programming , primarily
used for Python . It is developed by JetBrains. It provides various operations like code
analysis, a graphical debugger, an integrated unit tester, integration with version control
systems and supports web development using Django and Data Science with Anaconda.
PyCharm is supported by Windows, macOS and Linux versions. The Community Edition
is released under Apache License and an Professional Edition available with extra
features released under proprietary license.
Various functions and operations which pycharm provide are:
o Coding assistance: analysis, code completion, syntax and error handling,
integration, and quick fixes
o Python refactoring: rename, extract method, introduce variable, introduce
constant, pull up, push down and many more.
o Supports web frameworks: Django, web2py and Flask. • Integrated Python
debugger
o Integrated unit testing, with line-by-line code.
o Google Engine Python development.

o Version control integration: unified user interface sub

o Version, Perforce and CVS with change lists and merge.
o Supports scientific tools like matplotlib, numpy and scipyAuthor names and
affiliations are to be centered beneath the title and printed in Times New
Roman 12-point, non-boldface type. (See example below).
Login page: This is the login page. In this Admin, Doctor and Receptionist can login into their
accounts.
Admin can login into his page and view the doctors and receptionist. Admin can also add and delete the
doctor and receptionist. It is shown in Fig1.
Figure 1. Admin Form
Doctor can login into his account and view the patient’s. It is shown in Fig2.

Figure 2. Login for Doctor

Receptionist will be adding the patient’s along with the symptoms and they can also view the patient’s.
It is shown in Fig 3.
Figure 3. Receptionist Form

Here in data mining classification algorithms we have so many algorithms. In our project we can
use any algorithm but there will no correct accuracy. We can get approximate results. This is the
reason why we are using Random forest algorithm which gives the correct results.
Table 2. The detailed comparison of different classifier in terms of Accuracy,
Sensitivity and Specificity.
From Table 2. It is proved that among different classifiers, Random Forest achieved a
highest accuracy for predicting diseases.
4.1.2 Comparison with existing work:

4.1.2.1 Random Forest (RF):
It is one of the prediction algorithms in the machine learning area. It is more adaptable to
ensemble approach. It can easily tackle large datasets.
4.1.2.2 K –Nearest Neighbours (KNN):
It is grouped under the category of lazy prediction technique. It is easy technique helps to
group new work based on similarity measure. The training data are sorted in this
algorithm. Define k - number of nearby neighbours. Distance between training samples

and instance. Estimate inaccessibility of the training sections arranged and the
neighbouring neighbour based on the minimum - the remoteness is determined in the
subsequent step. Trainingdata for all categories defined. Majority of the class of nearest
neighbours have the forecast value of the query record.
4.2.2.4 Navie Bayes (NB):
NB is prevalent and fits when the input data is large and needa short computational time.
Calculation based on prospect is done by applying Bayes formula.
p(h/D) = p((D/h)p(h))/p((D))
Where p(h) is refers to prior probability of hypothesis, h in this case is true p(D) is refers
to prior possibility of training data D. p(h/D) is refers to possibility of h given D.p(D/h) is
refers to possibility of D given h.
Figure Comparision of NB,KNN and Random Forest
5. CONCLUSION AND FUTURE WORK

We proposed general disease prediction system based on machine learning algorithm. We
utilized Radom forest algorithm to classify patient data because today medical data
growing very vastly and that needs to process existed data for predicting exact disease
based on symptoms. We got accurate general disease risk prediction as output, by giving
the input as patients record which help us to understand the level of disease risk
prediction. Because of this system may leads in low time consumption and minimal cost
possible for disease prediction and risk prediction.
Future scope:
In Due Course, latest technology advancements will be taken into consideration. As part
of technical build-up many components of the networking system will be generic in
nature so that future research can either use or interact with this. The future holds a lot to
offer to the development and refinement of this research.
REFERENCES
1. M. Chen, Y. Hao, K. Hwang, L. Wang, and L. Wang,“Disease prediction by

machine learning over big data from healthcare communities”, ,” IEEE Access,
vol. 5, no. 1, pp. 8869–8879, 2017.
2. B. Qian, X. Wang, N. Cao, H. Li, and Y.-G. Jiang, “A relative similarity based
method for interactive patient risk prediction,” Springer Data Mining Knowl.
Discovery, vol. 29, no. 4, pp. 1070– 1093, 2015.
3. IM. Chen, Y. Ma, Y. Li, D. Wu, Y. Zhang, and C. Youn, “ Wearable 2.0: Enable
human-cloud integration in next generation healthcare system,” IEEE Commun. ,
vol. 55, no. 1, pp. 54–61, Jan. 2017.

4. Y. Zhang, M. Qiu, C.-W. Tsai, M. M. Hassan, and A. Alamri, “HealthCPS:

Healthcare cyberphysical system assisted by cloud and big data,” IEEE Syst. J.,
vol. 11, no. 1, pp. 88–95, Mar. 2017.
5. L. Qiu, K. Gai, and M. Qiu, “Optimal big data sharing approach for telehealth in
cloud computing,” in Proc. IEEE Int. Conf. Smart Cloud (Smart Cloud), Nov.
2016, pp. 184– 189
6. Disease and symptoms Dataset –www.github.com.
7. Heart disease Dataset-WWW.UCI Repository. com
8. Ajinkya Kunjir, Harshal Sawant, Nuzhat F.Shaikh, “Data Mining and
Visualization for prediction of Multiple Diseases in Healthcare,” in IEEE big
data analytics and computational intelligence, Oct 2017 pp.2325.
9. Shanthi Mendis, Pekka Puska, Bo Norrving, World Health Organization (2011),
Global Atlas on Cardiovascular Disease Prevention and Control, PP. 3– 18.
ISBN 978-92-4-156437-3.
10. Shanthi Mendis, Pekka Puska, Bo Norrving, World Health Organization (2011),
Global Atlas on Cardiovascular Disease Prevention and Control, PP. 3– 18.
ISBN 978-92-4-156437-3.
11. Amin, S.U.; Agarwal, K.; Beg, R., “Genetic neural network based data mining in
prediction of heart disease using risk factors”, IEEE Conference on Information
& Communication Technologies (ICT), vol., no.,pp.1227-31,11- 12 April 2013.
12. Palaniappan S, Awang R, “Intelligent heart disease prediction System using data
mining
13. techniques,” IEEE/ACS International Conference on Computer Systems and
Applications, AICCSA 2008., vol., no., pp.108115, March 31 2008-April 4 2008.
14. B. Nithya , Dr. V. Ilango Professor, “Predictive Analytics in Health Care Using
Machine
15. Learning Tools and Techniques,” International Conference on Intelligent
Computing and Control Systems,2017.
16. S.Leoni Sharmila, C.Dharuman and P.Venkatesan “Disease Classification Using
MachineLe arning Algorithms - A Comparative Study”, International Journal of
Pure and Applied Mathematics Volume 114 No. 6 2017, 1-10
17. Allen Daniel Sunny1, Sajal Kulshreshtha, Satyam Singh3, Srinabh, Mr. Mohan
Ba, Dr.
18. Sarojadevi H “ Disease Diagnosis System By Exploring Machine Learning
Algorithms”, International Journal of Innovations in Engineering and
Technology (IJIET) Volume 10 Issue 2 May 2018.
19. Shraddha Subhash Shirsath “Disease Prediction Using Machine Learning Over
Big Data”International Journal of Innovative Research in Science, Vol. 7, Issue 6,
June 2018.

Prediction of Diseases Using Random Forest

Uploaded by

Copyright:

Available Formats

Prediction of Diseases Using Random Forest

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Prediction of Diseases Using Random Forest

Uploaded by

Copyright:

Available Formats

Zeichen Journal ISSN No: 0932-4747

PREDICTION OF DISEASES USING RANDOM

Keywords: Data Mining, Classification Algorithms, Random Forest.

Volume 6, Issue 5, 2020 Page No:19

Volume 6, Issue 5, 2020 Page No:20

3.1.1 Dataset Description:

Table 1. Patient Fever Symptoms.

Patient id Cough Head Sore Neck Weakness Disease

Volume 6, Issue 5, 2020 Page No:21

3.2 Random Forest:

Volume 6, Issue 5, 2020 Page No:22

o Version control integration: unified user interface sub

Figure 1. Admin Form

Volume 6, Issue 5, 2020 Page No:23

Figure 2. Login for Doctor

Figure 3. Receptionist Form

4.1.2 Comparison with existing work:

Volume 6, Issue 5, 2020 Page No:24

Figure Comparision of NB,KNN and Random Forest

5. CONCLUSION AND FUTURE WORK

1. M. Chen, Y. Hao, K. Hwang, L. Wang, and L. Wang,“Disease prediction by

Volume 6, Issue 5, 2020 Page No:25

4. Y. Zhang, M. Qiu, C.-W. Tsai, M. M. Hassan, and A. Alamri, “HealthCPS:

Volume 6, Issue 5, 2020 Page No:26

You might also like