Classification of Diabetes Disease Using Decision Tree Algorithm (C4.5)
Classification of Diabetes Disease Using Decision Tree Algorithm (C4.5)
Classification of Diabetes Disease Using Decision Tree Algorithm (C4.5)
Abstract. Diabetes is one of the most common health problems in the world. Diabetes is also
known as "the silent killer" because according to WHO (2016) diabetes increased from 108
million in 1980 to about 422 million adults had diabetes in 2014. If not handled properly,
diabetes can become chronic and damage other organs and can cause death. This disease has
several symptoms in the patient but evaluating the different factors or symptom variables
required to determine which variables are more dominant. This research aims to establish the
most influential variable of the many variables causing diabetes mess. We suggest using a data
mining decision tree (C4.5) in this paper to forecast diabetes to help doctors analyse the disease
sooner. Data mining has carried out various approaches to predict a disease, one of them is the
use of c4.5. In this research, produce a decision tree and the result shown that polydipsia play a
role in diabetes with accuracy 90.38 %. One of the most dominant signs of diabetics is the sign
of polydipsia.
1. Introduction
According to data obtained from the WHO official website, there are around 422 million people
around the world who have diabetes. Generally, people with diabetes are from low- and middle-
income countries. However, in the last 3 decades, it has been found that the increasing number of
diabetics is evenly distributed throughout the world in both low-income and high-income countries.
Each year the death associated with diabetes reaches 1.6 million people. Diabetes is a condition in
which the body has problems producing the hormone insulin which can cause damage to other organs
in the body [1–3].
Research related to classification methods has been done before for disease using certain
classification algorithms. According to the decision tree is the most powerful classification technique
in a study conducted to classify diabetic patients in a population in Canada [4]. According to getting
the best classification technique with best precision value equal to 0.770 and recall of 0.775 using the
Hoeffding Tree algorithm [5]. According to diabetes detection using a deep learning algorithm
because early detection of diabetes is believed to be very important, the use of a support vector
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
Annual Conference on Science and Technology (ANCOSET 2020) IOP Publishing
Journal of Physics: Conference Series 1869 (2021) 012082 doi:10.1088/1742-6596/1869/1/012082
machine (SVM) can improve performance by 0.03% and 0.06% by using the Convolutional Neural
Network (CNN) [6].
Decision trees are well established method for classification task because of their fast construction
time and good interpretability [7]. This study performs a classification technique using the C4.5
algorithm to obtain a decision tree [8], that will show the parameters that have the greatest influence
on the occurrence of diabetes, by knowing the early symptoms diabetes can be treated early with the
hope that later it can reduce the death rate for sufferers caused by diabetes.
We propose early detection and prevention of diabetes that will reduce the degree of complexity
and high diabetes-related costs, and we use the decision tree (C4.5) classifier to apply it for knowledge
analysis. As a result, well-timed diagnosis and avoidance lead to a decrease in morality and risk of
diabetes complication, resulting in improved quality of life.
2. Methods
The dataset used in this study is derived from secondary data obtained from the early-stage diabetes
dataset which can be accessed through https://www.kaggle.com/singhakash/early-stage-diabetes-risk-
prediction-datasets. The description of dataset is belonging to: This dataset has been collected using
direct questionnaires from the patients of Sylhet Diabetes Hospital in Sylhet, Bangladesh and
approved by a doctor. The attribute of this dataset is: Age with range (1.20-65) , Sex with (Male and
Female), (Polyuria, Polydipsia, sudden weight loss, weakness, Polyphagia, Genital thrush, visual
blurring, Itching, Irritability, delayed healing, partial paresis, muscle stiffness, Alopecia, and Obesity)
with condition (Yes and No), Class with (Positive and Negative) result, for more details, see Table 1.
The data contained in the source consists of 520 record with several variables (age, gender,
polyuria, polydipsia, sudden weight loss, weakness, polyphagia, genital thrush, visual blurring,
itching, irritability, delayed healing, partial paresis, muscle stiffness, alopecia, obesity) [9]. Decision
tree (C4.5) is an algorithm that can generate a decision tree and classify an object. C4.5 is a better
form of Iterative Dichotomised 3 (ID 3) algorithm [10]. This study performs a classification technique
using the c4.5 algorithm to obtain a decision tree that will show the parameters that have the greatest
role in diabetes [11].
The work steps as contained in Figure 1. The first stage of the approach we suggest is to collect
data sets of symptoms of diabetes symptoms, public data at this stage. Furthermore, in the second
stage, we pick the data by performing feature selection, taking only clinical symptoms that are
considered diabetes indicators, which will then be used as attributes when implemented in the model.
In the third stage, data that passed the feature selection is split into two data, namely training data and
test data, where training data is used to test the model and testing data to test the model accuracy. In
the next step, the training data will be processed using the C4.5 model; a model will be obtained from
this stage which can show in Figure 2. where a detailed description of the C4.5 model results will be
clarified when the C4.5 algorithm stage is implemented.
2
Annual Conference on Science and Technology (ANCOSET 2020) IOP Publishing
Journal of Physics: Conference Series 1869 (2021) 012082 doi:10.1088/1742-6596/1869/1/012082
Figure 1. Classification technique using the C4.5 algorithm to obtain a decision tree.
3
Annual Conference on Science and Technology (ANCOSET 2020) IOP Publishing
Journal of Physics: Conference Series 1869 (2021) 012082 doi:10.1088/1742-6596/1869/1/012082
To get the root value in the decision tree, it is necessary to calculate the entropy value and the gain
value of each parameter using the formula for finding the entropy value as listed in (1) and the gain
value as contained in (2)[14] :
(1)
(2)
The following is the calculation for the entropy value based on formula (1):
Furthermore, after getting the entropy value, the entropy value is used as the basis for calculating the
gain value for each parameter. The calculation for the polyuria parameter values is as follows:
• Poliyuria Positive :
• Poliyuria Negative :
• Poliyuria gain :
The same technique is applied to the other parameters in order to obtain a table of the results of the
entrophy value and gain value used to determine the first root of the decision tree as shown in Table 2:
4
Annual Conference on Science and Technology (ANCOSET 2020) IOP Publishing
Journal of Physics: Conference Series 1869 (2021) 012082 doi:10.1088/1742-6596/1869/1/012082
Based on the results table, it can be seen that the highest gain value is in the polydipsia parameter,
which is 0.440. Therefore, it is certain that the first root will be occupied by the polydipsia parameter.
The instrument used to display the decision tree in this study is the rapid miner, for previously
processed data the results of the decision tree are as follows:
5
Annual Conference on Science and Technology (ANCOSET 2020) IOP Publishing
Journal of Physics: Conference Series 1869 (2021) 012082 doi:10.1088/1742-6596/1869/1/012082
From the results of the decision tree model processing (Figure 2), it can be explained that polydipsia
has the most influence on whether a patient has diabetes or not. For more details, it can be explained
as follows:
• If the polydipsia is no, polyuria is also no, then the patient is declared negative.
• If the polydipsia is no, polyuria is yes, delayed healing no, then the patient is declared
positive.
• If the polydipsia is no, polyuria is yes, delayed healing is yes, itching is no, then the patient is
declared positive
• If the polydipsia is no, polyuria is yes, delayed healing is yes, itching is yes, alopecia is no,
then the patient is declared positive
• If the polydipsia is no, polyuria is yes, delayed healing is yes, itching is yes, alopecia is yes,
partial is no, obesity no, then the patient is declared positive
• If the polydipsia is no, polyuria is yes, delayed healing is yes, itching is yes, alopecia is yes,
partial is no, obesity yes, then the patient is declared negative
• If the polydipsia is no, polyuria is yes, delayed healing is yes, itching is yes, alopecia is yes,
partial is yes, then the patient is declared negative
• If the polydipsia is yes, polyuria is also yes, then the patient is declared positive.
• If the polydipsia is yes, polyuria is no, polyphagia is yes, then the patient is declared positive.
• If the polydipsia is yes, polyuria is no, polyphagia is no, itching is no, then the patient is
declared positive.
6
Annual Conference on Science and Technology (ANCOSET 2020) IOP Publishing
Journal of Physics: Conference Series 1869 (2021) 012082 doi:10.1088/1742-6596/1869/1/012082
• If the polydipsia is yes, polyuria is no, polyphagia is no, itching is yes, delayed healing is yes,
then the patient is declared positive.
• If the polydipsia is yes, polyuria is no, polyphagia is no, itching is yes, delayed healing is no,
muscle stiffness is no, then the patient is declared positive.
• If the polydipsia is yes, polyuria is no, polyphagia is no, itching is yes, delayed healing is no,
muscle stiffness is yes, then the patient is declared negative.
The results of the model performance using the tools are shown in Table 3.
(3)
or it can be said that the accuracy value for this algorithm model is 90.38% [15].
4. Conclusion
This disease has several symptoms in the patient but evaluating the different factors or symptom
variables required to determine which variables are more dominant. This research aims to establish the
most influential variable of the many variables causing diabetes mess. We have using a data mining
decision tree (C4.5) in this paper to forecast diabetes to help doctors analyse the disease sooner. Data
mining has carried out various approaches to predicting disease; one of them is c4.5. The experimental
results show that the parameter that has the greatest influence on diabetes is polydipsia, the
performance results show a fairly good accuracy value, namely 90.38% so that this algorithm model
can be concluded as quite good. Therefore, someone who has symptoms of polydipsia can check
diabetes early. We prefer to refine the model in the future by incorporating more data from various
sources and considering other variables such as eating patterns, lifestyle, etc.
Acknowledgment
Thanks to Hamzanwadi University for providing grants for this research.
References
[1] Tenenbaum-Gavish K, Sharabi-Nov A, Binyamin D, Møller H J, Danon D, Rothman L, Hadar
E, Idelson A, Vogel I, Koren O, Nicolaides K H, Gronbaek H and Meiri H 2020 First
trimester biomarkers for prediction of gestational diabetes mellitus Placenta 101 80–9
[2] Tigga N P and Garg S 2020 Prediction of Type 2 Diabetes using Machine Learning
Classification Methods Procedia Comput. Sci. 167 706–16
[3] Edla D R and Cheruku R 2017 Diabetes-Finder: A Bat Optimized Classification System for
Type-2 Diabetes Procedia Comput. Sci. 115 235–42
[4] Perveen S, Shahbaz M, Guergachi A and Keshavjee K 2016 Performance Analysis of Data
Mining Classification Techniques to Predict Diabetes Procedia Comput. Sci. 82 115–21
[5] Mercaldo F, Nardone V and Santone A 2017 Diabetes Mellitus Affected Patients Classification
7
Annual Conference on Science and Technology (ANCOSET 2020) IOP Publishing
Journal of Physics: Conference Series 1869 (2021) 012082 doi:10.1088/1742-6596/1869/1/012082
and Diagnosis through Machine Learning Techniques Procedia Comput. Sci. 112 2519–28
[6] Swapna G, Vinayakumar R and Soman K P 2018 Diabetes detection using deep learning
algorithms ICT Express 4 243–6
[7] Hopner F 2020 Multidimensional Decision Tree Splits to Improve Interpretability Procedia
Comput. Sci. 176 156–65
[8] Budiman E, Haviluddin, Dengan N, Kridalaksana A H, Wati M and Purnawansyah 2018
Performance of Decision Tree C4.5 Algorithm in Student Academic Evaluation Lect. Notes
Electr. Eng. 488 380–9
[9] Islam F, Ferdousi R, Rahman S and Bushra H Y 2019 Computer Vision and Machine
Intelligence in Medical Image Analysis
[10] HSSINA B, MERBOUHA A, EZZIKOURI H and ERRITALI M 2014 A comparative study of
decision tree ID3 and C4.5 Int. J. Adv. Comput. Sci. Appl. 4 13–9
[11] Sharma S, Agrawal J and Sharma S 2013 Classification Through Machine Learning Technique:
C4. 5 Algorithm based on Various Entropies Int. J. Comput. Appl. 82 28–32
[12] Dash M and Liu H 1997 Feature selection for classification Intell. Data Anal. 1 131–56
[13] Salappa a., Doumpos M and Zopounidis C 2007 Feature selection algorithms in classification
problems: an experimental evaluation Optim. Methods Softw.
[14] Kirshners A, Parshutin S and Gorskis H 2016 Entropy-Based Classifier Enhancement to
Handle Imbalanced Class Problem Procedia Comput. Sci. 104 586–91
[15] Nellore S B 2015 Various performance measures in Binary classification-An Overview of ROC
study IJISET-International J. Innov. Sci. Eng. Technol. 2 596–605