Wahyuprabowo DwiRe Checked17!12!2013VersiWIT
Wahyuprabowo DwiRe Checked17!12!2013VersiWIT
Wahyuprabowo DwiRe Checked17!12!2013VersiWIT
Abstract
The objective of this research is to investigate the randomization of data on a
computer based feature selection for diagnosing coronary artery disease. The
randomization on Cleveland dataset was conducted because the performance
value is different for each experiment. Assuming the performance values have a
Gaussian probability distribution is a solution to handle different performance
value provided by the process of randomizing dataset. The final performance is
taken from the mean value of all performance value. In this research, computer
based feature selection (CFS), medical expert based feature selection (MFS) and
combined both of MFS and CFS (MFS+CFS) are also conducted to improve the
performance of the classification algorithm. Also, this research found a different
characteristic on Cleveland dataset from previous work. This difference
obviously can affect the feature selection result and the final performance. In
summary, the randomization dataset and computing the final performance can
generally represent the performance of the classification algorithm.
Keywords: CFS, Classification algorithm, Coronary artery disease, Cleveland
dataset, Gaussian probability distribution, MFS, Randomization.
1 Introduction
Coronary Artery Disease (CAD), sometimes called as Coronary Heart Disease
(CHD) is the most common heart diseases. CAD occurs when the blood flow to
the heart muscle in the coronary arteries is blocked by atherosclerosis (fatty
deposits) [1]. It has a very high mortality rate, e.g. in 2008 an estimated 7.3
238 Advances in Intelligent Systems
million deaths in the world are caused by CAD [2]. The initial diagnosis usually
uses medical history and physical examination, then further testing can be done.
For further testing, coronary angiography provides the “gold standard” diagnosis
of disease in the coronary arteries [3]. Coronary angiography test is preferred by
cardiologists to diagnose the presence of CAD with high accuracy even though
invasive, risky and expensive [4].
From the shortcomings of this test, it is necessary to develop a method which
is capable of diagnosing CAD before coronary angiography test. The goal is to
avoid invasive, risky and expensive diagnostic procedures to the patient.
Therefore, this motivates the development of a computer based method to be
able to diagnose the presence of CAD. The computer based method can provide
diagnostic procedures to patients in a way that is non-invasive, safe and less
expensive.
Various computer based methods have been developed to identify heart related
diseases. The methods of neural network [5], fuzzy [6] and data mining [7] are
proposed to diagnose CAD. The neural network based methods have advantages
on nonlinear prediction, strong on parallel processing and the ability to tolerate
faults, but they have a weakness in the need for a large training data, over-fitting,
slow convergence and local optimum [8]. Fuzzy logic offers reasoning at a
higher level by using linguistic information obtained from domain experts, but
fuzzy systems lack the ability to learn and cannot adjust to the new environment
[9]. Data mining which is a process of extracting hidden knowledge from the
data offers other advantages. This method can reveal patterns and relationships
among large amounts of data in a single dataset or not [10].
In medical diagnosis, data reduction is an important issue. Medical data often
contain a large number of features that are irrelevant, redundant and relatively
small number of cases that can affect the quality of disease diagnosis [11].
Therefore, the feature selection process can be used to select relevant features in
medical data. Feature selection is proposed in many researches
[11][12][13][14][15] to improve the accuracy in the diagnosis of CAD.
Nahar et al. [14] performed computer-based feature selection process. This
process is called by the computer feature selection (CFS). CFS selects features
randomly so there is the possibility to dispose of medical significant factors. To
avoid the loss of medical significant factors, the feature selection process needs
to be carried out by medical experts (termed as MFS). These significant factors
are age, chest pain type, resting blood pressure, cholesterol, fasting blood sugar,
resting heart rate, maximum heart rate and exercise-induced angina. For CFS,
Nahar et al. used CfsSubsetEval as attribute selection method (using BestFisrt
search strategy) provided by Weka.
There is a difference in the characteristics of the Cleveland dataset between
current research and Nahar et al. [14]. The difference lies in the total of positive
class instances. This difference obviously can affect the final performance. But
there is something important issue that Nahar et al. did not considered. This issue
is the effect of randomizing process in data which can affect the performance of
computer based diagnosis. In this research, the study of randomizing medical
data (Cleveland dataset) is discussed.
239 Advances in Intelligent Systems
Table 2: Characteristic of
Cleveland dataset
Number of
Dataset Positive Number of Status indicated by
Positive
Name Class Negative Class positive class
Class
H-0 Health 165 138 Health, Sick
Sick-1 S1 54 249 S1, Negative
Sick-2 S2 36 267 S2, Negative
Sick-3 S3 35 268 S3, Negative
Sick-4 S4 13 290 S4, Negative
Table 3: Characteristic of
Cleveland dataset
(Nahar et al.)
Number of
Dataset Positive Number of Status indicated by
Positive
Name Class Negative Class positive class
Class
H-0 Health 165 138 Health, Sick
Sick-1 S1 56 247 S1, Negative
Sick-2 S2 37 266 S2, Negative
Sick-3 S3 36 267 S3, Negative
Sick-4 S4 14 289 S4, Negative
For CFS, Nahar et al. use CfsSubsetEval as attribute selection method (using
BestFisrt search strategy) provided by Weka. CFS selects features randomly so
there is the possibility to dispose of medical significant factors [14].
In this research, the six well known classifiers (Naïve Bayes, SMO, IBK,
AdaBoostM1, J48 and PART) were used. This is the reason why Cleveland
dataset has to convert to binary-class, because these algorithms are binary
classifier.
241 Advances in Intelligent Systems
Cj is class label with the largest conditional probability value determines the
category of the data record [7].
2.4.2 SMO
There are two components of the SMO algorithm. These components are an
analytical method to solve two Lagrange multipliers and heuristic methods to
determine the optimizing multiplier. This algorithm was introduced by John Platt
in 1998 at Microsoft Research [17].
2.4.3 IBK
The algorithm found a group of k objects in the training set that are closest to the
test object and the label on the basis of the assignment of a certain class
domination. It discusses the main issue in many datasets that may not be exactly
matching one object with another object, as well as the fact that conflicting
information about the class of an object can be obtained from the nearest objects
[18].
2.4.4 AdaBoostM1
“Boosting” is a general method for improving the performance of any learning
algorithm. The boosting can be used to significantly reduce the error of any
“weak” learning algorithm that consistently generates classifiers which need only
be a little bit better than random guessing [19].
2.4.5 J48
J48 is a classification algorithm that implements the C4.5 algorithm [10]. C4.5
algorithm is intended for supervised learning. C4.5 learns to mapping an attribute
values to a class that can be applied to classify a new class (unseen instance)
[18].
2.4.6 PART
PART algorithm builds a tree using C4.5's heuristics with the parameters
specified by the user same with J48. The rules of the classification algorithm
derived from the partial decision tree. Partial decision tree is a decision tree that
contains branches of undefined sub-trees [10].
A feature selection result that obtained from CFS can be seen in table 4. For each
dataset, the features that are selected by CFS are different.
Table 4: Feature selection result
of CFS
H-0 chest pain, resting ECG, maximum heart rate, exercise induced
angina, old peak, the number of vessels coloured, thal
Sick-1 sex, chest pain, fasting blood sugar, resting ECG, exercise induced
angina, thal
Sick-2 chest pain, fasting blood sugar, maximum heart rate, exercise
induced angina, oldpeak, number of vessels coloured, thal
Sick-3 maximum heart rate, exercise induced angina, oldpeak, number of
vessels coloured, thal
Sick-4 resting ECG, oldpeak, number of vessels coloured
It can be seen from table 4, CFS does not select the features that are
considered as medical significant factor by MFS. Therefore, to avoid the medical
significant factor is not selected, it is necessary to combine MFS and CFS.
four cases (Naïve Bayes, SMO, AdaBoostM1 and J48). For Sick-1, the accuracy
of CFS is better than CVP-10 fold in five cases (Naïve Bayes, IBK,
AdaBoostM1, J48 and PART) and the accuracy of MFS is better than CVP-10
fold in four cases (Naïve Bayes, IBK, AdaBoostM1 and J48). For Sick-2, the
accuracy of CFS and MFS is better than CVP-10 fold in three cases (Naïve
Bayes, AdaBoostM1 and PART). For Sick-3, the accuracy of CFS is better than
CVP-10 fold in three cases (Naïve Bayes, AdaBoostM1 and PART) and the
accuracy of MFS is better than CVP-10 fold in three cases (Naïve Bayes,
AdaBoostM1 and J48). For Sick-4, the accuracy of CFS and MFS is better than
CVP-10 fold in three cases (Naïve Bayes, J48 and PART).
Table 5: Performance for 10-fold
and CVP 10-fold
Full Feature Dataset
Accuracy (%) TP F-measure Training time
Dataset Algorithm CVP CVP CVP CVP
10- 10- 10- 10-
10- 10- 10- 10-
fold fold fold fold
fold fold fold fold
H-0 Naïve Bayes 84.042 84.249 0.868 0.873 0.855 0.858 0.001 0.000
SMO 83.065 83.183 0.839 0.866 0.842 0.848 0.083 1.039
IBK 83.131 82.368 0.872 0.860 0.849 0.841 0.000 0.169
AdaBoostM1 83.298 80.541 0.861 0.837 0.848 0.824 0.021 0.022
J48 78.892 75.817 0.844 0.806 0.812 0.783 0.002 0.187
PART 80.560 79.437 0.860 0.842 0.827 0.816 0.002 0.189
Sick-1 Naïve Bayes 77.549 78.316 0.110 0.115 0.139 0.151 0.001 0.001
SMO 82.194 82.175 0.000 0.000 0.000 0.000 0.026 1.102
IBK 80.909 81.186 0.016 0.010 0.021 0.015 0.000 0.232
AdaBoostM1 82.194 77.631 0.000 0.160 0.000 0.199 0.144 0.080
J48 81.411 81.409 0.014 0.014 0.020 0.020 0.002 0.158
PART 81.459 81.382 0.009 0.010 0.011 0.012 0.002 0.240
Sick-2 Naïve Bayes 78.615 80.090 0.426 0.327 0.313 0.275 0.001 0.001
SMO 88.036 88.148 0.000 0.000 0.001 0.000 0.027 1.275
IBK 87.755 87.730 0.002 0.009 0.003 0.016 0.000 0.217
AdaBoostM1 84.827 83.006 0.062 0.086 0.072 0.088 0.145 0.077
J48 87.515 87.876 0.011 0.011 0.013 0.017 0.001 0.128
PART 86.347 86.926 0.026 0.021 0.030 0.024 0.002 0.168
Sick-3 Naïve Bayes 82.305 82.578 0.513 0.478 0.394 0.386 0.000 0.000
SMO 88.132 88.422 0.025 0.000 0.035 0.000 0.025 1.153
IBK 87.563 88.023 0.000 0.005 0.000 0.008 0.000 0.213
AdaBoostM1 85.486 83.423 0.211 0.153 0.226 0.164 0.161 0.070
J48 87.860 87.878 0.010 0.013 0.012 0.017 0.001 0.119
PART 86.471 87.306 0.038 0.034 0.040 0.040 0.002 0.137
Sick-4 Naïve Bayes 93.329 93.797 0.074 0.075 0.076 0.099 0.001 0.001
SMO 95.731 95.703 0.000 0.000 0.000 0.000 0.010 0.860
IBK 95.725 95.451 0.000 0.000 0.000 0.000 0.000 0.171
AdaBoostM1 95.682 95.703 0.000 0.000 0.000 0.000 0.572 0.103
J48 95.731 95.654 0.000 0.000 0.000 0.000 0.002 0.110
PART 95.154 95.364 0.009 0.007 0.007 0.006 0.001 0.140
245 Advances in Intelligent Systems
Table 7 shows the comparison of the final performance when MFS and CFS are
combined. The bold values indicate the best algorithm for each dataset. The
highlighted values indicate accuracy of MFS+CFS is better than MFS. For H-0,
the accuracy of MFS+CFS is better than MFS for all algorithms. For Sick-1, the
accuracy of MFS+CFS is better than MFS for two cases (J48 and PART). For
Sick 2, the accuracy of MFS+CFS is better than MFS for two cases (IBK and
PART). For Sick-3, the accuracy of MFS+CFS is better than MFS for one case
(PART). For Sick-4, there is no accuracy of MFS+CFS better than MFS.
246 Advances in Intelligent Systems
SMO, IBK, J48 and PART) and Sick-4 (Naïve Bayes, AdaBoostM1 and PART).
From the analysis of final performance result, it can be seen that the feature
selection process (CFS and MFS) improve the accuracy in some case than only
apply CVP 10-fold (without feature selection). Then, to improve the ability of
computer based feature selection, the method of combined MFS and CFS can be
proposed. From table 7, the method of combined MFS and CFS improve the
accuracy in some case for dataset H-0, Sick-1, Sick-2 and Sick-3 than only apply
MFS process. For CFS, this research only use one attribute selection method
(CfsSubsetEval) so this is not generally represent the CFS process. In the future
works, the modification of the CFS method with other attribute selection is
recommended to improve the performance of diagnosing coronary artery disease.
Also, the modification of CFS can combined with MFS to ensure the medical
expert about the diagnosis result.
5 Acknowledgements
The research work was supported by Intelligent System Research Group at
Department of Electrical Engineering and Information Technology, Universitas
Gadjah Mada.
References
[1] Randall, O. S., Segerson, N. M. & Romaine, D. S., The Encyclopedia of the
Heart and Heart Disease, 2nd ed. Facts on File, 2010.
[2] WHO, Global Atlas on Cardiovascular Disease Prevention and Control, 1st
ed. World Health Organization, 2012.
[3] Phibbs, B., The Human Heart: A Basic Guide to Heart Disease, Second.
Lippincott Williams & Wilkins, 2007.
[4] Setiawan, N. A., Diagnosis of Coronary Artery Disease Using Artificial
Intelligence Based Decision Support System, Universiti Teknologi Petronas,
2009.
[5] Khemphila, A. & Boonjing, V., Heart Disease Classification Using Neural
Network and Feature Selection, 2011 21st International Conference on
Systems Engineering (ICSEng), pp. 406–409, 2011.
[6] Pal, D., Mandana, K. M., Pal, S., Sarkar, D. & Chakraborty, C., Fuzzy
expert system approach for coronary artery disease screening using clinical
parameters, Knowl.-Based Syst., vol. 36, pp. 162–174, Dec. 2012.
[7] Alizadehsani, R., Habibi, J., Hosseini, M. J., Mashayekhi, H., Boghrati, R.,
Ghandeharioun, A., Bahadorian, B. & Sani, Z. A., A data mining approach
for diagnosis of coronary artery disease, Comput. Methods Programs
Biomed, 2013.
[8] Capparuccia, R., De Leone, R., and Marchitto, E., Integrating support vector
machines and neural networks, Neural Netw., vol. 20, no. 5, pp. 590–597,
Jul. 2007.
[9] Negnevitsky, M., Artificial Intelligence: A Guide to Intelligent Systems, 2nd
ed. Addison-Wesley, 2004.
248 Advances in Intelligent Systems
[10] Witten, I. H. & Frank, E., Data Mining: Practical Machine Learning Tools
and Techniques, Second Edition, 2nd ed. Morgan Kaufmann, 2005.
[11] Chu, N., Ma, L., Li, J., Liu, P. & Zhou, Y., Rough set based feature
selection for improved differentiation of traditional Chinese medical data,
2010 Seventh International Conference on Fuzzy Systems and Knowledge
Discovery (FSKD), vol. 6, pp. 2667–2672, 2010.
[12] Babaoglu, İ., Findik, O. & Ülker, E., A comparison of feature selection
models utilizing binary particle swarm optimization and genetic algorithm in
determining coronary artery disease using support vector machine, Expert
Syst. Appl., vol. 37, no. 4, pp. 3177–3183, Apr. 2010.
[13] Shilaskar, S. & Ghatol, A., Feature selection for medical diagnosis :
Evaluation for cardiovascular diseases, Expert Syst. Appl., vol. 40, no. 10,
pp. 4146–4153, Aug. 2013.
[14] Nahar, J., Imam, T., Tickle, K. S., and Chen, Y.-P. P., Computational
intelligence for heart disease diagnosis: A medical knowledge driven
approach, Expert Syst. Appl., vol. 40, no. 1, pp. 96–104, Jan. 2013.
[15] Guan, D., Yuan, W., Jin, Z., and Lee, S., Undiagnosed samples aided rough
set feature selection for medical data, 2012 2nd IEEE International
Conference on Parallel Distributed and Grid Computing (PDGC), pp. 639–
644, 2012.
[16] UCI, Heart disease dataset, Online. http://archive.ics.uci.edu/ml/machine-
learning-databases/heart-disease/cleve.mod.
[17] Platt, J. C., Sequential Minimal Optimization: A Fast Algorithm for Training
Support Vector Machines, Advances In Kernel Methods - Support Vector
Learning, 1998.
[18] Wu, X. & Kumar, V., The top ten algorithms in data mining. Boca Raton:
CRC Press, 2009.
[19] Freund, Y. & Schapire, R. E., Experiments with a New Boosting Algorithm.
1996.
[20] Walpole, R. E., Myers, R. H. & Ye, K. E., Probability & statistics for
engineers & scientists. Boston: Prentice Hall, 2012.