Article +3206 1
Article +3206 1
Article +3206 1
Abstract:
Cardiovascular disease is one of the top health concerns to humanity and is gradually increasing
daily. Predicting it timely and taking the necessary steps for its intervention is crucial. Precisely
predicting cardiac disease is a challenging job that a human or application can do. The complexity
of the cardiovascular system compels the use of Artificial Intelligence (AI) to find the solution.
Machine learning techniques (sub-set of artificial intelligence) have done tremendous work in
medical sciences by providing vast answers to their queries. Computer scientists have used different
machine-learning methods for the identification of cardiac disease. This study aims to enhance the
accuracy of the prophecy of cardiac disease to reduce the risk factors. It proposes a hybrid ensemble
framework to analyze the cardiac data based on essential features for optimum prediction results.
This ensemble framework uses multiple machine-learning classification methods to approach the
optimal solution. This study uses the Cleveland open access dataset to discuss the working
performance of famous classification techniques like Decision Tree, Naive Bayes, SVM, KNN,
logistic regression, RF, Gradient Boosting, and XGB Classifier. It proposes a Hybrid Ensemble
Framework based on this analysis to enhance the results. The proposed method shows incredible
results using the Adaptive Boosting Ensemble technique. AdaBoost is used with hyperparameters on
the results retrieved from the applied ML methods and gets more accuracy. The accuracy of this
proposed method is evaluated using an open-access Cleveland dataset, which has various cardiac
modalities, clinical records, and physiological measurements. Our proposed Hybrid Ensemble
Framework achieved an accuracy of 91.80%, precision= 0.94, f1-score=0.92, macro avg= 0.92, and
recall = 0.93. The results obtained by the other machine-learning algorithms are less than our model.
The comparison of previously completed results is also examined to reflect the improvement in the
proposed technique. Moreover, this technique opens new doors for real-world clinical solutions, and
it advances the cardiac disease risk stratification field by introducing an innovative and applicable
approach by merging ML and ensemble methods. The HEF enhances prediction accuracy and
provides valuable insights into the key factors influencing cardiac disease risk, ultimately
facilitating more informed clinical decision-making. Our findings underscore the potential of this
hybrid ensemble framework as a valuable tool for improving the detection and management of
cardiac diseases, ultimately reducing the burden of CVD (cardiovascular disease) on healthcare
systems and society.
Keywords: Heart disease prediction; Machine learning; Ensemble classifier; Hybrid Technique;
Decision Tree; Naive Bayes; SVM; KNN; logistic regression; RF; Gradient Boosting; XGB
I-Introduction:
Cardiac disease is the main reason of death not only in developing countries but also in modern
nations, presenting a question to world healthcare strategies. In recent years, heart disease and
cancer have been the primary causes of death, making up around 43.5% of all deaths during this
period. A recent report from the American Heart Association reveals that cardiovascular disease
claims more lives annually than all types of cancer and chronic lower respiratory diseases combined
[1]. The prediction and consequences of cardiovascular patients have notably improved over the past
few years, thanks to innovations in technology and techniques. Notably, machine learning,
particularly within the neural network, is rapidly evolving and structuring the processes of
cardiology [2,3]. This advancement opens new doors to enhance various aspects of cardiovascular
procedures. The rapid adoption of machine learning is driven by the exponential growth of data,
particularly in the cardiac field. The most common symptoms of cardiac disease are usually focused
on the chest, indicating pain, tightness, pressure, and discomfort. Heart attack is also noticed in
specific patients' conditions regarding coldness in arms or legs, numbness, shortness of breath, and
weakness if blood circulation is blocked or lessened in those body portions [4].
When discussing risk factors, we mean things that increase the chance of getting sick. If we can fix
or remove these factors, we can lower the risk of getting sick. This study focuses on the main risk
factors for heart disease, such as age, sex, lifestyle factors (smoking, physical activity, alcohol, and
stress), and metabolic syndrome factors (insulin resistance, dyslipidemia, abdominal obesity, high
blood pressure, and dietary factors). If people are exposed to these factors, their chances of getting
heart disease increase; while removing or improving the elements, we can decrease the risk of heart
disease.
The dataset consists of multiple features, and working with all parts may not be helpful and could
lead to inaccurate results. Therefore, this study aims to increase the precision of cardiac disease
diagnosis using a combination of classification and feature selection. It chooses the most essential
features in the dataset, and for data preparation, we divide the dataset into testing and training sets.
We then classify the parts using eight machine learning techniques: K-nearest neighbor (KNN),
Random Forest, Naïve Bayes, AdaBoost, Decision Tree, Logistic Regression, and Support Vector
Machine. By combining these methods, we have improved the accuracy of heart disease diagnosis
in different ways. The novelty of this study is that working with these eight machine-learning
techniques has never been done before. This proposed method has two main benefits: it reduces the
number of features used in datasets and increases the accuracy of diagnosis.
Cardiac disease is responsible for a growing number of deaths worldwide every year. Computer
scientists are using machine learning and data mining methods to resolve healthcare industry issues
to assist medical professionals in diagnosing and identifying diseases. Machine learning techniques
can identify patterns, relationships, and knowledge that may not be possible for ordinary statistical
methods [5]. Machine learning is now widely used for analyzing organic compounds, healthcare,
and weather forecasting. It is becoming more critical in the healthcare sector and crucial for
predictive analysis [6-7]. Machine learning and data mining techniques predict and analyze stroke,
diabetes, cancer, and heart disease. These methods are marvelous in diagnosing and predicting
cardiovascular issues [8]. Some famous machine learning techniques have shown extraordinary
performance, such as KNN (K-Nearest-Neighbor), commonly used in data mining for pattern
recognition and classification. The research performed by Paris et al. has shown enhanced results by
Vol. 30 No. 18 (2023): JPTCP (1336-1353) Page | 1337
A Hybrid Ensemble Framework For Cardiac Disease Risk Stratification With Machine Learning
the voting technique used in the ensemble than the different classifiers [9]. The study also includes
the KNN in diagnosing heart disease on a benchmark dataset. This facilitates comparisons with
other data mining techniques employed on the same dataset. Additionally, the effectiveness of KNN
can be improved by incorporating voting. We aim to provide a framework for predicting heart
disease by utilizing diverse machine-learning techniques catering to medical experts and
researchers, emphasizing cardiac imaging and modalities. This research also evaluates accuracy loss
for each fold to measure the framework's efficiency on benchmark datasets.
The machine-learning algorithms are used to classify the different data types and mainly to diagnose
the probability of disease chances [10]. They provide a way to predict the new class of samples
through the training dataset [11]. This type of classification is known as supervised classification,
which trains itself through labeled data [12-13]. This proposed model uses a classification technique
to diagnose heart disease using an objective clinical dataset of cardiac patients.
In this study, we have used eight different linear and non-linear machine learning techniques to
predict heart disease and found the best method. The machine learning classifiers that we have used
are support vector machine (SVM), logistic regression (LR), k-nearest neighbors (KNN), decision
tree (DT), random forest (RF), gradient boosting classifier (GB), naïve Bayesian (NB) and extreme
gradient boosting (XGB). The main objectives of this study include:
To build a system for the prediction of cardiac disease
To identify the prime attributes of the dataset for the enhancement of results
To propose a state-of-the-art framework using Ensemble machine-learning techniques
The organization of this research is as follows: section 2 investigates the work done by previous
researchers in this area of interest. Section 3 discloses the research methodology in detail, briefly
describing how to normalize the data and its preprocessing and analysis. In section 4, results are
provided with comparison tables of each used machine learning technique, and in section 5, an
intelligent conclusion is given along the way to future work.
II-Literature Review
Computer scientists have been using statistical analysis on the data collected by healthcare
professionals. They use different data mining techniques to investigate the issues raised by the
clinicians. Medical experts provide the data to scientists for classification and prediction of diseases.
Data scientists have done numerous works for the prediction of heart disease. They have identified
the risk factors related to heart disease. The most common factors that cause cardio attacks are
diabetes, smoking, blood pressure, cholesterol, hypertension, age, and family history of heart
disease [14].
Several machine-learning techniques are used to predict cardiovascular disease. Data analysts have
applied different machine learning methods to identify heart disease. They have used other datasets
for the diagnosis of cardiac disease. The prediction results of the various researchers cannot be
equated because they have used different datasets and techniques. However, with time, standard
datasets have been focused on and investigated with other machine-learning techniques. Computer
scientists have used different modalities and features in available datasets to improve results. They
have modified the preprocessing techniques and ways to refine the datasets [15]. As stated in [16],
they have chosen thirteen features and three strategies for predicting heart disease. In this study, they
have used artificial neural network (ANN), Multivariate Adaptive Regression (MAR), and Logistic
Regression (LR) methods to build the hybrid framework for the prediction of heart disease. The
result is improved by using Pearson correlation coefficients for the missing values in the Cleveland
dataset [17]. Genetic algorithm is also used for heart disease prediction along with other machine
learning algorithms and obtained an accuracy of 89% on the dataset of fifty patients [18]. Data
analysis techniques work in different ways. Most of them focus on the neighbor’s field points and
make the classification based on similarity [19-20]. This study [21] explores the linear regression
Vol. 30 No. 18 (2023): JPTCP (1336-1353) Page | 1338
A Hybrid Ensemble Framework For Cardiac Disease Risk Stratification With Machine Learning
technique and obtained 87.1% accuracy. R. Perumal et al. [22] used the linear regression and SVM
machine learning models on Cleveland datasets and secured 87% accuracy. Some authors have used
feature selection models to enhance the accuracy of results [23]. Orange and Weka data analysis
tools are also used to find the prediction results [24]. Mohan et al. [25] combined linear regression
and random forest to improve the results. In research [26], scientists used decision trees, gaussian
NB, and linear regression models to get an accuracy of 82.75%. In another study, particle swarm
optimization adopted selected features [27].
III- Methodology
A – Dataset and Preprocessing
This study has used the Cleveland standard dataset for heart disease prediction of the Cleveland
Clinic Foundation [28]. This benchmark dataset has 76 attributes that store patient information of
different types. However, most analysts have used only 13 out of 76 features. The reason behind
selecting specific points is that these are the key attributes through which the existence of heart
disease can be predicted. It has 303 records of cardiac patients represented by the Target column,
containing 0 for absence of heart disease and 1 for presence. The age field includes information on a
patient’s age; gender is represented by sex attributes, representing 1 for males and 0 for females.
Chest pain attribute CP has four values: two values for fasting blood sugar FBS, resting
electrocardiogram RESTECG has 3 classes, and exercise angina EXANG has two values. ST
represents a slope that has 3 values. The other four fields have numerical input those are resting bp
TRESTBPS, cholesterol CHOL, OLDPEAK, and AGE as shown in figure 1.
After uploading the CSV file of the Cleveland dataset to Google Colab, we normalized it by
removing missing and duplicate values. Preprocessing of the dataset is essential for better
performance of machine-learning techniques, so we separated the categorical and numerical values.
Categorical data is converted to numerical values as shown in figure 2.
The assessment of the classification method is reflected through different performance indices,
which consist of precision, sensitivity, accuracy, and F1 score. Accuracy is the reflection of the
overall performance of the machine-learning model. The model’s finding capability of the target
class is measured by the Recall metric whether a model can search all objects in the target class or
not. Precision measures the predicting capability of the machine-learning model. F-Score is the
harmonic mean between sensitivity and precision while Sensitivity describes the ratio of the number
of accurate positive predictions to the total actual positive instances. Mathematically, these attributes
are represented by the following expressions:
𝑇𝑃+𝑇𝑁
𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (1)
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (2)
𝑇𝑃+𝐹𝑃
𝑇𝑃
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = (3)
𝑇𝑃+𝐹𝑁
2∗𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 2𝑇𝑃
𝐹1 = = (4)
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 2𝑇𝑃+𝐹𝑃+𝐹𝑁
Where FN, TN, FP, and TP represent the false negative, true negative, false positive, and true
positive respectively.
After the conversion of categorical data, feature scaling is performed on the numerical data using
the standard scale to balance the features between certain ranges. The next step taken is to split the
dataset into training and testing. The division of the dataset is 80% for training and 20% for testing
purposes. The chosen eight classifiers were applied to the dataset and found the optimum accuracy
after taking certain hyperparameters as shown in Figure 3.
Vol. 30 No. 18 (2023): JPTCP (1336-1353) Page | 1340
A Hybrid Ensemble Framework For Cardiac Disease Risk Stratification With Machine Learning
7. Naïve Bayes
NB classifier is another machine learning algorithm used to predict a target class. It is a Bayesian
theorem that uses probabilities for calculation. It calculates the probabilities for each attribute in a
dataset and adopts the BOW (bag of words) technique to predict the class. We found the accuracy
for this model is 85.24. The working performance of the model is represented by multiple metrics
like ROC (Receiver Operating Characteristic), PRC (Precision-Recall Curve), and CM (Confusion
Metric), as shown in Figure 9, and its classification report is reflected in Table 8.
IV. Results
This study aims to predict cardiovascular disease using machine learning techniques. In this
research, we have proposed a Hybrid Ensemble Framework (HEF) that has shown the best results
among other machine-learning techniques, as shown in Table 10. We have used 8 machine-learning
classifiers on the open-access dataset of Cleveland, and our HEF method has demonstrated its
superiority over others. We have used an Adaptive Booting ensemble classifier in conjunction with
Logistic Regression to improve the accuracy. Our framework has achieved an accuracy of 91.80%,
surpassing all other ML models. We have used the scikit-learn library to import the classifiers to
predict heart disease. Logistic Regression approached the maximum accuracy of 88.52% among
eight models. Naïve Bayes followed the LR and reached 85.24%, while the Support Vector
Machine, K-Neighbor Network, Decision Tree, Random Forest, Gradient Booster, and Extreme
Gradient Booster achieved accuracy between 78.68% and 83.60%. We used Logistic Regression as a
base classifier in the AdaBoost Ensemble classifier with specific hyperparameters to attain
maximum occurs n_estimators =158 and learning_rate =1.0. The efficiency of our HEF framework
is reflected in Figure 11, which shows the coefficients of Logistic Regression classifiers in
AdaBoost. The notable margin in accuracy between HEF and other ML techniques makes it more
worthy for real-world applications, especially for predicting cardiac disease.
The accuracy of each used model is shown below in Table 10 to reflect the noticeable difference
between the proposed and existing models.
The working performance of the model is represented by multiple metrics like ROC (Receiver
Operating Characteristic), PRC (Precision-Recall Curve), and CM (Confusion Metric), as shown in
Figure 11, and its classification report is reflected in Table 11.
The comparison of results accuracy is graphically represented by a Bar graph in Figure 12.
V. Conclusion
This study has proved that enhanced accuracy can be achieved by using the HEF model, and heart
disease can be predicted easily with fewer chances of errors. The comparative analysis has shown
the importance of machine-learning techniques in healthcare decision-making. Our innovative
Hybrid Ensemble Framework, which combines AdaBoost with Logistic Regression, has proved its
supremacy over other machine-learning models with an accuracy of 91.80%. It proved its working
and efficiency outstanding in risk stratification. The findings of our research work have shown the
precision and effectiveness of ensemble methodologies and the promise of AdaBoost in enhancing
the results of base classifiers. In the future, the accuracy of results may be improved by using deep
learning techniques.
Vol. 30 No. 18 (2023): JPTCP (1336-1353) Page | 1350
A Hybrid Ensemble Framework For Cardiac Disease Risk Stratification With Machine Learning
References:
1. S.S. Virani, et al., Heart disease and stroke statistics, Update. Circulation, 2021 143 (8) (2021)
e254–e743
2. Chu, M., Wu, P., Li, G., Yang, W., Gutiérrez-Chico, J. L., & Tu, S. (2023). Advances in
Diagnosis, Therapy, and Prognosis of Coronary Artery Disease Powered by Machine Learning
Algorithms. JACC: Asia, 3(1), 1-14.
3. Al’Aref, S. J., Singh, G., Choi, J. W., Xu, Z., Maliakal, G., van Rosendael, A. R., ... & Min, J. K.
(2020). A boosted ensemble algorithm for determination of plaque stability in high-risk patients
on coronary CTA. Cardiovascular Imaging, 13(10), 2162-2173.
4. Sowmiya C, Sumitra P. Analytical study of heart disease diagnosis using classification
techniques. In: IEEE international conference on intelligent techniques in control, optimization
and signal processing (INCOS), March 2017; 2017. p. 23–5.
https://doi.org/10.1109/ITCOSP.2017.8303115.
5. J. Han and M. Kamber, "Data Mining Concepts and Techniques," Morgan Kaufmann Publishers,
2006.
6. I.-N. Lee, S.-C. Liao, and M. Embrechts, "Data mining techniques applied to medical
information," Med. inform, 2000.
7. M. K. Obenshain, "Application of Data Mining Techniques to Healthcare Data," Infection
Control and Hospital Epidemiology, 2004.
8. S. C. Liao and I. N. Lee, "Appropriate medical data categorization for data mining classification
techniques," MED. INFORM, vol. 27, no. 1, pp. 59–67, 2002.
9. I. H. M. Paris, L. S. Affendey, and N. Mustapha, "Improving Academic Performance Prediction
using Voting Technique in Data Mining," World Academy of Science, Engineering and
Technology, 2010.
10. M. Diwakar, A. Tripathi, K. Joshi, M. Memoria, P. Singh, and N. Kumar, “Latest trends on heart
disease prediction using machine learning and image fusion,” Mater. Today Proc., vol. 37, no.
Part 2, pp. 3213–3218, 2020, doi: 10.1016/j.matpr.2020.09.078.
11. C. J. Harrison and C. J. Sidey-Gibbons, “Machine learning in medicine: a practical introduction
to natural language processing,” BMC Med. Res. Methodol., vol. 21, no. 1, pp. 1–18, 2021, doi:
10.1186/s12874-021-01347-1
12. M. Pérez-Ortiz, S. Jiménez-Fernández, P. A. Gutiérrez, E. Alexandre, C. Hervás-Martínez, and
S. Salcedo-Sanz, “A review of classification problems and algorithms in renewable energy
applications,” Energies, vol. 9, no. 8, pp. 1–27, 2016, doi: 10.3390/en9080607.
13. S. Pandey, M. Supriya, and A. Shrivastava, “Data Classification Using Machine Learning
Approach,” no. June, 2018, doi: 10.1007/978-3-319-68385-0.
14. L. Shahwan-Akl, "Cardiovascular Disease Risk Factors among Adult Australian-Lebanese in
Melbourne," International Journal of Research in Nursing, 2010.
15. I. H. M. Paris, L. S. Affendey, and N. Mustapha, "Improving Academic Performance Prediction
using Voting Technique in Data Mining," World Academy of Science, Engineering and
Technology, 2010.
16. Shao YE, Hou CD, Chiu CC. Hybrid intelligent modeling schemes for heart disease
classification. Appl Soft Comput 2014; 14:47–52.
17. Yekkala I, Dixit S, Jabbar MA. August. Prediction of heart disease using ensemble learning and
Particle Swarm Optimization. In: 2017 international conference on smart technologies for smart
nation (Smart Tech Con). IEEE; 2017. p. 691–8.
18. Amin SU, Agarwal K, Beg R. April. Genetic neural network-based data mining in the prediction
of heart disease using risk factors. In: 2013 IEEE conference on information & communication
technologies. IEEE; 2013. p. 1227–31.
19. Tan PN, Chawla S, Ho CK, Bailey J, editors. Advances in knowledge discovery and data
mining, Part II: 16th Pacific-Asia conference, PAKDD 2012, Kuala Lumpur, Malaysia, may 29-
June 1, 2012, Proceedings, Part II, vol. 7302. Springer; 2012.
39. Shouman, Mai & Turner, Timothy & Stocker, Rob. (2012). Applying k-Nearest Neighbour in
Diagnosing Heart Disease Patients. International Journal of Information and Education
Technology. 2. 220-223. 10.7763/IJIET. 2012.V2.114.
40. Sharma, A.; Mishra, P.K. Performance Analysis of Machine Learning Based Optimized Feature
Selection Approaches for Breast Cancer Diagnosis. Int. J. Inf. Technol. 2022, 14, 1949–1960.
41. M. Diwakar, A. Tripathi, K. Joshi, M. Memoria, P. Singh, and N. Kumar, “Latest trends on heart
disease prediction using machine learning and image fusion,” Mater. Today Proc., vol. 37, no.
Part 2, pp. 3213–3218, 2020, doi: 10.1016/j.matpr.2020.09.078.