Heart disease prediction using machine learning techniques

: a survey
V.V. Ramalingam*, Ayantan Dandapath, M Karthik Raja
1,2Department of Computer Science & Engineering, SRM Institute of Science and Technology
*Corresponding Author E-mail: [email protected]


Heart related diseases or Cardiovascular Diseases (CVDs) are the main reason for a huge number of death in the world over the last few
decades and has emerged as the most life-threatening disease, not only in India but in the whole world. So, there is a need of reliable,
accurate and feasible system to diagnose such diseases in time for proper treatment. Machine Learning algorithms and techniques have
been applied to various medical datasets to automate the analysis of large and complex data. Many researchers, in recent times, have
been using several machine learning techniques to help the health care industry and the professionals in the diagnosis of heart related
diseases. This paper presents a survey of various models based on such algorithms and techniques andanalyze their performance. Models
based on supervised learning algorithms such as Support Vector Machines (SVM), K-Nearest Neighbour (KNN), NaïveBayes, Decision
Trees (DT), Random Forest (RF) and ensemble models are found very popular among the researchers.

Keywords: Cardiovascular Diseases; Support Vector Machines; K- Nearest Neighbour; Naïve Bayes; Decision Tree; Random Forest; Ensemble Models.

1. Introduction 2. Dimensionality Reduction

Heart is an important organ of the human body. It pumps blood to Dimensionality Reduction involves selecting amathematical
every part of our anatomy. If it fails to function correctly, then the representation such that one can relate the majority of, but not all,
brain and various other organs will stop working, and within few the variance within the given data, thereby including only most
minutes, the person will die. Change in lifestyle, work related significant information. The data considered for a task or a
stress and bad food habits contribute to the increase in rate of problem, may consists of a lot of attributesor dimensions, but not
several heart related diseases. all of these attributes may equally influence the output. A large
Heart diseases have emerged as one of the most prominent cause number of attributes, or features, may affect the computational
of death all around the world. According to World Health complexity and may even lead to overfitting which leads to poor
Organisation, heart related diseases are responsible for the taking results. Thus, Dimensionality Reduction is a very important step
17.7 million lives every year, 31% of all global deaths. In India considered while building any model. Dimensionality Reduction is
too, heart related diseases have become the leading cause of generally achieved by two methods -Feature Extraction and Feature
mortality [1]. Heart diseases have killed 1.7 million Indians in Selection.
2016, according to the 2016 Global Burden of Disease Report,
released on September 15,2017. Heart related diseases increase A. Feature Extraction
the spending on health care and also reduce the productivity of an In this, a new set of features is derived from the original feature
individual. Estimates made by the World Health Organisation set.Feature extraction involves a transformation of the features.
(WHO), suggest that India have lost up to $237 billion, from This transformation is often not reversible asfew, or maybe many,
2005-2015, due to heart related or Cardiovascular diseases [2]. useful information is lost in the process.In [3]and[4]Principal
Thus, feasible and accurate prediction of heart related diseases is Component Analysis (PCA)is used for feature extraction.
very important. Principal Component Analysis is a popularly used linear
Medical organisations, all around the world, collect data on transformation algorithm. In the feature space, it finds the
various health related issues. These data can be exploited using directions that maximize variance and finds directions that are
various machine learning techniques to gain useful insights. But mutually orthogonal. It is a global algorithm that gives the best
the data collected is very massive and, many a times, this data can reconstruction.
be very noisy. These datasets, which are too overwhelming for
human minds to comprehend, can be easily explored using various B. Feature Selection
machine learning techniques. Thus, these algorithms have become In this, a subset of original feature set is selected. In [5],key
very useful, in recent times, to predict the presence or absence of features are selected by CFS(Correlation based Feature Selection)
heart related diseases accurately. Subset Evaluation combined with Best First Search method to
reduce dimensionality. In [6]chi-square statistics test is used to
select the most significant features.

3. Algorithms and Techniques Used C. K – Nearest Neighbour

In 1951, Hodges et al. introduced a nonparametric technique for
A. Naïve Bayes pattern classification which is popularly known the K-Nearest
Naive Bayes is a simple but an effective classification technique Neighbour rule[13]. K-Nearest Neighbour technique is one of the
which is based on the Bayes Theorem. It assumes independence most elementary but very effective classification techniques. It
among predictors, i.e., the attributes or features should be not makes no assumptions about the data and is generally be used for
correlated to one another or should not, in anyway, be related to classification tasks when there is very less or no prior knowledge
each other. Even if there is dependency, still all these features or about the data distribution. This algorithm involves finding the k
attributes independently contribute to the probability and that is nearest data points in the training set to the data point for which a
why it is called Naïve. target value is unavailable and assigning the average value of the
found data points to it.
In [10] KNN gives an accuracy of 83.16% when the value of k is
equal to 9 while using 10-cross validation technique. In [14]
KNN with Ant Colony Optimization performs better than other
techniques with an accuracy of 70.26% and the error rates is
0.526.Ridhi Saini et al. have obtained a efficiency of 87.5% [15],
which is very good.

D. Decision Tree
Decision tree is a of supervised learning algorithm.This technique
is mostly used in classification problems. It performs effortlessly
withcontinuous and categorical attributes. This algorithm
dividesthe population into two or more similar sets based on the
most significantpredictors.Decision Treealgorithm, first calculates
the entropy of each and every attribute. Then the dataset is split
with the help of thevariables or predictors with maximum
information gain or minimum entropy. These two steps are
In [7], Naive Bayes has achieved an accuracy of 84.1584% with performed recursively with the remaining attributes.
the 10 most significant features which are selected using SVM-
RFE (Recursive Feature Elimination) and gain ratio algorithms
whereas in[8],Naive Bayes has achieved an accuracy of 83.49%
when all 13 attributes of the Cleveland dataset[25] are used.

B. Support Vector Machine

Support Vector Machine is an extremely popular supervised
machine learning technique(having a pre-defined target variable)
which can be used as a classifier as well as a predictor. For
classification, it finds a hyper-plane in the feature space that
differentiates between the classes. An SVM model represents the
training data points as points in the feature space, mapped in such
a way that points belonging to separate classes are segregated by a
margin as wide as possible. The test data points are then mapped
into that same space and are classified based on which side of the
margin they fall.

Fig. 2: Decision Tree

In [10]decision tree has the worst performance with an accuracy

of 77.55% but when decision tree is used with boosting technique
it performs better with an accuracy of 82.17%.In [9] decision tree
performs very poorly with a correctly classified instance
Fig. 1: Support Vector Machine percentage of 42.8954% whereas in [16] also uses the same
dataset but used the J48 algorithm for implementing Decision
Shan Xu et al. have used SVM to achieve an accuracy of 98.9% in Trees and the accuracy thus obtained is 67.7% which is less but
People's Hospital dataset [5].In [9], SVM performs the best with still an improvement on the former. Renu Chauhan et al. have
85.7655% of correctly classified instance and in [10] SVM is used obtained an accuracy of 71.43% [17]. M.A. Jabbar et al. have used
with boosting technique to give an accuracy of 84.81%. alternating decision trees with principle component analysis to
HoudaMezrigui et al. have used SVM to attain a f-measure value obtain an accuracy 92.2%[18].Kamran Farooq et al. have achieved
of 93.5617 [11]. In [12] SVM classifies the pixel variation with an the best results on using decision tree-based classifier combined
accuracy of 92.1% helping to identify the affected region with forward selection which achieves a weighted accuracy of
accurately. 78.4604% [19].
E. Random Forest diseases but still there is a lot scope of research to be done on how
Random Forest is also a popularly supervised machine learning to handle high dimensional data and overfitting. A lot of research
algorithm.This technique can be used for both regression and can also be done on the correct ensemble of algorithms to use for a
classification tasks but generally performs better in classification particular type of data.
tasks. As the name suggests, Random Forest technique considers
multiple decision trees before giving an output. So, it is basically 5. Acknowledgment
an ensemble of decision trees. This technique is based on the
belief that more number of trees would converge to the right
We sincerely thank the staff of SRM Institute of Science and
decision. For classification, it uses a voting system and then
Technology, that have provided their immense support and
decides the class whereas in regression it takes the mean of all the
guidance throughout the project.
outputs of each of the decision trees. It works well with large
datasets with high dimensionality.
