Heart Disease Prediction Using Machine Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

10 IV April 2022

https://doi.org/10.22214/ijraset.2022.40918
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue IV Apr 2022- Available at www.ijraset.com

Heart Disease Prediction Using Machine Learning


Karanam Sai Jagadeesh1, Raghavendra R2
1
PG Student, Department of Master of Computer Applications, School of CS & IT, JAIN (Deemed-to-be-University
2
Assistant Professor, School of CS & IT, JAIN (Deemed-to-be-University)

Abstract: In this modern times, Heart Disease prediction is one of the most critical tasks in the world. In recent times, a lot of
people have died due to heart disease. Machine learning plays a very important role in training and testing the huge amount of
data in the medical field. Heart disease prediction is a crucial task to create and evaluate the prediction process to avoid heart
disease and alert the patient before he/she suffers from disease. This research predicts the chances of Heart Disease and says
whether the patient has heart disease or not by implementing different machine learning techniques such as Decision Tree,
Logistic Regression. Finally, this study shows a result of heart disease and Results are obtained and comparative experiments
have shown that the proposed approach can be utilized to give the prediction to the patient.
Keywords: Machine Learning , Heart Disease, Logistic Regression, Heart Risk, Classification Algorithm.

I. INTRODUCTION
The work proposed in this model focuses mainly on various methods that are employed in heart disease prediction. In the human
body the heart is the main role and it regulates the blood to the whole body. Basically if the heart can't regulate proper blood it
causes a huge problem to the body. Any misleading things can affect the heart disease and also the chance of getting a heart stroke.
In today’s modern era, heart disease is one of the primary reasons for common deaths in this generation due to their luxury and
unhealthy lifestyle like huge alcohol, fast food fat food and smoking and stress.(1)
World Health Organization said that in every year lakhs of people are suffering from this heart disease and they are losing their lives
A good and healthy measures can safe from the heart disease earlier .The main effective is need to improve to create prediction
system and help the poor to save from the lives. Heart diseases are found as the prime source of death in the world due to modern
era luxury and unhealthy food. This proposed work makes an attempt to evaluate heart diseases at an early starting stage to avoid
huge losses. In the medical field, machine learning algorithms and techniques can be used to predict various heart diseases. The
main goal of this model is to provide a tool for doctors to detect heart disease at an early stage.This model will help to prevent and
detect the patients earlier from the heart disease. (1)

II. LITERATURE REVIEW


Bo Jin, Chao Che (2018) Introduced a “Predicting the Risk of Heart Disease With EHR” model designed by applying Artificial
neural networks. This paper used the electronic health record data from real-world datasets related to patients' heart disease to
perform the analysis and predict the heart disease. We implemented a one-hot encryption model that diagnoses events and heart risk
failure events victimization, the essential principles of an expanded memory in the neural network model. By analyzing the results,
we predicted to reveal the importance of respecting the results of nature in the records (2)
Fahd Saleh has designed and introduced a ML model comparing five types of different algorithms. A Rapid Miner tool was used
which resulted in higher accuracy compared to Matlab software and Weka tools for data mining. In this research the results of
Decision Tree, Logistic Regression, Random forest, Naive Bayes and SVM classification algorithms were used. Decision tree
algorithm comes with the highest accuracy(3)
Anjan Nikhil Repaka, ea tl., proposed a system that uses NB (Naïve Bayesian) techniques for classification of dataset and AES
(Advanced Encryption Standard) algorithm for secure data transfer for prediction of disease.(4)
K.Prasanna Lakshmi, Dr. C.R.K.Reddy (2015) created and implemented the model called fast rule based heart disease prediction
with associative technique the author used chi-square test to predict the disease with some associative techniques from the model.(5)
M.Satish, et al. created and done with naïve bayes and decision tree models to predict the heart disease model and he named this
model called pure classifier association rule. He used a heart disease data warehousing dataset for this model.(6)
Aakash Chauhan (2018) introduced “Heart Disease Prediction using Evolutionary Rule Learning”. This study reduces the manual
task that additionally helps in extracting the information (data) directly from the electronic records. To extract this type of rule, we
have to apply some frequency of pattern growth with the data mining on the patient's dataset. This will evaluate and try to reduce
the cost of services and shown that majority of the rules helps within the best prediction of heart disease (7)

©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 940
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue IV Apr 2022- Available at www.ijraset.com

Ashir Javeed (2017) introduced “A Effective Learning System based on Random Search Algorithm To Detect Heart Disease”. This
project uses a random search algorithm to detect the disease from factor selection and random forest model for diagnosing
cardiovascular and heart stroke disease by using the different algorithms called Random Search Algorithm by using the patient's
dataset.This model is principally optimized for using grid search algorithmic programs. (8)
Two type forms of experiments are used in this cardiovascular disease prediction. In the first form, a random forest model is
developed and used to predict the model and in the second form the proposed Random Search Algorithm based random forest
model is developed. This methodology is efficient and less complex than conventional random forest models. Compared to
conventional random forest it produces 3.3% higher accuracy than the random search algorithm. The proposed learning system can
help the doctors to improve the quality of heart failure detection(9)
In this Project, a literature survey of review delivers the concept of machine learning techniques has been studied for heart disease
from the above listed papers. Using some of the machine learning algorithms it can provide promising results to bring the most
effective accuracy in analyzing the prediction model.
The main aim of this project/paper is predicting the heart disease/heart stroke of the patient by using machine learning algorithms
like logistic regression to find the prediction in the form of 0 and 1’s. In this project the user can get to know the output from these
14 types of input attributes. Then 14 attributes are going to test and train data for the accurate and efficient results to predict the
disease.

III. PROPOSED MODEL


The proposed work predicts heart disease by exploring the above mentioned four classification algorithms and does performance
analysis. The objective of this study is to effectively predict if the patient suffers from heart disease. The health professional enters
the input values from the patient's health report. The data is fed into a model which predicts the probability of having heart disease.
Fig. 1 shows the entire process involved.

Data Collection and Preprocessing The dataset used was the Heart disease Dataset which is a combination of 4 different databases,
but only the UCI Cleveland dataset was used. This database consists of a total of 900 attributes but all published experiments refer
to using a subset of only 14 features. Therefore, we have used the UCI Cleveland dataset available in the Kaggle website for our
analysis. The complete description of the 14 attributes used in the proposed work is mentioned below(10)

©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 941
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue IV Apr 2022- Available at www.ijraset.com

A. Flask Web App


Flask is a micro web framework written in Python. It is classified as a microframework because it does not require particular tools
or libraries. It has no database abstraction layer, form validation, or any other components where pre-existing third-party libraries
provide common functions.In this we used flask web applications we use pickling file to store the training dataset through the pickle
file.“Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse
operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy.(11)

B. Preprocess Dataset
Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is
gathered from different sources it is collected in raw format which is not feasible for the analysis.It is also an important step in data
mining as we cannot work with raw data. The quality of the data should be checked before applying machine learning or data
mining algorithms

C. Supervised Learning
Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-
output pairs. It infers a function from labeled training data consisting of a set of training examples.
Supervised learning is a process of providing input data as well as correct output data to the machine learning model. The aim of a
supervised learning algorithm is to find a mapping function to map the input variable(x) with the output variable(y).

D. Training and Testing


The attributes mentioned in Table 1 are provided as input to the different ML algorithms such as Random Forest, Decision Tree,
Logistic Regression classification techniques The input dataset is split into 80% of the training dataset and the remaining 20% into
the test dataset. Training dataset is the dataset which is used to train a model. Testing dataset is used to check the performance of the
trained model. For each of the algorithms the performance is computed and analyzed based on different metrics used such as
accuracy, precision, recall and F-measure scores as described further.

E. Supervised Learning Algorithms


1) Random Forest: Random Forest algorithms are used for classification as well as regression. It creates a tree for the data and
makes predictions based on that. Random Forest algorithm can be used on large datasets and can produce the same result even
when large sets record values are missing. The generated samples from the decision tree can be saved so that it can be used on
other data. In random forest there are two stages, firstly create a random forest then make a prediction using a random forest
classifier created in the first stage. In this project we can get 75-80 percentage accuracy from the dataset.(12)
2) Decision Tree: Decision Tree algorithm is in the form of a flowchart where the inner node represents the dataset attributes and
the outer branches are the outcome. Decision Trees are chosen because they are fast, reliable, easy to interpret and very little
data preparation is required.In Decision Tree, the prediction of class label originates from the root of the tree. The value of the
root attribute is compared to the record's attribute. On the result of comparison, the corresponding branch is followed to that
value and jump is made to the next node.
3) Logistic Regression: Logistic Regression is a classification algorithm mostly used for binary classification problems. In logistic
regression instead of fitting a straight line or hyper plane, the logistic regression algorithm uses the logistic function to squeeze
the output of a linear equation between 0 and 1. There are 13 independent variables which makes logistic regression good for
classification.

©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 942
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue IV Apr 2022- Available at www.ijraset.com

4) Scikit-Learn: Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a
model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict
anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a
(supervised) machine learning experiment to hold out part of the available data as a test set x and y set. Note that the word
“experiment” is not intended to denote academic use only, because even in commercial settings machine learning usually starts
out experimentally. In scikit-learn a random split into training and test sets can be quickly computed with the train-test-split
helper function
5) Cross Validation K-fold: Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data
sample. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split
into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in
place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.(13)

Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data.
That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions
on data not used during the training of the model.Cross-Validation is a statistical method of evaluating and comparing learning
algorithms by dividing data into two segments: one used to learn or train a model and the other used to validate the model.
When evaluating different settings (“hyperparameters”) for estimators, such as the C setting that must be manually set for an SVM,
there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This
way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance.
To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training
set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be
done on the test set.(14)
However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for
learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.
A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final
evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is
split into k

©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 943
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue IV Apr 2022- Available at www.ijraset.com

IV. CONCLUSION
This component will help in predicting the severity of the heart stroke/cardiovascular disease. After the successful model user will
input data, the weights will be cross checked with the given inputs. The prediction of this heart disease system will consist of 13
attribute values that will be input to the system. The target value is zero or one The predicted will be generated in the form of a ‘yes’
or ‘no’ format considering all the risk factors whether they lie in the criteria as per the model is trained

REFERENCES
[1] Goel R Heart Disease Prediction Using Various Algorithms of Machine Learning, http://dx.doi.org/10.2139/ssrn.3884968
[2] Jin B, Che C, Liu Z, et al (2018), Predicting the Risk of Heart Failure With EHR Sequential Data Modeling, http://dx.doi.org/10.1109/access.2017.2789324
[3] Jensen K, Martinsen ACT, Tingberg A, et al (2014), Comparing five different iterative reconstruction algorithms for computed tomography in an ROC study,
http://dx.doi.org/10.1007/s00330-014-3333-4
[4] Repaka AN, Ravikanti SD, and Franklin RG (2019), Design And Implementing Heart Disease Prediction Using Naives Bayesian,
http://dx.doi.org/10.1109/icoei.2019.8862604
[5] Lakshmi KP, Prasanna Lakshmi K, and Reddy CRK (2015), Fast rule-based heart disease prediction using associative classification mining,
http://dx.doi.org/10.1109/ic4.2015.7375725
[6] Al-Bayaty BFZ, Zopon Al-Bayaty BF, Bharati Vidyapeeth University, et al (2016), Comparative Analysis between Naïve Bayes Algorithm and Decision Tree
to Solve WSD Using Empirical Approach, http://dx.doi.org/10.7763/lnse.2016.v4.228
[7] Chauhan A, Jain A, Sharma P, et al (2018), Heart Disease Prediction using Evolutionary Rule Learning, http://dx.doi.org/10.1109/ciact.2018.8480271
[8] Peyls N Learning curve for insertion of a peripherally introduced central catheter using echo guidance on a phantom model,
http://dx.doi.org/10.26226/morressier.59dd3a6ad462b8029238a5db
[9] Strom S (2019), Photophysiological responses of two dinoflagellate species used in natural high light exposure experiments (Protist Signaling project),
http://dx.doi.org/10.1575/1912/bco-dmo.723266.1
[10] Williams G (2011) Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery, Springer Science & Business Media
[11] Irsyad R Penggunaan Python Web Framework Flask Untuk Pemula.
[12] Institute of Medicine, Board on Global Health, and Committee on Preventing the Global Epidemic of Cardiovascular Disease: Meeting the Challenges in
Developing Countries (2010) Promoting Cardiovascular Health in the Developing World: A Critical Challenge to Achieve Global Health, National Academies
Press
[13] Vabalas A, Gowen E, Poliakoff E, et al (2019), Machine learning algorithm validation with a limited sample size.
[14] Duarte E and Wainer J (2017), Empirical comparison of cross-validation and internal metrics for tuning SVM hyperparameters.

©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 944

You might also like