COMP 5310: Principles of Data Science: Heart Disease UCI

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 9

COMP 5310: Principles of

Data Science

Heart Disease UCI


Presented by
Maha Sulaeman El-Shahawy
(Unikey : mels6088)

Umaira Uzma Sajjad


(Unikey : usaj8459)

The University of Sydney Page 1


Contents

Setup Model Evaluation


– Hypothesis – Classification Report
– Exploratory Data Analysis – ROC curve
– Generalization/Training
Approach curves
– Preprocessing Statistical Testing
– Hyperparameter Tuning – Between Models
– Modeling – Hypothesis Testing

The University of Sydney Page 2


Setup
Null Hypothesis (H0):
Heart disease cannot be predicted using the features
(demographics, health factors, medical results)
Exploratory Data Analysis

Table 2: Pearson Figure 1: Heart disease starts at a lower


Table 1: Heart Disease Data Set; 303 instances and 14 columns (13 features, 1 Target) Correlation and p-value age (late 30s) for men compared to
women (early 50s)

The University of Sydney Page 3


Approach
Preprocessing
– 6 missing values: SimpleImputer (mean of each feature)
– Normalisation: StandardScalar (by mean); Integer-Encoding
– PCA : > 90% variance at 10 components
Tuning
– Gridsearch ( PCA components >= 10, model parameters)
– Test and Training data (20 : 80)
– Cross validation of 10 folds

The University of Sydney Page 4


Approach
Modeling
– Pipelines
– Supervised classification algorithms

Table 2: Results of considered models. Times are taken on an Intel® core™ i7-4700 HQ @2.4 GHZ processor and 16 GB RAM

The University of Sydney Page 5


Model Evaluations

Classification Report Confusion Matrix

Figure 2: SVM Classification Report on test data

Figure 4: SVM confusion


matrix on test data

Figure 3: Naïve Bayes Classification Report on test data Figure 5: NB confusion matrix
on test data

The University of Sydney Page 6


Model Evaluations

Training Size vs Error Rates Complexity vs Error Rates


Figure 6: Learning curve of SVM
train sizes with accuracy using
bootstrap train/development data

Figure 7: Learning curve of NB


train sizes with accuracy using Figure 8: Validation curve of SVM complexity with
bootstrap train-development data accuracy using bootstrap train-development data

The University of Sydney Page 7


Statistical Testing (α = 0.01)

Between Models Hypothesis Testing


Hypothesis: the two algorithms Mann Whitley U-test
should have the same error rate – p-value = 0.035

Test p-value
McNemar’s test 1.00
Paired T-test 0.10

Result : p-value > α. Weak evidence against the null hypothesis


The University of Sydney Page 8
Thank you

The University of Sydney Page 9

You might also like