Luis Bronchal
Exploratory Data Analysis [EDA]
Data loading and cleaning
Variable analysis
Correlation between variables
Univariable analysis
Machine learning model
Baseline model
Improving baseline model
Feature importance analysis
Explanatory models
Predictive model
Model comparasion
Next things to try
This is an analysis of the Pima Indians Diabetes Database, obtained from Kaggle (https://www.kaggle.com/uciml/pima-
diabetes-database) It is a small dataset with missing values. We have used imputation techniques and tryied some exp
(classification tree and linear regression) and predictive models (random forest and xgboost)
registerDoMC(cores = detectCores() - 1)
It looks like there aren’t explicit missing values, but if we see in detail we can see some biological measurements have
dataset, and that’s impossible:
Let’s see how many rows are affected with this problem:
[1] 234
[1] 0.3046875
biological_data[biological_data<=0] <- NA
dat[, names(biological_data)] <- biological_data
We have very few data and we can’t get rid off these rows. We are going to try to impute missing data.
Let’s see the proportion of the outcome output.
X0 X1
0.6510417 0.3489583
Let’s see the correlation between numerical variables. There are variables which are highly correlated. That’s the case
Univariable analysis
for (x in 1:(ncol(dat)-1)) {
univar_graph(names(dat)[x], dat[,x], dat, dat[,'Outcome'])
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle
1/27/2019 Pima Indians Diabetes Database Analysis | Kaggle
There are variables with high right skew (Insulin, DiabetesPedigreeFunction, Age) and other with high left skew like Blo
Baseline model
Let’s create a baseline model. We’ll see later if it is necessary to improve it.
dindex <- createDataPartition(dat$Outcome, p=0.7, list=FALSE)
train_data <- dat_original[dindex,]
test_data <- dat_original[-dindex,]
We are going to impute the missing data in training and testing set separately.
X0 X1
350 188
Prediction X0 X1
X0 129 29
X1 21 51
Accuracy : 0.7826
95% CI : (0.7236, 0.8341)
No Information Rate : 0.6522
P-Value [Acc > NIR] : 1.156e-05
Kappa : 0.5094
Mcnemar's Test P-Value : 0.3222
Sensitivity : 0.6375
Specificity : 0.8600
Pos Pred Value : 0.7083
Neg Pred Value : 0.8165
Prevalence : 0.3478
Detection Rate : 0.2217
Detection Prevalence : 0.3130
Balanced Accuracy : 0.7488
'Positive' Class : X1
X0 vs. X1 0.8595833
The accuracy is not quite bad , but this is not the best metric in this case.
The auc has a value of 0.8595833
The F1 score is 0.6710526
The recall (Sensitivity) is quite bad 0.6375
Next things to consider in order to build a better model than these baseline one:
We have to think about the features to include in the model, because some are highly correlated (we can try PCA,…
We have to work with the unbalanced problem (oversampling, synthetic cases,…)
We can try different machine learning models
boruta_results <- Boruta(Outcome~., train_data)
All the variables are important except BloodPressure. Glucose is the most important one.
If we see the correlation matrix between variables, we can see some correlation, but they are below 0.75 , so that’s coh
Boruta and it looks like we can’t ride off any feature:
findCorrelation(correlat, cutoff=0.75)
We are going to use a different aproach. We are going to recursively explore which are the best feature set for a linear
From this approach it looks like that all features are needed.
We are going to try explanatory models: logistic regression and classification trees.
Explanatory models
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7007 -0.7340 -0.4207 0.7024 2.4104
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.831720 0.114078 -7.291 3.08e-13 ***
Pregnancies 0.359195 0.127200 2.824 0.004745 **
Glucose 1.125086 0.149146 7.544 4.57e-14 ***
BloodPressure 0 006958 0 124510 0 056 0 955433
BloodPressure 0.006958 0.124510 0.056 0.955433
SkinThickness 0.043982 0.145204 0.303 0.761969
Insulin -0.102950 0.134696 -0.764 0.444682
BMI 0.510131 0.154228 3.308 0.000941 ***
DiabetesPedigreeFunction 0.314418 0.114818 2.738 0.006174 **
Age 0.153331 0.130900 1.171 0.241452
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The more relevant feature related with diabetes is Glucose, followed by BMI, Pregnancies and DiabetesPedigreeFuncti
Prediction X0 X1
X0 132 35
X1 18 45
Accuracy : 0.7696
95% CI : (0.7097, 0.8224)
No Information Rate : 0.6522
P-Value [Acc > NIR] : 7.748e-05
Kappa : 0.4656
Mcnemar's Test P-Value : 0.02797
Sensitivity : 0.5625
Specificity : 0.8800
Pos Pred Value : 0.7143
Neg Pred Value : 0.7904
Prevalence : 0.3478
Detection Rate : 0.1957
Detection Prevalence : 0.2739
Balanced Accuracy : 0.7212
'Positive' Class : X1
We achieve some better results that with the baseline model, but not very good ones:
Predictive model
note: only 7 unique complexity parameters in default grid. Truncating the grid to 7 .
Prediction X0 X1
X0 122 28
X1 28 52
Accuracy : 0.7565
95% CI : (0.6958, 0.8105)
No Information Rate : 0.6522
Kappa : 0.4633
Mcnemar's Test P-Value : 1.000000
Sensitivity : 0.6500
Specificity : 0.8133
Pos Pred Value : 0.6500
Neg Pred Value : 0.8133
Prevalence : 0.3478
Detection Rate : 0.2261
Detection Prevalence : 0.3478
Balanced Accuracy : 0.7317
'Positive' Class : X1
Model comparasion
We are going to compare these models over the training and resampling data:
This is the correlation between models. This info can be used if we decide to combine some models to build a stacked
We are going to see the models results when they are applied over the test data:
Sensitivity F1 AUC
results_glm 0.6375 0.6710526 0.8595833
results_glmnet 0.5625 0.6293706 0.8586667
results_rpart 0.6375 0.6071429 0.7812500
results_rf 0.6375 0.6500000 0.8317500
results_xgbTree 0.4500 0.5413534 0.8252917
results_knn 0.5625 0.5625000 0.7662917
Simple logistic regression looks like to be the best model here: best sensitivity, F1 score and AUC.
We have developed some explanatory models (classification tree and linear regression). They show us what are the mo
factors in order to have a person diabetes. Predictive models should be improve prediction performance but they don’t
outstanding results.
This kernel has been released under the Apache 2.0 open source license.
Data Sources
Pima Indians Diabetes Database
This dataset is originally from the National Institute of Diabetes and
Luis, nice work. When looking to your analysis of missing vales (for instance blood pressure cannot be z
same for BMI) you documented the remediation by invoking some function (kmi…) could you please ela
bit further?
You have to deal with the missing data. There are different approaches to do this. I have trie
KNN imputation (with the function knnImputation) to do a first analysis, but it is possible to
techniques (it is the first thing in the section 'Next things to try' of my report).
Here you can see an interesting intro about this subject:
the unit of insulin 2 hour serum test is muU/ml. can anybody tell me the normal range in order to differe
towards the diabetic prone person. i am able to find insulin test data in mg/dl or mol/l. so i am unable to
kindly help me either in conversion of muU/ml to mg/dl or mol/l, or kindly tell me the range of muU/ml
Luis Bronchal Kernel Author • Posted on Latest Version • 2 years ago • Options
I am not an expert on this business domain, so I can't help you with that.
Beautiful work! :)
