File 156905

Machine Learning for Diabetes Mellitus
Prediction in the Intensive Care Unit
Adam Ragab
S TUDENT NUMBER: 2029445
T HESIS SUBMITTED IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF
B ACHELOR OF S CIENCE IN C OGNITIVE S CIENCE & A RTIFICIAL I NTELLIGENCE
D EPARTMENT OF C OGNITIVE S CIENCE & A RTIFICIAL I NTELLIGENCE
S CHOOL OF H UMANITIES AND D IGITAL S CIENCES
T ILBURG U NIVERSITY
Thesis committee:
Supervisor: Dr. Sharon Ong
Second Reader: Dr. Travis Wiltshire
Tilburg University
School of Humanities and Digital Sciences
Department of Cognitive Science & Artificial Intelligence
Tilburg, The Netherlands
June 2021
Preface
I would like to firstly thank my thesis supervisor, Dr. Sharon Ong, for her continu-
ous support and guidance throughout the semester. With a lot on my plate between
coursework and an internship, she made it possible to manage the workload and get
through it. Secondly, I would like to acknowledge and thank my family for their endless
love and support throughout my studies, without them I wouldn’t be the person I am
today. Lastly, I want to make note of why I choose this research topic. I believe that
artificial intelligence has the potential to revolutionize healthcare and the way we live,
it’s something I am passionate about and an area of research I will continue working
on after graduation. Moreover, I believe it’s ever so important in a world where the
gaps of inequality are ever so widening that artificial intelligence does not perpetuate
the process but rather makes the world a fairer place. This is one of the reasons why I
included bias evaluation in my thesis, as it’s more important now than ever to ensure
that we develop fair models.
Machine Learning for Diabetes Mellitus
Prediction in the Intensive Care Unit
Adam Ragab
Diabetes is a chronic disease stemming from abnormal blood glucose levels and disruptions in the
production of the hormone insulin. It is estimated that 451 million adults live with the disease
worldwide, and this number is projected to be 645 million by 2045. In recent years work has
focused on predicting diabetes using machine learning for non-emergency care admissions, this
has proven to be possible with relative success. However, the prediction of diabetes mellitus in
intensive care unit admissions using machine learning is yet to be extensively studied. As a
diabetes diagnosis is a relevant piece of medical information required for appropriate care to
be administered, this thesis explored the extent to which machine learning can be utilized to
predict diabetes mellitus in intensive care unit admissions and investigated the racial bias of
such models. Here Decision Tree, Random Forest, Gradient Boosted Trees, and Neural Networks
were trained on a subset of the Women in Data Science (WiDs) Datathon 2020 dataset which
contains the de-identified clinical information of 130,157 patients admitted to intensive care
units (ICUs) across the globe within a one-year timeframe. The models were evaluated for their
performance and racial bias. The results show that Random Forest and Gradient Boosted Trees
are the best performing models, additionally, none of the models were found to be racially biased.
1. Introduction
Diabetes is a chronic disease stemming from abnormal blood glucose levels and dis-
ruptions in the production of the hormone insulin. It is extremely prevalent in society,
with an estimated 451 million adults living with the disease worldwide in 2017, and
the number expected to increase by 2045 to 645 million (Cho et al., 2018). The impact
of diabetes on society is immense, not only from a quality of life and health oriented
perspective (with an estimated 5 million deaths per year associated with the disease
worldwide) but also from an economic point of view. It is estimated that 850 billion USD
are spent each year on costs related to the disease. Early diagnosis of diabetes is vital so
that appropriate treatment can be administered and prevent serious complications. It is
often the case that patients admitted into the intensive care unit are unable to adequately
inform clinicians of their relevant medical history. Given that 49.7% of all diabetics are
undiagnosed, and that a diabetes diagnosis is relevant for adequate care in critical
care settings such as the ICU, improvements in diagnostics for instance automatic-
algorithms would reduce costs and save lives. In parallel to the adoption of A.I in health-
care, there have been strides in the field of fairness and research for machine learning
algorithms and artificial intelligence (Bellamy et al., 2019). Studies in recent years have
shown that much like artificial intelligent systems deployed in other domains, artificial
intelligent systems in the medical domain are prone to errors and bias(Gurupur, V., &
Wan, T.,2020). It is this bias that even when a system is designed appropriately, (systems
which are not inherently flawed by design), but rather biased by the data itself that was
1
Cognitive Science & Artificial Intelligence 2021
used to train the system, that leads to systems that systematically bias a population
subgroup. Medical datasets have been shown to underrepresent certain ethnic groups,
resulting in biased data. In recent years, studies have shown the prevalence of such
biases in a multitude of domains, such as that of bias in facial recognition applications
(Leslie, 2020), scoring systems used in finance, jobs, and insurance (Leavy, S. 2018).
Work by McRaden et al. (2020) indicates that such bias can have significant health
implications for such underrepresented groups. Therefore any development of a model
for the prediction of diabetes must also be investigated and evaluated for model bias.
1.1 Research Questions
This motivates the first research question for this research project:
To what extent can machine learning be used to predict diabetes mellitus in ICU admissions?
And the research sub questions and their motivations :
Identifying the best performing machine learning model is critical in understanding
which model is most suitable for the prediction task.
Which machine learning classifier has the best performance on prediction of diabetes melli-
tus?
Following this, the second research question of this research project is as follows:
What is the effect of racial biased training data on machine learning model performance on
the prediction of diabetes mellitus ICU admissions?
2. Related Work
2.1 Intensive Care Unit and Machine Learning
Advances in data collection, storage, and management, combined with a desire to

provide more efficient and effective care, and the rapid advances in machine learn-
ing and artificial intelligence-based systems technology have enabled the adoption of
clinical decision support systems and other such technologies (Panch et al, 2019). Such
systems offer the ability to greatly enhance medical care and even offer the promise of
ushering in a new age of healthcare that is both personalized and preventive (Kelly et al,
2019). The ability of A.I driven systems to support clinicians, particularly in high-stress,
critical care settings such as in the ICU, has led to research in the application of such
machine learning techniques to tackle a variety of tasks within the ICU (Gutierrez, 2020).
Houthoof et al. (2015) applied machine learning for forecasting length of stay and ICU
readmissions using support vector machines and neural networks. Awad et al. (2017)
utilized decision trees, random forests, and naive bayes to predict ICU mortality. Yoon
et al. (2017) predicted patient instability in the ICU using logistic regression and random
forests, and Sottile et al. (2018) utilized random forests, naive bayes and adaBoost for
the prediction of patient-ventilator asynchrony (Gutierrez, 2020). The application of
machine learning for diabetics within the ICU has also been studied. Such as predicting
mortality of diabetic patients (Anand et al, 2018) and length of diabetic patient stay
using random forest classifiers (Hargreaves et al, 2020).
2.2 Machine Learning for Diabetes Prediction
There have been many applications of machine learning for diabetes prediction. Zou
et al. (2018) utilized decision trees, random forests and neural networks to predict
diabetes and investigated the use of PCA and mrRMR for dimensionality reduction..
2
A. Ragab Machine Learning for Diabetes Mellitus Prediction
Mahboob Alam et al. (2019) predicted diabetes in the Pima Indian Diabetes dataset
using neural networks, random forest and k-means clustering. P et al. (2020) used deep
neural networks to achieve 98.16% accuracy in the prediction of diabetes in the Pima
Indian Diabetes dataset. Zhou et al. (2020) also utilized a deep neural network with drop
regulation to achieve an accuracy of 94.02% in the diabetes type dataset and 99.41% for
the Pima Indians diabetes dataset.
2.3 Model Fairness and Bias
Recent work by Bellamy et al. (2019) has led to the development of a toolkit and accom-
panying resources for detecting and mitigating algorithm bias. Work by McCradden
(2020), Kelly et al. (2019), and Gurupur et al. (2020), have identified the need for ad-
dressing biases in artificial intelligence and machine learning models as key challenges
to deliver clinical impact and ensure patient safety. Noseworthy et al. (2020) recently
investigated the impact of racial and ethnic bias on convolutional neural network per-
formance in ECG Analysis. The findings showed that race did not significantly impact
model performance and suggested reporting of performance for demographic groups
for all future machine learning models utilized in healthcare.
3. Methodology
This section describes the methodological framework , procedures and implementations

of the conducted research.
3.1 Dataset
The data utilized in this study originated from a dataset curated for the Women in
Data Science (WiDs) Datathon 2020. This dataset was collected by the Massachusetts
Institute of Technologies (MIT) Global Open Source Severity Illness Score (GOSSIS)
initiative. The dataset, which is freely available on Kaggle, contains the de-identified
clinical information of 130,157 patients admitted to intensive care units (ICUs) across the
globe within a one-year timeframe. Containing 181 physiological and clinical variables
recorded within 24-hours of ICU admission, the dataset provides an extensive profile
of admitted patients. See appendix, Table 1.0 for a data dictionary containing a detailed
description of the variables within the dataset.
3.2 Data Exploration
Exploratory data analysis was first performed on the dataset, this encompassed looking
at the disruptions of diabetic and non-diabetic patients and ethnicity distributions as.
From this initial exploration 102,006 diabetic and 28,151 non-diabetic patients were
identified. The ethnic composition of patients was 78% Caucasian, 12% African Amer-
ican, 5% other, 4% Hispanic and 2% Asian. This ethnic composition is visualized in
Figure 1.
Given the datasets large size, both in terms of the number of patients and plethora
of variables, it was deemed unfeasible to extensively explore the rich clinical and phys-
iological variable set in its entirety, thus heuristics were used for initial data exploration
after which a more detailed examination of a subset of the dataset could be conducted.
Thus, the dataset was explored for key clinical and physiological features previously
3
Figure 1: Ethnic distributions of patients in dataset.
identified in the literature to be critical for diabetes prediction(Zou et al,2018)(Balkau et

al,2008) (P et al, 2020),(Mahboob Alam et al., 2019). See Table 1.
Table 1: Features correlated to Diabetes from literature. The 5 features , Diastolic Blood
Pressure, Plasma Glucose Concentration, 2-h Serum Insulin levels, Body Mass Index
and Age were sufficient for the classification of patients as either Diabetic or Non-
Diabetic in previous works in diabetes prediction (Zou et al., 2018), (P et al, 2020), (Zhou,
Myrzashova & Zheng, 2020), (Mahboob Alam et al., 2019).
Diabetes Predictors
Diastolic Blood Pressure
Plasma Glucose Concentration after an 2-h Oral Glucose Tolerance Test
2-h Serum Insulin
Body Mass Index
Age
Here, extensive use was made of the provided data dictionary which came as sup-
plementary material associated with the dataset. This data dictionary contains a list of
all clinical and physiological variables within the dataset, along with brief descriptions
of each variable, such as their data type and measurement. Of the 5 clinical features,
all but one were identified to be within the dataset. The missing feature 2-h serum
insulin, has been in previous studies related to being a significant predictor of diabetes
mellitus(Zou et al, 2018). For these 5 features, group statistics (mean, standard deviation,
percentiles, max and minimum values) and data distributions were calculated and
plotted. The results of the group statistics are contained in Table 2. and distributions
in Figure 2.
Lastly, the number of missing values in the remaining features were examined. The
result of this analysis is visualized in Figure 3, a wide variation in missing feature values
can be observed. This is further expanded upon in the section on Feature selection.
4
Figure 2: Table of Group statistics of Clinical Diabetes Predictors
Figure 3: Histogram plots of diabetes predictors. The results show that the features are
for the most part normally distributed, with varying degrees of skew and kurtosis.
3.3 Feature Selection
Feature selection lends itself to the reduction of irrelevant and redundant variables,
reducing computation time and improving prediction performance(Chandrashekar
Sahin, 2014). For feature selection, three feature sets; (Clinical based, threshold based,
5
Figure 4: Quantity of missing values per feature, here white lines represent missing
values
and feature-importance-score based) were generated. Here, the threshold based feature
set was further evaluated using a recurrent feature selection algorithm that utilizes
SHapley Additive exPlanations (SHAP) values to score feature importance. The results
of this recurrent feature selection algorithm were used to generate the final feature
sets consisting of the subset of the most important features. This section details the
generation of each feature set and in detail describes the feature selection algorithm
and resulting feature subset (feature set based on feature importance-score).
3.3.1 Clinical Feature Set. The initial feature set, referred to as the Clinical feature set,
consists of the 6 features that were identified by the data exploration process as outlined
in the previous section (2.2.1 Data exploration). The features; Age, BMI,diastolic blood
pressure and Glucose, were identified in previous works of diabetes prediction using
machine learning (Zou et al., 2018), (P et al, 2020), (Zhou, Myrzashova & Zheng, 2020),
(Mahboob Alam et al., 2019) and to be sufficient predictors of diabetes and are good
clinical and biological predictors of diabetes(Balkau et al, 2008).This is a rather small
feature set given the 181 variables within the dataset, and thus will serve as the de facto
baseline feature set.
3.3.2 Threshold-based Feature set. Threshold-based Feature set is the feature set gener-
ated by first removing features that exceed a threshold of missing values. Thresholding
was utilized as earlier data exploration showed that if all rows/entries with missing
values are dropped, then the resulting dataset contains only 66 patients. This means
that out of 130,106 patients, only 66 had all features recorded upon admission. As this
dataset is extremely small, dropping features with an excess number of missing values
based on thresholding was opted for. A variety of different thresholding schemes were
explored, with the best trade-off between feature retention and missing value removal
to have been 85%, meaning that at a minimum 85% of the data in a feature column
is complete and not missing (see table Figure 4 for the explored thresholding schemes
and resulting number of features.) With an 85% threshold, a significant portion of the
features, and entries were preserved ( 95 features and 60,000 respectively). There exists
a delicate trade-off between these thresholds, a higher threshold is more optimal as a
significant portion of our dataset is retained, for lower thresholds more features are
6
Figure 5: Features with missing values exceeding Threshold values
preserved but missing values need to be computed with imputation. If the results from
the higher-thresholded features are not sufficient, lower thresholded features can be
utilized instead and the missing data imputed. Usually features with many missing
values are not significant in a clinical setting given a certain diagnostic, by virtue of not
being measured. Although not the aim of this study , restricting our feature set to the
most “common” features aids in not needing to impute data.
3.3.3 Feature-Importance-Score-based Feature Set. The Feature-Importance-Score-

based Feature Set or Feature selection feature set consists of the most important subset
of features that were generated by the recurrent feature selection algorithm that uses
SHapley Additive Explanations (SHAP) as a feature importance metric. Here the 12
most important features were chosen. There are a total of 3 Feature-Importance feature
sets, one for each model. See section 3.3.4 for an explanation of SHapley Additive
Explanations (SHAP). Section 3.3.5 for an overview of recurrent feature selection and
section 3.3.6 for the feature selection implementation and results.
3.3.4 SHapely Additive exPlanations (SHAP). SHapely Additive exPlanations (SHAP)

are a unified measure of feature importance (Lundberg and Lee, 2017). Unlike tra-
ditional feature importance measures, SHAP values are consistent and easily inter-
pretable. In short, Shapley values are the average marginal contribution of a feature
across all possible coalitions(Molnar, 2020). The Shapley value is a solution for com-
puting feature contributions for single predictions for any machine learning model.
SHAP values can even be utilized to interpret individual predictions. Here we utilized
the SHAP implementation of the Probatus package, an open-source python package
developed by data scientists at ING Bank, which in turn utilizes the SHAP (Lundberg
and Lee, 2017) open-source Python library.
7
3.3.5 Recurrent Feature Selection. Recurrent Feature Selection (RFS) is a feature selec-
tion method that iteratively determines optimal features by fitting data to a model, cal-
culating feature important scores and then tuning the feature set in each iteration (Chan-
drashekar Sahin, 2014). Here the ShapRFECV algorithm from the Probatus open-source
package was utilized. This implements the recurrent feature elimination algorithm for
tree-based classification models using an API similar to sci-kit learn (Buitinck et al,
2013). While utilizing SHAP feature importance scores for feature selection. ShapRFECV
has additional options for cross validation and hyperparameter optimization.
3.3.6 Feature Selection Implementation and Results. Each generated feature set is fit to
three different tree based models (the models are prior hyperparameter optimized, see
section 3.1 Hyperparamter optimization for a list of hyperparameters). Here the models
are those utilized later in the prediction task, namely Decision Tree, Random Forest
and Gradient Boosted Trees (for implementation and more details, see section 2.4) The
ShapREFECV is run with 5 fold-cross validation. Thus for each feature set, and for each
iteration of the recurrent feature selection algorithm, the data is fitted to the model,
SHAP feature importance scores are generated along with an ROC-AUC performance
metric, afterwards the lowest scoring feature is removed. For each feature and model,
the top 12 ranking features were calculated. From these the final feature set was selected
based on a comparison between the feature rankings.
3.4 Models
In this section, classification models are described, evaluation metrics defined and
implementation details outlined.
3.4.1 Classification algorithms. In total 4 classification models were chosen. Models

were chosen based on several criteria, such as interpretability, non-linearness, and
previous usage in diabetes prediction literature. Using this criteria, it was determined
that several tree based models along with a Neural Network based classifier were most
appropriate. Of the Tree based models, Decision Tree, Random Forest and Gradient
Boosted Trees models were chosen with the Decision Tree being used as the baseline
model. For the Neural Network based model, the Fully Connected Neural Network
architecture employed by P et al (2020),was utilized. The following sections will in
further detail explain model selection,architecture and implementation.
3.4.2 Decision Tree - Baseline. Decision Trees are interpretable supervised machine
learning algorithms/classifiers that classify input observations by sorting them through
a series of if-then question nodes,wherein a classification is achieved on the basis class
of the last node (also known as the terminal node) (Kingsford Salzberg, 2008).Decision
trees consist of if-then question nodes that act as decision criteria, wherein each node
refers to a feature and associated if-then rule that discriminates said feature from
one class to another. These nodes form an hierarchical tree structure, hence the name
Decision Trees. Figure 5.0 depicts the architecture of a Decision Tree. Decision Trees
have been used extensively for the task of predicting diabetes in the literature ,(Zou
et al,2018), (Ahmed,2016), (Arivu Selvan and Nadesh 2017) , (Patil et al,2010). Decision
Trees are often combined in ensemble methods such as Random Forest and Gradient
Boosted Trees to which have also seen extensive use with improved performance. Given
that they are at the core such models, they have been chosen to be the baseline classifier.
Decision Tree models come in different flavours, namely flavours of the underlying
8
Figure 6: Diagram of Decision Tree architecture
algorithm that they run on. Some popular ones are ID3, C4.5, CAR,however for diabetes
prediction often C4.5 or J45 Decision Trees are utilized.The Decision tree model that was
used in this research implemented an optimised version of the CART algorithm from
the Sci-Kit Learn implementation, which is similar to the C4.5 algorithm.
3.4.3 Random Forest. Random Forests are ensemble methods, composed of multiple
decision trees. Ensemble methods are a machine learning paradigm where multiple
models are aggregated to improve overall model performance and stability. Such mod-
els can aggregate the independent results of the composing models to form a final
decision. Random forests, utiliz a voting process to aggregate scores and compute a final
decision and thus lend themselves to reducing variance in data that they are trained on
(Fawagreh, Gaber Elyan, 2014) . There exist three main approaches to building ensem-
bles, these are boosting, bagging or stacking. Random forests fall into the latter category.
Figure 6.0 depicts the Random Forest model architecture. Like the Decision Trees that
constitute them, Random Forest models have been in previous studies used to predict
diabetes with relative success. Zou et al. (2018),(Singh and Lakshmiganthan,2017).The
Random Forest model was implemented via the Scikit-learn library (Pedgregosa et al,
2011).
3.4.4 Gradient Boosted Trees. Gradient boosted Trees are ensemble methods just like
their Random Forest counterparts. Gradient boosted Trees fit independent models in
sequence, wherein each model in the sequence is trained on the incorrectly classified
observations of the prior model (Natekin Knoll, 2013). Figure 7.0 depicts the Gradient
Boosted Trees architecture. Gradient Boosted Trees have not been extensively utilized
in the prediction of diabetes but have seen applications for other tasks in the realm
of classification tasks for the intensive care unit.(Gutierrez, 2020) There exist several
implementations of Gradient Boosted Trees. Here the Gradient Boosted Tree model of
choice was implemented via the XGBoost (eXtreme Gradient Boosting) library (Chen
Guestrin, 2016).
9
Figure 7: Diagram of Random Forest architecture
Figure 8: Diagram of Gradient Boosted Trees architecture
3.4.5 Neural Network - Fully Connected Neural Network. Neural networks are a class
of machine learning models that are loosely based on how the mammalian nervous sys-
tem functions. The most basic neural network architecture is the perceptron. The percep-
tron consists of an input layer, wherein there exist an arbitrary number of inputs, each
denoted by a node. Each input has an associated weight, the weighted sum of the input
and product of the weights is calculated and evaluated by an activation function (e.g
Sigmoid function), the output of this activation function is the output of the perceptron.
The perceptron is limited to learning linear representation; by aggregating perceptrons,
non-linear representation can be learned (Nielsen, 2015). An aggregation of individual
perceptrons is called a multilayer perceptron, and consists of an input-layer with an
arbitrary number of input nodes, an arbitrary number of hidden layers consisting of an
arbitrary number of neurons and an output layer. Each neuron/node in each layer has
an associated weight and activation function, and is connected to all the other nodes
in the consecutive layer. There exist many other neural network architectures such as
Convolutional Neural Networks (CNN), whose main application has been in image
processing and Recurrent Neural Networks (RNN), however multilayer perceptrons
have seen greatest success in the task of predicting diabetes P et al. (2020),Zhou et al.
(2020). Thus the neural network architecture of choice is the multilayer perceptron, here
10
Figure 9: Diagram of Fully Connected Neural Network architecture
a fully connected neural network (FCNN) architecture is utilized and is based on the
model of Zhou et al. (2020), which utilized a 10-20-10 hidden layer architecture. The
activation function for the hidden layers is ReLU, and the output activation function
is a standard sigmoid, further binary cross entropy is the loss function of choice.The
model was implemented using the PyTorch library.
3.5 Evaluation Metrics
The evaluation metrics are divided into two types. Evaluation metrics for model perfor-
mance and fairness.
3.5.1 Performance Evaluation Metrics. Here we evaluate performance according to

the metrics identified in the literature as standard for the task of predicting diabetes.
These metrics are accuracy, specificity, recall and precision, and F-score. Additionally as
is convention, the area under the receiver operating characteristic(ROC) curve is also
reported. The evaluation metrics are defined as follows: In the following equations, TP,
FP, FN and TP are defined as : True Positive (TP): Person is diabetic and predicted to be
diabetic. False Positive (FP) : Person is non-diabetic and predicted to be diabetic. True
Negative (TN) : Person is non-diabetic and predicted to be non-diabetic False Negative
(FN) : Person is diabetic and predicted to be non-diabetic. Precision measures the ratio
of true positively predicted diabetics to all true and false positive predicted diabetics.
In other words it measures the proportion of true positives out of all positive predicted
instances. Thus the higher the precision the lower the false positives that the model
predicts.
TP
P recision = (1)
TP + FP
11
Recall on the other hand measures the ratio of true positives compared to total
number of true positives cases, thus it measures how many diabetics are incorrectly
classified as non-diabetic. As a false negative diagnosis can result in serious negative
implications for the patient, recall is important to maximize.
TP
Recall = (2)
TP + FN
The F1 - Score is the harmonic mean of precision and recall. It measures how well
the model balances between precision and recall. As minimizing the number of false
positives is also important, the F1-Score can give us a good indication of the general
model performance, particularly if the dataset is not balanced, as is the case here.
P recision × Recall
F 1 − Score = (3)
P recision + Recall
Lastly accuracy measures the number of correctly classified instances divided by the
total number of instances. Accuracy is generally a good measure of model performance
in balanced datasets and for non-medical applications. As the dataset is not balanced,
besides for training purposes and given the medical application of the models, accuracy
will be used as a secondary metric to evaluate model performance.
TP + TN
Accuracy = (4)
TP + TN + FP + FN
3.5.2 Fairness/Bias Evaluation Metrics. For fairness metrics , the equal opportunity
difference and average odds difference from the AI Fairness 360 Toolkit (Bellamy et al,
2019) will be utilized for bias evaluation. The equal opportunity difference measures
the difference in true positive rates between a privileged and unprivileged class. Here
the privileged class is the majority ethnic group (Caucasians) and the unprivileged class
the remaining other ethnic groups. Thus first the true positive rate must be calculated:
TP
TPR = (5)
TP + FN
Where the True Positive Rate is the number of correctly classified true positive
instances, divided by the sum of the number of classified true positive and false negative
instances. Now the equal opportunity difference (EOD) can be defined as follows:
EOD = T P RD = unprivlidged − T P RD = privlidged (6)
Where D is the privilege label of the data.

The average odds difference measures the average difference in the False Positive
Rate and True Negative Rate between unprivileged and privileged groups. Thus the
False Positive Rate must first be calculated:
12
FP
FPR = (7)
FP + TN
Where the False Positive Rate is the number of false positive classifications, divided
by the sum of the number of classified false positives and true negative instances.
Now the average odds difference (AOD) metric can be defined as follows:
1
AOD = [(T P RD = unprivileged − T P RD = privileged) + (F P RD = unprivileged − F P RD = privileged)]
2
(8)
4. Results
5. Results
Model performance, as well as the results of the hyperparameter optimization and

fairness evaluation are described in detail in this section. All models were evaluated
after first being hyper-parameter optimized and subsequently trained on the literature
feature set ,threshold based feature set and the feature set obtained from the feature
selection using SHAPley values (SHAP-Feature-Selection Feature Set). Table 1.0, reports
the F1-Score of all models. Data was split into 80,20 train, test splits for the tree based
models and 70,20,10 train,test,validation splits for the neural network model. Each train
set was balanced and consisted of 16500 instances of diabetic and non-diabetic patients,
whereas the test was unbalanced by 70:30 ratio of non-diabetics to diabetics, consistent
with the imbalance in the original dataset. This resulted in 2415 non-diabetics and 1035
diabetics. Here results for each model are reported separately and then compared to
each other.
A summary of the best model performance independent of the feature sets can be
seen in Table 2.
Table 2: Best performance per model independent of feature set.
Classifier Precision Recall F1-Score ROC-AUC

Decision Tree 0.72 0.68 0.69 0.67
Random Forest 0.79 0.76 0.76 0.76
Gradient Boost 0.79 0.76 0.76 0.76
Fully Connected Neural Network 0.49 0.70 0.58 0.5
5.1 Hyperparameter Optimization Results
All models were optimized via grid search parameter optimization using 5-fold cross
validation. The models were fit to each possible combination of features in the pa-
rameter grid and evaluated for best the parameter set with the best performance.
For evaluating we utilized the F1-Score, as standard in this thesis. The results of this
hyperparameter optimization are seen in Table 3.
13
Table 3: Hyper-Parameter Optimization Results for each Classifier
Classifier Decision Tree Random Forest Gradient Boost

Literature ’criterion’: ’n_estimators’ : ’n_estimators’ :
’entropy’, 600,’bootstrap’: 1600, ’colsam-
’max_depth’: 6, True, ple_bytree’: 1.0,
’min_samples_leaf’: ’max_depth’: ’gamma’: 0.5,
2, None, ’max_depth’: 5,
’min_samples_split’: ’max_features’: ’min_child_weight’:
6 ’sqrt’, 1, ’subsample’:
’min_samples_leaf’: 1.0
1,
’min_samples_split’:
2
Threshold ’criterion’: ’n_estimators’ : ’n_estimators’ :
’entropy’, 600, ’bootstrap’: 2000, ’colsam-
’max_depth’: False, ple_bytree’: 1.0,
None, ’max_depth’: 40, ’gamma’: 0.5,
’min_samples_leaf’: ’max_features’: ’max_depth’: 5,
2, ’auto’, ’min_child_weight’:
’min_samples_split’:’min_samples_leaf’: 1, ’subsample’:
4 2, 1.0
’min_samples_split’:
2
The best hyper-parameters for each classifier differed notably between the two
feature sets. For the Decision Tree, ‘entropy’ was the best decision criteria for both
feature sets. Likewise, for the Random Forest classifier, the number of estimators was
consistent for both feature sets, at 600 estimators each. Here, bootstrapping was set
to ‘True’ for the model trained on the Literature feature set and alternatively set to
‘False’ for the Threshold-based Feature Set. The Gradient Boosted Trees models had a
significant difference in the number of estimators that suited each model the best, with
1600 and 2000 estimators being the best for the model trained on the Literature and
Threshold-Feature set respectively. The differences here can be explained by the large
variations in the feature sets in terms of their quantity of features. Thus these result
highlight the importance of optimizing hyper-paramters for each model in respect to
the underlying feature set.
5.2 Feature Selection Results
The feature selection results from the ShapRFECV feature selection algorithm are dis-
played in Table 4.
For all classifiers ; ‘age’, ‘bmi’, ‘d1_creatine_min’, ‘d1_glucose_max’,
‘d1_glucose_min’, ‘glucose_apache’, ‘d1_hemaglobin_max’ and ‘weight’ were among
the top 12 most important features. Feature importance among all classifiers were
similar, with only difference in 1 to 3 features among the different classifiers.
Additionally these results indicate that the feature importance scoring was able to
14
Table 4: Hyper-Parameter Optimization Results for each Classifier
Classifier Features
Decision Tree ’age, bmi, ’d1_heartrate_max’,
’d1_calcium_min’, ’d1_creatinine_min’,
’icu_id’, ’d1_glucose_max’,
’d1_glucose_min’, ’d1_hemaglobin_max’,
’weight’, ’glucose_apache’, ’d1_wbc_max’
Random Forest ’age’, ’bmi’, ’d1_creatinine_max’ ,
’d1_creatinine_min’ , ’d1_glucose_max’ ,
’d1_glucose_min’ , ’d1_hemaglobin_max’,
’weight’ , ’d1_wbc_max’ , ’glucose_apache’
, ’ventilated_apache’
Gradient Boosted Trees ’age’, ’bmi’ , ’d1_glucose_min’,
’d1_glucose_max’ , ’icu_id’ , ’weight’,
’glucose_apache’, ’d1_creatinine_max’ ,
’d1_hemaglobin_max’, ‘d1_creatine_min’,
‘d1_platelets_min’
score features that are identified in the literature as clinical and biological predictors of
diabetes highly. Validating the final list of features obtained.
5.3 Classification Performance

5.3.1 Clinical Feature Set Results. Table 5 illustrates the classifier performance for all
models on the clinical feature set. The baseline classifier on the de facto baseline feature
set (Clinical Feature Set), the Decision Tree model had an F1-Score of 0.68, with a slightly
higher precision of 0.70 than recall (0.67). The ensemble methods, namely the Random
Forest and Gradient Boosted Trees performed better across all metrics with an F1-Score
of 0.73 and 0.71 respectively. Here the Random Forest performed the best out of all
models. Conversely the Fully-Connected Neural Network was able to have good recall
performance compared to the other models but performed poorly when compared to
all other metrics.
Table 5: Classifier Performance on Clinical Feature Set

Decision Tree 0.70 0.67 0.68 0.66
Random Forest 0.76 0.71 0.73 0.71
Gradient Boost 0.75 0.70 0.71 0.69
The Confusion Matrix of the Random Forest model indicates that the model is able
to correctly predict diabetic patients at a greater rate than non-diabetics. See Figure 10.
5.3.2 Threshold Feature Set Results. The performance of all models on the Threshold
Feature set is contained in Table 6. Here the Random Forest and Gradient Boosted
15
Figure 10: Confusion Matrix of Random Forest model predictions for Clinical Feature
Set.
Models performed best with an F1-Score of 0.76 and performed better across all metrics
when compared to the Clinical Feature Set. This is to be expected given the significant
difference in the number of features between the sets (6 compared to 95). All models
besides the Fully-Connected Neural Network (with equal performance) had some per-
formance gain, with the Decision Tree having better Precision (0.72) and ROC-AUC of
0.68.
Table 6: Classifier Performance on Threshold Feature Set

Decision Tree 0.72 0.67 0.68 0.67
Random Forest 0.79 0.76 0.76 0.76
Gradient Boost 0.79 0.76 0.76 0.76
Figure 11, shows the Confusion Matrix of the Gradient Boosted Tree model. From
it we can determine that the Gradient Boosted Tree model correctly predicts diabetic
patients at a greater rate than non-diabetics and has an overall lower false positive and
false negative rate than when compared to the Random Forest model trained on the
Clinical Feature Set.
5.3.3 Recurrent-Feature Selected Feature Set Results. Model performance for the
recurrent-feature selected feature set is displayed in Table 7. Here the Random Forest
and Gradient Boosted Trees performed best once again with an F1-Score of 0.76. Per-
formance was similar to that when trained on the Threshold feature set, with only a
decrease in Recall performance of 0.01. Decision Tree performance improved from the
Threshold feature set, in terms of Recall and F1-Score with increases of 0.01 in both
metrics.
The Confusion Matrix of the Gradient Boosted Tree model predictions for the
recurrent-feature selected feature set, seen in Figure 12 indicates that false positive and
false negative rates are lower than that of the model trained on the Threshold feature
16
Figure 11: Confusion Matrix of Gradient Boosted Tree model predictions for Threshold
Feature Set.
Table 7: Classifier Performance on Recurrent-Feature Selected Feature Set

Decision Tree 0.72 0.68 0.69 0.67
Random Forest 0.79 0.75 0.76 0.75
Gradient Boost 0.79 0.75 0.76 0.75
Figure 12: Confusion Matrix of Gradient Boosted Tree model predictions for Recurrent-
Feature Selected Feature Set.
set, although still higher than the random forest model trained on the clinical feature
set.
5.4 Bias Evaluation
The best performing models were evaluated for bias. As model performance across
all models between the Threshold Feature et and Recurrent Feature Selected Feature
17
set was almost identical, both were evaluated. The Neural Network was omitted from
this evaluation. Lower scores (closer to 0) for both Equal Opportunity Difference and
Average Odds Difference represent less bias, with scores of 0 indicating no bias at
all. With positive skewed results indicating bias in favour of the unprivileged groups
(minority ethnicities) and negatively skewed results indicating bias in favour of the
privileged group (Caucasians).
5.4.1 Model Bias for Threshold Feature Set. Model bias for both metrics Equal Op-
portunity Difference and Average Odds Difference models can be considered to be low
for all models, and positively skewed to favour the unprivileged groups. The Random
Forest and Gradient Boosted models have similar Equal Opportunity Difference and
Average Odds Difference scores as seen in Table 8. With the Equal Opportunity Scores
being relatively smaller (0.010 and 0.011 respectively)than those of the Decision Tree.
These results indicate that the difference in True Positive Rates between groups is
marginally biased in favour of the unprivileged groups.
Table 8: Equality of opportunity and Average Odds Differences per model for the
Threshold Feature set
Classifier Equal Opportunity Difference Average Odds Difference

Decision Tree 0.039 0.031
Random Forest 0.010 0.034
Gradient Boost 0.011 0.036
5.4.2 Model Bias for Recurrent-Feature Selected Feature Set. Much like the model
bias for the models trained on the Threshold feature set, bias for the models trained
on the Recurrent-Feature Selected Feature Set is similarly low as seen in Table 9. Of note
is the model bias for the Gradient Boosted Tree model, here both Equal Opportunity
Difference and Average Odds Difference are significantly lower, -0.0010 and 0.0046
respectively and thus an order of magnitude smaller when compared to bias metric
scores of other classifiers. Furthermore the negative equal opportunity difference score
indicates a bias in favour of the privileged group, although this score is so low that it
can be essentially ignored.
Table 9: Equality of opportunity and Average Odds Differences per model for the
Recurrent-Feature Selected Feature Set
Classifier Equal Opportunity Difference Average Odds Difference

Decision Tree 0.022 0.025
Random Forest 0.052 0.028
Gradient Boost -0.0010 0.0046
6. Discussion
The goal of this thesis was to determine whether machine learning could be used to
predict diabetes in intensive care unit admissions, identify the best machine learning
18
classifier and lastly quantify the impact of bias on model performance. Here, multiple
models were trained and evaluated in order to answer the main research question:
“To what extent can machine learning be used to predict diabetes mellitus in ICU
admissions?”. From the results obtained (See Section 3 Results) it can be concluded
that the extent is significant depending on the machine learning method applied. All
models, including the baseline were able to correctly predict diabetes in the test set,
above chance level. However there were major differences between performance of
the varying machine learning models, with the lowest performing model ; The Fully-
Connected Neural Network, reporting an F1-Score of 0.58, whereas the highest per-
forming model, the Random Forest,predicted diabetics with an F1-score of 0.76. All
other models, namely the Decision Tree, Gradient Boosted Trees performed significantly
better. The performance of these models, although comparable to that seen in previous
works of diabetes predictions (Zhou, Myrzashova Zheng, 2020), (Mahboob Alam et
al., 2019) is not adequate for clinical use but with improvements could be. It should
be noted that these results are limited, as only some machine learning classifiers were
utilized and thus our results pertaining to the first research question are constrained to
this subset and thus a full conclusion can not be drawn in the context of all machine
learning classifiers. However it is still possible to put the results in the context of the
research question when considering only the subset of classifiers explored, thus this
lends itself to the second research question : “Which machine learning classifier has
the best performance on the prediction of diabetes mellitus?”. The results show that
the best performance varied depending on the feature set that the models were trained
and evaluated on. Across all feature sets, the ensemble method based models ; Random
Forest and Gradient Boosted Trees consistently performed best. For both the Threshold
Feature Set and Recurrent-Feature-Selected Feature Set both models had equal perfor-
mance, performance for the Threshold Feature set resulted in an F1-Score and recall of
0.76 recall, precision of 0.79 and ROC-AUC of 0.76. For the Recurrent-Feature-Selected
Feature Set performance was similar, with an F1-Score of 0.76 ,recall and ROC-AUC of
0.75 and precision of 0.79. As best performance is equal it can be concluded that both
Random Forest and Gradient Boosted Trees have the best performance. It should be
noted however that the Random Forest classifier outperformed the Gradient Boosted
Trees on the Clinical Feature set across all performance metrics by 0.01 to 0.02.
Thus Random Forest have had the most success, followed closely by Gradient
Boosted Trees, the minute differences in performance can perhaps be attributed to
hyperparameters and or variances in the feature set that lend themselves better to
one model. Interestingly the Fully Connected Neural Network under performed sig-
nificantly when compared to previous work in the (Zhou, Myrzashova Zheng, 2020)
(P, R, R K K, 2020) where performance was reported of upwards of 0.90 accuracy
scores (Note: Indeed only F1-Score was reported here but these are identical to the
accuracy scores). It may be that predicting diabetes in Intensive care unit patients is
innately different, and thus Neural Networks may struggle to learn the underlying
representations that tree based models seem to excel at. Although it can be assumed,
from this study that ensemble methods using trees are the best suited for this prediction
task, this is not a conclusive answer, as neural networks should be just as effective as
tree based ensemble models, as there exists not much difference among random forest,
gradient boosted trees and neural networks in the literature. Lastly, the impact of bias
was investigated under the research question: “What is the effect of racially biased
training data on machine learning model performance on the prediction of diabetes
mellitus ICU admissions?”. Here the best performing models, namely the Decision Tree,
Random Forest and Gradient Boosted Trees, were evaluated based on the difference in
19
the true positive rates (Equal opportunity Difference) and average difference between
true positive and false positive rate between privileged and unprivileged groups (Aver-
age Odds Difference). For both bias metrics, the results of the bias evaluation indicate no
bias in any of the models. The highest equal opportunity difference score was 0.039 and
0.036 for the Average Odds Difference metric, these are significantly low scores and thus
it can be concluded that racially biased training data has no impact on the performance
of machine learning models on the prediction of diabetes in ICU admissions. These
findings are in line with other work that investigated ethnic bias in machine learning
models, such as that of Noseworthy et al. (2020), which also concluded that race did not
impact model performance between ethnicities. An additional sub-research question
was: “Which bias mitigation strategies minimize model bias?”. As no substantial bias
was detected, it was deemed not necessary to attempt to further minimize bias with bias
mitigation strategies such as those presented by Bellamy et al, (2019).
6.1 Limitations
The research presented in this thesis is limited by several factors pertaining to both the
general model performance evaluation and fairness evaluation.
6.1.1 Limitations for General Model Performance. The prediction of diabetic patients
using the trained classifiers inherently relied on the quality of our training data and
the ground truth that is provided along with it. Here it is critical that the provided
ground truth is indeed valid and accurate. At first glance the data that was utilized
for this thesis originates from the Massachusetts Institute of Technologies (MIT) Global
Open Source Severity Illness Score (GOSSIS), which can be considered a reputable and
trusted source in every sense of the word. However, due to the way diabetes manifests
itself in the general population, there may exist significant problems in the ground
truth that was provided. Here, the issue stems from research that estimates that 48%
of diabetics remain undiagnosed(Cho et al., 2018). It is not known to what extent this
statistic applies to our dataset, but it cannot be ruled out that at least a portion of
the non-diabetic labelled patients in our dataset are in actuality diabetic. Thus model
performance may either be better or worse by virtue of some of the predicted false
positives and false negatives being incorrectly misclassified. However, given that the
ground truth is provided as is, we are unable to determine whether this is indeed the
case. An alternative method to solve this issue would be to either partner with a hospital
to collect and/or gain access to an ICU patient admissions dataset, for which follow up
tests can be conducted on patients who are predicted to be diabetes positive by the
machine learning models. Although this solution is not easily feasible in most scenarios
it is indeed a solution that may work for certain researchers, such as those embedded at
institutions that operate university hospitals.
6.1.2 Limitations for Bias Evaluation. Model bias was evaluated by the Equal Oppor-
tunity Difference metric which measures the difference of True Positive Rates between
privileged and unprivileged groups, likewise, the Average Odds Difference the other
bias evaluation metric, measures the difference in True Positive Rates and False Positive
Rates between privileged and unprivileged groups, thus these methods are able to
measure any bias in prediction performance between these two groups of privilege or
in other words between the ethnic majority and minority. However as the unprivileged
group consisted of the aggregated minority ethnicities, it is not possible to determine
whether there exist bias for one of the many minority ethnicities within this group,
20
thus the bias evaluation results are limited to the aggregated group minorities and
no conclusions can be definitively drawn on the model bias for each of the minor-
ity ethnicities independently. An alternative method for bias evaluation that would
overcome this limitation would involve rerunning the evaluation using the same bias
metric for each independent minority ethnicity without grouping ethnicities into one
single unprivileged group, thus the bias of the model for each minority ethnicity can be
examined.
6.2 Future Work
There exist many avenues for future research in the task of predicting diabetes in
intensive care unit patients. For starters, a more diverse range of models can be explored
for their performance on the classification task. Perhaps more notably, the use of hybrid
models that combine the tree based models with neural networks may be a potentially
promising avenue of research. The interpretability of tree based models combined with
the performance of neural networks may be fused in such a way that errors of one model
type are supplemented by the other. Alternatively further work can be conducted to
improve performance through feature-engineering. Using the same feature rich dataset
as utilized here, feature set size can be reduced by the generation of more meaningful
features from the existing feature set. In parallel, work that aims at predicting blood-
insulin levels based on features only available in this dataset may also lead to promising
results. Blood-insulin levels were often the most important feature in previous work that
predicted diabetes in non-critical care settings. Therefore supplementing our dataset
with blood-insulin levels as an additional feature may significantly improve model
performance.
7. Conclusion
To conclude the primary goal of this thesis was to investigate the extent to which
machine learning can be utilized to predict diabetes mellitus in intensive care unit
admissions, to this end it was demonstrated that a subset of machine learning models,
namely Decision Trees, Random Forest and Gradient Boosted Trees could do so at a
reasonable level that is similar to model performance seen in previous machine learning
studies focused on predicting diabetes. However the extent of this model performance
is not clinically viable, and thus further work is needed to improve model performance.
The secondary goal of the thesis was to evaluate the machine learning models for racial
bias, more specifically the effect which racially biased training data may have on model
performance. To this end, the results of the bias evaluation show that no-model perpet-
uated any racial biases between the privileged (majority) and unprivileged(minority)
ethnic groups.h Therefore the effect of racially biased training data can be concluded
to be negligible. The work presented here is not without its limitations, such as those
relating to the methodology and general model and bias evaluation that was conducted.
Either the evaluation methods and/or the uncertainty of the truth in the provided labels
of the dataset are such limitations. Besides addressing these limitations further work
should explore a wider subset of machine learning models, most notably hybrid mod-
els may provide increased performance and interpretability. And additionally, explore
feature engineering approaches to further improve model performance.
21
8. Self Reflection
The thesis was a valuable experience that allowed me to learn a lot, not only about
myself but also about the scientific research process and the amount of work that
goes into it. From previous work I understood that scientific research took time and
careful planning and that setting and meeting deadlines were important, particularly
in a semester where a lot of things were going on. When it came to working under
supervision, there were situations in which I went against the better judgment of my
supervisor to attempt to accomplish some goal that I imagined in my mind (this is
somewhat hard to formulate). One such instance which may better explain this was in
the onset of the project when I spent a considerable amount of time finishing an online
course on how to work with medical data for research so that I could get access to a
dataset that I deemed appropriate for my thesis. I spent at least 10 hours finishing the
course, only to not end up using the dataset, something that my supervisor warned me
of. The lesson learned from this is to really take the advice from your supervisors and
mentors into account and not be naive. Thus next time around I would listen closely to
advice, and only attempt such efforts if ahead of schedule.
Now when it came to writing code, I felt fairly comfortable, particularly after
completing the Software Engineering for CSAI course in the previous semester with
good results, from which I learned a lot of valuable lessons. However, old habits which
I developed in my earlier years of study proved to be problematic to shake. The problem
that I ran into was that I would start off code development adhering to all standards.
Once I ran into issues or would experiment with varying versions of code I would end
up with various Jupyter Notebooks, with bits and pieces of working code that I would
combine into one notebook. Here commenting was an afterthought as the mere effort
of solving the issue, put me off the thought of cleaning up the code. This resulted in
more work later on. Here in future situations, I need to take a more leveled approach
and build experimenting with code variations into my workflow such that I comment
on a section of code as soon as I know it works, and make a habit out of doing so. This
is something that I work on right now so that in future projects such issues do not arise.
Of course, it’s never guaranteed that a section of code may need to be changed later on,
but at least in this way, I save considerable amounts of work later on.
Lastly, when it comes to writing the actual thesis, time management was a big issue.
I think with enough time, I can write fairly well, but I put off writing for too long, mainly
because I was focused on other coursework, which resulted in a somewhat rushed
writing process, for at least parts of my thesis. It is somewhat frustrating, especially
knowing that if I had more time that I could iron out all the little details but I suppose
that is something I just need to work on, by getting in the habit of just writing as I go
and not starting the writing process towards the end. Overall I am not entirely satisfied
with my work, but understand that given the circumstances it’s nothing to beat myself
up about. The overall takeaway is that research like many other things in life requires
time and you need to put yourself in a position so that you have that time to spend.
22
6
References
Gurupur, V., & Wan, T. (2020). Inherent Bias in Artificial Intelligence-Based Decision Support Systems
for Healthcare. Medicina, 56(3), 141. https://doi.org/10.3390/medicina56030141
Kelly, C., Karthikesalingam, A., Suleyman, M., Corrado, G., & King, D. (2019). Key challenges for
delivering clinical impact with artificial intelligence. BMC Medicine, 17(1).
https://doi.org/10.1186/s12916-019-1426-2
McCradden, M., Joshi, S., Anderson, J., Mazwi, M., Goldenberg, A., & Zlotnik Shaul, R. (2020). Patient
safety and quality improvement: Ethical principles for a regulatory approach to bias in healthcare machine
learning. Journal Of The American Medical Informatics Association, 27(12), 2024-2027.
https://doi.org/10.1093/jamia/ocaa085
Leslie, D. (2020). Understanding Bias in Facial Recognition Technologies. SSRN Electronic Journal.
https://doi.org/10.2139/ssrn.3705658
Leavy, S. (2018). Gender bias in artificial intelligence. Proceedings Of The 1St International Workshop
On Gender Equality In Software Engineering. https://doi.org/10.1145/3195570.3195580
Parikh, R., Teeple, S., & Navathe, A. (2019). Addressing Bias in Artificial Intelligence in Health Care.
JAMA, 322(24), 2377. https://doi.org/10.1001/jama.2019.18058
Hyland, S., Faltys, M., Hüser, M., Lyu, X., Gumbsch, T., & Esteban, C. et al. (2020). Early prediction of
circulatory failure in the intensive care unit using machine learning. Nature Medicine, 26(3), 364-373.
https://doi.org/10.1038/s41591-020-0789-4
Bellamy, R., Mojsilovic, A., Nagar, S., Ramamurthy, K., Richards, J., & Saha, D. et al. (2019). AI
Fairness 360: An extensible toolkit for detecting and mitigating algorithmic bias. IBM Journal Of
Research And Development, 63(4/5), 4:1-4:15. https://doi.org/10.1147/jrd.2019.2942287
Bakker, J. (2016). Focus on acute circulatory failure. Intensive Care Medicine, 42(12), 1862-1864.
https://doi.org/10.1007/s00134-016-4596-9
Panch, T., Mattie, H. & Celi, L.A. The “inconvenient truth” about AI in healthcare. npj Digit. Med. 2, 77
(2019). https://doi.org/10.1038/s41746-019-0155-4
Noseworthy, P., Attia, Z., Brewer, L., Hayes, S., Yao, X., & Kapa, S. et al. (2020). Assessing and
Mitigating Bias in Medical Artificial Intelligence. Circulation: Arrhythmia And Electrophysiology, 13(3).
https://doi.org/10.1161/circep.119.007988
Anand, R. S., Stey, P., Jain, S., Biron, D. R., Bhatt, H., Monteiro, K., Feller, E., Ranney, M. L., Sarkar, I.
N., & Chen, E. S. (2018). Predicting Mortality in Diabetic ICU Patients Using Machine Learning and
Severity Indices. AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on
Translational Science, 2017, 310–319.
7
Hargreaves, Carol & Cherie, Chow. (2020). Machine Learning Application to Predict the Length of Stay
of type 2 Diabetes Patients in the Intensive Care Unit. Test Engineering and Management. 6143-6163.
Balkau, B., Lange, C., Fezeu, L., Tichet, J., de Lauzon-Guillain, B., & Czernichow, S. et al. (2008).
Predicting Diabetes: Clinical, Biological, and Genetic Approaches: Data from the Epidemiological Study
on the Insulin Resistance Syndrome (DESIR). Diabetes Care, 31(10), 2056-2061.
https://doi.org/10.2337/dc08-0368
Gutierrez, G. (2020). Artificial Intelligence in the Intensive Care Unit. Critical Care, 24(1).
https://doi.org/10.1186/s13054-020-2785-y
Cho, N., Shaw, J., Karuranga, S., Huang, Y., da Rocha Fernandes, J., Ohlrogge, A., & Malanda, B.
(2018). IDF Diabetes Atlas: Global estimates of diabetes prevalence for 2017 and projections for 2045.
Diabetes Research And Clinical Practice, 138, 271-281. https://doi.org/10.1016/j.diabres.2018.02.023
Mahboob Alam, T., Iqbal, M., Ali, Y., Wahab, A., Ijaz, S., & Imtiaz Baig, T. et al. (2019). A model for
early prediction of diabetes. Informatics In Medicine Unlocked, 16, 100204.
https://doi.org/10.1016/j.imu.2019.100204
P, B., R, S., R K, N., & K, A. (2020). Type 2: Diabetes mellitus prediction using Deep Neural Networks
classifier. International Journal Of Cognitive Computing In Engineering, 1, 55-61.
https://doi.org/10.1016/j.ijcce.2020.10.002
Zhou, H., Myrzashova, R., & Zheng, R. (2020). Diabetes prediction model based on an enhanced deep
neural network. EURASIP Journal On Wireless Communications And Networking, 2020(1).
https://doi.org/10.1186/s13638-020-01765-7
Zou, Q., Qu, K., Luo, Y., Yin, D., Ju, Y., & Tang, H. (2018). Predicting Diabetes Mellitus With Machine
Learning Techniques. Frontiers In Genetics, 9. https://doi.org/10.3389/fgene.2018.00515
Houthooft R, Ruyssinck J, van der Herten J, et al. Predictive modelling of survival and length of stay in
critically ill patients using sequential organ failure scores. Artif Intell Med. 2015;63:191–207. doi:
10.1016/j.artmed.2014.12.009.
Awad A, Bader-El-Den M, McNicholas J, Briggs J. Early hospital mortality prediction of intensive care
unit patients using an ensemble learning approach. Int J Med Inform. 2017;108:185–195. doi:
10.1016/j.ijmedinf.2017.10.002.
Yoon JH, Mu L, Chen L, et al. Predicting tachycardia as a surrogate for instability in the intensive care
unit. J Clin Monit Comput. 2019;33:973–998. doi: 10.1007/s10877-019-00277-0.
Sottile PD, Albers D, Higgins C, Mckeehan J, Moss MM. The association between ventilator
dyssynchrony, delivered tidal volume, and sedation using a novel automated ventilator dyssynchrony
detection algorithm. Crit Care Med. 2018;46:e151–e157. doi: 10.1097/CCM.0000000000002849.

File 156905

Uploaded by

Copyright:

Available Formats

File 156905

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

File 156905

Uploaded by

Copyright:

Available Formats

Machine Learning for Diabetes Mellitus

Prediction in the Intensive Care Unit

T HESIS SUBMITTED IN PARTIAL FULFILLMENT

1.1 Research Questions

2.1 Intensive Care Unit and Machine Learning

Advances in data collection, storage, and management, combined with a desire to

2.2 Machine Learning for Diabetes Prediction

2.3 Model Fairness and Bias

This section describes the methodological framework , procedures and implementations

3.2 Data Exploration

Figure 1: Ethnic distributions of patients in dataset.

identified in the literature to be critical for diabetes prediction(Zou et al,2018)(Balkau et

Figure 2: Table of Group statistics of Clinical Diabetes Predictors

3.3 Feature Selection

Figure 5: Features with missing values exceeding Threshold values

3.3.3 Feature-Importance-Score-based Feature Set. The Feature-Importance-Score-

3.3.4 SHapely Additive exPlanations (SHAP). SHapely Additive exPlanations (SHAP)

3.4.1 Classification algorithms. In total 4 classification models were chosen. Models

Figure 6: Diagram of Decision Tree architecture

Figure 7: Diagram of Random Forest architecture

Figure 8: Diagram of Gradient Boosted Trees architecture

Figure 9: Diagram of Fully Connected Neural Network architecture

3.5 Evaluation Metrics

3.5.1 Performance Evaluation Metrics. Here we evaluate performance according to

EOD = T P RD = unprivlidged − T P RD = privlidged (6)

Where D is the privilege label of the data.

Model performance, as well as the results of the hyperparameter optimization and

Table 2: Best performance per model independent of feature set.

Classifier Precision Recall F1-Score ROC-AUC

5.1 Hyperparameter Optimization Results

Table 3: Hyper-Parameter Optimization Results for each Classifier

Classifier Decision Tree Random Forest Gradient Boost

5.2 Feature Selection Results

Table 4: Hyper-Parameter Optimization Results for each Classifier

5.3 Classification Performance

Table 5: Classifier Performance on Clinical Feature Set

Classifier Precision Recall F1-Score ROC-AUC

Table 6: Classifier Performance on Threshold Feature Set

Classifier Precision Recall F1-Score ROC-AUC

Table 7: Classifier Performance on Recurrent-Feature Selected Feature Set

Classifier Precision Recall F1-Score ROC-AUC

5.4 Bias Evaluation

Classifier Equal Opportunity Difference Average Odds Difference

Classifier Equal Opportunity Difference Average Odds Difference

6.2 Future Work

You might also like