Implementation of Machine Learning Algorithms To C

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Alloghani et al.

BMC Medical Informatics and Decision Making 2019, 19(Suppl 9):253


https://doi.org/10.1186/s12911-019-0990-x

RESEARCH Open Access

Implementation of machine learning


algorithms to create diabetic patient
re-admission profiles
Mohamed Alloghani1,2* , Ahmed Aljaaf1,3 , Abir Hussain1 , Thar Baker1 , Jamila Mustafina4 , Dhiya Al-Jumeily1
and Mohammed Khalaf5
From 2018 International Conference on Intelligent Computing (ICIC 2018) and Intelligent Computing and Biomedical
Informatics (ICBI) 2018 conference
Wuhan and Shanghai, China. 15–18 August 2018, 3–4 November 2018

Abstract
Background: Machine learning is a branch of Artificial Intelligence that is concerned with the design and
development of algorithms, and it enables today’s computers to have the property of learning. Machine learning is
gradually growing and becoming a critical approach in many domains such as health, education, and business.
Methods: In this paper, we applied machine learning to the diabetes dataset with the aim of recognizing patterns
and combinations of factors that characterizes or explain re-admission among diabetes patients. The classifiers used
include Linear Discriminant Analysis, Random Forest, k–Nearest Neighbor, Naïve Bayes, J48 and Support vector
machine.
Results: Of the 100,000 cases, 78,363 were diabetic and over 47% were readmitted.Based on the classes that models
produced, diabetic patients who are more likely to be readmitted are either women, or Caucasians, or outpatients, or
those who undergo less rigorous lab procedures, treatment procedures, or those who receive less medication, and
are thus discharged without proper improvements or administration of insulin despite having been tested positive for
HbA1c.
Conclusion: Diabetic patients who do not undergo vigorous lab assessments, diagnosis, medications are more likely
to be readmitted when discharged without improvements and without receiving insulin administration, especially if
they are women, Caucasians, or both.
Keywords: Machine learning, Linear discriminant, Algorithms, Support vector machine, Diabetes re-admission, HbA1c

Introduction be achieved through experimental results that depict


The approaches used in managing maladies have a major valuable methods with better performance when com-
influence on the medical outcome of the patient includ- pared with other studies. In the same context, many
ing the probability of re-admission. A growing number strategies were developed to achieve such objectives by
of publications suggest the urgent needs to explore and employing novel statistical models on large-scale datasets
identify the contributing factors that imply critical roles [1–6]. Such an observation has prompted the requirement
in human diseases. This can help to uncover the mech- of effective patient management protocols, especially for
anisms underlying diseases progression. Ideally, this can those admitted into intensive care unit. However, the same
protocols are not fully applicable to Non–Intensive Care
*Correspondence: [email protected] Unit (Non-ICU) inpatients, and this has inculcated poor
1
The Artificial Intelligence Department-, Dubai, UAE
2
Liverpool John Moores University, Liverpool, UAE inpatient management practices regarding the number of
Full list of author information is available at the end of the article treatments, the number of lab test conducted, discharge,
© The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Alloghani et al. BMC Medical Informatics and Decision Making 2019, 19(Suppl 9):253 Page 2 of 16

insignificant changes or improvements at the time of dis- identified socioeconomic status, ethnicity, disease bur-
charge, and high rates of re-admissions. Nonetheless, such den, public coverage, and history of hospitalization as
a claim has not been proven and the influence on these key re-admission risk factors. Besides these factors and
factors on re-admission among diabetes. As such, this principal admission conditions, re-admission can be a fac-
study hypothesized that time spent in hospital, number tor of health management practices. This study provides
of lab procedures, number of medications, and num- information on the managerial causes of re-admission
ber of diagnoses have an association with re-admission using six machine learning models. Additionally, most
rates and are proxies of in-hospital management prac- studies employ regression data mining technique and as
tices that affect patient health outcomes. However, detec- such this study provides a framework for implement-
tion of Hemoglobin A1c (HbA1c) marker, administration ing other machine learning techniques in exploring the
of insulin treatment, diabetes treatment instances, and causative agents of re-admission rates among diabetes
noted changes are factors that can moderate the admis- patients. The primary importance of the algorithm is to
sion and are treated as partial management factors in the help hospitals identify multiple strategies that work effec-
study. Some of the re-admission is avoidable although tively for re-admission of a given health condition. In
this requires evidence-based treatments. According to [7] specific, implementation of multiple strategies will focus
in a retrospective cohort study evaluated the basic diag- on improved communication, the safety of the medica-
noses and 30-day re-admission patterns among Academic tion, advancements in care planning, and enhanced train-
Tertiary Medical Center patients’ and established within ing on the management of medical conditions that often
30-days re-admissions are avoidable. In specific, the study lead to re-admissions. Each of these sub-domains involves
established that 8.0% of the 22.3% of the within 30 days re- decision making and given the size and nature of health-
admissions are potentially avoidable. As a subtext to the care information, data mining and deep learning tech-
conclusion, the authors asserted that these re-admission niques may prove critical in reducing the re-admission
cases were related in direct or indirect consequences due rates.
to the pre-conditions related to the primary diagnosis.
For instance, research demonstrated that patients admit- Methodology
ted for heart failure and other related diseases are more Figure 1 illustrates the high-level machine learning pro-
likely to be readmitted for acute heart failure.However, cess diagram used in the paper. The study explored
the re-occurrence of the heart condition is dependent on the probable predictors of diabetes hospital re-admission
the treatment administered, observed health outcome at among the hospitals using machine learning techniques
discharge, and other pre-existing health conditions. along with other exploratory methods. The dataset con-
sists of 55 attributes and only 18 were used as per the
Research contribution scope of the study. The performance of the models is eval-
Under the circumstances, it is essential for healthcare uated using the conventional confusion matrix and ROC
stakeholders to pursue re-admission reduction strategies, efficiency analysis. The final re-admission model is based
especially with a specific focus on the potentially avoid- on the best performing model as per the true positive
able re-admissions. The authors in [8] highlighted the rates, sensitivity and specificity.
role that financial penalties imposed on health institu-
tions with higher re-admission rates in reducing the re- Linear discriminant analysis
admission incidences. Furthermore, the article assessed LDA algorithm is a variant of Fisher’s linear discriminant
and concluded that extensive assessment of patient needs, and it classifies data to vector format based linear com-
reconciling medication, educating the patients, planning bination of attributes based on a target factor or class
timely outpatient appointments, and ensuring follow-up variable. The algorithm has a close technical resemblance
through calls and messages are among the best emerg- to Analysis of Variance (ANOVA) and regression as it
ing practices for reducing re-admission rates. However, explains the influences of predictors using linear com-
implementing these strategies requires significant funding binations [5]. There are two approaches to LDA. The
although the long-term impacts outweigh any financial techniques assume that the data conforms to Gaussian
demands. Hence, it suffices to deduce that re-admissions distribution and as such, assumes that each attribute has
in a health facility are a priority area for improved a bell-shape curve when visualized and it also assumes
health facilities and reducing healthcare cost. Regardless that each variable has the same variance, and that data
of the far-reaching interest in hospital re-admissions, lit- points of each attribute vary around the average by the
tle research has explored re-admission among diabetes same amount. That is, the algorithm requires the data and
patients. A reduction of diabetic patient re-admission its attributes to be normally distributed and of constant
can reduce health cost while improving health outcomes variance or standard variation. As a result, the algorithm
at the same time. More importantly, some studies have estimates the mean and the variance of the data for each

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Alloghani et al. BMC Medical Informatics and Decision Making 2019, 19(Suppl 9):253 Page 3 of 16

Fig. 1 The Machine Learning Process Diagram

of the class that it creates using the conventional statistical rule and the Gaussian distribution assumption for class
techniques. means where:
1    
μ= (x) (1) π xy = i (4)
nk
Where μ is the mean of each input attribute (x) for each Secondly, LDA finds a linear combination of the predic-
class (k) and n is the total number of observations in the tors that return the optimum predictor value, and this
dataset. The variance associated with the classes is also study uses the latter. LDA algorithm can be implemented
computed using the following conventional method. in five basic steps. First, in performing LDA classification,
1  the d-dimensional mean vectors are computed for the
σ2 = (x − μ)2 (2) classes identified in the dataset using the mean approach
n−k
(Eq. 1). The variance and the normality assumption must
In Eq. 2, sigma squared is the variance across all instance
be checked before proceeding. Second, both within and
serving as input in the model, k is the number of classes,
between-class scatters are computed and returned as a
and n is the number of observations or instance in the
matrix. The within-class scatter or distances are com-
dataset. μ is the mean and is computed using Eq. 1.
puted based on Eq. 5.
Besides the assumptions, the algorithm makes predic-
tion using a probabilistic approach that can be summa- 
c
rized in two steps. Firstly, LDA classifies predictors and Swithin = Si (5)
assigns them to a class based on the value of the posterior i=1
probability denoted as
   and
π y = i x (3)

n
The objective is to minimize the total probability of mis- Si = (x − μi ) (x − μi )T (6)
classifying the features, and this approach relies on Bayes’ x∈Di

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Alloghani et al. BMC Medical Informatics and Decision Making 2019, 19(Suppl 9):253 Page 4 of 16

where i is the scatter for every class identified in the is important to note that Random Forest is an ensem-
dataset and μ is the mean of the classes computed using ble algorithm because it combines different trees. Ide-
Eq. 1. ally, ensemble algorithms combine one or more classifiers
The Between-class scatter is calculated using Eq. 7. with the different types. Random forest can be thought
of a bootstrapping approach for improving the results

c
obtained from the decision tree. The algorithm works
Sbetween = Ni (μi − μ ) (μi − μ)T (7)
in the following order. First, it selects a bootstrap sam-
i−1
ple S(i) from the sample space and the argument denoting
In Eq. 7, S is general mean value while μ and N refers the bootstrap sample refers to the ith bootstrap. The
to the sample mean and sizes of identified classes respec- algorithm learns a conventional decision tree although
tively. The third step involves solving Eigenvectors asso- through implementation of a modified decision tree algo-
ciated with the product of the within-class and out-class rithm. The modification is specific and is systematically
matrices. The fourth step involves sorting the linear dis- implemented as the tree grows. That is, at each node of
criminant to identify the new feature subspace. The selec- decision tree, instead of implementing an iteration for
tion and sorting using decreasing magnitudes of Eigen- all possible feature split, RF randomly selects a subset
values. The last step involves the transformation of the of features such that f ⊆ F and then splits the fea-
samples or observations onto the new linear discrimi- tures in the subset (f ). The splitting is based on the
nant sub-spaces. The pseudo-code for LDA is presented best feature in the subset and during implementation,
in Algortihm 1. the algorithm chooses the subset that it is much smaller
than the set of all features. Small size of subset reduces
the burden to decide on the number of features to split
Algorithm 1 Linear Discriminant Analysis since datasets with large size subsets tend to increase the
  T n
1: D = x , yi i=1 : computational complexity. Hence, the narrowing of the

i attributes to be learned improves the learning speed of the
2: Di = xj |yi = ci , j = 1, ..., n , i = 1, 2
T
algorithm.
//declaration of class-specific subsets
3: μi = mean (Di ) , i = 1, 2
Algorithm 2 Random Forest (Decision Tree Ensemble)
//calculation of class means
Prerequisite: Specify the training set S :=
4: B = (μ1 − μ2 ) (μ1 − μ2 )
T
(x1 , y1 ) , ..., (xn , yn ) the set of all features F, and the
//Computation of between-class scatter
number of trees to be included in the forest B
5: Zi = Di − 1ni μi , i = 1, 2
T
1: function RANDOM FOREST (S, F)
// Center scatter matrix
2: H ← 0
6: Si = Zi Zi , i = 1, 2
T
3: for i ∈ 1, ..., B do
//The scatter of the respective classes
4: S(i) ← 1 Bootstrap sample drawn from S
7: S = S1 + S2 (i)
5: hi RandomizedTreelearn(S , F)
//Computation  of within-class
 scatter distances
6: H ← H ∪ {hi }
8: λ1, w = eigen S−1 B
7: end for
//Computation of eigenvector using dominance
8: return H
9: end function
10: function RANDOMIZEDTREELEARN (S, F)
For the classes i, the algorithm divides the data into 11: At each node:
D1 and D2 then calculates the within and between the 12: f ← Draw a small subset of F
class distances, and the best linear discriminant is a vec- 13: Split the best feature in f
tor obtained from the product of transpose of within-class 14: return The Learned Tree (Model)
and between-class scatter matrices. 15: end function

Random forest
Random forest is a variant of decision degree growing The algorithm uses bagging to implement the ensem-
technique and it is different from the other classifiers, ble decision tree, and it is prudent to note that bagging
because it supports random growth branches within the reduces the variance of the decision tree algorithm.
selected subspace. The random forest model predicts
the outcome based on a set of random base regression Support vector machine
trees. The algorithm selects a node at each random base Support Vector Machine is a group of supervised learn-
regression and split it to grow the other branches. It ing techniques that classify data based on regression

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Alloghani et al. BMC Medical Informatics and Decision Making 2019, 19(Suppl 9):253 Page 5 of 16

analysis. One of the variables in the training sample should The classification and prediction application of the algo-
be categorical so that the learning process assigns new rithm depends on the dominant k class and the predictive
categorical value as part of the predictive outcome. As equation is as the following:
such, SVM is a non-likelihood binary classifier lever- 
(x) = argmaxc ∈ y. (xi , yi ) ∈ N (x, L, K)yi (10)
aging the linear properties. Besides classification and
regression, SVM detects outliers and is versatile when It is imperative to note that output class consists of
applied to dimensionality high [1]. Ideally, a training vec- members from the target attribute and the distance used
tor variable, that has at least two categories, is defined as in assigning the attributes to classes is based on Euclidean
follows: distance. The implementation of the algorithm consists
xi ∈ Rp , i = 1, ..., n (8) of six steps. The first step involves the computation of
Euclidean distance. In the second step, the computed n
where xi represents the training observation and Rp indi- distances are arranged in a non-decreasing order, and in
cates the real-valued p-dimensional feature space and the third step, a positive integer k is drawn from the sorted
predictor vector space. A pseudo-code for a simple SVM Euclidean distances. In the fourth step, k-points corre-
algorithm is illustrated: sponding to the k-distances are established and assigned
based on proximity to the center of the class. Finally, for
k >0 and for (number of points in the i, an attribute x
Algorithm 3 Support Vector Machine)
is assigned to that class if ki > kj for all i  = j is true.
AttributeSupportVector(ASV )
Algorithm 4 shows the kNN steps process:
={Closest Attribute Pair from Opposite Classes}
1: while margin constraint violating points exist do
2: Find the violator Algorithm 4 k-Nearest Neighbor)
3: ASV = ASV∪ Violator Preconditions: Specify training data (X), class
4: if any αp < 0 because of addition of c to S then labels (Y ), and unknown sample (x)
5: ASV = p
ASV
1: Classify (X, Y , x)
6: Repeat all the violating points are pruned 2: for i = 1 to m do
7: end if 3: Compute distance d (Xi , x)
8: end while 4: end
5: for Compute Set I contains the minimum sets of k
distances d (Xi , x)
The algorithm searches for candidate support vectors 6: Return majority label for {Yi ; i ∈ I}
denoted as S and it assumes that SV occupies as a space
where the parameters of the linear features of the hyper-
plane are stored.
Naïve Bayes
k-nearest neighbor Even though Naïve Bayes is one of the supervised learning
kNN classifies data using the same distance measure- techniques, it is probabilistic in nature so that the classifi-
ment techniques as LDA and other regression-based algo- cation is based on Naïve Bayes’ rules of probability, espe-
rithms. In classification application, the algorithm pro- cially those of association. Conditional probability is the
duces class members while in regression application it construct of Naïve Bayes classifier [9–16]. The algorithm
returns the value of a feature or a predictor [9]. The assigns instance probabilities to the predictors parsed in a
technique can identify the most significant predictor and vector format representing each probable outcome. Naïve
as such was given preference in the analysis. Nonethe- Bayes classifier is the posterior probability that the divi-
less, the algorithm requires high memory and is sensi- dend of the product of prior with likelihood and evidence
tive to non-contributed features despite being considered returns. The construction of the model from the output
insensitive to outliers and versatile among many other of the analysis is quite complex although the probabilistic
qualifying features. The algorithm creates classes or clus- computation from the generated classes is straightforward
ters based on the mean distance between data-points. [17–22]. The Bayes Theorem upon which the Naïve Bayes
The mean distance is calculated using the following classifier is based can be written as follows:
equation. P(ν|μ)P(μ)
P (μ|ν) = (11)
1 P(ν)
(x) = . (xi , yi ) ∈ kNN(x, L, K)yi (9)
k Where μ and v are events or instances in an experiment
In Eq. 9, kNN (x, L, K), k denotes the K nearest neigh- and P(μ) and P(ν) are the probability of their occurrence.
bors of the input attribute (x) in the learning set space (i). The conditional probability of an event μ occurring after

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Alloghani et al. BMC Medical Informatics and Decision Making 2019, 19(Suppl 9):253 Page 6 of 16

v is the basis of Naïve Bayes classifier. The classifier uses (Additional file 1). The data has 55 attributes, about
maximum likelihood hypothesis to assign data points to 100,000 observations, and has missing values. However,
classes. The algorithm assumes that each feature is inde- the study used a sample based on the treatment of dia-
pendent and makes equal contribution to the outcome betes. In specific, of the 100,000 cases, 78,363 meet
or all features belonging to the same class have the same the inclusion criteria since they received medication for
influence on that class. In Eq. 11, the algorithm com- diabetes. Consequently, the study explored re-admission
putes the probability of event μ provided that v already incidences among patients who had received treatment.
occurred, and as such v is the evidence and the proba- The amount of missing information, the type of the
bility P(μ) is regarded as the priori probability. That is, data (categorical or numeric) that guided the data clean-
it refers to probability obtained before seeing the evi- ing process, re-admission, Insulin prescription, HbA1c
dence while the conditional probability P (μ|ν) is priori test results, and observed changes were retained as the
probability of v since it is a probability computed with major out-come associated with time spent in the hospi-
evidence. tal, the number of diagnoses, lab procedures, procedures,
and medications [31, 32]. Of the 55, only 18 variables
J48 were selected as per the scope for analysis and even
J48 is one of the decision tree growing algorithm. How- about 8 of the selected served as proxy controls. The
ever, J48 is the reincarnation of the C4.5 algorithm, which study was split into 70% training and 30% validation
is an extension of the ID3 algorithm [23]. As such, J48 is subsets.
a hierarchical tree learning technique and it has several
mandatory parameters including the confidence value and K-fold validation
the minimum learning instance, which are translated to To improve the overall accuracy and validate a model, we
branches and nodes in the final decision tree [23–29]. relied on the 10-fold cross validation method applied for
estimating accuracy. The training dataset is split into k-
Data assembly and pre-processing subsets and the subset held out while the model is fully
The study used diabetes data that was collected across trained on remaining subsets. Figure 2 illustrates the val-
130 hospitals in the US in the years between 1999–2008 idation method. The K-fold Cross-validation method uti-
[30]. The dataset includes data systematically composed lizes the defined training feature set and randomly splits it
from contributing electronic health records’ providers into k equal subsets. The model is trained k times. During
that contained encounter data such as inpatient, outpa- each iteration, 1 subset is excluded for use as validation.
tient and emergency, demographics, provider specialty, This technique reduces over-fitting issues, which occurs
diagnosis, in-hospital procedures, in-hospital mortality, when a model trains the data too closely to a set of data,
laboratory and pharmacy data. The complete list of which can result in failure to predict future information
the features and description is provided in Table S1 reliably [2, 12, 33].

Fig. 2 Cross-Validation Scheme for both training validation subsets

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Alloghani et al. BMC Medical Informatics and Decision Making 2019, 19(Suppl 9):253 Page 7 of 16

Discussion Those who received less than 10 diagnoses and less than
Exploratory analysis 70 procedures were more likely to be readmitted. None of
Of the 47.7% diabetic patients who were readmitted, the patients received more than diagnosis and a majority
11.6% stayed in the hospital for less than 30 days while were admitted for more than 30 days.
36.1% stayed for more than 30 days. A majority (52.3%) Figure 4 depicts a scatter plot of a number of diagnoses
of those who stayed for more than 30 days did not and lab procedures. The re-admission rates are quite dif-
receive any medical procedures during the first visit. In ferent between a group of patients who noted change at
general, diabetic patients who received a fewer num- discharge than those who did not. Those who failed to
ber of lab procedures, treatment procedures, medica- note significant improvement at discharge received more
tions, and diagnoses are more likely to be readmitted than 50 medications and less than 10 diagnoses. However,
than their counterparts. Furthermore, the more frequent re-admission is higher among those who noted improve-
a patient is admitted as an in-patient the less likely the ment at discharge.
probability of re-admission. Our study indicated that,
women (53.3%) and Caucasian (74.6%) diabetic patients Density distributions
are more vulnerable to re-admission than male and The distribution of re-admission and subsequent patterns
the other races. Besides several lab procedures, medica- associated with reported change and results of HbA1c are
tions, and diagnoses, insulin administration and HbA1c shown in Figs. 5 and 6.
results exacerbate the re-admission rates among diabetic Figures 5 and 6 illustrate the density distribution of
patients. number of medications, lab procedures, and diagnoses
grouped by re-admission, HbA1c results, insulin admin-
Scatterplots istration change at discharge. Notably, the distribution
The Scatterplots of re-admission incidences with an over- density of the number of lab procedures, medications,
lay of HbA1c measurements and change recorded at the and diagnoses are the same for grouping categories.
time of discharge are shown in Figs. 3 and 4. Figure 6 shows significant differences in the number of
Figure 3 illustrates the Scatterplot of the number of medications and lab procedures. For instance, the aver-
diagnoses and lab procedures that patient received for age number of medications differs between ’No’, ’Up’,
re-admission rates. The figures have 8 panels display- ’Steady’, and ’Down’ insulin categories. A similar differ-
ing scatters of diagnoses and lab procedures for differ- ence in mean of the number of medications is observed
ent instances of HbA1c results and change. The plot in the change distribution curve with those recording
shows that patients who had negative HbA1c tests results change at discharge receiving more medications than their
received several diagnosis and very few were readmitted. counterparts.

Fig. 3 Scatterplot of Medications and Diagnoses

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Alloghani et al. BMC Medical Informatics and Decision Making 2019, 19(Suppl 9):253 Page 8 of 16

Fig. 4 Scatterplot of Medications and Diagnoses

Fig. 5 Density Plots of Predictors by re-admission and HbA1c

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Alloghani et al. BMC Medical Informatics and Decision Making 2019, 19(Suppl 9):253 Page 9 of 16

Fig. 6 Density Plots of Predictors by Insulin and change

Smooth linear fits Table 1 depicts that Naïve Bayes correctly classified
Figures 7 and 8 illustrate the smooth line fits asso- the re-admission rates less than 30 days and none re-
ciated with Scatterplots. The smoothen fits include a admission incidences. SVM accurately classified 48.3% of
95% confidence interval and demonstrates the likely the re-admission incidence exceeding 30 days. The objec-
performance of linear regression models in forecasting tive is to obtain the performing model.
re-admission.
Figures 7 and 8 depict smooth linear fits of the Scatter- Individual model performance
plots and density plots in Figs. 3, 4, 5, and 6. The figures The LDA model yields two linear discriminants LD1 and
illustrate that the number of lab procedures has linear LD2 with proportion trace of 0.9646 and 0.0354 respec-
relationships with the number of diagnoses although the tively. Hence, the first LD explains more than 96.46% of
data is likely to be heteroskedastic. The number of diag- the between-group variance while the second account for
noses and medications also have the same relationship and 3.54% of the between-group variance.
plot patterns. For medication versus procedures, the rela-
tionship is linear and change in diabetes status increases LD1 = 003 ∗ Lab Procedures − 0.102 ∗ Procedures
with medications and lab procedures. As for re-admission, +0.08 ∗ Medications + 0.18 ∗ Emergency + 0.67 Inpatient
incidents of more than 30 days re-admission reduced with +0.17Diagnoses
increasing number of diagnoses, lab procedures, and med-
(12)
ications. Similarly, the probability of detecting HbA1c
increases with increasing number of diagnoses and lab Figure 9 illustrates the plot of LD1 versus LD2. Equation
procedures. 12 depicts the profile of diabetic patients.
The predictors were significantly correlated at 5% level
Model evaluation and they influenced re-admission based on the frequency
The performance of the models in predicting re- of each. The kNN model used all the 16 predictors to learn
admission incidence was based on the confusion matrix the data and selected three as significant predictors. In
and in specific the percentage of the correctly predicted specific, the kNN model proposes that high re-admission
read-mission categories. for diabetes treatment is caused by a fewer number of

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Alloghani et al. BMC Medical Informatics and Decision Making 2019, 19(Suppl 9):253 Page 10 of 16

Fig. 7 Smooth Linear Fits with Insulin and Change as Facets

lab procedures, diagnoses, and medications. However, distribution curves suggest that patients are more likely
the rates are higher among patients who tested positive to feel better at time of discharge provided that the
for HbA1c and did not fail to receive insulin treatment lab services and medications are of superior quality. It
(Fig. 3). is important to reiterate that Naves Bayes’ model has
SVM classified the readmitted diabetic patients into true positive and false negative rates showing that it had
three classes using a polynomial of degree 3 suggest- 13.78% accuracy and 13.78% sensitivity. Finally, random
ing that diabetes re-admission cases do not have a lin- forest classified diabetic patients using linear approaches
ear relationship with the predictors. As an inference, with re-admission as the control. Figures 7 and 8 demon-
the polynomial relationship illustrated by the kernel strate that the smoothen linear of the paired predictors
and degree of the SVM indicates higher re-admission shows that re-admissions taking more than 30 days is
rates among patients discharged without any significant reduced by increasing the number of medical diagnoses.
changes (Fig. 4). Naïve Bayes classifier yields two classes Further, the HbA1c results increase with increasing num-
using the Laplace approach. The classification from the ber of diagnoses. However, it is important to note that
model depicts a reduced likelihood of re-admission in the association between the number of lab procedures and
cases where the patients undergo a series of laboratory medications tends to be non-linear while that between
tests, rigorous diagnosis, proper medication, and dis- the number of diagnoses and medication is linear regard-
charge after confirmation of improvement. The density less of the grouping variable. The J48 based tree shown in
distributions in Figs. 5 and 6 compliments the findings Fig. 9 does not consider the linear relationships and omits
of the model. In specific, the distributions of the num- diabetic patients who were never re-admitted. The resul-
ber of medications and lab procedures show a noticeable tant tree included a number of inpatient treatment days,
difference in the distribution when considering insulin number of emergencies, number of medications, lab pro-
administration as part of treatment. Regarding aggrega- cedures, and diagnoses in the model. The model suggests
tion of the distribution of the number of medications that diabetic patients admitted as in-patients tend not
and lab procedures by status at discharge (change), the to be re-admitted. Similarly, the tree demonstrates that

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Alloghani et al. BMC Medical Informatics and Decision Making 2019, 19(Suppl 9):253 Page 11 of 16

Fig. 8 Smooth Linear Fits with re-admission and HbA1c as Facets

several diagnoses improve health outcomes and reduce of 64% and a sensitivity of 52.4%. The ROC curves asso-
re-admission. ciated with the predictions of re-admission that exceeded
30 days are displayed in the figures below.
Best fit model The larger the area covered the more efficient the model
The best fitting model is based on the performance mea- is, and this principle Fig. 10 depicts that Naïve Bayes is the
sures summarized in Table 2. The key decision relies most efficient.
on the efficiency of the model in predicting the re-
admission rates and the area under the curve (AUC) Naïve bayes analysis
and precision/recall curve are the best measures for such The model focused on the top 5 best factors (exposures)
a task. that contributed to re-admission for less and more than 30
Table 2 illustrates that Naïve Bayes is the most sensitive days. The association between the exposures and outcome
and efficient model for learning, classifying and predicting (re-admission instances) are given as log odds ratio in the
re-admission rates using mHealth data. It has an efficiency nomograms illustrated in Fig. 11. The three classes model
are Class 0 (No re-admission), Class 1 (re-admission for
less than 30 days), and Class 3 (re-admission for more than
Table 1 True Positive Rate Comparison Table 30 days).
Model <30 days >30 days No Figure 11 depicts the exposure factors with absolute
Random Forest 21.0% 42.8% 60.5% importance on Class 0 including number of emergencies,
kNN 17.8% 40.3% 59.6% number of patients, discharge disposition ID, admission
source ID, and number of diagnoses. The log odds ratios
Naïve Bayes 23.6% 46.6% 61.2%
illustrate the association between these exposure factors.
SVM 12.2% 48.3% 55.9%
The conditional probability for re-admission after dis-
J48 17.3% 40.4% 60.3% charge based on these exposure factors is 0.5.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Alloghani et al. BMC Medical Informatics and Decision Making 2019, 19(Suppl 9):253 Page 12 of 16

Fig. 9 Plot of two linear discriminants obtained from LDA learner

Figure 12 depicts the exposure factors with absolute odds ratios illustrate the association between these expo-
importance on Class 1 including the number of emer- sure factors to lack of re-admission after discharge. The
gencies, the number of patients, discharge disposition conditional probability for re-admission after discharge
ID, time in hospital ID, and number of diagnoses. The based on these exposure factors is 0.42. The number
log odds ratios display the association between these of emergency admission increases re-admission chances
exposure factors to lack of re-admission after discharge. by 80% for those with least history. Further, those with
The conditional probability for re-admission after dis- higher inpatient admission history have 65% chance of
charge based on these exposure factors is 14%. In specific, re-admission for more than 30 days. Most importantly,
there is a 48% chance of re-admission for patients with patients who undergo more than 9.5 diagnoses tests have
a number of diagnoses between 8.5 and 9.5, and a 52% 70% chance of re-admission for more than 30 days after
chance for those with diagnoses between 5.5 and 8.5. discharge.
Similarly, those spending between 2.5 to 3.5 days in the
hospital is more likely to be readmitted (59%) for less Conclusion
than 30 days than their counterparts with 41% chance The size of the health data and the amount of informa-
of re-admission. Finally, those with fewer emergency tion contained exemplifies the importance of machine
admission history stand higher chances of re-admission learning in the health sector. Developing the profiles
(80%) than those with sufficient emergency admission for the patients can help in understanding the factors
history. that help reduce the burden of the disease while at the
Figure 13 depicts the exposure factors with absolute same time improve outcomes. Diabetes is a major prob-
importance on Class 2 including the number of emer- lem given that over 78% of the patients admitted across
gencies, the number of patients, discharge disposition ID, the 130 hospitals were treated for the condition. Of the
admission source ID, and number of diagnoses. The log total number of diabetic patients who participated in the
study, over 47% were readmitted with over 36% percent
staying in the hospital for over 30 days. This study has
Table 2 Comparison of model efficiency and sensitivity also established that women and Caucasians are more
Model AUC CA F1 Precision Recall vulnerable to hospital re-admissions [5, 33–39]. Each
kNN 0.575 0.499 0.489 0.482 0.499 of the machine learning models has established differ-
J48 0.578 0.490 0.487 0.485 0.490 ent combinations of features influencing the admission
rates. For instance, LDA proposes a linear combination
SVM 0.547 0.475 0.421 0.483 0.475
while the SVM suggests a third-degree polynomial degree
Random Forest 0.602 0.529 0.509 0.499 0.529
of association between re-admission and its predictors.
Naïve Bayes 0.640 0.566 0.524 0.519 0.566 Further, J48 models the relationship as non-linear with

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Alloghani et al. BMC Medical Informatics and Decision Making 2019, 19(Suppl 9):253 Page 13 of 16

Fig. 10 ROC curves illustrating the Areas Under Curve for the models

Fig. 11 Nomogram visualization of Naïve Bayes classifier on target class 0

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Alloghani et al. BMC Medical Informatics and Decision Making 2019, 19(Suppl 9):253 Page 14 of 16

Fig. 12 Nomogram visualization of Naïve Bayes classifier on target class 1

emphasis on the importance of emergency admission and undergo vigorous lab assessments, diagnosis, medica-
in-patient treatment on re-admission rates. kNN models tions are more likely to be readmitted when discharged
lead to the conclusion that fewer number of lab pro- without improvements and without receiving insulin
cedures, diagnoses, and medications lead to increased administration, especially if they are women, Caucasians,
higher re-admission rates. Diabetic patients who do not or both.

Fig. 13 Nomogram visualization of Naïve Bayes classifier on target class 2

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Alloghani et al. BMC Medical Informatics and Decision Making 2019, 19(Suppl 9):253 Page 15 of 16

Supplementary information 9. Kavakiotis I, Tsave O, Salifoglou A, Maglaveras N, Vlahavas I, Chouvarda


Supplementary information accompanies this paper at I. Machine learning and data mining methods in diabetes research.
https://doi.org/10.1186/s12911-019-0990-x. Comput Struct Biotechnol J. 2017;15:104–16. https://doi.org/10.1016/j.
csbj.2016.12.005.
10. Chuai G, Jifang Y, Chen M, et al. Deepcrispr: optimized crispr guide rna
Additional file 1: List of features and descriptions in the experiment design by deep learning. Genome Biol. 2018;19(1):18.
datasets. 11. Yi H-C, Huang D-S, Li X, Jiang T-H, Li L-P. A deep learning framework for
robust and accurate prediction of ncrna-protein interactions using
Abbreviations evolutionary information. Mol Ther-Nucleic Acids. 2018;1(11):337–44.
AI: Artificial intelligence; ANOVA: Analysis of variance; C4.5: Data mining https://doi.org/10.1016/j.omtn.2018.03.001.
algorithm; DA: Discriminant analysis; HbA1c: Glycated hemoglobin test; ID3: 12. Ling H, Kang W, Liang C, Chen H. Combination of support vector
Iterative dichotomiser 3; J48: Decision tree J48; KNN: K-nearest neighbors; LA: machine and k-fold cross validation to predict compressive strength of
Linear discriminant; ML: Machine learning; NB: Naive Bayes; Non-ICU: Non concrete in marine environment. Constr Build Mater. 2019;206:355–63.
intensive care unit; RF: Random forest; ROC: Receiver operating characteristic; https://doi.org/10.1016/j.conbuildmat.2019.02.071.
SVM: Support vector machine 13. Harleen Kaur VK. Predictive modelling and analytics for diabetes using a
machine learning approach. Appl Comput Inform. 2018. https://doi.org/
10.1016/j.aci.2018.12.004.
Acknowledgments 14. Zhang H, Yu P, et al. Development of novel prediction model for
The data sources used in the paper was retrieved from UCI Machine Learning drug-induced mitochondrial toxicity by using naïve bayes classifier
Repository as submitted by the Center for Clinical and Translational Research. method. Food Chem Toxicol. 2017;10:122–9. https://doi.org/10.1016/j.fct.
The organization and the characteristics of the data made it easy to complete 2017.10.021.
classification and clustering tasks. We are also grateful to the artificial 15. Donzé J, Bates DW, Schnipper JL. Causes and patterns of readmissions in
intelligence department for providing the necessary support to carry out such patients with common comorbidities: retrospective cohort study. BMJ.
a research. 2013;347(7171):. https://doi.org/10.1136/bmj.f7171.
16. Smith DM, Giobbie-Hurder A, Weinberger M, Oddone EZ, Henderson
About this supplement WG, Asch DA, et al. Predicting non-elective hospital readmissions: a
This article has been published as part of BMC Medical Informatics and Decision multi-site study. Department of veterans affairs cooperative study group
Making Volume 19 Supplement 9, 2019: Proceedings of the 2018 International on primary care and readmissions. J Clin Epidemiol. 2000;53(11):1113–8.
Conference on Intelligent Computing (ICIC 2018) and Intelligent Computing and 17. Han J, Choi Y, Lee C, et al. Expression and regulation of inhibitor of dna
Biomedical Informatics (ICBI) 2018 conference: medical informatics and decision binding proteins id1, id2, id3, and id4 at the maternal-conceptus interface
making. The full contents of the supplement are available online at https:// in pigs. Theriogenology. 2018;108:46–55. https://doi.org/10.1016/j.
bmcmedinformdecismak.biomedcentral.com/articles/supplements/volume- theriogenology.2017.11.029.
19-supplement-9. 18. Jiang L, Wang D, Cai Z, Yan X. Survey of Improving Naive Bayes for
Classification. In: Alhajj R, Gao H, et al., editors. Lecture Notes in Computer
Authors’ contributions Science. Springer; 2007. https://doi.org/10.1007/978-3-540-73871-8_14.
All authors read and approved the final manuscript. 19. Jianga L, Zhang L, Yu L, Wang D. Class-specific attribute weighted naive
bayes. Pattern Recogn. 2019;88:321–30. https://doi.org/10.1016/j.patcog.
Author details 2018.11.032.
1 The Artificial Intelligence Department-, Dubai, UAE. 2 Liverpool John Moores
20. Han Lu LW, Zhi S. An assertive reasoning method for emergency response
University, Liverpool, UAE. 3 The University of Anbar, Al-Tameem Street, 55431 management based on knowledge elements c4.5 decision tree. Expert
Al-Anbar, Al-Ramadi, Iraq. 4 Kazan Federal University, Kremlyovskaya St, 420008 Syst Appl. 2019;122:65–74. https://doi.org/10.1016/j.eswa.2018.12.042.
Kazan, Republic of Tatarstan, Russia. 5 Department of Computer Science, 21. Skriver MVJKK, Sandbæk A, Støvring H. Relationship of hba1c variability,
Al-Maarif University College, Anbar, 31001 The city of Ramadi, Iraq. absolute changes in hba1c, and all-cause mortality in type 2 diabetes: a
danish population-based prospective observational study. Epidemiology.
Published: 12 December 2019 2015;3(1):8. https://doi.org/10.1136/bmjdrc-2014-000060.
22. ADA: Economic Costs of Diabetes in the U.S. in 2012. Diabetes Care; 2013.
References 23. Sun NJDL, Sun B, Wu MY-C. Lossless pruned naive bayes for big data
1. Guoa W-L, DS H. An efficient method to transcription factor binding sites classifications. Big Data Res. 2018;14:27–36. https://doi.org/10.1016/j.bdr.
imputation via simultaneous completion of multiple matrices with 2018.05.007.
positional consistency. Mol BioSyst. 2017;13(9):1827–37. https://doi.org/ 24. Nima Shiri Harzevili SHA. Mixture of latent multinomial naive bayes
10.1039/C7MB00155J. classifier. Appl Soft Comput. 2018;69:516–27. https://doi.org/10.1016/j.
2. Strack B, DeShazo JP, Clore JN. Impact of hba1c measurement on asoc.2018.04.020.
hospital readmission rates: Analysis of 70,000 clinical database patient 25. Nongyao Nai-arun RM. Comparison of classifiers for the risk of diabetes
records. BioMed Res Int. 2014;11:. https://doi.org/10.1155/2014/781670. prediction. Procedia Comput Sci. 2015;69:132–42. https://doi.org/10.
3. Bengio Y, Grandvalet Y. No unbiased estimator of the variance of k-fold 1016/j.procs.2015.10.014.
cross-validation. J Mach Learn Res. 2004;5:1089–105. 26. Arar OFKA. A feature dependent naive bayes approach and its application
4. Bo LJ. Song: Naive bayesian classifier based on genetic simulated to the software defect prediction problem. Appl Soft Comput. 2017;59:
annealing algorithm. Procedia Eng. 2011;23:504–9. https://doi.org/10. 197–209. https://doi.org/10.1016/j.asoc.2017.05.043.
1016/j.proeng.2011.11.2538. 27. Wyckoff OPCCB, Ciarkowski SL. Gianchandani: The relationship between
5. Chan M. Global report on diabetes. Report. 2016;978:9241565257. https:// diabetes mellitus and 30-day readmission rates. Clin Diabetes Endocrinol.
apps.who.int/iris/bitstream/handle/10665/204871/9789241565257_eng 2017;3(3):8. https://doi.org/10.1186/s40842-016-0040-x.
%.pdf;jsessionid=BE557465C4C16EF288D80B9E41AE01C8?sequence=1. 28. Ranjit Panigrahi SB. Rank allocation to j48 group of decision tree classifiers
6. Chen Peng LZ, Huang D-s. Discovery of relationships between long using binary and multiclass intrusion detection datasets. Procedia Comput
non-coding rnas and genes in human diseases based on tensor Sci. 2018;132:323–32. https://doi.org/10.1016/j.procs.2018.05.186.
completion. IEEE Access. 2018;6:59152–62. https://doi.org/10.1109/ 29. Dungan KM. The effect of diabetes on hospital readmissions. J Diabetes
ACCESS.2018.2873013. Sci Technol. 1045;6(5):.
7. Bansal D, Khanna K, Chhikara R, Gupta P. Comparative analysis of various 30. Sajida Perveen MSea. Performance analysis of data mining classification
machine learning algorithms for detecting dementia. Procedia Comput techniques to predict diabetes. Procedia Comput Sci. 2016;82:115–21.
Sci. 2018;132:1497–502. https://doi.org/10.1016/j.procs.2018.05.102. https://doi.org/10.1016/j.procs.2016.04.016.
8. Deepti Sisodia DSS. Prediction of diabetes using classification algorithms. 31. Ye SYJSHLZ, Ruan P. Dong: The impact of the hba1c level of type 2
Procedia Comput Sci. 2018;132:1578–85. https://doi.org/10.1016/j.procs. diabetics on the structure of haemoglobin. Report. 2016;33352:. https://
2018.05.122. doi.org/10.1038/srep33352.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Alloghani et al. BMC Medical Informatics and Decision Making 2019, 19(Suppl 9):253 Page 16 of 16

32. Kripalani SAB, Theobald CN, EE V. Reducing hospital readmission rates:


current strategies and future directions. Annu Rev Med. 2014;65:471–85.
https://doi.org/10.1146/annurev-med-022613-090415.
33. Wong T-T. Parametric methods for comparing the performance of two
classification algorithms evaluated by k-fold cross validation on multiple
data sets. Pattern Recogn. 2017;65:97–107. https://doi.org/10.1016/j.
patcog.2016.12.018.
34. Wang Xiaohu WL, Nianfeng L. An application of decision tree based on
id3. Phys Procedia. 2012;25:1017–21. https://doi.org/10.1016/j.phpro.
2012.03.193.
35. Trishan Panch PS, Atun R. Artificial intelligence, machine learning and
health systems. J Global Health. 2018;8(2):. https://doi.org/10.7189/jogh.
08.020303.
36. Wenzheng Bao ZJ, Huang D-S. Novel human microbe-disease association
prediction using network consistency projection. BMC Bioinformatics.
2017;18(S116):173–259. https://doi.org/10.1186/s12859-017-1968-2.
37. Wu J. A generalized tree augmented naive bayes link prediction model. J
Comput Sci. 2018;27:206–17. https://doi.org/10.1016/j.jocs.2018.04.006.
38. Mu YFBHUZXea. Pan C: Efficacy and safety of linagliptin/metformin
single-pill combination as initial therapy in drug-naïve asian patients with
type 2 diabetes. Diabetes Res Clin Pract. 2017;124:48–56. https://doi.org/
10.1016/j.diabres.2016.11.026.
39. Zhen Shen WB, Huang D-S. Recurrent neural network for predicting
transcription factor binding sites. 2018;8(15270):10. https://doi.org/10.
1038/s41598-018-33321-1.

Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:

1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at

[email protected]

You might also like