Implementation of Machine Learning Algorithms To C
Implementation of Machine Learning Algorithms To C
Implementation of Machine Learning Algorithms To C
Abstract
Background: Machine learning is a branch of Artificial Intelligence that is concerned with the design and
development of algorithms, and it enables today’s computers to have the property of learning. Machine learning is
gradually growing and becoming a critical approach in many domains such as health, education, and business.
Methods: In this paper, we applied machine learning to the diabetes dataset with the aim of recognizing patterns
and combinations of factors that characterizes or explain re-admission among diabetes patients. The classifiers used
include Linear Discriminant Analysis, Random Forest, k–Nearest Neighbor, Naïve Bayes, J48 and Support vector
machine.
Results: Of the 100,000 cases, 78,363 were diabetic and over 47% were readmitted.Based on the classes that models
produced, diabetic patients who are more likely to be readmitted are either women, or Caucasians, or outpatients, or
those who undergo less rigorous lab procedures, treatment procedures, or those who receive less medication, and
are thus discharged without proper improvements or administration of insulin despite having been tested positive for
HbA1c.
Conclusion: Diabetic patients who do not undergo vigorous lab assessments, diagnosis, medications are more likely
to be readmitted when discharged without improvements and without receiving insulin administration, especially if
they are women, Caucasians, or both.
Keywords: Machine learning, Linear discriminant, Algorithms, Support vector machine, Diabetes re-admission, HbA1c
insignificant changes or improvements at the time of dis- identified socioeconomic status, ethnicity, disease bur-
charge, and high rates of re-admissions. Nonetheless, such den, public coverage, and history of hospitalization as
a claim has not been proven and the influence on these key re-admission risk factors. Besides these factors and
factors on re-admission among diabetes. As such, this principal admission conditions, re-admission can be a fac-
study hypothesized that time spent in hospital, number tor of health management practices. This study provides
of lab procedures, number of medications, and num- information on the managerial causes of re-admission
ber of diagnoses have an association with re-admission using six machine learning models. Additionally, most
rates and are proxies of in-hospital management prac- studies employ regression data mining technique and as
tices that affect patient health outcomes. However, detec- such this study provides a framework for implement-
tion of Hemoglobin A1c (HbA1c) marker, administration ing other machine learning techniques in exploring the
of insulin treatment, diabetes treatment instances, and causative agents of re-admission rates among diabetes
noted changes are factors that can moderate the admis- patients. The primary importance of the algorithm is to
sion and are treated as partial management factors in the help hospitals identify multiple strategies that work effec-
study. Some of the re-admission is avoidable although tively for re-admission of a given health condition. In
this requires evidence-based treatments. According to [7] specific, implementation of multiple strategies will focus
in a retrospective cohort study evaluated the basic diag- on improved communication, the safety of the medica-
noses and 30-day re-admission patterns among Academic tion, advancements in care planning, and enhanced train-
Tertiary Medical Center patients’ and established within ing on the management of medical conditions that often
30-days re-admissions are avoidable. In specific, the study lead to re-admissions. Each of these sub-domains involves
established that 8.0% of the 22.3% of the within 30 days re- decision making and given the size and nature of health-
admissions are potentially avoidable. As a subtext to the care information, data mining and deep learning tech-
conclusion, the authors asserted that these re-admission niques may prove critical in reducing the re-admission
cases were related in direct or indirect consequences due rates.
to the pre-conditions related to the primary diagnosis.
For instance, research demonstrated that patients admit- Methodology
ted for heart failure and other related diseases are more Figure 1 illustrates the high-level machine learning pro-
likely to be readmitted for acute heart failure.However, cess diagram used in the paper. The study explored
the re-occurrence of the heart condition is dependent on the probable predictors of diabetes hospital re-admission
the treatment administered, observed health outcome at among the hospitals using machine learning techniques
discharge, and other pre-existing health conditions. along with other exploratory methods. The dataset con-
sists of 55 attributes and only 18 were used as per the
Research contribution scope of the study. The performance of the models is eval-
Under the circumstances, it is essential for healthcare uated using the conventional confusion matrix and ROC
stakeholders to pursue re-admission reduction strategies, efficiency analysis. The final re-admission model is based
especially with a specific focus on the potentially avoid- on the best performing model as per the true positive
able re-admissions. The authors in [8] highlighted the rates, sensitivity and specificity.
role that financial penalties imposed on health institu-
tions with higher re-admission rates in reducing the re- Linear discriminant analysis
admission incidences. Furthermore, the article assessed LDA algorithm is a variant of Fisher’s linear discriminant
and concluded that extensive assessment of patient needs, and it classifies data to vector format based linear com-
reconciling medication, educating the patients, planning bination of attributes based on a target factor or class
timely outpatient appointments, and ensuring follow-up variable. The algorithm has a close technical resemblance
through calls and messages are among the best emerg- to Analysis of Variance (ANOVA) and regression as it
ing practices for reducing re-admission rates. However, explains the influences of predictors using linear com-
implementing these strategies requires significant funding binations [5]. There are two approaches to LDA. The
although the long-term impacts outweigh any financial techniques assume that the data conforms to Gaussian
demands. Hence, it suffices to deduce that re-admissions distribution and as such, assumes that each attribute has
in a health facility are a priority area for improved a bell-shape curve when visualized and it also assumes
health facilities and reducing healthcare cost. Regardless that each variable has the same variance, and that data
of the far-reaching interest in hospital re-admissions, lit- points of each attribute vary around the average by the
tle research has explored re-admission among diabetes same amount. That is, the algorithm requires the data and
patients. A reduction of diabetic patient re-admission its attributes to be normally distributed and of constant
can reduce health cost while improving health outcomes variance or standard variation. As a result, the algorithm
at the same time. More importantly, some studies have estimates the mean and the variance of the data for each
of the class that it creates using the conventional statistical rule and the Gaussian distribution assumption for class
techniques. means where:
1
μ= (x) (1) π xy = i (4)
nk
Where μ is the mean of each input attribute (x) for each Secondly, LDA finds a linear combination of the predic-
class (k) and n is the total number of observations in the tors that return the optimum predictor value, and this
dataset. The variance associated with the classes is also study uses the latter. LDA algorithm can be implemented
computed using the following conventional method. in five basic steps. First, in performing LDA classification,
1 the d-dimensional mean vectors are computed for the
σ2 = (x − μ)2 (2) classes identified in the dataset using the mean approach
n−k
(Eq. 1). The variance and the normality assumption must
In Eq. 2, sigma squared is the variance across all instance
be checked before proceeding. Second, both within and
serving as input in the model, k is the number of classes,
between-class scatters are computed and returned as a
and n is the number of observations or instance in the
matrix. The within-class scatter or distances are com-
dataset. μ is the mean and is computed using Eq. 1.
puted based on Eq. 5.
Besides the assumptions, the algorithm makes predic-
tion using a probabilistic approach that can be summa-
c
rized in two steps. Firstly, LDA classifies predictors and Swithin = Si (5)
assigns them to a class based on the value of the posterior i=1
probability denoted as
and
π y = i x (3)
n
The objective is to minimize the total probability of mis- Si = (x − μi ) (x − μi )T (6)
classifying the features, and this approach relies on Bayes’ x∈Di
where i is the scatter for every class identified in the is important to note that Random Forest is an ensem-
dataset and μ is the mean of the classes computed using ble algorithm because it combines different trees. Ide-
Eq. 1. ally, ensemble algorithms combine one or more classifiers
The Between-class scatter is calculated using Eq. 7. with the different types. Random forest can be thought
of a bootstrapping approach for improving the results
c
obtained from the decision tree. The algorithm works
Sbetween = Ni (μi − μ ) (μi − μ)T (7)
in the following order. First, it selects a bootstrap sam-
i−1
ple S(i) from the sample space and the argument denoting
In Eq. 7, S is general mean value while μ and N refers the bootstrap sample refers to the ith bootstrap. The
to the sample mean and sizes of identified classes respec- algorithm learns a conventional decision tree although
tively. The third step involves solving Eigenvectors asso- through implementation of a modified decision tree algo-
ciated with the product of the within-class and out-class rithm. The modification is specific and is systematically
matrices. The fourth step involves sorting the linear dis- implemented as the tree grows. That is, at each node of
criminant to identify the new feature subspace. The selec- decision tree, instead of implementing an iteration for
tion and sorting using decreasing magnitudes of Eigen- all possible feature split, RF randomly selects a subset
values. The last step involves the transformation of the of features such that f ⊆ F and then splits the fea-
samples or observations onto the new linear discrimi- tures in the subset (f ). The splitting is based on the
nant sub-spaces. The pseudo-code for LDA is presented best feature in the subset and during implementation,
in Algortihm 1. the algorithm chooses the subset that it is much smaller
than the set of all features. Small size of subset reduces
the burden to decide on the number of features to split
Algorithm 1 Linear Discriminant Analysis since datasets with large size subsets tend to increase the
T n
1: D = x , yi i=1 : computational complexity. Hence, the narrowing of the
i attributes to be learned improves the learning speed of the
2: Di = xj |yi = ci , j = 1, ..., n , i = 1, 2
T
algorithm.
//declaration of class-specific subsets
3: μi = mean (Di ) , i = 1, 2
Algorithm 2 Random Forest (Decision Tree Ensemble)
//calculation of class means
Prerequisite: Specify the training set S :=
4: B = (μ1 − μ2 ) (μ1 − μ2 )
T
(x1 , y1 ) , ..., (xn , yn ) the set of all features F, and the
//Computation of between-class scatter
number of trees to be included in the forest B
5: Zi = Di − 1ni μi , i = 1, 2
T
1: function RANDOM FOREST (S, F)
// Center scatter matrix
2: H ← 0
6: Si = Zi Zi , i = 1, 2
T
3: for i ∈ 1, ..., B do
//The scatter of the respective classes
4: S(i) ← 1 Bootstrap sample drawn from S
7: S = S1 + S2 (i)
5: hi RandomizedTreelearn(S , F)
//Computation of within-class
scatter distances
6: H ← H ∪ {hi }
8: λ1, w = eigen S−1 B
7: end for
//Computation of eigenvector using dominance
8: return H
9: end function
10: function RANDOMIZEDTREELEARN (S, F)
For the classes i, the algorithm divides the data into 11: At each node:
D1 and D2 then calculates the within and between the 12: f ← Draw a small subset of F
class distances, and the best linear discriminant is a vec- 13: Split the best feature in f
tor obtained from the product of transpose of within-class 14: return The Learned Tree (Model)
and between-class scatter matrices. 15: end function
Random forest
Random forest is a variant of decision degree growing The algorithm uses bagging to implement the ensem-
technique and it is different from the other classifiers, ble decision tree, and it is prudent to note that bagging
because it supports random growth branches within the reduces the variance of the decision tree algorithm.
selected subspace. The random forest model predicts
the outcome based on a set of random base regression Support vector machine
trees. The algorithm selects a node at each random base Support Vector Machine is a group of supervised learn-
regression and split it to grow the other branches. It ing techniques that classify data based on regression
analysis. One of the variables in the training sample should The classification and prediction application of the algo-
be categorical so that the learning process assigns new rithm depends on the dominant k class and the predictive
categorical value as part of the predictive outcome. As equation is as the following:
such, SVM is a non-likelihood binary classifier lever-
(x) = argmaxc ∈ y. (xi , yi ) ∈ N (x, L, K)yi (10)
aging the linear properties. Besides classification and
regression, SVM detects outliers and is versatile when It is imperative to note that output class consists of
applied to dimensionality high [1]. Ideally, a training vec- members from the target attribute and the distance used
tor variable, that has at least two categories, is defined as in assigning the attributes to classes is based on Euclidean
follows: distance. The implementation of the algorithm consists
xi ∈ Rp , i = 1, ..., n (8) of six steps. The first step involves the computation of
Euclidean distance. In the second step, the computed n
where xi represents the training observation and Rp indi- distances are arranged in a non-decreasing order, and in
cates the real-valued p-dimensional feature space and the third step, a positive integer k is drawn from the sorted
predictor vector space. A pseudo-code for a simple SVM Euclidean distances. In the fourth step, k-points corre-
algorithm is illustrated: sponding to the k-distances are established and assigned
based on proximity to the center of the class. Finally, for
k >0 and for (number of points in the i, an attribute x
Algorithm 3 Support Vector Machine)
is assigned to that class if ki > kj for all i = j is true.
AttributeSupportVector(ASV )
Algorithm 4 shows the kNN steps process:
={Closest Attribute Pair from Opposite Classes}
1: while margin constraint violating points exist do
2: Find the violator Algorithm 4 k-Nearest Neighbor)
3: ASV = ASV∪ Violator Preconditions: Specify training data (X), class
4: if any αp < 0 because of addition of c to S then labels (Y ), and unknown sample (x)
5: ASV = p
ASV
1: Classify (X, Y , x)
6: Repeat all the violating points are pruned 2: for i = 1 to m do
7: end if 3: Compute distance d (Xi , x)
8: end while 4: end
5: for Compute Set I contains the minimum sets of k
distances d (Xi , x)
The algorithm searches for candidate support vectors 6: Return majority label for {Yi ; i ∈ I}
denoted as S and it assumes that SV occupies as a space
where the parameters of the linear features of the hyper-
plane are stored.
Naïve Bayes
k-nearest neighbor Even though Naïve Bayes is one of the supervised learning
kNN classifies data using the same distance measure- techniques, it is probabilistic in nature so that the classifi-
ment techniques as LDA and other regression-based algo- cation is based on Naïve Bayes’ rules of probability, espe-
rithms. In classification application, the algorithm pro- cially those of association. Conditional probability is the
duces class members while in regression application it construct of Naïve Bayes classifier [9–16]. The algorithm
returns the value of a feature or a predictor [9]. The assigns instance probabilities to the predictors parsed in a
technique can identify the most significant predictor and vector format representing each probable outcome. Naïve
as such was given preference in the analysis. Nonethe- Bayes classifier is the posterior probability that the divi-
less, the algorithm requires high memory and is sensi- dend of the product of prior with likelihood and evidence
tive to non-contributed features despite being considered returns. The construction of the model from the output
insensitive to outliers and versatile among many other of the analysis is quite complex although the probabilistic
qualifying features. The algorithm creates classes or clus- computation from the generated classes is straightforward
ters based on the mean distance between data-points. [17–22]. The Bayes Theorem upon which the Naïve Bayes
The mean distance is calculated using the following classifier is based can be written as follows:
equation. P(ν|μ)P(μ)
P (μ|ν) = (11)
1 P(ν)
(x) = . (xi , yi ) ∈ kNN(x, L, K)yi (9)
k Where μ and v are events or instances in an experiment
In Eq. 9, kNN (x, L, K), k denotes the K nearest neigh- and P(μ) and P(ν) are the probability of their occurrence.
bors of the input attribute (x) in the learning set space (i). The conditional probability of an event μ occurring after
v is the basis of Naïve Bayes classifier. The classifier uses (Additional file 1). The data has 55 attributes, about
maximum likelihood hypothesis to assign data points to 100,000 observations, and has missing values. However,
classes. The algorithm assumes that each feature is inde- the study used a sample based on the treatment of dia-
pendent and makes equal contribution to the outcome betes. In specific, of the 100,000 cases, 78,363 meet
or all features belonging to the same class have the same the inclusion criteria since they received medication for
influence on that class. In Eq. 11, the algorithm com- diabetes. Consequently, the study explored re-admission
putes the probability of event μ provided that v already incidences among patients who had received treatment.
occurred, and as such v is the evidence and the proba- The amount of missing information, the type of the
bility P(μ) is regarded as the priori probability. That is, data (categorical or numeric) that guided the data clean-
it refers to probability obtained before seeing the evi- ing process, re-admission, Insulin prescription, HbA1c
dence while the conditional probability P (μ|ν) is priori test results, and observed changes were retained as the
probability of v since it is a probability computed with major out-come associated with time spent in the hospi-
evidence. tal, the number of diagnoses, lab procedures, procedures,
and medications [31, 32]. Of the 55, only 18 variables
J48 were selected as per the scope for analysis and even
J48 is one of the decision tree growing algorithm. How- about 8 of the selected served as proxy controls. The
ever, J48 is the reincarnation of the C4.5 algorithm, which study was split into 70% training and 30% validation
is an extension of the ID3 algorithm [23]. As such, J48 is subsets.
a hierarchical tree learning technique and it has several
mandatory parameters including the confidence value and K-fold validation
the minimum learning instance, which are translated to To improve the overall accuracy and validate a model, we
branches and nodes in the final decision tree [23–29]. relied on the 10-fold cross validation method applied for
estimating accuracy. The training dataset is split into k-
Data assembly and pre-processing subsets and the subset held out while the model is fully
The study used diabetes data that was collected across trained on remaining subsets. Figure 2 illustrates the val-
130 hospitals in the US in the years between 1999–2008 idation method. The K-fold Cross-validation method uti-
[30]. The dataset includes data systematically composed lizes the defined training feature set and randomly splits it
from contributing electronic health records’ providers into k equal subsets. The model is trained k times. During
that contained encounter data such as inpatient, outpa- each iteration, 1 subset is excluded for use as validation.
tient and emergency, demographics, provider specialty, This technique reduces over-fitting issues, which occurs
diagnosis, in-hospital procedures, in-hospital mortality, when a model trains the data too closely to a set of data,
laboratory and pharmacy data. The complete list of which can result in failure to predict future information
the features and description is provided in Table S1 reliably [2, 12, 33].
Discussion Those who received less than 10 diagnoses and less than
Exploratory analysis 70 procedures were more likely to be readmitted. None of
Of the 47.7% diabetic patients who were readmitted, the patients received more than diagnosis and a majority
11.6% stayed in the hospital for less than 30 days while were admitted for more than 30 days.
36.1% stayed for more than 30 days. A majority (52.3%) Figure 4 depicts a scatter plot of a number of diagnoses
of those who stayed for more than 30 days did not and lab procedures. The re-admission rates are quite dif-
receive any medical procedures during the first visit. In ferent between a group of patients who noted change at
general, diabetic patients who received a fewer num- discharge than those who did not. Those who failed to
ber of lab procedures, treatment procedures, medica- note significant improvement at discharge received more
tions, and diagnoses are more likely to be readmitted than 50 medications and less than 10 diagnoses. However,
than their counterparts. Furthermore, the more frequent re-admission is higher among those who noted improve-
a patient is admitted as an in-patient the less likely the ment at discharge.
probability of re-admission. Our study indicated that,
women (53.3%) and Caucasian (74.6%) diabetic patients Density distributions
are more vulnerable to re-admission than male and The distribution of re-admission and subsequent patterns
the other races. Besides several lab procedures, medica- associated with reported change and results of HbA1c are
tions, and diagnoses, insulin administration and HbA1c shown in Figs. 5 and 6.
results exacerbate the re-admission rates among diabetic Figures 5 and 6 illustrate the density distribution of
patients. number of medications, lab procedures, and diagnoses
grouped by re-admission, HbA1c results, insulin admin-
Scatterplots istration change at discharge. Notably, the distribution
The Scatterplots of re-admission incidences with an over- density of the number of lab procedures, medications,
lay of HbA1c measurements and change recorded at the and diagnoses are the same for grouping categories.
time of discharge are shown in Figs. 3 and 4. Figure 6 shows significant differences in the number of
Figure 3 illustrates the Scatterplot of the number of medications and lab procedures. For instance, the aver-
diagnoses and lab procedures that patient received for age number of medications differs between ’No’, ’Up’,
re-admission rates. The figures have 8 panels display- ’Steady’, and ’Down’ insulin categories. A similar differ-
ing scatters of diagnoses and lab procedures for differ- ence in mean of the number of medications is observed
ent instances of HbA1c results and change. The plot in the change distribution curve with those recording
shows that patients who had negative HbA1c tests results change at discharge receiving more medications than their
received several diagnosis and very few were readmitted. counterparts.
Smooth linear fits Table 1 depicts that Naïve Bayes correctly classified
Figures 7 and 8 illustrate the smooth line fits asso- the re-admission rates less than 30 days and none re-
ciated with Scatterplots. The smoothen fits include a admission incidences. SVM accurately classified 48.3% of
95% confidence interval and demonstrates the likely the re-admission incidence exceeding 30 days. The objec-
performance of linear regression models in forecasting tive is to obtain the performing model.
re-admission.
Figures 7 and 8 depict smooth linear fits of the Scatter- Individual model performance
plots and density plots in Figs. 3, 4, 5, and 6. The figures The LDA model yields two linear discriminants LD1 and
illustrate that the number of lab procedures has linear LD2 with proportion trace of 0.9646 and 0.0354 respec-
relationships with the number of diagnoses although the tively. Hence, the first LD explains more than 96.46% of
data is likely to be heteroskedastic. The number of diag- the between-group variance while the second account for
noses and medications also have the same relationship and 3.54% of the between-group variance.
plot patterns. For medication versus procedures, the rela-
tionship is linear and change in diabetes status increases LD1 = 003 ∗ Lab Procedures − 0.102 ∗ Procedures
with medications and lab procedures. As for re-admission, +0.08 ∗ Medications + 0.18 ∗ Emergency + 0.67 Inpatient
incidents of more than 30 days re-admission reduced with +0.17Diagnoses
increasing number of diagnoses, lab procedures, and med-
(12)
ications. Similarly, the probability of detecting HbA1c
increases with increasing number of diagnoses and lab Figure 9 illustrates the plot of LD1 versus LD2. Equation
procedures. 12 depicts the profile of diabetic patients.
The predictors were significantly correlated at 5% level
Model evaluation and they influenced re-admission based on the frequency
The performance of the models in predicting re- of each. The kNN model used all the 16 predictors to learn
admission incidence was based on the confusion matrix the data and selected three as significant predictors. In
and in specific the percentage of the correctly predicted specific, the kNN model proposes that high re-admission
read-mission categories. for diabetes treatment is caused by a fewer number of
lab procedures, diagnoses, and medications. However, distribution curves suggest that patients are more likely
the rates are higher among patients who tested positive to feel better at time of discharge provided that the
for HbA1c and did not fail to receive insulin treatment lab services and medications are of superior quality. It
(Fig. 3). is important to reiterate that Naves Bayes’ model has
SVM classified the readmitted diabetic patients into true positive and false negative rates showing that it had
three classes using a polynomial of degree 3 suggest- 13.78% accuracy and 13.78% sensitivity. Finally, random
ing that diabetes re-admission cases do not have a lin- forest classified diabetic patients using linear approaches
ear relationship with the predictors. As an inference, with re-admission as the control. Figures 7 and 8 demon-
the polynomial relationship illustrated by the kernel strate that the smoothen linear of the paired predictors
and degree of the SVM indicates higher re-admission shows that re-admissions taking more than 30 days is
rates among patients discharged without any significant reduced by increasing the number of medical diagnoses.
changes (Fig. 4). Naïve Bayes classifier yields two classes Further, the HbA1c results increase with increasing num-
using the Laplace approach. The classification from the ber of diagnoses. However, it is important to note that
model depicts a reduced likelihood of re-admission in the association between the number of lab procedures and
cases where the patients undergo a series of laboratory medications tends to be non-linear while that between
tests, rigorous diagnosis, proper medication, and dis- the number of diagnoses and medication is linear regard-
charge after confirmation of improvement. The density less of the grouping variable. The J48 based tree shown in
distributions in Figs. 5 and 6 compliments the findings Fig. 9 does not consider the linear relationships and omits
of the model. In specific, the distributions of the num- diabetic patients who were never re-admitted. The resul-
ber of medications and lab procedures show a noticeable tant tree included a number of inpatient treatment days,
difference in the distribution when considering insulin number of emergencies, number of medications, lab pro-
administration as part of treatment. Regarding aggrega- cedures, and diagnoses in the model. The model suggests
tion of the distribution of the number of medications that diabetic patients admitted as in-patients tend not
and lab procedures by status at discharge (change), the to be re-admitted. Similarly, the tree demonstrates that
several diagnoses improve health outcomes and reduce of 64% and a sensitivity of 52.4%. The ROC curves asso-
re-admission. ciated with the predictions of re-admission that exceeded
30 days are displayed in the figures below.
Best fit model The larger the area covered the more efficient the model
The best fitting model is based on the performance mea- is, and this principle Fig. 10 depicts that Naïve Bayes is the
sures summarized in Table 2. The key decision relies most efficient.
on the efficiency of the model in predicting the re-
admission rates and the area under the curve (AUC) Naïve bayes analysis
and precision/recall curve are the best measures for such The model focused on the top 5 best factors (exposures)
a task. that contributed to re-admission for less and more than 30
Table 2 illustrates that Naïve Bayes is the most sensitive days. The association between the exposures and outcome
and efficient model for learning, classifying and predicting (re-admission instances) are given as log odds ratio in the
re-admission rates using mHealth data. It has an efficiency nomograms illustrated in Fig. 11. The three classes model
are Class 0 (No re-admission), Class 1 (re-admission for
less than 30 days), and Class 3 (re-admission for more than
Table 1 True Positive Rate Comparison Table 30 days).
Model <30 days >30 days No Figure 11 depicts the exposure factors with absolute
Random Forest 21.0% 42.8% 60.5% importance on Class 0 including number of emergencies,
kNN 17.8% 40.3% 59.6% number of patients, discharge disposition ID, admission
source ID, and number of diagnoses. The log odds ratios
Naïve Bayes 23.6% 46.6% 61.2%
illustrate the association between these exposure factors.
SVM 12.2% 48.3% 55.9%
The conditional probability for re-admission after dis-
J48 17.3% 40.4% 60.3% charge based on these exposure factors is 0.5.
Figure 12 depicts the exposure factors with absolute odds ratios illustrate the association between these expo-
importance on Class 1 including the number of emer- sure factors to lack of re-admission after discharge. The
gencies, the number of patients, discharge disposition conditional probability for re-admission after discharge
ID, time in hospital ID, and number of diagnoses. The based on these exposure factors is 0.42. The number
log odds ratios display the association between these of emergency admission increases re-admission chances
exposure factors to lack of re-admission after discharge. by 80% for those with least history. Further, those with
The conditional probability for re-admission after dis- higher inpatient admission history have 65% chance of
charge based on these exposure factors is 14%. In specific, re-admission for more than 30 days. Most importantly,
there is a 48% chance of re-admission for patients with patients who undergo more than 9.5 diagnoses tests have
a number of diagnoses between 8.5 and 9.5, and a 52% 70% chance of re-admission for more than 30 days after
chance for those with diagnoses between 5.5 and 8.5. discharge.
Similarly, those spending between 2.5 to 3.5 days in the
hospital is more likely to be readmitted (59%) for less Conclusion
than 30 days than their counterparts with 41% chance The size of the health data and the amount of informa-
of re-admission. Finally, those with fewer emergency tion contained exemplifies the importance of machine
admission history stand higher chances of re-admission learning in the health sector. Developing the profiles
(80%) than those with sufficient emergency admission for the patients can help in understanding the factors
history. that help reduce the burden of the disease while at the
Figure 13 depicts the exposure factors with absolute same time improve outcomes. Diabetes is a major prob-
importance on Class 2 including the number of emer- lem given that over 78% of the patients admitted across
gencies, the number of patients, discharge disposition ID, the 130 hospitals were treated for the condition. Of the
admission source ID, and number of diagnoses. The log total number of diabetic patients who participated in the
study, over 47% were readmitted with over 36% percent
staying in the hospital for over 30 days. This study has
Table 2 Comparison of model efficiency and sensitivity also established that women and Caucasians are more
Model AUC CA F1 Precision Recall vulnerable to hospital re-admissions [5, 33–39]. Each
kNN 0.575 0.499 0.489 0.482 0.499 of the machine learning models has established differ-
J48 0.578 0.490 0.487 0.485 0.490 ent combinations of features influencing the admission
rates. For instance, LDA proposes a linear combination
SVM 0.547 0.475 0.421 0.483 0.475
while the SVM suggests a third-degree polynomial degree
Random Forest 0.602 0.529 0.509 0.499 0.529
of association between re-admission and its predictors.
Naïve Bayes 0.640 0.566 0.524 0.519 0.566 Further, J48 models the relationship as non-linear with
Fig. 10 ROC curves illustrating the Areas Under Curve for the models
emphasis on the importance of emergency admission and undergo vigorous lab assessments, diagnosis, medica-
in-patient treatment on re-admission rates. kNN models tions are more likely to be readmitted when discharged
lead to the conclusion that fewer number of lab pro- without improvements and without receiving insulin
cedures, diagnoses, and medications lead to increased administration, especially if they are women, Caucasians,
higher re-admission rates. Diabetic patients who do not or both.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at