Paper 1
Paper 1
Paper 1
ABSTRACT Chronic kidney disease (CKD) describes a long-term decline in kidney function and has many
causes. It affects hundreds of millions of people worldwide every year. It can have a strong negative impact on
patients, especially when combined with cardiovascular disease (CVD): patients with both conditions have
lower survival chances. In this context, computational intelligence applied to electronic health records can
provide insights to physicians that can help them make better decisions about prognoses or therapies. In this
study we applied machine learning to medical records of patients with CKD and CVD. First, we predicted
if patients develop severe CKD, both including and excluding information about the year it occurred or date
of the last visit. Our methods achieved top mean Matthews correlation coefficient (MCC) of +0.499 in the
former case and a mean MCC of +0.469 in the latter case. Then, we performed a feature ranking analysis
to understand which clinical factors are most important: age, eGFR, and creatinine when the temporal
component is absent; hypertension, smoking, and diabetes when the year is present. We then compared our
results with the current scientific literature, and discussed the different results obtained when the time feature
is excluded or included. Our results show that our computational intelligence approach can provide insights
about diagnosis and relative important of different clinical variables that otherwise would be impossible to
observe.
INDEX TERMS Machine learning, computational intelligence, feature ranking, electronic health records,
chronic kidney disease, CKD, cardiovascular diseases, CVD.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
165132 VOLUME 9, 2021
D. Chicco et al.: Machine Learning Analysis of Health Records of Patients With CKD at Risk of CVD
TABLE 1. Meaning, measurement unit, and possible values of each feature of the dataset. ACEI: Angiotensin-converting enzyme inhibitors. ARB:
Angiotensin II receptor blockers. mmHg: millimetre of mercury. kg: kilogram. mmol: millimoles.
II. DATASET TABLE 2. Binary features quantitative characteristics. All the binary
features have meaning true for the value 1 and false for the value 0,
In this study, we examine a dataset of electronic medical except sex (0 = female and 1 = male). The dataset contains medical
records of 491 patients collected at the Tawam Hospital in records of 491 patients.
Al-Ain city (Abu Dhabi, United Arab Emirates), between
1st January and 31st December 2008 [28]. The patients
included 241 women and 250 men, with an average age of
53.2 years (Table 2 and Table 3).
Each patient has a chart of 13 clinical variables, expressing
her/his values of laboratory tests and exams or data about
her/his medical history (Table 1). Each patient included
in this study had cardiovascular disease or was at risk of
cardiovascular disease, according to the standards of Tawam
Hospital [28].
Several features regard the personal history of the
patient: diabetes history, dyslipidemia history, hypertension
history, obesity history, smoking history, and vascular
disease history (Table 2) state if the patient biography
had those specific diseases or conditions. Dyslipidemia
indicates excessive presence of lipids in the blood. Two
variables refer to the blood pressure (diastolic blood pressure
and systolic blood pressure), and other variables refer
to blood levels obtained through laboratory tests (choles-
terol, creatinine). Few features state if the patients have
taken specific-disease medicines (dyslipidemia medica-
tions, diabetes medications, and hypertension medications)
or inhibitors (angiotensin-converting-enzyme inhibitors,
or angiotensin II receptor blockers) which are known to
be effective against cardiovascular diseases [29] and hyper-
tension [30]. The remaining factors describe the physical
conditions of each patient: age, body–mass index, biological Outcomes (KDIGO) organization [31], CKD’s can be
sex (Table 2). grouped into 5 stages:
Among the clinical features available for this dataset, • Stage 1: normal kidney function, no CKD;
the EventCKD35 binary variable states if the patient had • Stage 2: mildly decreased function of kidney, mild CKD;
chronic kidney disease at high stage (3rd , 4th , or 5th • Stage 3: moderate decrease of kidney function, moderate
stage). According to the Kidney Disease Improving Global CKD;
TABLE 3. Numeric feature quantitative characteristics. σ : standard belong to a possibly infinite sorted set). In case of categorical
deviation.
features, one-hot encoding [34] can map them in a series of
numerical features. The consequent resulting feature space is
X ⊆ Rd .
A set of data Dn = {(x1 , y1 ), . . . , (xn , yn )}, with xi ∈ X
and yi ∈ Y, is available in a binary classification framework.
Moreover, some values of xi might be missing [35]. In this
case, if the missing value is categorical, we introduce an
additional category for missing values for the specific feature.
Instead, if the missing value is associated with a numerical
feature, we replace the missing value with the mean value of
the specific feature, and we introduce an additional logical
• Stage 4: severe decrease of kidney function, severe feature to indicate if the value of the feature is missing for a
CKD; particular sample [35].
• Stage 5: extreme CKD and kidney failure.
Our goal is to identify a model M : X → Y, which best
When the EventCKD35 variable has value 0, the patient’s approximates R, through an algorithm AH characterized by
kidney condition is at stage 1 or 2. Instead, when its set of hyper-parameters H. The accuracy of the model M
EventCKD35 equals to 1, the patient’s kidney is at stage 3, to represent the unknown relation R is measured using dif-
4, or 5 (Table 1). ferent indices of performance (Supplementary information).
Even if the value of eGFR has a role to the definition of Since the hyper-parameters H influence the ability of
the CKD stages in the KDIGO guidelines [31], we found AH to estimate R, we need to adopt a proper Model
weak correlation between the eGFRBaseline variable and Selection (MS) procedure [36]. In this work, we exploited
the target variable EventCKD35 in this dataset. The two the Complete Cross Validation (CCV) procedure [36]. CCV
variables have Pearson correlation coefficient equal to −0.36 relies on a simple idea: we resample the original dataset
and Kendall distance of −0.3, both in the [−1, +1] interval Dn many (nr = 500) times without replacement to build
where −1 indicates perfectly opposite correlation, 0 indicates a training set of size l Lrl while the remaining samples are
no correlation, and +1 indicates perfect correlation, kept in the validation set Vvr , with r ∈ {1, · · · , nr }. In order
The time year derived factor indicates in which year the to perform the MS phase, to select the best combination of
patient had a serious chronic kidney disease, or the year when the hyper-parameters H in the set of possible ones H =
he/she had his/her last outpatient visit, whichever occurred {H1 , H2 , · · · } using the algorithm AH , the hyper-parameters
first (Supplementary information),in the follow-up period. which minimize the average performance of the model,
All the dataset features refer to the first visits had by the trained on the training set, and evaluated on the validation
patients in January 2008, except the EventCKD35 and the set, should be selected. Since the data in Lrl are independent
time year variables that refer to the end of the follow-up from the ones in Vvr , the idea is that H∗ should be the set of
period, in June 2017. hyper-parameters which allows to achieve a small error on a
More information about this dataset can be found in the data set that is independent from the training set.
original article [28]. Finally, we need to estimate the error (EE) of the optimal
model with a separate set of data Tm = {(xt1 , yt1 ), · · · ,
III. METHODS (xtm , ytm )} since the error that our model commits over Dn
The problem described earlier (section I) can be addressed would be optimistically biased since Dn has been used to
as conventional binary classification framework, where the find M.
goal is to predict EventCKD35, using the data described Additionally, another aspect to consider in this analy-
earlier (section II). This target feature indicates if the patient sis is that data available in health informatics are often
has the chronic kidney disease in the stage 3 to 5, which unbalanced [37]–[39], and most learning algorithms do not
represents an advanced stage. work well with imbalanced datasets and tend to poorly
In binary classification, the problem is to identify the perform on the minority class. For these reasons, several
unknown relation R between the input space X (in our case: techniques have been developed in order to address this
the features described in Section II) and an output space issue [40]. Currently the most practical and effective method
Y ⊆ {0, 1} (in our case: the EventCKD35 target) [32]. Once involves the resampling of the data in order to synthesize
a relation is established, one can find a way to discover a balanced dataset [40]. For this purpose, we can under-
what the most influencing factors are in the input space for sample or over-sample the dataset. Under-sampling balances
predicting the associated element in the output space, namely the dataset by reducing the size of the abundant class.
to determine the feature importance [33]. By keeping all samples in the rare class and randomly
Note that, X can be composed by categorical features selecting an equal number of samples in the abundant class,
(the values of the features belong to a finite unsorted set) a new balanced dataset can be retrieved for further modeling.
and numerical–valued features (the values of the features Note that this method wastes a lot of information (many
samples might be discarded). For this reason, scientists take in {4, 8, 16, 24, 32} (rpart2 in the caret [56] R pack-
advantage of the over-sampling strategy more often. Over- age). For XGBoost we set tree gradient boosting and we
sample tries to balance the dataset by increasing the size searched the Booster Parameters in {0.001, 0.002, 0.004,
of rare samples. Rather than removing abundant samples, 0.008, 0.01, 0.02, 0.04, 0.08} the number of trees in
new rare samples are generated (for example by repetition, {100, 500, 1000}, the minimum loss reduction to make a
by bootstrapping, or by synthetic minority). The latter method split in {0, 0.001, 0.005, 0.01}, the fraction of samples in
is the one that we employed in this study: synthetic minority {1, 0.9, 0.7} and features {1, 0.5, 0.2, 0.1} used train the trees
oversampling [41], [42]. and the maxim number of leaves in {1, 2, 4, 8, 16}, and
Another important property of M is its interpretability, the regularization hyper-parameters in {10−6.0,−5.8,··· ,4 } [50].
namely the possibility to understand how it behaves. There For One Rule we did not have to tune hyper-parameters
are two options to investigate this property. The first one is to (OneR in the caret [56] R package).
learn a M such that its functional form is, by construction, Note that these methods have shown to be a set of the
interpretable [43] (for example, Decision Trees and Rule simplest yet best performing methods available in scientific
based models); this solution, however, usually results in literature [57], [58]. The difference between the methods is
poor generalization performances. The second one, used just the functional form of the model which tries to better
when the functional form of M is not interpretable by approximate a learning principle.
construction [43] (for example, Kernel Methods or Neural For example, Random Forests and XGBoost try to
Network), is to derive its interpretability a posteriori. A implement the wisdom of the crowd principles, Support
classical method for reaching this goal is to perform a feature Vector Machines are robust maximum margin classifiers,
ranking procedure [33], [44] which gives an hint to the users and Decision Tree and One Rule represent very easy to
of M about the most important features which influence its interpret models. In this paper we tested multiple algorithms
results. since the no-free-lunch theorem [59] assures us that, for a
specific application, it is not possible to know, a-priori, what
A. BINARY CLASSIFICATION ALGORITHMS algorithm will better perform on a specific task. Then we
In this paper, for the A , we will exploit different state-of-the- tested the ones which, in the past, have shown to perform well
art models. In particular we will exploit Random Forests [45], on many tasks and identified the best one for our application.
Support Vector Machines (linear and kernelized with the
Gaussian Kernel) [46], [47], Neural Network [48], Decision
Tree [49], XGBoost [50], and One Rule [51]. B. FEATURE RANKING
We tried a number of different hyper-parameter configu- Feature rankings methods based on Random Forests are
rations for the machine learning methods employed in this among the most effective techniques [60], [61], particularly
study. in the context of bioinformatics [62], [63] and health
For Random Forests, we set the number of trees to informatics [64]. Since Random Forests obtained the top
1000 and we searched number of variables randomly prediction scores for binary classification, we focus on this
sampled as candidates at each split in {1, 2, 4, 8, 16}, method for feature ranking.
the minimum size of samples in the terminal nodes of Several measures are available for feature importance in
the trees in {1, 2, 4, 8}, the percentage samples (sam- Random Forests. A powerful approach is the one based on
pled with bootstrap) during the creation of each tree the Permutation Importance or Mean Decrease in Accuracy
in {60, 80, 100, 120} [52]–[55]. For the linear and ker- (MDA), where the importance is assessed for each feature by
nelized Support Vector Machines [46], we searched the removing the association between that feature and the target.
regularization hyper-parameters in {10−6.0,−5.8,··· ,4 } and, This effect is achieved by randomly permuting [65] the values
for the kernelized Support Vector Machines, we used the of the feature and measuring the resulting increase in error.
Gaussian Kernel [47] and we searched the kernel hyper- The influence of the correlated features is also removed.
parameters in {10−6.0,−5.8,··· ,4 }. For the Neural Network In details, for every tree, the method computes two
we used a single hidden layer network (hyperbolic tan- quantities: the first one is the error on the out-of-bag samples
gent as activation function in the hidden layer) with as they are used during prediction, while the second one is the
dropout (mlpKerasDropout in the caret [56] R package), error on the out-of-bag samples after a random permutation of
we train it with adaptive subgradient methods (batch the values of a variable. These two values are then subtracted
size equal to 32), and we tuned the following hyper- and the average of the result over all the trees in the ensemble
parameters: the number of neurons in the hidden layer in is the raw importance score for the variable under exam.
{10, 20, 40, 80, 160, 320, 640, 1280}, the dropout rate of the Despite the effectiveness of MDA, when the number
hidden layer in {0.001, 0.002, 0.004, 0.008}, the learning of samples is small these methods might result being
rate in {0.001, 0.002, 0.005, 0.01, 0.02.0.05}, the fraction unstable [66]–[68]. For this reason, in this work, instead
of gradient to keep at each step in {0.01, 0.05, 0.1, 0.5}, of running the Feature Ranking (FR) procedure just once,
and the learning rate decay in {0.01, 0.05, 0.1, 0.5}. For analogously to what we have done for MS and EE, we sub-
Decision Tree we searched the max depth of the trees sample the original dataset and we repeat the procedure many
TABLE 4. CKD development binary classification results. Linear SVM: Support Vector Machine with linear kernel. Gaussian SVM: Support Vector Machine
with Gaussian kernel. MCC: Matthews correlation coefficient (worst value = −1 and best value = +1). TP rate: true positive rate, sensitivity, recall. TN rate:
true negative rate, specificity. PR: precision-recall curve. PPV: positive predictive value, precision. NPV: negative predictive value. ROC: receiver operating
characteristic curve. AUC: area under the curve. F1 score, accuracy, TP rate, TN rate, PPV, NPV, PR AUC, ROC AUC: worst value = 0 and best value = +1.
Confusion matrix threshold for TP rate, TN rate, PPV, and NPV: 0.5. We highlighted in blue and with an asterisk * the top results for each score. We report
the formulas of these rates in the Supplementary Information.
times. The final rank of a feature will be the aggregation of features and the derived year feature, both for supervised
the different ranking using the Borda’s method [69]. binary classification and feature ranking. We measured the
prediction with the typical confusion matrix rates (MCC, F1
C. BIOSTATISTICS UNIVARIATE TESTS score, and others), and the importance for each variable as
Before employing machine learning algorithms, we applied the logistic regression model coefficient. This method has
traditional univariate biostatistics techniques to evaluate no significant hyper-parameters so we did not perform any
the relationship between the EventCKD35 target and each optimization (glm method of the stats R package).
feature.
We made use of the Mann–Whitney U test (also known IV. RESULTS
as Wilcoxon rank–sum test) [70] for the numerical features In this section, we report the results for the prediction of
and of the chi–square test [71] for the binary features. The the chronic kidney disease (subsection IV-A) and its feature
p-values of both these tests range between 0 and 1: a low ranking (subsection IV-B).
p-value of this test means that the analyzed variable strongly
relates to the target feature, while a high p-value means the no A. CHRONIC KIDNEY DISEASE PREDICTION RESULTS
evident relation. These tests are also useful to detect the 1) CKD PREDICTION
importance of each feature with respect to the target: the We report the results obtained for the static prediction of the
lower the p-value of a feature, the stronger its association with CKD measured with traditional confusion matrix indicators
the target. Following the recent advice of Benjamin et al. [72], in Table 4. We rank our results by the Matthews correlation
we use 0.005 as threshold of significance for the p-values, that coefficient (MCC) because it is the only confusion matrix
is 5×10−3 . If the p-value of a test applied to a variable and the rate that generates a high score if the classifier was able to
target results being lower than 0.005, we consider significant correctly predict most of the data instances and correctly
the association between the variable and the target. make most of the predictions, both on the positive class and
the negative class [75]–[78].
D. PREDICTION AND FEATURE RANKING INCLUDING Random Forests outperformed all the other methods for
TEMPORAL FEATURE MCC, F1 score, accuracy, sensitivity, negative predictive
In the second analysis we performed for chronic kidney value, precision recall AUC, and receiver operating charac-
disease prediction, we decided to include the temporal teristic AUC (Table 4), while the support vector machine with
component expressing in which year the disease occurred for Gaussian kernel achieved the top specificity and precision.
the CKD patients or which year they had their last outpatient Because of the imbalance of the dataset (section II), all
visit (Supplementary information). the classifiers attained better results among the negative
We applied a Stratified Logistic Regression [73], [74] data instances (specificity and NPV) than among the posi-
to this complete dataset, including all the original clinical tive elements (sensitivity and precision). This consequence
happens because each classifier can observe and learn to the disease in the previous analysis. We then decided to
recognize more individuals without CKD during training, performed a stratified prediction including a time feature
and therefore are more capable of recognizing them than indicating the year when the patient developed the chronic
recognizing patients with CKD during testing. kidney disease, or the last visit for non-CKD patients (Sup-
XGBoost and One Rule obtained Matthews correlation plementary information). After having included the year
coefficients close to 0, meaning that their performance was information in the dataset, we applied a Stratified Logistic
similar to random guessing. Random Forests, linear SVM, Regression [74], [80], as described earlier (section III).
and Decision Tree were the only methods able to correctly The presence of the temporal feature actually improved
classify most of the true positives (TP rate = 0.792, 0.6, and the prediction, allowing the regression to obtain a MCC
0.588, respectively). No technique was capable of correctly of +0.469, better than all the MCC’s achieved by the
making most of the positive predictions: all PPVs are below classifiers applied to the static dataset version except Random
0.5 Table 4. Forests (Table 5). Also in this case, sensitivity and precision
Regarding positives, SVM with Gaussian kernel obtained result being much higher than sensitivity and NPV, because
an almost perfect specificity (0.940), while Random Forests of the imbalance of the dataset.
achieved an almost perfect NPV of 0.968 Table 4. This result comes with no surprise: it makes complete sense
These results show that the machine learning classifiers that the inclusion of a temporal feature describing the trend
Random Forests and SVM with Gaussian kernel can effi- of a disease could improve the prediction quality.
ciently predict patients with CKD and patients without CKD To better understand the prediction obtained by the
from their electronic health records, with high prediction Stratified Logistic Regression, we plotted a calibration
scores, in few minutes. curve [79] of its predictions (Figure 3). As one can notice,
Since Random Forests resulted being the best performing the Stratified Logistic Regression returns well calibrated
classifier, we also included the calibration curve plot [79] of predictions, as it trends follows the x = y line which
its predictions (Figure 2), for the sake of completeness. The represents the perfect calibration from approximately 5%
curve follows the trend of the x = y perfect line translated to approximately 75% of the probabilities. This calibration
on the x axis between approximately 5% and approximately curve confirms that the Stratified Logistic Regression made
65%, indicating well calibrated predictions in this interval. a good prediction.
TABLE 5. CKD prediction results including the temporal feature. The dataset analyzed for these tests contains the time year feature indicating in which
year after the baseline visits the patient developed the CKD. All the abbreviations have the same meaning described in the caption of Table 4.
TABLE 6. Feature ranking through biostatistics univariate tests. TABLE 7. Feature ranking generated by Random Forests. MDA average
We employed the Mann–Whitney U test [70] for the numerical features position: average position obtained by each feature through the accuracy
and the chi–square test [71] for the binary features. We reported in blue decrease feature ranking of Random Forests.
and with an asterisk * the only feature having a p-value lower than the
0.005 threshold, that is 5 × 10−03 .
plausible that some predictors encode a ‘baseline’ level of risk theirs, we can notice the difference in the ranking positions
of developing CKD, which is negated if the model knows in between the two studies.
which year the CKD developed. Hypertension resulted being the 4th most important factor
The variables which reduce most significantly between the in Salekin’s study [6], confirming the importance of the
models are age, eGFR and creatinine, which are all clinical HistoryHTN variable which is ranked at the 3rd position in
indicators of an individual’s baseline risk of CKD. Inspection our Stratified Logistic Regression ranking (Table 8). Also
of variables which maintain or increase their position diabetes history has high ranking in both the standings: 3rd
when the year feature is added identifies hypertension, position in the ranking of Salekin’s study [6], and 6th of
smoking and diabetes as key predictive factors in the model importance in our Stratified Logistic Regression ranking,
(subsection IV-B). These are all known to play a central role as HistoryDiabetes (Table 8).
in the pathogenesis of micro- and macrovascular disease,
including of the kidney. While the former variables may VI. CONCLUSION
encode baseline risk, the latter are stronger indicators for rate Chronic kidney disease affects more than 700 millions people
of progression. in the world annually, and kills approximately 1.2 million
It is also worth noting that without the temporal infor- of them. Computational intelligence can be an effective
mation, the model is tasked with predicting whether the means to quickly analyze electronic health records of patients
individual will develop CKD within the next 10 years. Here, affected by this disease, providing information about how
the baseline is highly relevant as it indicates how much further likely they will develop severe stages of this disease, or stating
the renal function needs to deteriorate. However, when the which clinical variables are the most important for diagnosis.
configuration is altered to include the year in which year the In this article, we analyzed a medical record dataset of 491
CKD developed, the relative importance of risk factors may patients from UAE with CKD and at risk of cardiovascular
be expected to increase – and indeed, we observed this in our disease, and developed machine learning methods able to
models. predict the likelihood they will develop CKD at stages 3-5,
with high accuracy. Afterwards, we employed machine
D. COMPARISON WITH RESULTS OF THE learning to detect the most important variables contained in
ORIGINAL STUDY the dataset, first excluding the temporal component indicating
The original study of Al-Shamsi et al. [28] included a feature the year when the CKD happened or the patient’s last visit,
ranking phase generated through a multivariable Cox’s and then including it. Our results confirmed the effectiveness
proportional hazards analysis, which included the temporal of our approach.
component [84]. Their ranking listed older age (AgeBase- Regarding limitations, we have to report that we performed
line), personal history of coronary heart disease (Histo- our analysis only on a single dataset. We looked for
ryCHD), personal history of diabetes mellitus (HistoryDLD), alternative public datasets to use as validation cohorts, but
and personal history of smoking (HistorySmoking) as most unfortunately we could not find any that have the same
important factors for risk of CKD serious event. clinical features.
In contrast to their findings, AgeBaseline was ranked in the In the future, we plan to further investigate the probability
last position in our Stratified Logistic Regression standing, of diagnosis prediction in this dataset through classifier
while HistoryCHD and HistoryDLD were at unimportant calibration and calibration plots [85], and to perform the
positions: 10th and 16th ranks out of 19 variables, respectively. feature ranking with a different feature ranking method such
Smoking history, instead, occupied a high rank both in our as SHapley Additive exPlanations (SHAP) [86]. Moreover,
standing and in the original study standing: our approach, we also plan to study chronic kidney disease by applying our
in fact, listed it as 5th out of 19. methods to CKD datasets of other types, such as microarray
gene expression [87], [88] and ultrasonography images [89].
E. COMPARISON WITH RESULTS OF OTHER STUDIES
Several published studies include a feature ranking phase LIST OF ABBREVIATIONS
to detect the most relevant variables to predict chronic AUC: area under the curve. BP: blood pressure. CHD:
kidney disease from electronic medical records. Most of coronary hearth disease. CKD: chronic kidney disease.
them, however, use feature ranking to reduce the num- CVD: cardiovascular disease. DLD: dyslipidemia. EE: error
ber of variables for the binary classification, without estimation. FR: feature ranking. KDIGO: Kidney Disease
reporting a final standing of clinical factors ranked by Improving Global Outcomes. HTN: hypertension. MCC:
importance [10], [12], [21]. Matthews correlation coefficient. MDA: Model Decrease in
Only the article of Salekin and Stankovic [6] reports Accuracy. MS: model selection. NPV: negative predictive
the most relevant variables found in their study: specific value. p-value: probability value. PPV: positive predictive
gravity, albumin, diabetes, hypertension, hemoglobin, serum value. PR: precision–recall. ROC: receiver operating char-
creatinine, red blood cells count, packed cell volume, acteristic. SHAP: SHapley Additive exPlanations. SVM:
appetite, and sodium resulted being at top positions. Even if Support Vector Machine. TN rate: true negative rate. TP rate:
the clinical features present in our datasets mainly differ from true positive rate. UAE: United Arab Emirates.
COMPETING INTERESTS [15] S. Belina V. J. Sara and K. Kalaiselvi, ‘‘Ensemble swarm behaviour based
The authors declare they have no competing interest. feature selection and support vector machine classifier for chronic kidney
disease prediction,’’ Int. J. Eng. Technol., vol. 7, no. 2, p. 190, May 2018.
[16] N. R. Shawan, S. S. A. Mehrab, F. Ahmed, and A. S. Hasmi, ‘‘Chronic
ACKNOWLEDGMENT kidney disease detection using ensemble classifiers and feature set
The authors thank Saif Al-Shamsi (United Arab Emirates reduction,’’ Ph.D. dissertation, Dept. Comput. Sci. Eng., BRAC Univ.,
Dhaka, Bangladesh, 2019.
University) for having provided additional information about [17] S. B. Satukumati and R. K. S. Satla, ‘‘Feature extraction techniques for
the dataset. chronic kidney disease identification,’’ Kidney, vol. 24, no. 1, p. 29, 2019.
[18] T. Abrar, S. Tasnim, and M. Hossain, ‘‘Early detection of chronic kidney
disease using machine learning,’’ Ph.D. dissertation, Dept. Comput. Sci.
DATA AND SOFTWARE AVAILABILITY Eng., BRAC Univ., Dhaka, Bangladesh, 2019.
The dataset used in this study is publicly available under [19] M. Elhoseny, K. Shankar, and J. Uthayakumar, ‘‘Intelligent diagnostic
the Creative Commons Attribution 4.0 International (CC BY prediction and classification system for chronic kidney disease,’’ Sci. Rep.,
vol. 9, no. 1, pp. 1–14, Dec. 2019.
4.0) license at: https://figshare.com/articles/dataset/Chronic_ [20] S. Ravizza, T. Huschto, A. Adamov, L. Böhm, A. Büsser, F. F. Flöther,
kidney_disease_in_patients_at_high_risk_of_cardiovascular R. Hinzmann, H. König, S. M. McAhren, D. H. Robertson, T. Schleyer,
_disease_in_the_United_Arab_Emirates_A_population-bas B. Schneidinger, and W. Petrich, ‘‘Predicting the early risk of chronic
kidney disease in patients with diabetes using real-world data,’’ Nature
ed_study/6711155?file=12242270 Med., vol. 25, no. 1, pp. 57–59, Jan. 2019.
Our software code is publicly available under GNU Gen- [21] S. I. Ali, B. Ali, J. Hussain, M. Hussain, F. A. Satti, G. H. Park, and
eral Public License v3.0 at: https://github.com/davidechicco/ S. Lee, ‘‘Cost-sensitive ensemble feature ranking and automatic threshold
selection for chronic kidney disease diagnosis,’’ Appl. Sci., vol. 10, no. 16,
chronic_kidney_disease_and_cardiovascular_disease p. 5663, Aug. 2020.
[22] P. Chittora, S. Chaurasia, P. Chakrabarti, G. Kumawat, T. Chakrabarti,
REFERENCES Z. Leonowicz, M. Jasiński, Ł. Jasiński, R. Gono, E. Jasińska, and
V. Bolshev, ‘‘Prediction of chronic kidney disease—A machine learning
[1] V. A. Luyckx, M. Tonelli, and J. W. Stanifer, ‘‘The global burden of kidney
perspective,’’ IEEE Access, vol. 9, pp. 17312–17334, 2021.
disease and the sustainable development goals,’’ Bull. World Health Org., [23] P. Ventrella, G. Delgrossi, G. Ferrario, M. Righetti, and M. Masseroli,
vol. 96, no. 6, p. 414, 2018. ‘‘Supervised machine learning for the assessment of chronic kidney disease
[2] S. Said and G. T. Hernandez, ‘‘The link between chronic kidney disease advancement,’’ Comput. Methods Programs Biomed., vol. 209, Sep. 2021,
and cardiovascular disease,’’ J. Nephropathol., vol. 3, no. 3, p. 99, 2014. Art. no. 106329.
[3] K. Damman, M. A. E. Valente, A. A. Voors, C. M. O’Connor, [24] M. Rashed-Al-Mahfuz, A. Haque, A. Azad, S. A. Alyami, J. M. W. Quinn,
D. J. van Veldhuisen, and H. L. Hillege, ‘‘Renal impairment, worsening and M. A. Moni, ‘‘Clinically applicable machine learning approaches to
renal function, and outcome in patients with heart failure: An updated identify attributes of chronic kidney disease (CKD) for use in low-cost
meta-analysis,’’ Eur. Heart J., vol. 35, no. 7, pp. 455–469, Feb. 2014. diagnostic screening,’’ IEEE J. Transl. Eng. Health Med., vol. 9, pp. 1–11,
[4] A. Charleonnan, T. Fufaung, T. Niyomwong, W. Chokchueypattanakit, 2021.
S. Suwannawach, and N. Ninchawee, ‘‘Predictive analytics for chronic [25] S. Krishnamurthy, K. Ks, E. Dovgan, M. Luštrek, B. G. Piletič,
kidney disease using machine learning techniques,’’ in Proc. Manage. K. Srinivasan, Y.-C.-J. Li, A. Gradišek, and S. Syed-Abdul, ‘‘Machine
Innov. Technol. Int. Conf. (MITicon), Bang-Saen, Thailand, Oct. 2016, learning prediction models for chronic kidney disease using national
pp. 80–83. health insurance claim data in Taiwan,’’ Healthcare, vol. 9, no. 5, p. 546,
[5] N. Tazin, S. A. Sabab, and M. T. Chowdhury, ‘‘Diagnosis of chronic kidney May 2021.
disease using effective classification and feature selection technique,’’ in [26] M. Gupta and P. Gupta, ‘‘Predicting chronic kidney disease using
Proc. Int. Conf. Med. Eng., Health Informat. Technol. (MediTec), Dhaka, machine learning,’’ in Emerging Technologies for Healthcare: Internet
Bangladesh, Dec. 2016, pp. 1–6. of Things and Deep Learning Models. Hoboken, NJ, USA: Wiley, 2021,
[6] A. Salekin and J. Stankovic, ‘‘Detection of chronic kidney disease pp. 251–277.
and selecting important predictive attributes,’’ in Proc. IEEE Int. Conf. [27] University of California Irvine Machine Learning Repository.
Healthcare Informat. (ICHI), Chicago, IL, USA, Oct. 2016, pp. 262–270. (Oct. 4, 2021). Chronic Kidney Disease Data Set. [Online]. Available:
[7] H. Polat, H. D. Mehr, and A. Cetin, ‘‘Diagnosis of chronic kidney disease https://archive.ics.uci.edu/ml/datasets/chronic_kidney_disease
based on support vector machine by feature selection methods,’’ J. Med. [28] S. Al-Shamsi, D. Regmi, and R. D. Govender, ‘‘Chronic kidney disease
Syst., vol. 41, no. 4, p. 55, 2017. in patients at high risk of cardiovascular disease in the United Arab
[8] M. S. Wibawa, I. M. D. Maysanjaya, and I. M. A. W. Putra, ‘‘Boosted Emirates: A population-based study,’’ PLoS ONE, vol. 13, no. 6, Jun. 2018,
classifier and features selection for enhancing chronic kidney disease Art. no. e0199920.
diagnose,’’ in Proc. 5th Int. Conf. Cyber IT Service Manage. (CITSM), [29] G. S. Francis, ‘‘ACE inhibition in cardiovascular disease,’’ New England J.
Denpasar, Indonesia, Aug. 2017, pp. 1–6. Med., vol. 342, no. 3, pp. 201–202, Jan. 2000.
[9] A. Subasi, E. Alickovic, and J. Kevric, ‘‘Diagnosis of chronic kidney [30] J. Agata, D. Nagahara, S. Kinoshita, Y. Takagawa, N. Moniwa, D. Yoshida,
disease by using random forest,’’ in Proc. Int. Conf. Med. Biol. Eng. N. Ura, and K. Shimamoto, ‘‘Angiotensin II receptor blocker prevents
(CMBEBIH). Singapore: Springer, 2017, pp. 589–594. increased arterial stiffness in patients with essential hypertension,’’
[10] S. Zeynu and S. Patil, ‘‘Prediction of chronic kidney disease using data Circulat. J., vol. 68, no. 12, pp. 1194–1198, 2004.
mining feature selection and ensemble method,’’ Int. J. Data Mining [31] Kidney Disease: Improving Global Outcomes (KDIGO) Transplant Work
Genomics Proteomics, vol. 9, no. 1, pp. 1–9, 2018. Group, ‘‘KDIGO clinical practice guideline for the care of kidney
[11] A. Ogunleye and Q.-G. Wang, ‘‘Enhanced XGBoost-based automatic transplant recipients,’’ Amer. J. Transplantation, vol. 9, p. S1, Nov. 2009.
diagnosis system for chronic kidney disease,’’ in Proc. IEEE 14th Int. Conf. [32] S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning:
Control Autom. (ICCA), Anchorage, AK, USA, Jun. 2018, pp. 805–810. From Theory to Algorithms. Cambridge, U.K.: Cambridge Univ. Press,
[12] S. Zeynu and S. Patil, ‘‘Survey on prediction of chronic kidney disease 2014.
using data mining classification techniques and feature selection,’’ Int. J. [33] A. Altmann, L. Toloşi, O. Sander, and T. Lengauer, ‘‘Permutation
Pure Appl. Math., vol. 118, no. 8, pp. 149–156, 2018. importance: A corrected feature importance measure,’’ Bioinformatics,
[13] A. A. Imran, M. N. Amin, and F. T. Johora, ‘‘Classification of chronic vol. 26, no. 10, pp. 1340–1347, 2010.
kidney disease using logistic regression, feedforward neural network and [34] M. A. Hardy, Regression With Dummy Variables. Newbury Park, CA, USA:
wide & deep learning,’’ in Proc. Int. Conf. Innov. Eng. Technol. (ICIET), Sage, 1993.
Osaka, Japan, Dec. 2018, pp. 1–6. [35] A. R. T. Donders, G. J. M. G. van der Heijden, T. Stijnen, and
[14] A. Shrivas, S. K. Sahu, and H. Hota, ‘‘Classification of chronic kidney K. G. M. Moons, ‘‘Review: A gentle introduction to imputation of missing
disease with proposed union based feature selection technique,’’ in Proc. values,’’ J. Clin. Epidemiol., vol. 59, no. 10, pp. 1087–1091, Oct. 2006.
3rd Int. Conf. Internet Things Connected Technol., Jaipur, India, 2018, [36] L. Oneto, Model Selection and Error Estimation in a Nutshell. Berlin,
pp. 26–27. Germany: Springer, 2020.
[37] K. F. Kerr, ‘‘Comments on the analysis of unbalanced microarray data,’’ [64] D. Chicco and C. Rovelli, ‘‘Computational prediction of diagnosis and
Bioinformatics, vol. 25, no. 16, pp. 2035–2041, Aug. 2009. feature selection on mesothelioma patient health records,’’ PLoS ONE,
[38] R. Laza, R. Pavón, M. Reboiro-Jato, and F. Fdez-Riverola, ‘‘Evaluating vol. 14, no. 1, Jan. 2019, Art. no. e0208737.
the effect of unbalanced data in biomedical document classification,’’ [65] P. Good, Permutation Tests: A Practical Guide to Resampling Methods for
J. Integrative Bioinf., vol. 8, no. 3, pp. 105–117, Dec. 2011. Testing Hypotheses. New York, NY, USA: Springer, 2013.
[39] K. Han, K. Z. Kim, J. M. Oh, I. W. Kim, K. Kim, and T. Park, ‘‘Unbalanced [66] M. L. Calle and V. Urrea, ‘‘Letter to the editor: Stability of random
sample size effect on the genome-wide population differentiation studies,’’ forest importance measures,’’ Briefings Bioinf., vol. 12, no. 1, pp. 86–89,
Int. J. Data Mining Bioinf., vol. 6, no. 5, pp. 490–504, 2012. Jan. 2011.
[40] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. [67] M. B. Kursa, ‘‘Robustness of random forest-based gene selection
Bing, ‘‘Learning from class-imbalanced data: Review of methods and methods,’’ BMC Bioinf., vol. 15, no. 1, pp. 1–8, Dec. 2014.
applications,’’ Expert Syst. Appl., vol. 73, pp. 220–239, May 2017. [68] H. Wang, F. Yang, and Z. Luo, ‘‘An experimental study of the intrinsic
[41] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, ‘‘SMOTE: stability of random forest variable importance measures,’’ BMC Bioinf.,
Synthetic minority over-sampling technique,’’ J. Artif. Intell. Res., vol. 16, vol. 17, no. 1, p. 60, 2016.
no. 1, pp. 321–357, 2002. [69] D. Sculley, ‘‘Rank aggregation for similar items,’’ in Proc. SIAM Int. Conf.
[42] T. Zhu, Y. Lin, and Y. Liu, ‘‘Synthetic minority oversampling technique for Data Mining, Minneapolis, MN, USA, Apr. 2007, pp. 587–592.
multiclass imbalance problems,’’ Pattern Recognit., vol. 72, pp. 327–340, [70] T. W. MacFarland and J. M. Yates, ‘‘Mann–Whitney U test,’’ in
Dec. 2017. Introduction to Nonparametric Statistics for the Biological Sciences Using
[43] C. Molnar. (2018). Interpretable Machine Learning. [Online]. Available: R. Berlin, Germany: Springer, 2016, pp. 103–132.
https://christophm.github.io/book/ [71] P. E. Greenwood and M. S. Nikulin, A Guide to Chi–Squared Testing,
[44] I. Guyon and A. Elisseeff, ‘‘An introduction to variable and feature vol. 280. Hoboken, NJ, USA: Wiley, 1996.
selection,’’ J. Mach. Learn. Res., vol. 3, pp. 1157–1182, Mar. 2003. [72] D. J. Benjamin et al., ‘‘Redefine statistical significance,’’ Nature Hum.
[45] L. Breiman, ‘‘Random forests,’’ Mach. Learn., vol. 45, no. 1, pp. 5–32, Behav., vol. 2, no. 1, pp. 6–10, 2018.
2001. [73] C. R. Mehta and N. R. Patel, ‘‘Exact logistic regression: The-
[46] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis. ory and examples,’’ Statist. Med., vol. 14, no. 19, pp. 2143–2160,
Cambridge, U.K.: Cambridge Univ. Press, 2004. Oct. 1995.
[47] S. S. Keerthi and C.-J. Lin, ‘‘Asymptotic behaviors of support vector [74] D. Chicco and G. Jurman, ‘‘Machine learning can predict survival of
machines with Gaussian kernel,’’ Neural Comput., vol. 15, no. 7, patients with heart failure from serum creatinine and ejection fraction
pp. 1667–1689, Mar. 2003. alone,’’ BMC Med. Informat. Decis. Making, vol. 20, no. 1, p. 16,
[48] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, Dec. 2020.
MA, USA: MIT Press, 2016. [75] D. Chicco, ‘‘Ten quick tips for machine learning in computational
[49] M. J. Zaki and W. Meira, Jr., Data Mining and Machine Learning: biology,’’ BioData Mining, vol. 10, no. 35, pp. 1–17, 2017.
Fundamental Concepts and Algorithms. Cambridge, U.K.: Cambridge [76] D. Chicco, M. J. Warrens, and G. Jurman, ‘‘The Matthews correlation coef-
Univ. Press, 2019. ficient (MCC) is more informative than Cohen’s Kappa and brier score in
[50] T. Chen and C. Guestrin, ‘‘XGBoost: A scalable tree boosting system,’’ binary classification assessment,’’ IEEE Access, vol. 9, pp. 78368–78381,
in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining 2021.
(KDD), San Francisco, CA, USA, 2016, pp. 785–794. [77] D. Chicco, N. Tötsch, and G. Jurman, ‘‘The Matthews correlation
[51] R. C. Holte, ‘‘Very simple classification rules perform well on most coefficient (MCC) is more reliable than balanced accuracy, bookmaker
commonly used datasets,’’ Mach. Learn., vol. 11, no. 1, pp. 63–90, informedness, and markedness in two-class confusion matrix evaluation,’’
Apr. 1993. BioData Mining, vol. 14, Feb. 2021, Art. no. 13.
[52] I. Orlandi, L. Oneto, and D. Anguita, ‘‘Random forests model selection,’’ in [78] D. Chicco, V. Starovoitov, and G. Jurman, ‘‘The benefits of the Matthews
Proc. Eur. Symp. Artif. Neural Netw., Comput. Intell. Mach. Learn., Bruges, correlation coefficient (MCC) over the diagnostic odds ratio (DOR) in
Belgium, 2016, pp. 441–446. binary classification assessment,’’ IEEE Access, vol. 9, pp. 47112–47124,
[53] F. Hutter, H. Hoos, and K. Leyton-Brown, ‘‘An efficient approach for 2021.
assessing hyperparameter importance,’’ in Proc. 31st Int. Conf. Mach. [79] P. C. Austin and E. W. Steyerberg, ‘‘Graphical assessment of
Learn. (ICML), Beijing, China, 2014, pp. 754–762. internal and external calibration of logistic regression models by
[54] S. Bernard, L. Heutte, and S. Adam, ‘‘Influence of hyperparameters on using loess smoothers,’’ Statist. Med., vol. 33, no. 3, pp. 517–535,
random forest accuracy,’’ in Proc. Int. Workshop Multiple Classifier Syst., Feb. 2014.
Reykjavik, Iceland, 2009, pp. 171–180. [80] N. E. Breslow, L. P. Zhao, T. R. Fears, and C. C. Brown, ‘‘Logistic
[55] P. Probst, M. Wright, and A.-L. Boulesteix, ‘‘Hyperparameters and tuning regression for stratified case–control studies,’’ Biometrics, vol. 44, no. 3,
strategies for random forest,’’ Wiley Interdiscipl. Rev., Data Mining Knowl. pp. 891–899, 1988.
Discovery, vol. 9, no. 3, p. e1301, 2019. [81] J. H. Zar, ‘‘Spearman rank correlation,’’ in Encyclopedia Biostatistics,
[56] M. Kuhn, ‘‘Building predictive models in R using the caret package,’’ vol. 7. Hoboken, NJ, USA: Wiley, 2005.
J. Statist. Softw., vol. 28, no. 5, pp. 1–26, 2008. [82] F. J. Brandenburg, A. Gleißner, and A. Hofmeier, ‘‘Comparing and
[57] M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim, ‘‘Do we aggregating partial orders with Kendall tau distances,’’ in Proc. 6th Int.
need hundreds of classifiers to solve real world classification problems?’’ Workshop Algorithms Comput. (WALCOM). Dhaka, Bangladesh: Springer,
J. Mach. Learn. Res., vol. 15, no. 1, pp. 3133–3181, 2014. 2012, pp. 88–99.
[58] M. Wainberg, B. Alipanahi, and B. J. Frey, ‘‘Are random forests truly the [83] D. Chicco, E. Ciceri, and M. Masseroli, ‘‘Extended Spearman and Kendall
best classifiers?’’ J. Mach. Learn. Res., vol. 17, no. 1, pp. 3837–3841, coefficients for gene annotation list correlation,’’ in Proc. 11th Int.
2016. Meeting Comput. Intell. Methods Bioinf. Biostatistics (CIBB), in Lecture
[59] D. H. Wolpert, ‘‘The lack of a priori distinctions between learning Notes in Computer Science, vol. 8623. Cambridge, U.K.: Springer, 2015,
algorithms,’’ Neural Comput., vol. 8, no. 7, pp. 1341–1390, Oct. 1996. pp. 19–32.
[60] Y. Saeys, T. Abeel, and Y. V. D. Peer, ‘‘Robust feature selection using [84] D. Clayton and J. Cuzick, ‘‘Multivariate generalizations of the proportional
ensemble feature selection techniques,’’ in Proc. Joint Eur. Conf. Mach. hazards model,’’ J. Roy. Stat. Soc., A, General, vol. 148, no. 2, pp. 82–108,
Learn. Knowl. Discovery Databases (ECML PKDD), Antwerp, Belgium, 1985.
2008, pp. 313–325. [85] P. A. Flach, ‘‘Classifier calibration,’’ in Encyclopedia of Machine Learning
[61] R. Genuer, J.-M. Poggi, and C. Tuleau-Malot, ‘‘Variable selection using and Data Mining. Berlin, Germany: Springer, 2016.
random forests,’’ Pattern Recognit. Lett., vol. 31, no. 14, pp. 2225–2236, [86] S. M. Lundberg and S.-I. Lee, ‘‘A unified approach to interpreting model
Oct. 2010. predictions,’’ in Proc. 31st Int. Conf. Neural Inf. Process. Syst. (NIPS),
[62] Y. Qi, ‘‘Random forest for bioinformatics,’’ in Ensemble Machine 2017, pp. 4768–4777.
Learning. Boston, MA, USA: Springer, 2012. [87] L.-T. Zhou, S. Qiu, L.-L. Lv, Z.-L. Li, H. Liu, R.-N. Tang, K.-L. Ma, and
[63] R. Díaz-Uriarte and S. A. De Andrés, ‘‘Gene selection and classification B.-C. Liu, ‘‘Integrative bioinformatics analysis provides insight into the
of microarray data using random forest,’’ BMC Bioinf., vol. 7, no. 1, p. 3, molecular mechanisms of chronic kidney disease,’’ Kidney Blood Pressure
Dec. 2006. Res., vol. 43, no. 2, pp. 568–581, 2018.
[88] Z. Zuo, J.-X. Shen, Y. Pan, J. Pu, Y.-G. Li, X.-H. Shao, and W.-P. Wang, CHRISTOPHER A. LOVEJOY received the bach-
‘‘Weighted gene correlation network analysis (WGCNA) detected loss of elor’s degree in medicine from the University
MAGI2 promotes chronic kidney disease (CKD) by podocyte damage,’’ of Cambridge, U.K., and the master’s degree in
Cellular Physiol. Biochem., vol. 51, no. 1, pp. 244–261, 2018. data science and machine learning from University
[89] C.-Y. Ho, T.-W. Pai, Y.-C. Peng, C.-H. Lee, Y.-C. Chen, Y.-T. Chen, College London, U.K. He is currently a Medical
and K.-S. Chen, ‘‘Ultrasonography image analysis for detection and Doctor with interests in applied machine learning
classification of chronic kidney disease,’’ in Proc. 6th Int. Conf. and bioinformatics.
Complex, Intell., Softw. Intensive Syst. (CISIS), Palermo, Italy, Jul. 2012,
pp. 624–629.