Information 13 00330 v2 PDF
Information 13 00330 v2 PDF
Information 13 00330 v2 PDF
Article
Supervised Learning Models for the Preliminary Detection of
COVID-19 in Patients Using Demographic and
Epidemiological Parameters
Aditya Pradhan 1 , Srikanth Prabhu 1, *, Krishnaraj Chadaga 1 , Saptarshi Sengupta 2 and Gopal Nath 3
Abstract: The World Health Organization labelled the new COVID-19 breakout a public health
crisis of worldwide concern on 30 January 2020, and it was named the new global pandemic in
March 2020. It has had catastrophic consequences on the world economy and well-being of people
and has put a tremendous strain on already-scarce healthcare systems globally, particularly in
underdeveloped countries. Over 11 billion vaccine doses have already been administered worldwide,
and the benefits of these vaccinations will take some time to appear. Today, the only practical
approach to diagnosing COVID-19 is through the RT-PCR and RAT tests, which have sometimes been
known to give unreliable results. Timely diagnosis and implementation of precautionary measures
will likely improve the survival outcome and decrease the fatality rates. In this study, we propose
Citation: Pradhan, A.; Prabhu, S.;
an innovative way to predict COVID-19 with the help of alternative non-clinical methods such as
Chadaga, K.; Sengupta, S.; Nath, G. supervised machine learning models to identify the patients at risk based on their characteristic
Supervised Learning Models for the parameters and underlying comorbidities. Medical records of patients from Mexico admitted between
Preliminary Detection of COVID-19 23 January 2020 and 26 March 2022, were chosen for this purpose. Among several supervised machine
in Patients Using Demographic and learning approaches tested, the XGBoost model achieved the best results with an accuracy of 92%. It is
Epidemiological Parameters. an easy, non-invasive, inexpensive, instant and accurate way of forecasting those at risk of contracting
Information 2022, 13, 330. https:// the virus. However, it is pretty early to deduce that this method can be used as an alternative in the
doi.org/10.3390/info13070330
clinical diagnosis of coronavirus cases.
Academic Editor: Willy Susilo
Keywords: COVID-19 diagnosis; machine learning; data-driven approaches; SMOTE; SHAP; LIME;
Received: 2 June 2022
infection prediction
Accepted: 30 June 2022
Published: 10 July 2022
of taste or smell [4]. Epidemiological studies have shown that elderly individuals are more
prone to severe illnesses, while children often have milder symptoms [5,6]. People with
underlying severe medical conditions, such as diabetes, hypertension, cancer, lung/liver
or kidney disease, have shown a bad prognosis and are at a higher risk of hospitalisation.
In worst-case scenarios, the infection can be fatal [7,8]. The first case originated in Wuhan,
Hubei Province, China, in December 2019 and has since spread to the entire world [9]. As
of June 2022, over 532 million cases have been reported, and around 6.3 million deaths
have been recorded [10].
COVID-19 is highly contagious and can transmit through direct contact (human-to-
human transmission and droplet) and indirect contact (airborne contagion and contam-
inated objects) [11]. Its symptoms typically manifest between 1 and 14 days, while the
mean incubation period is 5.2 days [12]. Countries worldwide have enforced norms such
as social distancing, face masks, quarantine and vaccinations to curb the spread of this
dangerous virus. Since it spreads rapidly and has no effective cure, the most efficient
method of tackling its spread is early detection and isolation of patients. Currently, to diag-
nose COVID-19, there are two major types of tests: the first being the molecular/nucleic
acid tests which include the reverse transcription-polymerase chain reaction (RT-PCR)
test, digital PCR, isothermal nucleic acid amplification test and clustered regularly inter-
spaced short palindromic repeats (CRISPR) system that detect the RNA component of the
virus [13]. The RT-PCR test is considered the gold standard technique worldwide to detect
COVID-19 since it delivers results more rapidly and accurately than others [14–17]. How-
ever, RT-PCR has difficulty discriminating true positives from true negatives in COVID-19
affected patients [18]. Another flaw is the false-negative rates which are highly variable.
The false-negative rates are maximum during the first five days after exposure (up to 67%)
and least during the eight-day after exposure (21%) [19]. Furthermore, due to the acute
shortage of RT-PCR test kits in underdeveloped countries, testing and detection are delayed.
The second type of test is the rapid antigen test (RAT). This test identifies antigens and
small proteins on the virus’s surface and gives the result within 15–30 min. Its primary
disadvantages have been its low specificity (77.8%) and sensitivity (18.8%) [20]. Thus, there
is an urgent requirement for a method that overcomes the pitfalls of the previous tests.
One way to tackle this problem is by using artificial intelligence (AI) and machine learn-
ing (ML) to enhance clinical prediction since they recognise complex patterns in massive
datasets [21]. With the advancement of machine learning, research can offer a strategic
framework for developing automated, complicated and objective algorithmic tools for the
analysis of multimodal and multidimensional biological/mathematical data. ML models
can aid in the prediction of patients who are at a high risk of contracting COVID-19. This
can prevent the spread and reduce fatalities. ML-powered prediction models combine
numerous features to estimate the risk of infection and alleviate the burden on healthcare
systems worldwide.
AI can be defined as a wide field of computer science concerned with developing
models that can mimic human cognitive abilities. ML is a subclass of AI where the computer
learns on its own by analysing historical data or experience and makes accurate predictions
without being explicitly programmed. The historical data may be divided into two subsets
for training and testing, among other configurations. For example, a classifier may be
trained on the training dataset, where it learns about the various interesting patterns
which discriminate the several existing classes. The trained model, i.e., the classifier, then
predicts the classes of the testing dataset. There are four categories of machine learning.
(a) Supervised machine learning methods are algorithms that learn from historical or
prior datasets using labels to predict appropriate classes for unseen data (classification)
or forecast future occurrences (regression) [22]. This learning style requires the presence
of supervision in the form of labels in the training phase. The learning system’s expected
output is compared to the actual results. If discrepancies are discovered, they can be
corrected by adjusting the model appropriately, usually through the employment of an
optimisation algorithm that lowers the error indicative of the goodness of fit. (b) In
Information 2022, 13, 330 3 of 28
unsupervised learning, the input data are unclassified or unlabelled [23]. The algorithm
does not specify the correct result, but it investigates the information in order that it may
derive deductions from it, characterise unlabelled datasets and find meaningful patterns
in it [24]. (c) Semi-supervised learning methods are those that fall between supervised
and unsupervised learning models. They use both labelled and unlabelled data during
the training process. This technique is used to increase the precision of learning [24]. (d)
Reinforcement learning approaches use actions to engage with the learning environment to
identify erroneous results [25]. The model is trained based upon the previous outcomes,
and rewards and punishments exist for the predictions. Based on this principle, the model
learns to maximise the rewards and minimise the penalties, thereby learning from the
environment [26–28]. In addition to these approaches, it is important to highlight deep
learning (DL), which is a subset of ML. The various deep learning architectures draw
inspiration from and are built upon computational analogues of neurons in the human
mind and aim to mimic how human beings learn. These techniques are representation-
learning approaches with many layers of representation created by building simple yet
non-linear components that change the representation at one level (beginning with the raw
input) into a higher, increasingly abstract level [29]. Applications of DL can be found in
the fields of natural language processing [30], image recognition [31], recommendation
systems [32], speech recognition [33], medical diagnosis [34], etc., among others. Deep
learning is extremely useful in learning complex patterns in data by means of developing
tailored models that use different combinations of transformations. DL model performance
scales with the amount of data, and its abstraction does not require the entire architecture
to be hardcoded.
In this research, machine learning and deep learning algorithms are utilised to perform
a preliminary diagnosis of COVID-19 using demographic and epidemiological parameters.
These techniques can be extremely useful in geographical settings where medical resources
are scarce or during pandemic peaks when demand is at its maximum, thereby putting
strain on the resources. The article serves to emphasise the following contributions:
• Extensive review of background research: We perform a detailed review of recent
work in the literature, which looks at various diagnostic procedures for COVID-19
using AI and ML. Emphasis is placed on articles which consider demographic and
epidemiological parameters as part of their data.
• Pre-processing: The data are pre-processed to understand the most important parame-
ters. Correlation techniques have been used to underline the most important columns
in the dataset.
• Balancing: We use the Borderline-SMOTE technique to balance the data.
• Feature importance: We highlight relevant feature importance derivation techniques.
• Application of ML models: Machine learning and deep learning techniques have been
used to derive insights from the data. As demonstrated below, the models tend to
perform quite well for the considered data.
• Analysis of parameters: Information about the various parameters is obtained, and
their effect on COVID-19 patients is studied. The results obtained are compared with
state-of-the-art studies in the literature using similar data.
• Future directions: We provide an overview of some challenges faced and potential
future directions to extend the work.
In this study, a labelled epidemiological dataset from various hospitals in Mexico is
considered. The entire dataset in Spanish is pre-processed and balanced. Several classifiers
are developed and are extensively evaluated using performance metrics such as accuracy,
precision, recall, specificity and AUC. We also look at some popular techniques used in
medical AI research, such as boosting and deep learning networks. The proposed models
may augment efforts of detection and intervention and are ideally expected to reduce
the heavy burden already faced by healthcare systems all around the world. The paper
is organised as follows: Section 2 consists of similar studies that diagnose and forecast
COVID-19 using machine learning. Section 3 elaborates on the dataset description, data
Information 2022, 13, 330 4 of 28
pre-processing, correlation analysis and some theoretical concepts related to ML. The
performance metrics, model evaluation and description of results are explained in detail
in Section 4. Section 5 highlights the key issues and future directions. Section 6 concludes
the paper.
2. Related Work
With rapid advancements made in increasing the computational power of machines
and the development of new sophisticated algorithms revolutionising the big data niche,
exponential progress has been seen in AI in the past two decades. In healthcare settings,
accurate diagnosis and initiating treatment at the appropriate time are crucial. With broad
impact encompassing the medical landscape, ML has transformed how we diagnose dis-
eases, make predictions, analyse images, provide personalised treatment and aid patients.
ML approaches have already been utilised to treat COVID-19, diabetes, pneumonia, cancer,
dementia, liver failure and Parkinson’s disease, amongst other ailments. They provide
accurate detection and estimation results [35–40], and this has helped decrease human
intervention in clinical practice.
From the start of the COVID-19 pandemic, we have seen a variety of areas where ML
has been used extensively. Predicting the outbreak of COVID-19 in different countries,
estimating the occurrence of the next wave and its severity, predicting mortality rates,
contact tracing, detection of people not wearing facemasks or practising social distancing,
developing vaccines to better understand the correlation of the underlying problems of the
patient with mortality rate [41], etc., have been some of the use cases of ML. Early diagnosis
of COVID-19 patients is critical to prevent the illness from progressing in an individual
and from spreading to others. Research has shown that radiological imaging of the chest,
such as computed tomography (CT) and X-ray, can be helpful in the early detection and
treatment of COVID-19 [42]. A survey of recent literature reveals that COVID-19 mortality
can be easily predicted using CT scans [43]. Narin et al. [44] were able to build a deep
Information 2022, 13, 330 5 of 28
convolutional neural network (CNN) model which was able to detect COVID-19 with an
accuracy greater than 96% using chest X-ray scans. Ozturk et al. created a DL model named
DarkCovidNet, which could detect COVID-19 accurately up to 98.08% from chest CT scan
images [45]. According to these studies, these models could predict COVID-19 effectively
and were as reliable as RT-PCR tests. Apart from that, they are much quicker and instantly
produce results. However, these methods are invasive and need to have a radiology expert
who can interpret the results, thus making the tests expensive. Furthermore, doctors do
not recommend CT scans for all patients due to the radiation emitted by the machine,
which can cause cancer [46]. X-rays are also prone to false-negative results [34], among
other pitfalls.
Blood markers, epidemiological parameters and other demographic factors can be
used for preliminary diagnosis of COVID-19. Unlike CT scans and X-rays, these facilities
are available in all hospitals. The demographic parameters can be easily collected from
patients. These tests can be used in parallel with RT-PCR tests. Muhammed et al. [47]
used supervised ML models to predict COVID-19 using a Mexican epidemiological dataset.
Eleven features were extracted for training the ML models. The dataset was obtained from
the General Director of Epidemiology, who had published it on their website [48]. Five
ML algorithms: decision trees, logistic regression, naïve Bayes, support vector machine
and artificial neural networks (ANN) were deployed. The accuracies obtained by them
were 94.99%, 94.4%, 94.36%, 92.4% and 89.2%, respectively. The article concluded that
these models could be effectively deployed in hospitals. Quiroz-Juarez et al. [49] used
ML to identify high risk coronavirus patients. The dataset obtained for this research
was published by the Mexican Federal Government [48]. Four ML algorithms: neural
networks, logistic regression, support vector machines and K-nearest neighbours (KNN)
were used. The accuracies obtained were 93.5%, 92.1%, 92.5% and 89.3%, respectively.
The article concluded that neural networks could easily outperform conventional machine
earning algorithms. Prieto [50] used the Mexican dataset to forecast COVID-19 using ML
and Bayesian approaches. Parameter estimation techniques were used in the beginning.
Clinical analysis was performed later. The synthetic minority oversampling technique
(SMOTE) was used in this research to balance the dataset. The author claimed that the
techniques mentioned above are accurate and many false-positive and false-negative results
have been eliminated. Iwendi et al. [51] used ML algorithms to diagnose COVID-19 in
patients from Brazil and Mexico. Demographics, social and economic conditions, symptom
reports and clinical factors were all considered. The models they developed obtained an
accuracy of 93% for the Mexican dataset and 69% for the Brazilian dataset.
AI was used in early COVID-19 detection in [52]. Decision tree, Support Vector
Machine and voting classifiers were used on the benchmarked dataset from Mexico. The
best model obtained sensitivity, specificity and AUC of 75%, 61% and 72%, respectively. The
results obtained were satisfactory according to the study. The effect of medical conditions
on COVID-19 susceptibility was studied in [53]. Many COVID-19 datasets were considered
for this research. The study claims that diabetes is a strong factor which links to COVID-19
mortality and that comorbidities such as hypertension and obesity are also important.
Maouche et al. [54] used four ML algorithms: Multi-Layer Perceptron (MLP), decision tree,
random forest and Gradient Boosting to diagnose COVID-19 using the Mexican dataset.
The accuracies obtained by the models were 97.92%, 97.14%, 99.06% and 99.28%. Feature
importance methods were used and the most important parameters were age, hypertension,
pneumonia, diabetes and obesity.
Delgado-Gallegos et al. [55] used a decision tree model to understand the stress
occupancy in healthcare professionals from Mexico. An accuracy of 94.1% was obtained by
the models. Many frontline COVID-19 workers suffered from compulsive and xenophobia
stress, according to the study. A random forest algorithm was used to predict the diagnosis
of COVID-19 in [56]. A precision of 95% was obtained by the model. The article concluded
that non-clinical diagnosis using information technology is going to play a crucial role
in medical settings in the coming years. Mukherjee et al. [57] used KNN to diagnose
Information 2022, 13, 330 6 of 28
COVID-19 using a cloud-based Internet of Things (IoT) system. Seven COVID-19 datasets
were used for this research. An ant colony optimization (ACO) algorithm was used for
feature selection. Maximum accuracy of 97% was obtained by the models. The rest of the
related articles are described in Table 1.
Table 1. Related works which diagnose and predict COVID-19 mortality using machine learning approaches.
Table 2. List of attributes present in the Mexican COVID-19 dataset after converting to English
(variable type is denoted in brackets).
Categories Characteristics
(2) Demographic Data 14. Address (String) 21. Birth City (String)
15. State (String) 22. Age (Integer)
16. Simple Address (String) 23. Nationality (String)
17. Origin (String) 24. Indigenous (Integer)
18. Sector (String) 25. Migrant (Integer)
19. Gender (Integer) 26. Original Country (String)
20. Birth State (String)
After
After initial
initial data
data exploration,
exploration, Pearson’s
Pearson’s correlation
correlation coefficient
coefficient analysis
analysis waswas utilised
utilised
to
to understand
understand how how each
each variable
variable influenced
influenced thethe result
result and
and other
other variables.
variables. The
The Pearson
Pearson
correlation represented by
correlation represented by“r”
“r”isisused
usedtoto understand
understand thethe relationships
relationships among
among various
various pa-
parameters. If the correlation coefficient value is “1/ − 1” with the output, it
rameters. If the correlation coefficient value is “1/−1” with the output, it demonstrates that demonstrates
that
therethere is a perfect
is a perfect relationship,
relationship, while 0while 0 indicates
indicates it has noiteffect.
has no effect.
If the If the correlation
correlation coefficient
coefficient
value is positive, it shows that the variable affects the result positively. Ifpositively.
value is positive, it shows that the variable affects the result If it is
it is negative, it
negative, it offers an inverse impact on the output. The correlation coefficient
offers an inverse impact on the output. The correlation coefficient analysis technique is analysis
technique is based on the premise that the significance of a feature set within a dataset may
be evaluated by examining the strength of the association between variables’ characteristics.
If the values range between 0.7 and 1.0, it is a strong correlation. If the values range between
0.3 and 0.7, it is considered a moderate correlation. Any value below 0.3 indicates a weak
correlation [67]. Figure 2 shows the variables with a high and low correlation with the
target variable (RT-PCR result). Some variables have a slight positive correlation relation
and some variables have a slight negative correlation with the result. The “pneumonia”
attribute shows the highest correlation among all variables, followed by age. This means
that older adults are at an increased risk of contracting the virus. Some other interesting
details from the coefficient analysis were found as well, such as men were at a higher
risk of contracting the disease than women. The comorbidities also played an important
role, and the features that had the most influence were hypertension, diabetes and obesity.
Autoimmune diseases did not affect the result. A threshold modulus value of 0.01 was
set to further eliminate the variables which had negligible influence on the output. Based
on this value, the features COPD, asthma, autoimmune disease, cardiovascular disease
and renal chronic disease were eliminated. This helped to narrow the dataset to the ten
best features.
details from the coefficient analysis were found as well, such as men were at a higher risk
of contracting the disease than women. The comorbidities also played an important role,
and the features that had the most influence were hypertension, diabetes and obesity. Au-
toimmune diseases did not affect the result. A threshold modulus value of 0.01 was set to
further eliminate the variables which had negligible influence on the output. Based on this
Information 2022, 13, 330 9 of 28
value, the features COPD, asthma, autoimmune disease, cardiovascular disease and renal
chronic disease were eliminated. This helped to narrow the dataset to the ten best features.
Figure 2.
Figure 2. Pearson’s
Pearson’scorrelation
correlationmatrix
matrix which
which indicates
indicates thethe strength
strength of relationship
of the the relationship among
among vari-
variables.
ables.
3.3. Some Machine Learning Algorithms and Related Terminologies
The first step in the ML process is to gather reliable data from a range of sources.
This stage of data collection is critical to the modelling process. Choices such as selecting
improper features or concentrating only on a subset of the data set’s items might make the
model less efficient. It is critical to take the required precautions while obtaining data since
errors committed at this point will only exacerbate issues as advancement to the subsequent
phases is made. The second step involves data preparation and processing. The primary
objective of this step is to identify and mitigate any possible biases in the data sources and
their characteristics. Combination of all the data and randomization of it is performed in
this stage. This ensures that data are dispersed uniformly and the ordering has no effect
on the learning process. Analysis of data must be done carefully to understand the data
and their properties. Filtering of unnecessary features, such as names, IDs, etc., which
have no significance to the model’s output, were removed. Further, processing was done
to find if there were any discrepancies present such as missing data, duplicate data and
wrong data which can skew the results. This can be performed by visualizing the data in
order to comprehend its structure and the relationships between the variables and classes.
Information 2022, 13, 330 10 of 28
Exploratory analysis can help us detect imbalances and relationships within the data and
outliers and null values can be systematically eliminated. Further, feature scaling may be
performed to have a uniform distribution of values. Data transformation by feature scaling
also has other benefits such as an increased training speed, better prediction outputs and
effective memory utilization. There are two major types of feature scaling: normalization
and standardization. Normalization is a mapping method that creates new ranges from
existing ones [68]. Of the several methods of normalization and standardization, such as (a)
scaling to a range, (b) clipping, (c) log scaling and (d) Z-score, we look at min-max scaling
which is a popular one where the values are converted in the range of 0 and 1 or −1 and 1.
The simple formula for min-max scalar that can scale data to a range is:
Here X, Xmax , Xmin and X0 represent the original value, maximum value, minimum
value and normalized value of the feature, respectively. Standardization, also known as
variable scaling is another scaling technique [69]. It results in zero mean and unit standard
deviation for each attribute in the dataset. It is also referred to as z-score normalization and
can be defined as follows:
X0 = (X − µ)/σ (2)
Here X, µ, σ and X0 represent the original value, mean value, standard deviation and
the standardized value for an attribute, respectively. Any of the above scaling methods
can be used. To further enhance the accuracy, conversion of the string and object data type
attributes to integer types is performed. There is another critical part of data processing
which is segmenting the datasets into train-test splits. The bigger portion will be used
to train the model, while the smaller portion will be used to evaluate it. Furthermore,
the datasets should be divided in such a way that they are not leaning toward a bias.
This is critical, since reusing the same datasets for training and evaluation will distort
the model’s efficiency. A processed input for the ML model may significantly increase
its performance. It may also aid in decreasing the model’s errors, resulting in increased
prediction accuracy. As a result, it is essential to consider and examine the datasets to
fine-tune them for better classification results. The next step is to choose a model which best
aligns with the dataset. Different algorithms were created with distinct objectives in mind.
It is imperative to select a model that is appropriate for the given problem from a variety of
models designed for a spectrum of tasks, including voice recognition, image classification
and general prediction. In this study, supervised classification algorithms were utilised to
build models for predicting COVID-19 infection. Algorithms such as logistic regression,
random forest, artificial neural networks (ANNs), decision trees and ensemble models such
as extreme gradient boosting (XGBoost) were used for training purposes. The next step in
the ML process cycle is the training stage. The pre-processed data are fed into the model
which then learns the underlying patterns in it. Most of the dataset is utilized for training.
This step takes a considerable amount of time as training models on large datasets with
complex patterns require many iterative improvements on the part of the optimization
algorithm. Once the model is trained, the final step is evaluating it to see how well it
performs. It explains how well the model has been predicting by testing it on data it has not
previously been exposed to, i.e., the test set. By testing it on the unseen data, we can obtain
a better understanding if the model is able to adapt to new information and extrapolate to
give correct outputs.
An important part of choosing the right model for the task is contingent upon success-
ful hyperparameter tuning. Hyperparameter tuning seeks to emphasize the favourable
outcomes obtained during the previous training cycles. The model is analysed and im-
proved and this is accomplished by fine-tuning the model’s parameters. The performance
peaks for certain values of the parameters are retained and utilized to build the final model.
The term hyperparameter tuning refers to the process of determining these values for the
variables. There are several methods to determine these optimal values: one of these is to
Information 2022, 13, 330 11 of 28
return to the training stage and train the model using several iterations of the training data.
This might result in increased accuracy since the extended length of training exposes the
model to more variations of system parameters applied to the training set and increases its
quality by exploring a broader region of the search space. Another approach is to refine the
model’s initial values. Arbitrary starting values often provide suboptimal outcomes. How-
ever, if we can improve the starting values or possibly start the model with a distribution
rather than a number, the predictions may improve. There are also hyperparameters that
one may tweak to observe changes in model performance. Examples of hyperparameters
used in a simple model which can be altered are—learning rate, loss function applied, the
training steps, etc. In this work, we use a grid search optimization technique to obtain
optimized values for the parameters. Grid search is a tuning technique which performs
comprehensive searching for the parameter by manually checking every value within the
hyperparameter space which has been specifically defined.
Once model tuning is complete, the trained model is available for the final step in the
pipeline to make predictions using the model. At this point, the model is deployed for use
on unseen data. The model develops autonomy from human intervention and makes its
predictions based on the test input and mapping it has learned from the training data. The
machine learning algorithms used for this research are elaborated below. Figure 3 describes
the process-flow of this research.
• Logistic regression: For binary and multiclass classification problems, logistic regres-
sion is an extensively used statistical classification approach. The logistic function is
used to forecast the likelihood of a class label [70]. The model gives exceptional results
when the labels are binary. Contrary to its name, this is a classification model, not a
regression model. It is quite simple to implement and achieves excellent performance
when using linearly separable classes. It uses the sigmoid function to classify the
instances. The mathematical equation for logistic regression can be given as:
where P is the probability that Y belongs to class C and β0 and β1 are model parameters.
• Random forest: The random forest (RF) method is a widely used machine learning
technique that interpolates the output of numerous decision trees (DT) to produce a
single result [71]. It is based on the notion of ensemble learning, which is a method for
integrating several weak classifiers in order to solve a complex problem. It can be used
for both regression and classification problems. RF is a technique that extends the
bagging approach by combining bagging with feature randomization to generate an
uncorrelated forest of decision trees. It partitions the data into training and testing sets
using the bootstrapping data sampling approach. The model builds trees repeatedly
with each bootstrap. The final forecast is based on the average vote for each class.
The larger the number of trees in the forest, the better the reliability. The chance of
overfitting also decreases drastically. Further, it provides great flexibility since it can
accurately perform classification and regression jobs with high accuracy. It can also be
used to understand the importance of each feature. However, its main disadvantage
is that these models are very complex and require much time and memory to train
the models. The equations to calculate the Gini impurity and entropy are described in
Equations (4) and (5). Both Gini impurity and entropy are measures of impurity of
a node.
c
Gini Impurity = ∑ f k (1 − f k ) (4)
k =1
C
Entropy = ∑ − fi log( fi ) (5)
k =1
where f is the frequency of the label and c represents the number of labels.
Information 2022, 13, 330 12 of 28
• XGBoost: The extreme gradient boosting (XGBoost) [72] algorithm is another pre-
diction modelling algorithm based on ensemble learning, which can be applied to
classification, regression and ranking problems. Generally, gradient boosting algo-
rithms may suffer from overfitting as a result of data inequality [72]. However, the
regularisation parameter in the XGBoost technique mitigates the danger of model
overfitting. It is also an iterative tree-based ensemble classifier which seeks to improve
the model’s accuracy by using a boosting data resampling strategy to decrease the
classification error. The algorithm is composed of a number of parameters. The ideal
parameter combination improves the model’s performance. It also makes use of the
previous unsuccessful iteration results in the subsequent steps to achieve an optimal
result. The XGBoost algorithm makes use of several CPU cores, allowing for simulta-
neous learning during training. The objective function of XGBoost is given by the sum
of loss and regularization function as described in Equation (6).
n j
∑l 0
∑ω
obj(θ ) = yk − yik + fj (6)
k j =1
T
H ( x ) = Sign( ∑ αt ht ( x )) (7)
t =1
The final classifier has a result H(x) for x which is given by the sign of weighted
summation of outcomes of T weak classifiers denoted by ht (x) and the weights assigned αt
which is calculated by using the error term of the classifier T.
Information 2022,
Information 13, 330
2022, 13, 330 1213ofof 28
28
Figure 3.
Figure Process workflow
3. Process workflow of
of COVID-19
COVID-19 diagnosis using machine learning.
• SMOTE: Data imbalance is a common problem in medical machine learning and often
results in overfitting. Imbalanced class distribution has a considerable performance
penalty in comparison to most traditional classifier learning techniques that assume a
generally balanced class distribution and equal misclassification costs. An effective
method to overcome dataset imbalance in ML is by using the synthetic minority
oversampling technique (SMOTE) [76]. SMOTE employs an oversampling technique
to adjust the initial training set. Rather than just replicating minority class cases,
SMOTE’s central concept is to offer new artificial instances which are similar to the
minority class. This new dataset is constructed by interpolating between numerous
occurrences of a minority class within a specific neighbourhood. In this research, a
technique called the Borderline-SMOTE was used. It is based on the principle that
borderline cases may provide negligible contribution to the overall success of the
classification [77]. The models are more reliable when the data are balanced. Figure 5
shows the dataset before and after the use of the Borderline-SMOTE algorithm. Further,
the training data were split randomly into an 80:20 ratio, with the larger proportion
of the partition reserved for training the model. The smaller set was used for testing
Information 2022, 13, 330 15 of 28
the models’ performance. It was made sure that both the subsets maintained a similar
composition and lacked bias.
Figure 5. (a) Imbalanced classes, (b) balanced classes after using Borderline-SMOTE.
• Shapley Additive Values (SHAP): SHAP is based on the principle of game theory and
1. Introduction
it is used to increase the interpretability and transparency of the ML models [78]. Most
MLCoronaviruses are a family
and deep learning of enveloped,
models highlywith
are compatible diverse,
SHAP.single-stranded viruses and
The ‘Tree-Explainer’ proce-
are dure
closelyis correlated
mainly used to RNA viruses that
in tree-based infect birds
classifiers suchand mammals
as decision [1].random
tree, They have a di-and
forest
ameter
other of boosting
60–140 nm and a genome
algorithms. SHAP sizeemploys
from 26–32 kb. When
a variety viewed
of visual under an electron
descriptions to convey
microscope, they appear
the importance to look like
of attributes andahow
crown dueinfluence
they to the glycoprotein
the model’sspike-like
decisionprojections
making. The
on their surface,
baseline whichof
estimates resemble
variousaparameters
solar coronaare[2]. Even though
compared the majority
to forecast of human
the prediction.
• coronaviruses (HCoV-NL63, HCoV-OC43, HCoV-229E, and HCoV-HKU1)
Local Interpretable Model-Agnostic Explanations (LIME): LIME is independent cause minor
of any
illnesses, the epidemics of two betacoronaviruses (β-CoV), Middle East respiratory
model and can be used with all the existing classifiers [79]. By adjusting the source syn- of
drome
datacoronavirus (MERS-CoV)
points and seeing how theand severe acute
predictions respiratory
vary, syndrome
the technique seeks coronavirus
to understand
(SARS-CoV),
the model’s prediction. To acquire a deeper understanding of theofblack-box
in the last two decades have resulted in high mortality rates 37% and 10%,model,
respectively [3]. The novel coronavirus disease of 2019, also known as COVID-19, is
specific approaches look at the fundamental components and how they interact in
caused by a strain of coronavirus named severe acute respiratory syndrome coronavirus
LIME. It also modifies the attribute values in a particular order before assessing the
2 (SARS-CoV-2). Its symptoms include cough, fever, fatigue, shortness of breath, body
impact on the whole outcome.
aches and loss of taste or smell [4]. Epidemiological studies have shown that elderly indi-
viduals are more prone to severe illnesses, while children often have milder symptoms
tp
Precision = (9)
tp + f p
• Recall: It is a performance metric that can be defined as the ratio of the patients
correctly diagnosed as COVID-19 positive to the total patients infected by the virus.
This metric emphasizes the false-negative cases. The recall is exceptionally high when
the number of false-negative cases is low. It is calculated by the formula given below:
tp
Recall/Sensitivity = (10)
tp + f n
• F1-score: It is an estimate which gives equal importance to the precision and recall
values obtained previously for the COVID-19 cases. It gives a better idea about the
positive cases of the virus obtained. It is given by the following formula:
precision × recall
F1 − score = 2 × (11)
precision + recall
• AUC (area under curve): The ROC (receiver operating characteristic) curve plots the
true positive rate against the false-positive rate for various test instances. It indicates
how well the models are differentiating the binary classes. The area under this curve
is the AUC. High values for AUC indicate that the classifier is performing well.
• Confusion matrix: For binary classification, the confusion matrix is a 2 × 2 matrix.
All the classified instances will be in the confusion matrix. The diagonal elements
Information 2022, 13, 330 17 of 28
indicate the correct classified instances (both true-positive and true-negative). The
non-diagonal elements indicate the wrongly classified instances (both false-positive
and false-negative). All the mentioned performance metrics can be easily calculated
using the confusion matrix.
Confusionmatrix
Figure6.6.Confusion
Figure matrixof
ofXGBoost
XGBoostalgorithm.
algorithm.
Using AdaBoost, many flaws in the model can be improved. It gives importance to
both data samples and models which makes the algorithm focus on observations which
are tricky to categorize. Further, it makes use of decision stumps to sequentially train
weak learners. While training, accuracy, precision, recall and F1-score values of 92.1%,
88.9%, 91.2% and 90% were achieved using the AdaBoost model. During the testing
phase, the scores were 90.4%, 90.1%, 89.5% and 89.8%, respectively. The SAMME R (a new
variant of the AdaBoost model) algorithm was used as it adjusts the additive model based
on the probability predictions and is more accurate and quicker than the conventional
classifier [80]. Apart from the above techniques, the weak learners were continuously
varied using base models such as logistic regression, decision tree and random forest.
Decision tree was found to be the most effective.
An artificial neural network (ANN) is made up of numerous perceptrons. Its function
is to train the model by computationally mimicking, in high-level terms, the operating
principles of biological neurons present in the human brain. They are constructed using
several interconnected layers with weighted connections. It makes use of the concept of
backpropagation to adjust weights and biases after incorporating feedback. After comple-
tion of training, it yielded accuracy, precision, recall and F1-score of 86.6%, 84.9%, 83.2%
and 84.1%, respectively. For testing, it obtained accuracy, precision, recall and F1-scores
of 86.2%, 88.2%, 83.1% and 85.7%, respectively. A decaying learning rate was chosen to
maintain the convergence. Further, three hidden layers were used using a leaky rectified
linear unit (Leaky ReLU) and sigmoid as activation functions. The adaptive moment
estimation (Adam) optimizer with a batch size of 32 was utilized. ADAM is considered
to be a cross between stochastic gradient descent with momentum and root mean square
propagation (RMSprop) [81]. ADAM was chosen as the training cost for it was the least
and it outperformed other optimisers. The number of neurons in the layers and dropouts
were decided using the grid search technique.
Random forest is a collection of several decision trees. The results of the trees are
combined to classify the instances based on majority voting. The first step is to create a
randomised sample from the original data for each tree. For every node, a random selection
of characteristics is chosen to achieve the best split possible. During training, the accuracy,
precision, recall and F1-score obtained were 91%, 91.6%, 89.9% and 90.7%, respectively.
During testing, the accuracy, precision, recall and F1-score obtained were 89%, 88.3%,
88.1%and 88.2%, respectively. To optimize the model’s output, a variety of hyperparameter
tuning methods were utilized. Tree count, node depth, the number of leaf nodes and the
branch level were some of the parameters considered.
KNN assigns new data points to categories based on their similarity measure, which is
often a distance measure such as Euclidean distance or Manhattan distance. It classifies new
instances using a majority voting technique using the number of nearest neighbours. After
training, the accuracy, precision, recall and F1-score obtained were 91.9%, 92.3%, 90.6% and
91.3%, respectively. During the testing phase, the accuracy, precision, recall and F1-score
obtained were 91.6%, 91.7%, 90.5% and 91%, respectively. The most important parameter
for the KNN algorithm is the value of ‘K’ (The number of neighbours to consider). In this
research, the elbow method was used to find the optimal value of ‘K’ [82]. Further, the ball
tree algorithm was used since the dataset was huge and had complex patterns [82]. Other
parameters, such as leaf size, bias weights and metrics, were also optimized using the grid
search technique.
Binary logistic regression uses the sigmoid function to classify instances. After training
the model, the accuracy, precision, recall and F1-score obtained were 84.2%, 73.3%, 63.8%
and 68.2%. Compared to other models, the performance of logistic regression was poor
since it uses a simple approach. For testing, the model obtained accuracy, precision, recall
and F1-score of 78.4%, 70%, 60.1% and 64.7%. The gradient descent algorithm was chosen
with the regularization parameter ‘C’ whose values were tested from 0.01 to 100 for optimal
hyperparameter tuning.
Information 2022, 13, 330 19 of 28
Table 3 summarises the results obtained by the classification algorithms. The AUCs
are described in Figure 7. Experimental results demonstrated that the XGBoost model
performed the best among all the classifiers. ANN, RF, AdaBoost and KNN yielded an
accuracy of 86.2%, 89%, 90.4% and 91.6%, respectively. The training and testing accuracies
of all the models are described in Figure 8. Further, all the metrics of all the classifiers are
pictorially depicted in Figure 9.
Table 3. Summary of the results obtained by various machine learning models used in this research
(in percentage).
Training Testing
Model
Information 2022, 13, 330 19 of 28
Accuracy Precision Recall F1-Score Accuracy Precision Recall F1-Score
XGBoost 94.5 94.7 93.8 94.2 92 92 91 91.4
AdaBoost 92.1 88.9 91.2 90 90.4 90.1 89.5 89.8
ANN 86.6 formed
84.9 the best among
83.2 all the84.1
classifiers. ANN,
86.2 RF, AdaBoost
88.2 and KNN
83.1yielded an85.7
accu-
Random racy91.6
of 86.2%, 89%,
91 89.990.4% and90.7
91.6%, respectively.
89 The training
88.3 and testing
88.1 accuracies
88.2 of
forest all the models are described in Figure 8. Further, all the metrics of all the classifiers are
KNN 91.9 92.3 90.6 91.3 91.6 91.7 90.5 91
Logistic
pictorially depicted in Figure 9.
84.2 73.3 63.8 68.2 78.4 70 60.1 64.7
Regression
Table 3. Summary of the results obtained by various machine learning models used in this research
(in percentage).
XGBoost, due to enhancements in its algorithm, was able to understand the data
Training Testing
Model better and give superior results. It works by enhancing the core gradient boosting machines
Accuracy Precision Recall F1-Score Accuracy Precision
framework using system optimisations such as pruning. The approximate Recall greedy F1-Score
algorithm
XGBoost 94.5 94.7
performs really93.8
well on the94.2
COVID-19 data 92 because it creates
92 trees in91 91.4
parallel, approximates
AdaBoost 92.1 88.9 91.2trees and employs
the splits in the 90 90.4 sparsity-aware
its unique 90.1 split89.5 89.8 which
finding method
ANN 86.6 84.9
takes care of83.2 84.1
dense zero entities, missing86.2values and 88.2 83.1 data, this
one-hot encoded 85.7is very
Random for- useful for large dataset such as this one. XGBoost further takes advantages of regularization
91 91.6 89.9 90.7 89 88.3 88.1 88.2
est algorithms LASSO (L1) and Ridge (L2) to inflict a greater penalty on more complicated
KNN 91.9 models to prevent
92.3 90.6 overfitting
91.3 along with91.6 its convex loss function.
91.7 90.5 It also implements
91
Logistic Re- the quantile sketch technique to locate the ideal split locations for weighted datasets
84.2 73.3 63.8 cross validation
and has an inbuilt 68.2 78.4
algorithm which70is executed60.1 64.7 These
after each step.
gression
distinctive characteristics help the XGBoost outperform the other models when they are
run independently on the COVID-19 dataset.
Figure
Figure 7. 7. AUCs
AUCs of of various
various classifiers
classifiers that
that diagnose
diagnose COVID-19.
COVID-19.
Accuracy
XGBOOST 92
94.5
ADABOOST 90.4
92.1
ANN 86.2
86.6
Information 2022, 13, 330 20 of 28
Accuracy
XGBOOST 92
94.5
ADABOOST 90.4
92.1
ANN 86.2
86.6
RANDOM FOREST 89
91
KNN 91.6
91.9
LOGISTIC REGRESSION 78.4
84.2
0 10 20 30 40 50 60 70 80 90 100
Test Train
Information 2022, 13, 330 Figure 8. Training and testing accuracies of various classifiers. 20 of 28
Figure 8. Training and testing accuracies of various classifiers.
Performance Metrics
91.4
XGBOOST 91
92
92
89.8
ADABOOST 89.5
90.1
90.4
84.6
ANN 83.1
88.2
86.2
88.2
RANDOM FOREST 88.1
88.3
89
91
KNN 90.5
91.7
91.6
64.7
LOGISTIC REGRESSION 60.1
70
78.4
0 10 20 30 40 50 60 70 80 90 100
Figure
Figure 9.
9. Performance
Performance metrics
metrics of
of all
all the
the classifiers during the
classifiers during the testing
testing phase.
phase.
XGBoost,
RT-PCR and dueRAT
to enhancements in itscan
COVID-19 testing algorithm, was able to
be supplemented understand
using the data
these models, bet-
which
ter and give superior results. It works by enhancing the core gradient boosting machines
is beneficial in areas where there is an acute shortage of the above test kits. The classifiers
framework usinginsystem
can also be used optimisations
parallel such as pruning.
to prevent false-negative TheItapproximate
results. greedyuseful
can also be highly algo-
rithm
duringperforms
instancesreally
suchwell
as aon the COVID-19
pandemic peak. data because
Further, theseit creates treesML
supervised in parallel, ap-
techniques
proximates the splits
may be utilised in the treesThis
retrospectively. andresearch
employsdemonstrates
its unique sparsity-aware split
the potential of finding
ML-based
method
estimationwhich takes care
techniques of dense
as tools zero entities,
augmenting missing values
interventions againstand
theone-hot
COVID-19encoded data,
pandemic.
this iscustomized
With very useful process
for largepipelines
dataset such as this
in place, one.
the XGBoostmethods
described further takes advantages
may also extend ofto
regularization algorithms against
enable early intervention LASSO other
(L1) and Ridgeand
diseases (L2)new
to inflict a greater
pandemics penalty
which on occur
might more
complicated
in future. models to prevent overfitting along with its convex loss function. It also im-
plements the quantile sketch technique to locate the ideal split locations for weighted da-
tasets and has an inbuilt cross validation algorithm which is executed after each step.
These distinctive characteristics help the XGBoost outperform the other models when they
are run independently on the COVID-19 dataset.
RT-PCR and RAT COVID-19 testing can be supplemented using these models, which
is beneficial in areas where there is an acute shortage of the above test kits. The classifiers
can also be used in parallel to prevent false-negative results. It can also be highly useful
Information 2022, 13, 330 21 of 28
Figure
Figure (a) Bar
10. Bar
10. (a) chartusing
chart using SHAP
SHAP (b)
(b)beeswarm plot plot
beeswarm indicating ShapleyShapley
indicating values. values.
LIME: The LIME feature importance models are described in Figure 11. Figure 11a de-
LIME:a The
scribes LIMEpositive
COVID-19 feature importance
patient and Figuremodels are described
11b describes a COVID-19in Figure
negative 11. Figure
patient.
describes a COVID-19
LIME forecasts positivebypatient
other samples creatingand Figure
unique 11bsamples
training describes
near athe
COVID-19
instance to negative
be
analysed and utilizes the previous model to anticipate the cases. The instance
tient. LIME forecasts other samples by creating unique training samples near the insta is systemati-
cally spread based on the weights to other data points. A linear regression model is utilized
to bebased
analysed and utilizes the previous model to anticipate the cases. The instance is s
on the new samples. This approach is used to validate the learned linear model on
tematically spread
a micro level. based11onthethe
In Figure weights
colour toindicates
“blue” other data points.
COVID-19 A linear
negative regression
diagnosis and mode
utilized based“orange”
the colour on theindicates
new samples.
a COVID-19This approach
positive is used
diagnosis. to validate
In Figure the learned lin
11a, the prediction
modelprobability is more
on a micro for In
level. theFigure
COVID-19 positive
11 the patient.
colour “blue”Theindicates
score is calculated
COVID-19 based on
negative di
various parameters such as pneumonia, age, pregnancy, diabetes and hypertension. The
nosis and the colour “orange” indicates a COVID-19 positive diagnosis. In Figure 11a,
prediction probability is more for the COVID-19 positive patient. The score is calcula
based on various parameters such as pneumonia, age, pregnancy, diabetes and hypert
sion. The weights of the parameters are also considered along with majority voting
Information 2022, 13, 330 22 of 28
weights of the parameters are also considered along with majority voting in coming to
a final decision. In Figure 11b, the LIME model indicates that the patient is COVID-19
negative. All the parameters except “pregnancy” point to negative diagnosis. Using LIME,
feature importance for each patient can be calculated accurately. According to explainable
Information 2022, 13, 330 22 of 28
AI techniques, the best features obtained were pneumonia, pregnancy, sex, another_case,
hypertension, age, diabetes, tobacco, obesity and other diseases.
Figure 11.(a)
Figure11. (a)COVID-19
COVID-19positive diagnosis
positive using
diagnosis LIME.
using (b) COVID-19
LIME. negative
(b) COVID-19 diagnosis
negative using LIME.
diagnosis using
LIME.
As a retrospective evaluation technique, ML models can be deployed to predict COVID-
19 diagnosis. This study evaluation
As a retrospective describes how ML models
technique, ML maymodelsbe built,
can be validated
deployedandtoused to
predict
swiftly identify patients. The study also highlights the use of feature importance
COVID-19 diagnosis. This study describes how ML models may be built, validated and methods
in identifying
used to swiftlythe most important
identify markers.
patients. The Thishighlights
study also aids in reducing
the use the substantial
of feature work-
importance
load placed
methods in on front-linethe
identifying health
mostprofessionals. This also
important markers. Thishelps
aidsunderdeveloped countries
in reducing the substantial
which lack technical and clinical resources under the burden of case volume
workload placed on front-line health professionals. This also helps underdeveloped coun- during an
infection peak.
tries which lack technical and clinical resources under the burden of case volume during
an infection peak.
4.4. Further Discussion
In this research,
4.4. Further Discussion a set of epidemiological and demographic parameters strongly associ-
ated with COVID-19 were identified. The data also contained details of patients who had
In this research, a set of epidemiological and demographic parameters strongly asso-
similar symptoms but were diagnosed as COVID-19 negative. Before the actual test results
ciated with COVID-19 were identified. The data also contained details of patients who
are obtained, these traits may help the doctors in identifying potential patients.
had similar symptoms but were diagnosed as COVID-19 negative. Before the actual test
Many viral diseases cause pneumonia. This condition is extremely dangerous and
results
can leadare obtained,
to fatality. Inthese
severe traits may
cases, help the is
COVID-19 doctors
knownintoidentifying potential along
induce pneumonia patients.
with
Many viral diseases cause pneumonia. This condition is
conditions such as acute respiratory distress syndrome (ARDS) and multi-organ failure.extremely dangerous and
can lead to fatality. In severe cases, COVID-19 is known to induce
However, in this dataset, most of the COVID-19 patients did not suffer from pneumonia. pneumonia along with
conditions issuch
COVID-19 as acute
known respiratory
to spread amongdistress
all humanssyndrome (ARDS)
including and multi-organ
pregnant failure.
women. However,
However,
most of thein this dataset,
pregnant women most in of thedataset
this COVID-19 werepatients
diagnosed didasnot suffer from
COVID-19 pneumonia.
negative. This
dangerous disease in known to spread rapidly. Nationwide lockdowns were However,
COVID-19 is known to spread among all humans including pregnant women. imposed
most
to of thethe
prevent pregnant
spread women in this dataset
of this disease. It was were
likelydiagnosed as COVID-19
that a patient negative.
could contract This
COVID-
dangerous disease in known to spread rapidly. Nationwide lockdowns
19 when he was in contact with another infected patient. Patients with comorbidities, were imposed to
prevent the spread of this disease. It was likely that a patient
such as hypertension and diabetes, are more vulnerable to succumb to COVID-19. This could contract COVID-19
when hereinforces
research was in contact with another
that diabetes, tobacco infected
use andpatient.
obesityPatients
increases with
the comorbidities, such
chance of infection.
as hypertension
According to theand diabetes,
study, are moreofvulnerable
the presence to succumb
other diseases apart to COVID-19.
from the onesThis research
mentioned
reinforces
above, are that
not diabetes,
extremelytobacco use and
dangerous fromobesity increases
an infection the chance of
standpoint. infection. Accord-
Furthermore, most
patients suffering from hypertension were COVID-19 negative. These are some ofabove,
ing to the study, the presence of other diseases apart from the ones mentioned are
the main
not extremely dangerous
inferences made from the study. from an infection standpoint. Furthermore, most patients suf-
fering from hypertension were COVID-19 negative. These are some of the main inferences
made from the study.
The pandemic’s heavy toll on human health and well-being has spurred various re-
search labs to develop intelligent systems with the purpose of automating COVID-19 de-
tection and severity. However, only a few ML models based on demographic and epide-
miological models have been deployed. Muhammed et al. [48] used ML models to diag-
Information 2022, 13, 330 23 of 28
The pandemic’s heavy toll on human health and well-being has spurred various
research labs to develop intelligent systems with the purpose of automating COVID-19
detection and severity. However, only a few ML models based on demographic and
epidemiological models have been deployed. Muhammed et al. [48] used ML models
to diagnose COVID-19 for the Mexican dataset. Five ML models were utilized and a
maximum accuracy of 95% was obtained by the decision tree model. However, no feature
importance techniques were utilized to understand the model’s predictions. Juárez et al. [49]
used the Mexican dataset for COVID-19 diagnosis. Among the four ML modes, neural
network obtained the maximum accuracy of 93.5%. Iwendi et al. [51] used AI to diagnose
COVID-19 for the Brazilian and Mexican patients. However, the accuracy obtained for the
Mexican dataset was only 69%. Martinez-Velaquez et al. [52] used ML for early detection
of COVID-19 where 22 features were considered and a maximum sensitivity of 75% was
obtained. Rezapour and Colin [53] used ML to understand the relationship between
COVID-19 susceptibility and comorbidities. The abovementioned works are summarized
and compared in Table 4.
No of Parameters Feature
Reference Dataset Origin ML Models Used Accuracy
Considered Importance
[48] Mexico Five 10 94.99% No
[49] Mexico Various ML models 21 93.50% No
[51] Mexico Various ML models - 69% No
[52] Mexico Various ML models 22 Sensitivity-75% Gini Index
[53] Mexico Various ML models 14 Qualitative No
Proposed Mexico Six 10 94.50% SHAP and LIME
In this research, ML was used to analyse the epidemiological and demographic pa-
rameters in predicting the occurrence of infection with coronavirus causing COVID-19.
These results are often easily available in shorter time intervals and at a lower price than
radiographic and molecular tests. A dataset from Mexico was utilized and six machine
learning models commonly used in medical AI were deployed. Information about patients,
data security, integration of data and automation are the advantages of using EMR (elec-
tronic health records). We emphasize the use of data-driven models which aim to help
clinicians make better decisions by providing them with valuable information generated by
the trained models. Further, feature importance techniques, such as SHAP and LIME, have
been effectively utilized which make the model more precise, interpretable and accurate.
This helps medical professionals during the final diagnosis of the patient.
5.1. Challenges
With AI making progress in leaps and bounds in the development of new algorithms,
it has increased the scope to where it can be applied. ML has many potential applications
across different medical problems. However, there is a clear dearth of such procedures
being effectively used in clinical practice. The following are some challenges that should be
addressed before widespread adoption is likely.
• Data from a single country: For this research, data were collected from Mexico. How-
ever, data from all geographic areas must be considered for better validation. This is
not a trivial task as there are clear differences in reporting standards and authenticity
across different countries.
Information 2022, 13, 330 24 of 28
6. Conclusions
COVID-19 must be diagnosed as early as possible for the patients to obtain appropriate
treatment and prevent it from spreading to others. In recent studies, it has been proved that
laboratory markers are an excellent diagnosis method since they are relatively cheap and
easily available in most hospitals for implementation schemes using data-driven techniques.
In this work, an extensive review of related literature was conducted in the beginning. The
Information 2022, 13, 330 25 of 28
Author Contributions: Conceptualization, K.C.; methodology, K.C., S.P. and A.P.; validation, S.S.;
formal analysis, S.P.; writing—original draft preparation, A.P.; writing—review and editing, S.S.,
K.C. and S.P.; visualization, S.S.; supervision, G.N. and K.C. All authors have read and agreed to the
published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Woo, P.C.; Huang, Y.; Lau, S.K.; Yuen, K.Y. Coronavirus genomics and bioinformatics analysis. Viruses 2010, 2, 1804–1820.
[CrossRef] [PubMed]
2. Hayden, F.; Richman, D.; Whitley, R. Clinical Virology, 4th ed.; ASM Press: Washington, DC, USA, 2017.
3. Huang, C.; Wang, Y.; Li, X.; Ren, L.; Zhao, J.; Hu, Y.; Zhang, L.; Fan, G.; Xu, J.; Gu, X.; et al. Clinical features of patients infected
with 2019 novel coronavirus in Wuhan, China. Lancet 2020, 395, 497–506. [CrossRef]
4. Coronaviridae Study Group of the International Committee on Taxonomy of Viruses. The species Severe acute respiratory
syndrome-related coronavirus: Classifying 2019-nCoV and naming it SARS-CoV-2. Nat. Microbiol. 2020, 5, 536–544. [CrossRef]
[PubMed]
5. Yuki, K.; Fujiogi, M.; Koutsogiannaki, S. COVID-19 pathophysiology: A review. Clin. Immunol. 2020, 215, 108427. [CrossRef]
6. Liu, K.; Chen, Y.; Lin, R.; Han, K. Review-Clinical features of COVID-19 in elderly patients: A comparison with young and
middle-aged patients. J. Infect. 2020, 80, e14–e18. [CrossRef]
7. Singh, A.K.; Gupta, R.; Ghosh, A.; Misra, A. Diabetes in COVID-19: Prevalence, pathophysiology, prognosis and practical
considerations. Diabetes Metab. Syndr. 2020, 14, 303–310. [CrossRef]
8. Zhang, J.; Wang, X.; Jia, X.; Li, J.; Hu, K.; Chen, G.; Wei, J.; Gong, Z.; Zhou, C.; Yu, H.; et al. Risk factors for disease severity,
unimprovement, and mortality in COVID-19 patients in Wuhan, China. Clin. Microbiol. Infect. 2020, 26, 767–772. [CrossRef]
9. Lu, H.; Stratton, C.W.; Tang, Y.W. Outbreak of pneumonia of unknown etiology in Wuhan, China: The mystery and the miracle. J.
Med. Virol. 2020, 92, 401–402. [CrossRef]
10. Johns Hopkins Coronavirus Resource Center. Available online: https://coronavirus.jhu.edu/ (accessed on 1 June 2022).
11. Lei, S.; Jiang, F.; Su, W.; Chen, C.; Chen, J.; Mei, W.; Zhan, L.; Jia, Y.; Zhang, L.; Liu, D.; et al. Clinical characteristics and
outcomes of patients undergoing surgeries during the incubation period of COVID-19 infection. EClinicalMedicine 2020, 21,
100331. [CrossRef]
12. Li, Q.; Guan, X.; Wu, P.; Wang, X.; Zhou, L.; Tong, Y.; Ren, R.; Leung, K.; Lau, E.; Wong, J.; et al. Early Transmission Dynamics in
Wuhan, China, of Novel Coronavirus–Infected Pneumonia. N. Engl. J. Med. 2020, 382, 1199–1207. [CrossRef]
13. Habibzadeh, P.; Mofatteh, M.; Silawi, M.; Ghavami, S.; Faghihi, M. Molecular diagnostic assays for COVID-19: An overview. Crit.
Rev. Clin. Lab. Sci. 2021, 58, 385–398. [CrossRef] [PubMed]
Information 2022, 13, 330 26 of 28
14. Mahendiratta, S.; Batra, G.; Sarma, P.; Kumar, H.; Bansal, S.; Kumar, S.; Prakash, A.; Sehgal, R.; Medhi, B. Molecular diagnosis of
COVID-19 in different biologic matrix, their diagnostic validity and clinical relevance: A systematic review. Life Sci. 2020, 258,
118207. [CrossRef] [PubMed]
15. Goudouris, E.S. Laboratory diagnosis of COVID-19. J. Pediatr. 2021, 97, 7–12. [CrossRef] [PubMed]
16. Zhu, H.; Zhang, H.; Xu, Y.; Laššáková, S.; Korabečná, M.; Neužil, P. PCR past, present and future. BioTechniques 2020, 69, 317–325.
[CrossRef]
17. Falzone, L.; Gattuso, G.; Tsatsakis, A.; Spandidos, D.A.; Libra, M. Current and innovative methods for the diagnosis of COVID-19
infection (Review). Int. J. Mol. Med. 2021, 47, 100. [CrossRef]
18. Yang, Y.; Yang, M.; Yuan, J.; Wang, F.; Wang, Z.; Li, J.; Zhang, M.; Xing, L.; Wei, J.; Peng, L.; et al. Laboratory Diagnosis and
Monitoring the Viral Shedding of SARS-CoV-2 Infection. Innovation 2020, 1, 100061. [CrossRef]
19. Kucirka, L.M.; Lauer, S.A.; Laeyendecker, O.; Boon, D.; Lessler, J. Variation in False-Negative Rate of Reverse Transcriptase
Polymerase Chain Reaction–Based SARS-CoV-2 Tests by Time Since Exposure. Ann. Intern. Med. 2020, 173, 262–267. [CrossRef]
20. Burog, A.; Yacapin, C.; Maglente, R.; Macalalad-Josue, A.; Uy, E.; Dans, A.; Dans, L. Should IgM/IgG rapid test kit be used in the
diagnosis of COVID-19? Acta Med. Philipp. 2020, 54, 1–12. [CrossRef]
21. Yu, K.H.; Beam, A.L.; Kohane, I.S. Artificial intelligence in healthcare. Nat. Biomed. Eng. 2018, 2, 719–731. [CrossRef]
22. Rustam, F.; Reshi, A.A.; Mehmood, A.; Ullah, S.; On, B.; Aslam, W.; Choi, G.S. COVID-19 Future Forecasting Using Supervised
Machine Learning Models. IEEE Access 2020, 8, 101489–101499. [CrossRef]
23. Kotsiantis, S.B. Supervised Machine Learning: A Review of Classification Techniques. Emerg. Artif. Intell. Appl. Comput. Eng.
2007, 160, 3–24.
24. Quinlan, R. C4.5: Programs for Machine Learning; Morgan Kaufmann Publishers: San Mateo, CA, USA, 1993.
25. Liu, D.; Clemente, L.; Poirier, C.; Ding, X.; Chinazzi, M.; Davis, J.T.; Vespignani, A.; Santillana, M. A machine learning methodology
for real-time forecasting of the 2019–2020 COVID-19 outbreak using Internet searches, news alerts, and estimates from mechanistic
models. arXiv 2020, arXiv:2004.04019.
26. Saravanan, R.; Sujatha, P. A state of art techniques on machine learning algorithms: A perspective of supervised learning
approaches in data classification. In Proceedings of the IEEE 2018 Second International Conference on Intelligent Computing and
Control Systems (ICICCS), Madurai, India, 14–15 June 2018; pp. 945–949.
27. Kaelbling, L.; Littman, M.; Moore, A. Reinforcement Learning: A Survey. J. Artif. Intell. Res. 1996, 4, 237–285. [CrossRef]
28. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018.
29. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [CrossRef] [PubMed]
30. Young, T.; Hazarika, D.; Poria, S.; Cambria, E. Recent Trends in Deep Learning Based Natural Language Processing. IEEE Comput.
Intell. Mag. 2018, 13, 55–75. [CrossRef]
31. Pak, M.S.; Kim, S.H. A review of deep learning in image recognition. In Proceedings of the International Conference on Computer
Applications and Information Processing Technology, Kuta Bali, Indonesia, 8–10 August 2017; pp. 1–3.
32. Shokeen, J.; Rana, C. An Application-oriented Review of Deep Learning in Recommender Systems. Int. J. Intell. Syst. Appl. 2019,
11, 46–54. [CrossRef]
33. Lee, W.; Seong, J.J.; Ozlu, B.; Shim, B.S.; Marakhimov, A.; Lee, S. Biosignal Sensors and Deep Learning-Based Speech Recognition:
A Review. Sensors 2021, 21, 1399. [CrossRef]
34. Chadaga, K.; Prabhu, S.; Vivekananda, B.K.; Niranjana, S.; Umakanth, S. Battling COVID-19 using machine learning: A review.
Cogent Eng. 2021, 8, 1958666. [CrossRef]
35. Zou, Q.; Qu, K.; Luo, Y.; Yin, D.; Ju, Y.; Tang, H. Predicting diabetes mellitus with machine learning techniques. Front. Genet. 2018,
9, 515. [CrossRef]
36. Toğaçar, M.; Ergen, B.; Cömert, Z.; Özyurt, F. A Deep Feature Learning Model for Pneumonia Detection Applying a Combination
of mRMR Feature Selection and Machine Learning Models. IRBM 2020, 41, 212–222. [CrossRef]
37. Kourou, K.; Exarchos, T.; Exarchos, K.P.; Karamouzis, M.V.; Fotiadis, D.I. Machine learning applications in cancer prognosis and
prediction. Comput. Struct. Biotechnol. J. 2015, 13, 8–17. [CrossRef] [PubMed]
38. Pellegrini, E.; Ballerini, L.; Hernandez, M.D.C.V.; Chappell, F.M.; González-Castro, V.; Anblagan, D.; Danso, S.; Muñoz-Maniega,
S.; Job, D.; Pernet, C.; et al. Machine learning of neuroimaging for assisted diagnosis of cognitive impairment and dementia: A
systematic review. Alzheimer Dement. Diagn. Assess. Dis. Monit. 2018, 10, 519–535. [CrossRef]
39. Bind, S.; Tiwari, A.K.; Sahani, A.K. A Survey of Machine Learning Based Approaches for Parkinson Disease Prediction. Int. J.
Comput. Sci. Inf. Technol. 2015, 6, 1648–1655.
40. Musunuri, B.; Shetty, S.; Shetty, D.K.; Vanahalli, M.K.; Pradhan, A.; Naik, N.; Paul, R. Acute-on-Chronic Liver Failure Mortality
Prediction using an Artificial Neural Network. Eng. Sci. 2021, 15, 187–196. [CrossRef]
41. Lalmuanawma, S.; Hussain, J.; Chhakchhuak, L. Applications of machine learning and artificial intelligence for COVID-19
(SARS-CoV-2) pandemic: A review. Chaossolitons Fractals 2020, 139, 110059. [CrossRef]
42. Zu, Z.Y.; Jiang, M.D.; Xu, P.P.; Chen, W.; Ni, Q.Q.; Lu, G.M.; Zhang, L.J. Coronavirus Disease 2019 (COVID-19): A Perspective
from China. Radiology 2020, 296, E15–E25. [CrossRef]
43. Lee, E.Y.P.; Ng, M.-Y.; Khong, P.-L. COVID-19 pneumonia: What has CT taught us? Lancet Infect. Dis. 2020, 20, 384–385. [CrossRef]
44. Narin, A.; Kaya, C.; Pamuk, Z. Automatic Detection of Coronavirus Disease (COVID-19) Using X-ray Images and Deep
Convolutional Neural Networks. Pattern Anal. Appl. 2021, 24, 1207–1220. [CrossRef]
Information 2022, 13, 330 27 of 28
45. Ozturk, T.; Talo, M.; Yildirim, E.A.; Baloglu, U.B.; Yildirim, O.; Acharya, U. Automated detection of COVID-19 cases using deep
neural networks with X-ray images. Comput. Biol. Med. 2020, 121, 103792. [CrossRef]
46. Smith-Bindman, R.; Yu, S.; Wang, Y.; Kohli, M.D.; Chu, P.; Chung, R.; Luong, J.; Bos, D.; Stewart, C.; Bista, B.; et al. An Image
Quality–informed Framework for CT Characterization. Radiology 2022, 302, 380–389. [CrossRef]
47. Muhammad, L.J.; Algehyne, E.A.; Usman, S.S.; Ahmad, A.; Chakraborty, C.; Mohammed, I.A. Supervised Machine Learning
Models for Prediction of COVID-19 Infection using Epidemiology Dataset. SN Comput. Sci. 2020, 2, 11. [CrossRef] [PubMed]
48. Franklin, M.R. Mexico COVID-19 Clinical Data. Available online: https://www.kaggle.com/marianarfranklin/mexico-covid19-
clinical-data/metadata (accessed on 26 June 2020).
49. Quiroz-Juárez, M.A.; Torres-Gómez, A.; Hoyo-Ulloa, I.; León-Montiel, R.D.J.; U’Ren, A.B. Identification of high-risk COVID-19
patients using machine learning. PLoS ONE 2021, 16, e0257234. [CrossRef] [PubMed]
50. Prieto, K. Current forecast of COVID-19 in Mexico: A Bayesian and machine learning approaches. PLoS ONE 2022, 17, e0259958.
[CrossRef] [PubMed]
51. Iwendi, C.; Huescas, C.; Chakraborty, C.G.Y.; Mohan, S. COVID-19 health analysis and prediction using machine learning
algorithms for Mexico and Brazil patients. J. Exp. Theor. Artif. Intell. 2022, 1, 1–21. [CrossRef]
52. Martinez-Velazquez, R.; Tobon, V.D.P.; Sanchez, A.; El Saddik, A.; Petriu, E. A Machine Learning Approach as an Aid for Early
COVID-19 Detection. Sensors 2021, 21, 4202. [CrossRef]
53. Rezapour, M.; Varady, C.A. A machine learning analysis of the relationship between some underlying medical conditions and
COVID-19 susceptibility. arXiv 2021, arXiv:2112.12901.
54. Maouche, I.; Terrissa, S.L.; Benmohammed, K.; Zerhouni, N.; Boudaira, S. Early Prediction of ICU Admission Within COVID-19
Patients Using Machine Learning Techniques. In Innovations in Smart Cities Applications; Springer: Cham, Switzerland, 2021;
Volume 5, pp. 507–517.
55. Delgado-Gallegos, J.L.; Avilés-Rodriguez, G.; Padilla-Rivas, G.R.; Cosio-León, M.D.l.Á.; Franco-Villareal, H.; Zuñiga-Violante, E.;
Romo-Cardenas, G.S.; Islas, J.F. Clinical applications of machine learning on COVID-19: The use of a decision tree algorithm for
the assessement of perceived stress in mexican healthcare professionals. medRxiv 2020. [CrossRef]
56. Yadav, A. Predicting Covid-19 using Random Forest Machine Learning Algorithm. In Proceedings of the 2021 12th International
Conference on Computing Communication and Networking Technologies (ICCCNT), Khargpur, India, 6 July 2021; pp. 1–6.
57. Mukherjee, R.; Kundu, A.; Mukherjee, I.; Gupta, D.; Tiwari, P.; Khanna, A.; Shorfuzzaman, M. IoT-cloud based healthcare model
for COVID-19 detection: An enhanced k-Nearest Neighbour classifier based approach. Computing 2021, 1–21. [CrossRef]
58. Chaudhary, L.; Singh, B. Community detection using unsupervised machine learning techniques on COVID-19 dataset. Soc. Netw.
Anal. Min. 2021, 11, 28. [CrossRef]
59. Cornelius, E.; Akman, O.; Hrozencik, D. COVID-19 Mortality Prediction Using Machine Learning-Integrated Random Forest
Algorithm under Varying Patient Frailty. Mathematics 2021, 9, 2043. [CrossRef]
60. Wollenstein-Betech, S.; Cassandras, C.G.; Paschalidis, I.C. Personalized predictive models for symptomatic COVID-19 patients
using basic preconditions: Hospitalizations, mortality, and the need for and ICU or ventilator. Int. J. Med. Inform. 2020, 123, 11–22.
[CrossRef] [PubMed]
61. Durden, B.; Shulman, M.; Reynolds, A.; Phillips, T.; Moore, D.; Andrews, I.; Pouriyeh, S. Using Machine Learning Techniques to
Predict RT-PCR Results for COVID-19 Patients. In Proceedings of the 2021 IEEE Symposium on Computers and Communications
(ISCC), Athens, Greece, 5–8 September 2021; pp. 1–4.
62. Guzmán-Torres, J.A.; Alonso-Guzmán, E.M.; Domínguez-Mota, F.J.; Tinoco-Guerrero, G. Estimation of the Main Conditions in
(SARS-CoV-2) COVID-19 Patients That Increase the Risk of Death Using Machine Learning, the Case of Mexico; Elsevier: Amsterdam,
The Netherlands, 2021; Volume 27.
63. Chadaga, K.; Prabhu, S.; Umakanth, S.; Bhat, V.K.; Sampathila, N.; Chadaga, R.P.; Prakasha, K.K. COVID-19 Mortality Prediction
among Patients Using Epidemiological Parameters: An Ensemble Machine Learning Approach. Eng. Sci. 2021, 16, 221–233.
[CrossRef]
64. Chadaga, K.; Chakraborty, C.; Prabhu, S.; Umakanth, S.; Bhat, V.; Sampathila, N. Clinical and laboratory approach to diagnose
COVID-19 using machine learning. Interdiscip. Sci. Comput. Life Sci. 2022, 14, 452–470. [CrossRef] [PubMed]
65. Almansoor, M.; Hewahi, N.M. Exploring the Relation between Blood Tests and COVID-19 Using Machine Learning. In
Proceedings of the 2020 International Conference on Data Analytics for Business and Industry: Way Towards a Sustainable
Economy (ICDABI), Sakheer, Bahrain, 26–27 October 2020; pp. 1–6.
66. Open Data General Directorate of Epidemiology. Available online: https://www.gob.mx/salud/documentos/datos-abiertos-15
2127 (accessed on 26 March 2022).
67. Ahlgren, P.; Jarneving, B.; Rousseau, R. Requirements for a cocitation similarity measure, with special reference to pearson’s
correlation coefficient. J. Am. Soc. Inf. Sci. Technol. 2003, 54, 550–560. [CrossRef]
68. Devillanova, G.; Solimini, S. Min-max solutions to some scalar field equations. Adv. Nonlinear Stud. 2012, 12, 173–186. [CrossRef]
69. Thara, T.D.K.; Prema, P.S.; Xiong, F. Auto-detection of epileptic seizure events using deep neural network with different feature
scaling techniques. Pattern Recognit. Lett. 2019, 128, 544–550.
70. Nick, T.G.; Campbell, K.M. Logistic regression. Methods Mol. Biol. 2007, 404, 273–301.
71. Belgiu, M.; Drăguţ, L. Random Forest in remote sensing: A review of applications and future directions. ISPRS J. Photogramm.
Remote Sens. 2016, 114, 24–31. [CrossRef]
Information 2022, 13, 330 28 of 28
72. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In KDD ’16: Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; Association for Computing
Machinery: New York, NY, USA, 2016; pp. 785–794.
73. Schapire, R.E. Explaining adaboost. In Empirical Inference; Springer: Berlin/Heidelberg, Germany, 2013; pp. 37–52.
74. Zhang, M.; Zhou, Z. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognit. 2007, 40, 2038–2048. [CrossRef]
75. Krogh, A. What are Artificial Neural Networks? Nat. Biotechnol. 2008, 26, 195–197. [CrossRef]
76. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell.
Res. 2002, 16, 321–357. [CrossRef]
77. Han, H.; Wang, W.-Y.; Mao, B.-H. Borderline-smote: A new over-sampling method in imbalanced data sets learning. Adv. Intell.
Comput. 2005, 3644, 878–887.
78. Parsa, A.B.; Movahedi, A.; Taghipour, H.; Derrible, S.; Mohammadian, A. Toward Safer Highways, Application of XGBoost and
SHAP for Real-Time Accident Detection and Feature Analysis. Accid. Anal. Prev. 2019, 136, 105405. [CrossRef] [PubMed]
79. Visani, G.; Bagli, E.; Chesani, F.; Poluzzi, A.; Capuzzo, D. Statistical stability indices for LIME: Obtaining reliable explanations for
machine learning models. J. Oper. Res. Soc. 2020, 73, 91–101. [CrossRef]
80. Hatwell, J.; Gaber, M.M.; Azad, R.M.A. Ada-WHIPS: Explaining AdaBoost classification with applications in the health sciences.
BMC Med. Inform. Decis. Mak. 2020, 20, 250. [CrossRef]
81. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980.
82. Dhanabal, S.; Chandramathi, S. A review of various K-nearest neighbor query processing techniques. Int. J. Comput. Appl. Technol.
2011, 31, 14–22.