Applsci 12 05807 v2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

applied

sciences
Article
Multi-Classification of Motor Imagery EEG Signals Using
Bayesian Optimization-Based Average Ensemble Approach
Souha Kamhi 1,2, *, Shuai Zhang 1,2, *, Mohamed Ait Amou 3 , Mohamed Mouhafid 3 , Imran Javaid 1,2 ,
Isah Salim Ahmad 1,2 , Isselmou Abd El Kader 1,2 and Ummay Kulsum 1,2

1 State Key Laboratory of Reliability and Intelligence of Electrical Equipment, Hebei University of Technology,
Tianjin 300130, China; [email protected] (I.J.); [email protected] (I.S.A.);
[email protected] (I.A.E.K.); [email protected] (U.K.)
2 Hebei Key Laboratory of Bioelectromagnetism and Neural Engineering, School of Health Sciences &
Biomedical Engineering, Hebei University of Technology, Tianjin 300130, China
3 School of Electronics and Information Engineering, Hebei University of Technology, Tianjin 300401, China;
[email protected] (M.A.A.); [email protected] (M.M.)
* Correspondence: [email protected] (S.Z.); [email protected] (S.K.)

Abstract: Motor Imagery (MI) classification using electroencephalography (EEG) has been exten-
sively applied in healthcare scenarios for rehabilitation aims. EEG signal decoding is a difficult
process due to its complexity and poor signal-to-noise ratio. Convolutional neural networks (CNN)
have demonstrated their ability to extract time–space characteristics from EEG signals for better
classification results. However, to discover dynamic correlations in these signals, CNN models must
be improved. Hyperparameter choice strongly affects the robustness of CNNs. It is still challenging
since the manual tuning performed by domain experts lacks the high performance needed for real-life
Citation: Kamhi, S.; Zhang, S.; Ait
applications. To overcome these limitations, we presented a fusion of three optimum CNN models
Amou, M.; Mouhafid, M.; Javaid, I.;
using the Average Ensemble strategy, a method that is utilized for the first time for MI movement
Ahmad, I.S.; Abd El Kader, I.;
Kulsum, U. Multi-Classification of
classification. Moreover, we adopted the Bayesian Optimization (BO) algorithm to reach the optimal
Motor Imagery EEG Signals Using hyperparameters’ values. The experimental results demonstrate that without data augmentation,
Bayesian Optimization-Based our approach reached 92% accuracy, whereas Linear Discriminate Analysis, Support Vector Machine,
Average Ensemble Approach. Appl. Random Forest, Multi-Layer Perceptron, and Gaussian Naive Bayes achieved 68%, 70%, 58%, 64%,
Sci. 2022, 12, 5807. https://doi.org/ and 40% accuracy, respectively. Further, we surpassed state-of-the-art strategies on the BCI competi-
10.3390/app12125807 tion IV-2a multiclass MI database by a wide margin, proving the benefit of combining the output of
Academic Editors: Szczepan Paszkiel,
CNN models with automated hyperparameter tuning.
Ningrong Lei and Maria António
Castro Keywords: hyperparameters; ensemble learning (EL); brain–computer interface (BCI); MI-EEG;
multiclassification
Received: 24 May 2022
Accepted: 6 June 2022
Published: 7 June 2022

Publisher’s Note: MDPI stays neutral 1. Introduction


with regard to jurisdictional claims in Brain Computer Interface (BCI) is a revolutionary technology, that permits direct
published maps and institutional affil- communication with the brain through computers, without the involvement of peripheral
iations. nerves or muscles. This happens by sensing the human brain signals and translating them
into commands-based applications that are mainly designed to help disabled individuals
suffering from motor diseases and congenital conditions, such as Kennedy’s Symptoms,
Progressive Bulbar Palsy (PBP), Muscular dystrophy (MD), or locked-in syndrome, to
Copyright: © 2022 by the authors.
communicate with the external world.
Licensee MDPI, Basel, Switzerland.
The electroencephalogram is one of the most effective non-invasive methods for
This article is an open access article
distributed under the terms and
studying the electrophysiological dynamics of the brain. It is widely used in BCI-based
conditions of the Creative Commons
applications [1] because it is portable, easy to use, inexpensive, and it has a high tempo-
Attribution (CC BY) license (https://
ral resolution. Regardless of having a very low signal-to-noise ratio and its low spatial
creativecommons.org/licenses/by/ resolution, it is still considered the best choice for collecting human brain motor cortex
4.0/). signals among all other existing options, such as Magnetoencephalography (MEG), Positron

Appl. Sci. 2022, 12, 5807. https://doi.org/10.3390/app12125807 https://www.mdpi.com/journal/applsci


Appl. Sci. 2022, 12, 5807 2 of 19

Emission Tomography (PET), or functional Magnetic Resonance Imaging (fMRI) [2], which
are more complicated to employ.
MI-based EEG is a commonly used approach within BCIs. It is a cognitive process
that does not require any prior stimuli. It relies completely on the subject imagination of an
action without physically performing it. This helps generate a signal similar to that of body
movement [3]. In general, brain signals are time varying, and subject specific [4], making
MI-EEG feature extraction and classification difficult, especially for conventional Machine
Learning (ML) algorithms [5] that involve a handcrafted signal processing pipeline and a
long training time to learn significant features. Moreover, traditional ML is domain specific.
This reduces its common potential application in the general context of rehabilitation-
related technologies. As a result, they largely fail to satisfy the growing level of performance
needed from real-life-based BCI applications.
Deep Learning (DL) algorithms achieved good results dealing with time series and
other traditional ML algorithm limitations. One of the best Deep Neural Networks (DNN)
algorithms is CNN [6]. It does not require data pre-processing nor feature extraction prior
to classification. The CNN classifier manages raw EEG data analysis successfully in several
application domains and has achieved a satisfying result [7,8]. However, the complex
nature of the signal and the implementation of more advanced CNN architectures demand
an optimal hyperparameter deployment to decrease the training time and increase the
learning rate. Hyperparameter selection is an essential step in building any CNN model.
The manual tuning of those parameters may lead to uncertain results as it depends on the
intuition of the user. The choice of the right combinations is time consuming and demands
expert knowledge. This limitation can be solved by considering optimization algorithms
that help reach the global optima of a given CNN by choosing the appropriate set of values
of the hyperparameters in a given space in less time efficiently and automatically.
This paper’s contributions are as follows: (1) As far as we know, it is the first time,
the concept of Average Ensemble of optimized CNNs is applied to the EEG MI movement
classification problem, achieving a high level of performance. (2) Our proposed approach
outperformed the recent methods in the BCI competition IV-2a 4-class MI database [9] by a
significant margin. (3) The proposed approach’s performance is extensively evaluated by
comparing it to five traditional ML algorithms (Linear Discriminate Analysis [10], Support
Vector Machine [11], Random Forest [12], Multi-Layer Perceptron [13], and Gaussian
Naive Bayes [14]). (4) The optimum hyperparameters are determined automatically by
employing BO.
The classification performance of MI-EEG signals is an important part of building a
strong and reliable BCI application. Many researchers, therefore, investigated different
approaches seeking good results. Javeria et al. [15] used Support Vector Machine, Naïve-
Bayesian Parzen-Window (NBPW), and K-Nearest Neighbour (KNN) algorithms to train
and classify upper limb MI tasks and rest. Features were extracted by carrying out Common
Spatial Pattern and Linear Discriminate Analysis for feature extraction. Since not all of the
extracted features are necessary for classification, the optimal features are selected by using
the Sequential Backward Floating Selection (SBFS) method. Two datasets were employed
to assess the algorithm’s performance. Their proposed method achieves an accuracy of
86.50% on the Wet Gel Electrodes dataset and 60.61% on the Emotive Epoch Headset.
Bhattacharyya et al. [16] classified upper limb movement using raw EEG signals acquired
from the BCI competition-II 2003 dataset [17]. Unreduced features along with reduced
features were tested on the following classifiers: Linear Discriminate Analysis, Quadratic
Discriminant Analysis (QDA), KNN, Support Vector Machine, Radial Basis Function (RBF),
and Naive Bayesian (NB) classifier. Consequently, the best results were achieved by
Support Vector Machine, reaching an accuracy of 82.14% in both scenarios, whilst KNN’s
performance improved with reduced features. Wahid et al. [18] used EEG signals from
40 healthy people and four ML classifiers, including Support Vector Machine, Linear
Discriminate Analysis, Random Forest, and KNN, to identify foot and hand movement
tasks. Common Spatial Pattern features were retrieved from several window widths,
Appl. Sci. 2022, 12, 5807 3 of 19

ranging from 0.3 s to 2 s. The findings demonstrate that with 2 s window sizes and nine
votes, the KNN can reach 78.9%. The ranking study utilizing both accuracy and area under
the curve data demonstrate that the Random Forest method can consistently perform well
with varying window sizes and amounts of votes. Yang et al. [19] explored the effectiveness
of CNN for multi-class MI-EEG data classification. They constructed augmented Common
Spatial Pattern features employing pair-wise projection matrices, which deal with various
frequency ranges. There are eight levels of frequency bands, with 4 Hz and 40 Hz being
the starting and ending frequencies, respectively. The width of each band is selected as
(3 4 7 8 11 12 13 15) Hz and the widths of each window shift are (2 2 4 5 6 6 5 5). In total,
there are 60 frequency bands. Furthermore, they presented a frequency complementary
feature map selection approach that limits the reliance across frequency bands, attaining
an averaged cross-validation accuracy of 69.27%. Their findings show that CNN can
acquire deep and discriminant structural features for EEG classification without depending
on handcrafted features (handcrafted features are those that are manually engineered).
Cheng et al. [20] employed a cascade of one-dimensional convolutional neural networks
and fully connected neural networks (1D-CNN-FC) with a cooperative training method to
classify background EEG and MI-EEG. Six convolutional layers and two fully connected
layers comprise the 1D-CNN-FC cascade. They compared their proposed cascade to
the traditional strategy for motor imagery-BCI, which included Support Vector Machine
for classification and Common Spatial Pattern for feature extraction. Their experiments
compare three classification tasks: Firstly, to classify pure imagery from background EEG,
secondly, transitional imagery from background EEG, and lastly, to separate both pure
and transitional imagery from background EEG using a joint training scheme, considering
pure and transitional imagery signals as the same class. The cooperative training method
outperformed traditional training in their experiments. Abbas et al. [21] presented a
framework in which the covariance maximization method is adopted for discriminating
inter-class data and this technique uses Fast Fourier Transform Energy Maps (FFTEM) to
convert 1D data into 2D data and select features. FFTEM is used as a feature selection
method, which exploits both the spectral and energy information hidden in EEG data. It
is applied to the discriminative features obtained from Common Spatial Pattern filters to
compute energy maps corresponding to frequency bands and EEG channels. These energy
maps are used to train the CNN classifier. It was necessary to convert 1D data to 2D as they
were using 2D CNN. Their proposed classification system achieved a mean kappa value of
0.61%. Sakhavi et al. [22] applied a CNN architecture for classification as well as a novel
temporal representation of the data. Their new representation is developed by adjusting
the filter-bank technique, and the CNN is constructed and optimized for the representation
accordingly. On the BCI competition IV-2a dataset, their approach surpassed the recent
methods in the literature by a 7% improvement in average subject accuracy. Zhao et al. [23]
suggested a three-dimensional visualization approach as well as a multi-branch three-
dimensional CNN to solve MI classification challenges. Their three-dimensional CNN
representation is created by converting EEG data into a series of 2-dimensional arrays
that maintain the sampling electrodes’ spatial distribution. Their innovative experiments
demonstrated that the introduced method achieves state-of-the-art classification kappa
score levels and considerably surpasses other classifiers by a 50% reduction in various
subjects’ standard deviation. Furthermore, they proved that the multi-branch architecture
may reduce overfitting when the size of the dataset is insufficient. Liu et al. [24] presented
a new semi-supervised cartesian k-means for MI-EEG classification. They employed CNN
architectures that had been pre-trained on MI examples to produce deep features, which
they then used to classify MI data. Deep features are the high-level features that are
extracted by the feature extraction block of the CNN. Their experimental results show
that their suggested strategy outperforms other comparable algorithms. Deng et al. [25]
presented the Temporary Sparse Group Lasso (TSGL) EEGNet neural network algorithm,
which has superior efficiency on the MI-EEG and is easily interpretable. Their suggested
algorithm was improved on the basis of a well-recognized DL model, known as EEGNet,
Appl. Sci. 2022, 12, 5807 4 of 19

and it was optimized using the traditional ML approach. Their experimental findings
indicated that their approach obtained an average accuracy of 81.34% and a kappa of
75.11% on the BCI Competition IV 2a dataset [9]. Zümray Dokur et al. [26] evaluated the
influence of the augmentation procedure’s effect on the MI-EEG’s classification performance
rather than employing a previous transformation (Common Spatial Pattern). Furthermore,
they redesigned the DNN topology to improve classification results, reduce the number of
nodes in the architecture, and employ fewer hyperparameters. Moreover, they examined
these enhancements on datasets from BCI competitions in 2005 and 2008, with two and
four classes, respectively, and found that the augmentation had a significant influence on
the average performance.
The above-cited techniques remain relatively practical, due to their drawbacks, and
functioning mode. BO algorithm overcomes those limitations due to its processing man-
ner [27]. The BO concept is based on the assumption function, which holds predicted
values, whereas the acquisition function helps predict the next point until reaching the
global optimum. The posterior probability related to the Gaussian distribution of the
unknown optimized objective function is obtained from prior information and sample
data. It is effective and precise on complex training and high-dimensional data. Further,
it reduces time cost and delivers accurate results. Wu et al. [28] studied the effect of
hyperparameter tuning on an ML algorithm employing BO with Gaussian Process (GP).
They used the same dataset on other optimization algorithms, such as Random Forest
and multi-grained cascade forest. As a result, the BO algorithm outperformed the other
methods. Hengjin et al. [29] aimed to optimize an EEG-based CNN classifier for online
E-health programs, which needs high precision and inexpensive time cost to manage the
brain dynamics simulations. They employed the BO algorithm on time series data in a
fashioned manner. They sliced up the hyperparameters into groups. Each group was
optimized under the same conditions in the same space but separately. Then, the selection
of the most optimized values is conducted through a matrix based on Markov Chain.
The employment of the BO contributed significantly on building a robust CNN model
and successfully handling EEG data. Brenda et al. [30] used two different techniques for
MI-EEG classification on the BCI competition IV dataset 2a [9] and eight other pieces of
EEG data [30]. On both methods, feature extraction was executed by a Filter Bank Common
Spatial Pattern (FBCSP). After frequency band selection, the matrix containing salient fea-
tures is fed into previously optimized conventional CNN to perform binary classifications
of all classes. The same processed data are fed into four more evolutive CNNs classifiers.
As a result, the accuracy improved.
As can be noticed, individual models were used in the majority of EEG signal classifi-
cation investigations. In our study, we focused on merging the optimum results of three
robust CNNs trained separately. Besides that, none of the aforementioned studies used BO
in conjunction with Ensemble Learning (EL) approaches. Consequently, they failed to meet
near-perfect performance, which is unacceptable in the clinical context.
This paper builds upon EL, DL, and BO to accomplish three major goals:
• Design an efficient classification system by merging the top three best CNNs generated
by the help of BO addressed for BCI-based rehabilitation uses.
• Obtain competitive findings using BCI IV-2a competition data by reaching high levels
of Precision, Recall, F1-score, Accuracy, and Kappa.
• Highlight the significant contribution of BO and checkpoint call back techniques on
boosting the performance of CNN on MI data.
The paper is structured as follows: Section 2 discusses the proposed BO-based Average
Ensemble approach for the classification of four EEG tasks, including the left hand, right
hand, tongue, and feet. Section 3 illustrates the experimental results and a comparison of
state-of-the-art strategies. Lastly, Section 4 concludes the paper.
Appl. Sci. 2022, 12, x FOR PEER REVIEW 5 of 19
Appl. Sci. 2022, 12, x FOR PEER REVIEW 5 of 19
Appl. Sci. 2022, 12, 5807 5 of 19

2. Materials and Methods


2.
2. Materials
Materials and
and Methods
Methods the proposed approach for EEG data multi-classification. First,
This section describes
This
we present section
This section describes
describes
our target the
theproposed
dataset. Next, weapproach
proposed introducefor
approach theEEG
for EEG data
datamulti-classification.
proposed multi-classification.
one-dimensional CNN First,
First,
we present
we present
topology. our
Then, target
our target dataset.
dataset.
we discuss Next,
howNext, we
BO can introduce
we be
introduce the
used to the proposed
proposed
select one-dimensional
one-dimensional
the optimal CNN
CNN
hyperparameters.
topology.
topology.
After, Then,
Then,we
we describe discuss
wethe
discuss how
Averagehow BO
BOcancanbe
Ensemble be used
usedto
method, select
towhich the
the optimal
selectwas optimal
used hyperparameters.
hyperparameters.
to combine the three
After, we describe
After, we describe
top-generated CNNsthe Average
thefrom
Average Ensemble
Ensemble
the BO method,
method,
stage. Last, which was
which was
we compare used
ourused to combine
to combine
Ensemble modelthe
thetothree
three
five
top-generatedML CNNs
top-generated
traditional CNNs from the
the BO
fromFigure
algorithms. BO1stage.
stage. Last,
depicts the we
Last, we compare
compare
overall our
flow of theEnsemble
our proposedmodel
Ensemble model toto five
framework. five
traditional
traditional ML
ML algorithms.
algorithms. Figure
Figure 11 depicts
depicts the
the overall
overall flow
flow of
of the
the proposed
proposed framework.

Figure 1. The proposed framework.


Figure 1.
Figure Theproposed
1. The proposed framework.
framework.
2.1.
2.1. Data
Data Acquisition
Acquisition
2.1. Data Acquisition
In
In this article, we
this article, weused
usedthe theopen-for-use
open-for-use BCI BCI Competition
Competition IV 2aIVdataset
2a dataset20082008 from from
Graz
Graz Inuniversity
this article, weIt used the open-for-use BCI Competition IVleft
2a dataset 2008hand,
from
university [9]. It[9].
containscontains 9 subjects’
9 subjects’ signalssignals
recorded recorded
for leftfor hand, hand,hand,
right right tongue,
Graz
tongue, university [9].
EachIt subject’s
containsdata 9 subjects’ signals in recorded forEach
left hand, rightsession.
hand,
and feet.and feet.subject’s
Each were collected
data were collected in two days. two Each
days. day isday oneissession.
one Each
tongue,
Each
session and
session feet. Each
contains
contains subject’s
a total
a total data
of 6 runs
of 6 runs were
each each collected
including
including in two days.
12 trials
12 trials Each
for every
for every day is one
imagined
imagined session.
move-
movement.
Each
ment. session
Subjects take contains
Subjects a take a total
shorta pause
short of 6 between
pause
between runs
theeach the
runs. including
runs.
To 12
startTothe
starttrials
the for
recording, every
recording, imagined move-
the participants
the participants sat in a
ment.
sat Subjects
in a comfortable
comfortable take a short
chairof
chair in front pause
inafront between
digital ofscreenthe
a digital runs.
with To
screen
arrows start
with the recording,
arrows the
indicating indicatingthe participants
the task of
task of imagination.
sat
Eachin arrow
a comfortable
imagination. Each for
stays chair
arrowabout in1.25
stays front of a digital
fors. about
Before 1.25 screen
Beforewith
thes.beginning arrows
theofbeginning indicating
each trial, ofa each the task
crosstrial,
appearsa crossof
for
imagination.
appears
2 s on the Each
forscreen arrow
2 s onalong stays
the screen for
with aalong about
beep with 1.25
a beep
to signal s. Before the
thetobeginning beginning
signal theofbeginning of each
the trial. of The trial,
the a cross
trial. The
duration of
appears
each MI for
duration of 2each
trial s onMI
after thetrial
the screen
cue is 3along
after s.the with
Thecue a3 beep
cueisin s. The
the tocue
task signal
wasininthethe beginning
thetask
formwas of aninofarrow.
thetheformtrial.
TheofThe
an
BCI
duration
arrow. The
Competition ofBCI
each MI trialwas
2aCompetition
dataset after
2a the cueusing
dataset
recorded is 3 recorded
was s.22The cueusing
electrodes in the 22task
with was inwith
electrodes
a sampling the
rateaform of Hz
sampling
of 250 an
arrow.
rate
and of The BCI Competition
250 Hz andofa 100
a sensitivity sensitivity2a
µV. Then, dataset
of 100 was recorded
µV. Then,filtered
a bandpass using
a bandpass 22
between electrodes
filtered between
0.5 Hz with
and 100 a sampling
0.5 Hz and was
rate
100 of 250
applied.
Hz was ToHz and a To
eliminate
applied. sensitivity
the noise, of
eliminate 100
athe
50 Hz µV. Then,
notch
noise, a 50filter
Hz a notch
bandpass
was used.filterfiltered between
A representation
was used. 0.5ofHz
A representationtheand
BCI
100 HzBCIwasCompetition
Competition
of the applied.
IV-2a’sTo eliminate
synchronicity theis noise,
showna in
IV-2a’s synchronicity 50 Hz
isFigurenotch
shown 2.infilter
Figurewas 2.used. A representation
of the BCI Competition IV-2a’s synchronicity is shown in Figure 2.

Figure 2.
Figure Experiment tasks
2. Experiment tasks representation
representation on
on aa timeline.
timeline.
Figure 2. Experiment tasks representation on a timeline.
Appl. Sci. 2022, 12, x FOR PEER REVIEW 6 of 19
Appl. Sci. 2022, 12, 5807 6 of 19

Data pre-processing is a commonly applied step in a majority of signal processing-


relatedData
studies. It helps cleanisthe
pre-processing data from environmental
a commonly applied step inand physiological
a majority artifacts
of signal and
processing-
discard
relatedbad trials [31].
studies. Furthermore,
It helps the application
clean the data of filters helps
from environmental reshape the data
and physiological ac-
artifacts
cording to thebad
and discard study context
trials and needs. However,
[31]. Furthermore, in our case
the application study,
of filters wereshape
helps are using theBCI
data
Competition
according to IVthe
dataset
study2acontext
which and
is already
needs.pre-processed.
However, in our Wecase
are,study,
therefore, notusing
we are willing
BCI
toCompetition IV dataset
apply any further 2a whichsince
modification is already pre-processed.
our methodology We are,ontherefore,
is based not willing
an end-to-end auto-to
applypipeline
mated any further modification
for MI sincewith
classification our methodology
CNN. CNNs is based
are on an end-to-end
well-known for their automated
capacity
topipeline for MI
handle raw EEGclassification with CNN.
data and extract CNNs
features. are well-known
Further, for theiriscapacity
the classification executed to auto-
handle
raw EEGThe
matically. dataeye
and extract features.
movement Further,from
data collected the classification is executed
Electrooculography automatically.
(EOG) channels are The
eye
not movementindata
considered this collected
experiment. from Electrooculography (EOG) channels are not considered
in this experiment.
2.2. The Proposed Base CNN Architecture
2.2. The Proposed Base CNN Architecture
In this paper, we first created a base CNN topology before performing hyperparam-
In this paper,
eter optimization. Ourweproposed
first created a basehas
1D-CNN CNN topology
2 main blocksbefore performing2hyperparam-
(2 convolutional, batch nor-
eter optimization. Our proposed 1D-CNN has 2 main blocks (2 convolutional,
malization, and 2 max-pooling layers), and a classification block (4 dense and 3 dropout 2 batch
normalization, and 2 max-pooling layers), and a classification block
layers). The architecture of the proposed base CNN approach is depicted in Figure 3.(4 dense and 3 dropout
layers). The architecture of the proposed base CNN approach is depicted in Figure 3.

Figure 3. The topology of our CNN approach. BN: Batch Normalization; Conv1D: Convolutional
Figure 3. The topology of our CNN approach. BN: Batch Normalization; Conv1D: Convolutional
layer one-dimensional.
layer one-dimensional.
In each convolution layer, we used 64 convolution filters, 3 filter sizes, and 1 time step.
As In each convolution
a fuzzy layer,layers
filter, convolution we used 64 convolution
improve the featuresfilters, 3 filter
of original sizes,and
signals andeliminate
1 time
step. As Aa convolution
noise. fuzzy filter, operation
convolution
takeslayers
placeimprove
betweenthe thefeatures of original
upper layer’s signals
feature vectorsand
and
eliminate noise. A convolution operation takes place between the upper layer’s
the current layer’s convolution filter. Finally, the activation function returns the outcomes feature
vectors and the current
of convolution layer’sEquation
calculations. convolution filter. the
(1) shows Finally, the activation
outcome function layer:
of the convolution returns
the outcomes of convolution calculations. Equation (1) shows the outcome of the convo-
lution layer: X jl = f ( ∑ X jl −1 × Wijl + blj ) (1)
i∈ Mj
= ( × + ) (1)
∈ receptive field. X l represents the characteristic
where M j denotes the current neuron’s j
where the lth layer’s
vector fordenotes jth convolution
the current filter. f means
neuron’s receptive field.a nonlinear function.
represents Wijl represents
the characteristic
the bias
vector for coefficient
the th layer’s assigned to the lth layer’s
th convolution filter.jth convolution
means a nonlinearkernel. function. rep-
We utilized Batch Normalization [32] after each convolutional
resents the bias coefficient assigned to the th layer’s th convolution kernel. layer to enhance per-
formance results and accelerate training convergence. It is accomplished
We utilized Batch Normalization [32] after each convolutional layer to enhance per- by normalizing
each feature
formance at the
results andbatch level during
accelerate training training and thenItre-scaling
convergence. later with
is accomplished by the entire train-
normalizing
ing data in mind.
each feature hat the i batchhleveli during training and then re-scaling later with the entire(2),
Each mini-batch’s dimension is transformed as shown in Equation
Var in
where data ( k )
x mind. E x (mini-batch’s
andEach k ) represent the variance and the mean of kth scalar feature,
training dimension is transformed as shown in Equation
( ) ( )
respectively.
(2), where The numerical
and stability is improved
represent by a minor
the variance and the constant
mean e. of th scalar fea-
ture, respectively. The numerical stability is improved h byi a minor constant .
x((k)) − E x((k))
x̂ ((k)) = −
=q (2)(2)
Var[ x((k))] ++ e
 
Appl. Sci. 2022, 12, 5807 7 of 19

The learnable parameters β and γ are introduced through Batch Normalization. The
transformation of each layer is described in Equation (3).
 
BatchNormalization x (k) = γ(k) × x̂ (k) + β(k) (3)

Pooling is a sublayer of a CNN that introduces non-linearity by amalgamating nearby


values’ feature into one, using a specific operator. The most common options are average-
pooling and max-pooling. We chose max-pooling [33] for this study. As the point after
pooling, the maximum value of the pool area is used. This minimizes the output’s spatial
size, reducing the model’s computational complexity. We employed 1D max pooling with
a stride size of 1 and a pooling size of 2.
In order to generate the final output, dense layers were applied to produce an output
equal to the number of classes (this study takes four classes into account). We selected
y = f c( x, w, b) to represent the function performed by the dense layer, where y means
the output, x denotes the input to the dense layer, w represents the weight matrix, and b
describes the bias. The input x includes Z features. The component xi represents feature i.
It is worth noting that w contains dimension Z × A, where A reflects the y’s dimension. It
works with a vector x to generate the component j in y, as shown in Equation (4).

yj = ∑ wij xi + bj (4)
i

To prevent over-fitting and improve generalization, we used the Dropout [34] tech-
nique between dense layers. Dropout offers regularization by removing a portion of the
previous layer’s outputs.
It should be noted that this step did not include:
• The dropout rate.
• The activation functions.
• The number of nodes in the first three dense layers.
These are among the hyperparameters that will be used as inputs for BO in the
next subsection.

2.3. Hyperparameter Optimization


Tuning hyperparameters is an optimization problem with an unknown objective
function. As can be seen in Equation (5), where x denotes a hyperparameter setting, xopt
means the best choice and f represents performance.

xopt = argmax f ( x ) (5)


x∈X

There is a range of methods that can be used to accomplish this. Grid Search [35], for
example, is a method for searching exhaustively over a portion of a hyperparameter space
for a certain algorithm. The number of parameter combinations will grow exponentially as
additional hyperparameters are introduced. As a result, the procedure will take a long time
to complete. Random Search [35] is another method for locating valuable hyperparameters.
Random Search, unlike Grid Search, tries out different combinations of parameters at
random. When computing, this causes a high variance. These two approaches are unable to
learn anything from the assessed hyperparameter sets during the optimization procedure.
In this experiment, an efficient strategy known as BO [36] is adopted to select the
optimum hyperparameters. BO incorporates previous knowledge about the objective func-
tion and updates posterior knowledge, increasing accuracy and reducing model loss. The
optimization is carried out in accordance with Bayesian Theorem [37], which is enshrined
in Equation (6), where M denotes the model and O denotes the observation.

P( M |O) = ( P(O| M ) P( M )/P( M ) (6)


where ( ) describes the prior probability of , ( | ) is ’s posterior probability
given , and ( | ) is ’s likelihood given . By using BO, one can find the objective
function’s minimum on a bounded set. The CNN model with additional evaluations for
the objective function is adequate for making the right choices. By reducing the error in
Appl. Sci. 2022, 12, 5807 validation, BO leads to optimal CNNs in the current study. 8 of 19
The first critical step in the BO process is to use a Posterior distribution with a GP to
update the prior Function F results. GP provides a set of observed previous possible val-
ues transformed directly into posterior probabilities on the functions. This scenario allows
where P( M ) describes the prior probability of M, P( M|O) is M’s posterior probability
us to apply GP in our multi-classification problem. The next step is to use an acquisition
given O, and P(O| M) is O’s likelihood given M. By using BO, one can find the objective
function to pick the optimum point for F. Following that, BO identifies the suggested sam-
function’s minimum on a bounded set. The CNN model with additional evaluations for
pling points as determined by the acquisition function. The Expected Improvement (EI)
the objective function is adequate for making the right choices. By reducing the error in
acquisition function is used at the points where the objective function is maximum and
validation, BO leads to optimal CNNs in the current study.
the location where the uncertainty of the prediction is high. The mathematical represen-
The first critical step in the BO process is to use a Posterior distribution with a GP to
tation of EI function is shown in Equation (7), where denotes the sample points’
update the prior Function F results. GP provides a set of observed previous possible values
mean, means the Optimization Function at the highest point. The optimization
transformed directly into posterior probabilities on the functions. This scenario allows
function has a search space limitation represented as . The Cumulative Distribution, de-
us to apply GP in our multi-classification problem. The next step is to use an acquisition
noted by to ,pick
function takesthe
into account point
optimum all of the
for points up to the
F. Following point
that, BO sample under
identifies the Gaussian
suggested
distribution.
sampling points as determined by the acquisition function. The Expected Improvement (EI)
acquisition function is used at the points where = ((the objective
− function
) − )( is) maximum and(7) the
location where the uncertainty of the prediction is high. The mathematical representation
of EIInfunction
order toisobtain
shownresults in the(7),
in Equation validation
where MSPset, itdenotes
is essential to use points’
the sample an objective
mean,func-
OFH
tion. The critical step after that is to update the Gaussian distribution
means the Optimization Function at the highest point. The optimization function network. Several
has a
iterations
search spaceof this processrepresented
limitation are completed as L.inThe
order to tune the
Cumulative validation set
Distribution, for creating
denoted by CD,
optimized hyperparameters that work best for our classification problem.
takes into account all of the points up to the point sample under Gaussian distribution. In this paper,
60 iterations were performed. The flow of BO algorithm related to our proposed CNN
network is summarized in Figure
Expected 4.
Improvement = (( MSP − OFH ) − L)(CD ) (7)
The hyperparameters chosen for optimization in this experiment are:
TheIn order to obtain
activation results
function inthe
for theconvolutional
validation set, layers
it is essential
and thetofirst
use three
an objective function.
dense layers.
The critical step after that is to update the Gaussian distribution network. Several iterations
 The dropout rate.
of this process are completed in order to tune the validation set for creating optimized
 The number of nodes in the first three dense layers.
hyperparameters that work best for our classification problem. In this paper, 60 iterations
 The optimizer.
were performed. The flow of BO algorithm related to our proposed CNN network is
 The batch size.
summarized in Figure 4.

Theflow
Figure4.4.The
Figure flowofofBO.
BO.

The hyperparameters chosen for optimization in this experiment are:


For each of these hyperparameters, the valid search ranges are predefined. “ReLU,
The activation function for the convolutional layers and the first three dense layers.
SELU, Tanh, ELU, and Sigmoid” are the activation functions specified for the convolu-

tional The dropout
layers and therate.
first three dense layers. ReLU is a popular activation function in

CNN. The number
It applies of nodes in
a threshold the firsttothree
function everydense
input layers.
element, setting anything less than 0
to• 0. SELU
The optimizer.
is an activation function that promotes self-normalization. SELU network neu-

ronal The batch size.
activations naturally settle to a 0 mean and unit variance. Tanh is used to estimate
or discriminate
For each of between two classes; however,
these hyperparameters, it only
the valid transfers
search negative
ranges inputs into
are predefined. neg-
“ReLU,
ative
SELU, quantities ranging
Tanh, ELU, from −1 to
and Sigmoid” are1.the
ELU is a neural
activation network
functions activation
specified function.
for the ELU,
convolutional
unlike
layersReLU,
and thehasfirst
negative
three numbers, allowing
dense layers. ReLU it to
is bring meanactivation
a popular unit activations
functioncloser to 0,
in CNN.
similar to batch normalization, but with less computing effort. Sigmoid is
It applies a threshold function to every input element, setting anything less than 0 to 0. a type of non-
SELU is an activation function that promotes self-normalization. SELU network neuronal
activations naturally settle to a 0 mean and unit variance. Tanh is used to estimate or
discriminate between two classes; however, it only transfers negative inputs into negative
quantities ranging from −1 to 1. ELU is a neural network activation function. ELU, unlike
ReLU, has negative numbers, allowing it to bring mean unit activations closer to 0, similar
to batch normalization, but with less computing effort. Sigmoid is a type of non-linear
activation function that is commonly employed in feedforward neural networks. It is a
probabilistic method to play an important factor that ranges from 0 to 1. Thus, when we
Appl. Sci. 2022, 12, x FOR PEER REVIEW 9 of 19

Appl. Sci. 2022, 12, 5807 9 of 19


linear activation function that is commonly employed in feedforward neural networks. It
is a probabilistic method to play an important factor that ranges from 0 to 1. Thus, when
we
needneed to make
to make a choice
a choice or estimate
or estimate an outcome,
an outcome, we we utilize
utilize thisthis activation
activation function
function sincesince
the
the
rangerange is the
is the smallest,
smallest, so prediction
so prediction is more
is more correct.
correct.
The
The dropout
dropout rate is set between 0 and 0.5. The The number
number of of nodes
nodes inin the
the first
first three
dense
dense layers
layershas
hasa alower
lowerbound
bound of of
32 32
andand
an upper
an upperbound of 1024.
bound The optimizers
of 1024. speci-
The optimizers
fied are Nadam
specified (Adam
are Nadam with Nesterov
(Adam momentum),
with Nesterov RMSProp
momentum), (root means
RMSProp (root square
means propa-
square
gation), Adam Adam
propagation), (adaptive moment
(adaptive estimation),
moment SGD (stochastic
estimation), gradient gradient
SGD (stochastic descent), descent),
and Ad-
and AdaMax
aMax (an Adam (anvariant).
Adam variant).
The low The
and low
highand high
of the of the
batch batch
size are 1size are 1Table
to 128. to 128. Table
1 lists the1
lists the hyperparameters
hyperparameters considered considered
in the BOin the BO experiment,
experiment, as wellasaswell
the as the search
search space space
for eachfor
each hyperparameter.
hyperparameter.

Table 1. Hyperparameter setting.

Hyperparameters
Hyperparameters Range
Range to to Probe
Probe
Activation function ReLU–SELU–Tanh–ELU–Sigmoid
Activation function ReLU–SELU–Tanh–ELU–Sigmoid
Dropout
Dropout rate
rate 0 to
0 to 0.50.5
Number of nodes in the
Number of nodes in the first first dense
dense layer
layer 32 to
32 to 10241024
Number
Number of of nodes
nodes in in
thethe second
second dense
dense layer
layer 32 32
to to
10241024
Number
Number of of nodes
nodes in in
thethe third
third dense
dense layer
layer 32 32
to to
10241024
Optimizer
Optimizer Nadam–RMSProp–Adam–SGD–AdaMax
Nadam–RMSProp–Adam–SGD–AdaMax
Batch
Batch size
size 1 to 128128
1 to

2.4. Ensemble
Ensemble Learning
Learning (EL)
In ML, EL is defined as the training of several models, known as weak learners, and
the merging of their predictions to generate improved results. It allows us to merge all of
these models
models to obtain
obtain a single
single excellent
excellent model
model that
that is
is based on insights
insights from most or all
of the training set [38]. TheThe fact
fact that
that there is no limit to the number of weak learners that
be used
can be usedisisananintriguing
intriguingfeature
featureofofEL.
EL.ItItcould
could range
range from
from asas
fewfewas as two
two to many
to as as many as
as ten
ten weakweak learners.
learners. ELELcancan
bebe used
used inina avariety
varietyofofways
waystotoenhance
enhanceoverall
overallperformance,
performance,
including averaging, weighted averaging, max voting, and many others.
The motivation
motivation behind
behindemploying
employingthe ELEL
the approach
approach is toisimprove
to improve the overall efficiency
the overall effi-
of the top
ciency three
of the topoptimum models models
three optimum generated by BO. Further,
generated it eliminates
by BO. Further, the possibility
it eliminates the pos-of
weak learners
sibility of weakfocusing
learnerson local minima.
focusing on localIn this paper,
minima. wepaper,
In this utilized wethe Average
utilized theEnsemble
Average
method. With
Ensemble this With
method. method,
this each prediction
method, on the test
each prediction onset
theistestaveraged, and finally
set is averaged, andthefi-
prediction is derived from the average prediction model. Each of the
nally the prediction is derived from the average prediction model. Each of the three weak three weak learners
make anmake
learners equalan contribution to the overall
equal contribution to theprediction as shownasinshown
overall prediction Equation (8).
in Equation (8).
3
) ==1/3
y(x(x) 1/3∑ ym ( x() ) (8)
m =1

The specific steps of this method can be described as follows: (1) (1) generate
generate three
three weak
weak
learners
learners (top
(top three optimum
optimum models)
models) from
from the BO experiment,
experiment, (2) train them separately
separately
from scratch
scratch on the same dataset, and (3) merge these weak learners by averaging their
predictions ((y1,,y2,, y3)) for each sample of our dataset calculate y.. An illustration of the
dataset to calculate
structure of the proposed Average
Average Ensemble
Ensemble method
method isis displayed
displayed in
in Figure
Figure 5.
5.

Representation of the proposed Average Ensemble


Figure 5. Representation Ensemble method.
method.
Appl. Sci. 2022, 12, 5807 10 of 19

2.5. Traditional ML Algorithms


In this experiment, we picked five traditional ML algorithms (Linear Discriminate
Analysis, Support Vector Machine, Random Forest, Multi-Layer Perceptron, and Gaussian
Naive Bayes) to compare their performance with our proposed Average Ensemble approach.
We trained these algorithms on the same dataset in order to offer a fair performance
evaluation. Note that the hyperparameters of the five traditional ML algorithms are not
optimized in our experiment. We used Scikit-learn [39] Python Package to build these
algorithms and we trained them with the default hyperparameters provided in this Package.

2.5.1. Linear Discriminate Analysis (LDA)


LDA [10] is a dimensionality reduction algorithm that identifies a subspace with the
greatest possible difference in variance between classes while keeping the variance within
classes constant. Consequently, the LDA-derived subspace effectively distinguishes the classes.
Let us consider a set of C-class and D-dimensional samples x1 , x2 , . . . , x N , N1 of which


corresponds to the z1 class, N2 belong to the z2 class, and Nc to the zc class. For good discrimi-
nation between classes, we need to determine the separation measure. Equations (9) and (10)
give the scatter plots both within and between class measures, respectively.
c
Si = ∑ (x − µi )(x − µi )T (9)
x ∈ zi

c
SB = ∑ Ni (µi − µ)(µi − µ)T (10)
i =1

where:
c c
1 1 1
Sz = ∑ Si , µi =
Ni ∑ xi µ=
N ∑x = N ∑ Ni µi
i =1 x ∈ zi ∀x i =1

The total scatter similarly can be represented by the matrix ST = SB + SZ . The scatter
matrices SeB and mean vector SeZ for the projected samples can be described as shown in
Equations (11) and (12). The scatter matrices are a measure of the scatter in multivariate
feature space.
c
SeB = ∑ Ni (µei − µe)(µei − µe)T (11)
i =1
c
SeZ = ∑ ∑ (y − µei )(y − µei )T (12)
i =1 y ∈ z i

2.5.2. Support Vector Machine (SVM)


SVM [11] is another classifier employed in this study, which tries to determine a
hyperplane (also known as the decision border) that separates the classes and, further,
for which the distances between the hyperplane and the nearest instances are maximized.
Margins are the space between the hyperplane and the nearest instances. Consider mapping
x (an input) into a high-dimensional feature space s = ∅( x ) and making an optimum
hyperplane denoted by w·s − b = 0 in order to categorize instances into different classes,
where b represents the separation hyperplane’s bias and w denotes the normal vector. This
is accomplished by addressing the problem shown in Equation (13).
n
1
min ksk2 + C ∑ β i , (13)
2 i =1

s.t. yi (w·si − b) ≥ 1 − β i , β i ≥ 0
where the ith input sample is represented by xi , the class label value belonging to xi is
denoted by yi , the slack element that permits an instance to be in the margin (0 ≤ β i ≤ 1)
1
‖ ‖ + , (13)
2

. . y( ∙ − )≥1− , ≥0
Appl. Sci. 2022, 12, 5807 where the th input sample is represented by , the class label value belonging to 11 ofis19
denoted by , the slack element that permits an instance to be in the margin (0 ≤ ≤ 1)
or to be incorrectly classified ( > 1) is described by , the input samples’ number is
indicated by , and the
or to be incorrectly penalty(factor
classified β i > 1is) symbolized by β.i , the input samples’ number is
is describedby
indicated by n, and the penalty factor is symbolized by C.
2.5.3. Random Forest (RF)
2.5.3.
RFRandom Forestan(RF)
[12] employs EL approach for classification that involves various decision trees
duringRF [12]
the employs
training an and
stage EL approach
producesfor classification
individual trees’that involves
average various It
prediction. decision trees
constructs
during the training stage and produces individual trees’ average prediction.
forests with an arbitrary number of trees. The flow of this algorithm is described in the It constructs
forests with
following an (1)
steps: arbitrary
Create number ofbootstrap
N trees of trees. Theinstances
flow of this
based algorithm is described
on the dataset, (2) at in
eachthe
followinginstance,
bootstrap steps: (1) Create
grow oneNunpruned
trees of bootstrap instances based
tree, (3) arbitrarily on theofdataset,
take k-tries (2) at each
the predictors at
bootstrap
each node, instance,
(4) select grow one unpruned
the optimal split from tree,
those(3) variables,
arbitrarilyand take(5)
k-tries of the
predict newpredictors
data by
at each node,
merging the N (4) select
tree’s the optimal
predictions. Thesplit fromoperation
general those variables,
of the and (5) predict
RF classifier new datain
is depicted by
merging
Figure 6. the N tree’s predictions. The general operation of the RF classifier is depicted in
Figure 6.

Figure 6. The general operation of the RF classifier.


Figure 6. The general operation of the RF classifier.
X indicates the input data to the algorithm. tree1, tree2, . . . , treeN represent the
indicates
randomly the input
constructed data to
decision theand
trees algorithm. 1, outputs
their associated 2, … , denoted
represent
by k1, k2,the
. . .ran-
, kN.
domly constructed decision trees and their associated outputs denoted by 1,
Majority Voting is conducted and class k is chosen from among k1, k2, . . . , kN. The 2, … , .
Majority Voting is conducted and class is chosen from among
algorithm’s output is class k, which received the majority of votes. 1, 2, … , . The algo-
rithm’s output is class , which received the majority of votes.
2.5.4. Multi-Layer Perceptron (MLP)
2.5.4. Multi-Layer
MLP [13] is aPerceptron
feed-forward(MLP)
neural network that links the input data to relevant outputs.
It isMLP
made[13]up is
ofanumerous
feed-forward layers neural network
of neurons, thateach
with links the input
layer being data
fully to relevant to
connected out-
the
puts. It is made up of numerous layers of neurons, with each layer being
following layer. Figure 7 is an example of an MLP neural network topology. Information fully connected
tofrom
the following
the datasetlayer.
( x1, Figure
x2, . . . , 7xN
is )anisexample
passed toofthe
aninput
MLP layer,
neuralwhere
network
it istopology.
processedInfor-
by the
mation
network. from the dataset
During ( 1, layer
the hidden 2, … , processing,
) is passed to the receive
neurons input layer, where itfrom
information is processed
the input
Appl. Sci. 2022, 12, x FOR PEER REVIEW 12 of 19
by the and
layer network.
processDuring the hidden
it in a hidden manner.layerThe
processing, neurons
output layer receive information
takes processed from
data and delivers
the input
these outlayer
of theand process
network asitan inoutput
a hidden manner. The output layer takes processed data
signal.
and delivers these out of the network as an output signal.

Figure 7. MLP
Figure7. MLP classifier.
classifier.

2.5.5. Gaussian Naive Bayes (GNB)


The Naive Bayes theorem is the core of the GNB ML classifier [14]. Its goal is to allo-
cate the feature vector to the class by computing the feature vector’s posteriori proba-
bility. Equation (14) shows the mathematical representation of Naive Bayes theorem,
Appl. Sci. 2022, 12, 5807 12 of 19

2.5.5. Gaussian Naive Bayes (GNB)


The Naive Bayes theorem is the core of the GNB ML classifier [14]. Its goal is to
allocate the feature vector to the class Ci by computing the feature vector’s posteriori
probability. Equation (14) shows the mathematical representation of Naive Bayes theorem,
where i ∈ { L, R, F, T }, L denotes left hand, R is right hand, F means both feet, T indicates
tongue class, and the selected optimal features’ set is represented by O = {o1 , o2 , . . . , om }.

p(O|Ci ) × p(Ci )
p(Ci |O) = (14)
p(O)

Naive Bayes can be expanded as GNB by adopting a gaussian distribution. Suppose


2
σjLand µiL represent the variance and mean of the feature vector oi corresponding to left
hand class CL . The class-conditional probability is then determined using the gaussian
normal distribution, as shown in Equation (15).

(oi −µiL )2
1 −
2σ2
p(O = oi |CL ) = q e jL (15)
2
2πσjL

The class is provided by the prediction result, which is given in Equation (16).

C pred = argmax p(Ci |o1 , o2 , . . . , om ) (16)


i

2.6. Performance and Parameter Evaluation


In this paper, we used Accuracy, Precision, Recall, F1-score, and Cohen’s Kappa to
evaluate the performance of our proposed method.
Accuracy, which is the total of correctly classified samples (True Positive + True
Negative) divided by the sum of properly and incorrectly classified samples (True Positive
+ True Negative + False Positive + False Negative), is one of the most often used metrics in
multi-class classification.
Precision reflects how far we can rely on our model when it identifies an individual as
Positive and it is calculated by dividing the ratio of True Positive samples by the sum of
positively predicted samples (True Positive + False Positive).
Recall indicates our model’s capability to detect all Positive samples in the dataset and
it is computed by dividing the ratio of True Positive samples by the sum of True Positives
and False Negatives.
F1-score combines Recall and Precision into a single metric that incorporates both
features. It merges both metrics using the harmonic mean notion as shown in Equation (17).
!
precision × recall
 
2
F1 − score = = 2 × (17)
precision−1 + recall−1 precision + recall

Cohen’s Kappa is a metric based on the agreement principle that attempts to account
for assigning the same item in each class at random. It can function on any number
of classes into which the dataset is classified. It can be expressed mathematically as
shown in Equation (18), where pe denotes the expected agreement and po stands for the
observed agreement.

0 po − pe 1 − po
Cohen s Kappa = = 1− (18)
1 − pe 1 − pe

3. Results
All models in this investigation were trained and executed using the Google Colabora-
tory platform [40] with a virtual GPU.
served agreement.
− 1−
Cohen s Kappa = = 1− (18)
1− 1−

Appl. Sci. 2022, 12, 5807 3. Results 13 of 19


All models in this investigation were trained and executed using the Google Cola-
boratory platform [40] with a virtual GPU.
The
The convergence plot plot of
ofthe minimumf ( x() as
theminimum ) as a function
a function of total
of total number
number of itera-
of iterations
tions is shown
is shown in Figure
in Figure 8. As previously
8. As previously indicated,
indicated, we went wethrough
went through 60 iterations.
60 iterations. We noticedWe
noticed that it
that it took tooktwo
only onlyhyperparameter
two hyperparameter trials trials to find
to find significant
significant improvements.
improvements. After
After 21
21 rounds,
rounds, thethe function
function value
value reached
reached its minimum,
its minimum, andand the accuracy
the accuracy scorescore didimprove
did not not im-
prove any further.
any further.

Figure 8.
Figure The progress
8. The progress of
of BO.

Figure 99 is
Figure is aa representation
representation of of the
the order
order inin which
which thethe points
points were
were sampled
sampled during
during
optimization, as well as histograms illustrating the distribution of these
optimization, as well as histograms illustrating the distribution of these points for each points for each
search space. Each point’s color encodes the sequence in which samples
search space. Each point’s color encodes the sequence in which samples were assessed. were assessed. The
best parameters are indicated by a red star. A variety of areas were investigated.
The best parameters are indicated by a red star. A variety of areas were investigated. Re- Regarding
the histogram
garding of the activation
the histogram function,function,
of the activation ReLU was ReLUchosen
wasroughly 24 times24
chosen roughly as times
the best
as
activation function for the convolutional layers and the first three dense layers,
the best activation function for the convolutional layers and the first three dense layers, while SELU
came in
while second
SELU camewith 14 sample
in second withcounts. In terms
14 sample of the
counts. dropout
In terms of histogram,
the dropout0.5 was selected
histogram, 0.5
over 15 times. The same sample count was recorded for 0, implying
was selected over 15 times. The same sample count was recorded for 0, implying that we can that
disable
we
the disable
can dropoutthe layer with alayer
dropout certain
withseta of hyperparameters.
certain Concerning
set of hyperparameters. the histograms
Concerning of
the his-
the number of nodes in the first three dense layers, 1024 was the most
tograms of the number of nodes in the first three dense layers, 1024 was the most often often chosen value,
appearing 7 times in the first dense layer, 10 times in the second dense layer, and 12 times
chosen value, appearing 7 times in the first dense layer, 10 times in the second dense layer,
in the third dense layer. When it comes to the optimizer histogram, Nadam was the most
and 12 times in the third dense layer. When it comes to the optimizer histogram, Nadam
frequently selected optimizer with more than 16 sample counts, while Adam and SGD
was the most frequently selected optimizer with more than 16 sample counts, while Adam
ranked second (12 sample counts each). With respect to the histogram of batch size, the
and SGD ranked second (12 sample counts each). With respect to the histogram of batch
highest value of search space (128) was picked most often.
size, the highest value of search space (128) was picked most often.
Table 2 provides more insights with a list of the top five best trials. It is noticed
that the top two best trials (Trials 20 and 51) achieve a maximum level of accuracy of
80%, demonstrating that CNN can learn discriminant features for EEG data multi-class
classification. Another note is that the ReLU activation function was chosen in all these
optimal trials, making it the ideal choice for CNN models. Moreover, in the majority of
these optimum trials, dense nodes of over 300 were used indicating that a higher dense
node value leads to better performance. Further, it is evident that the combination of ReLU,
batch size of roughly 60, and AdaMax optimizer is the optimum option.
In order to prepare the top three optimum models (Trials 20, 51, and 31) for the
Average Ensemble method, we trained them from scratch. Our approach this time was
to execute the Keras checkpoint callback at the end of every epoch, thereby saving the
weights of the model whenever the validation’s accuracy improved. This allowed us to
tune another critical hyperparameter, the number of epochs. Further explanation can be
found in Figure 10, where the checkpoint callback saves the weights of the first optimum
CNN, at epoch 32. The accuracy of the first, second, and third optimum models increased
by 8%, 8%, and 10%, respectively, after applying this method.
Appl. Sci. 2022, 12, 5807
x FOR PEER REVIEW 1414of
of 19

Figure 9.
Figure 9. Representation
Representation of
of the
the evaluation
evaluation plot
plot allows
allows us
us to
to understand
understand the
the progression of the
progression of the search
search
as well as the histograms of explored values. The best point samples are marked with a red star.
as well as the histograms of explored values. The best point samples are marked with a red star.
Dense_nodes1, Dense_nodes2, and Dense_nodes3 refer to the number of nodes in the first, second,
Dense_nodes1, Dense_nodes2, and Dense_nodes3 refer to the number of nodes in the first, second,
and third dense layers, respectively.
and third dense layers, respectively.
Table 2 provides more insights with a list of the top five best trials. It is noticed that
The hyperparameters for the first iteration of BO were chosen at random (ReLU
the top two
activation best trials
functions, (Trials 20
a dropout rateand 51) achieve
of 50%, 32 nodesa for
maximum level dense
the first three of accuracy
layers, of 80%,
a batch
demonstrating that CNN can learn discriminant features for EEG data multi-class
size of 64, and Adam as optimizer) to highlight the benefit of hyperparameter optimization classi-
fication.
and Another
the usage note is thatcall
of checkpoint the back.
ReLUIn activation function
Table 3, we comparewasthe
chosen in all these
performance optimal
of the base
trials, making it the ideal choice for CNN models. Moreover, in the majority
CNN architecture, before and after applying BO and checkpoint call back. It is evident of these op-
timum trials, dense nodes of over 300 were used indicating that a higher
that the performance of our Base CNN model drastically improved after optimizing the dense node value
leads to better
activation performance.
function, Further,
the dropout, the itnumber
is evident that thethe
of nodes, combination
optimizer, oftheReLU,
batchbatch size
size, and
of roughly 60, and AdaMax optimizer is the optimum option.
the number of epochs. In terms of the average performance of the three top CNNs, we
Appl. Sci. 2022, 12, x FOR PEER REVIEW 15 of 19

Table 2. The top five most optimum trials.


Appl. Sci. 2022, 12, 5807 Trial Number 20 51 31 2 15 of 19
21
Activation Function ReLU ReLU ReLU ReLU ReLU
Dropout 0 0.3 0.5 0.16 0
notice Dense_nodes1
an average increase of 19.3%,629 30%, 18.6%,
529 24.6%, and 27.5%
520 in Accuracy,
549 Precision,
724
Recall, F1-score, and Kappa, respectively.
Dense_nodes2 947 965 451 321 32
Dense_nodes3 862
Table 2. The top five most optimum trials.
698 965 383 665
Batch Size 60 61 128 44 69
TrialOptimizer
Number 20AdaMax 51AdaMax 31Nadam 2
Nadam 21
SGD
Accuracy
Activation Function ReLU 80% ReLU80% ReLU74% ReLU 72% 72%
ReLU
Time (seconds)
Dropout 0 36.06 0.3 37.7 0.5 32.9 0.1648.78 33.45
0
Dense_nodes1 629 529 520 549 724
In order to prepare the top three optimum models (Trials 20, 51, and 31) for the Av-
Dense_nodes2 947 965 451 321 32
erage Ensemble method, we trained them from scratch. Our approach this time was to
Dense_nodes3
execute the Keras checkpoint 862 callback at
698the end of965
every epoch,383thereby saving
665 the
weights of the
Batch model whenever
Size 60 the validation’s
61 accuracy
128 improved. 44This allowed 69 us to
tune another critical
Optimizer hyperparameter,
AdaMax the number
AdaMax of epochs.
Nadam Further
Nadam explanation
SGDcan be
found in Figure 10, where the checkpoint callback saves the weights of the first optimum
Accuracy 80% 80% 74% 72% 72%
CNN, at epoch 32. The accuracy of the first, second, and third optimum models increased
Time
by 8%, 8%,(seconds) 36.06 after applying
and 10%, respectively, 37.7 32.9
this method. 48.78 33.45

Figure 10. Plot of accuracy versus various epochs of the first optimum CNN. The perfect number of
epochs (32)
(32) isishighlighted
highlightedbybythe
theblue
bluecircle,
circle, which
which indicates
indicates where
where thethe checkpoint
checkpoint callback
callback saved
saved the
the model’s weights the last time it was executed.
model’s weights the last time it was executed.

The hyperparameters for the first iteration of BO were chosen at random (ReLU acti-
Table Comparisona between
vation3.functions, dropoutthe performance
rate of nodes
of 50%, 32 the basefor
CNNthearchitecture
first threebefore
denseand after applying
layers, a batch
BO and checkpoint callback.
size of 64, and Adam as optimizer) to highlight the benefit of hyperparameter optimiza-
tion and the usage of checkpoint call back. In Table 3, we compare the performance of the
Base CNN Architecture Accuracy Precision Recall F1-Score Kappa
base CNN architecture, before and after applying BO and checkpoint call back. It is evi-
Before BO + checkpoint callback
dent that the performance of our68Base CNN model
58 68 improved
drastically 62 after optimizing
54.7
the activation function, the dropout, the number of nodes, the optimizer, the batch size,
1st optimum CNN 88 90 88 88 84.01
and the
2nd number
optimumof epochs. In terms
CNN 88 of the average
88 performance
88 of the
88 three top83.97
CNNs,
After BO + checkpoint callbackwe notice an average increase of 19.3%, 30%, 18.6%, 24.6%, and 27.5% in Accuracy, Preci-
3rd optimum CNN 86 86 84 84 78.63
sion, Recall, F1-score, and Kappa, respectively.
Average 87.3 88 86.6 86.6 82.2

We combined our three generated CNNs using the Average Ensemble strategy. The
performance evaluation metrics for the proposed models are shown in Table 4. Precision is
described as the positive predictive rate. Based on Precision results, our Ensemble method
surpassed the weak-learners’ average Precision by 5.7% which indicates the feasibility
Appl. Sci. 2022, 12, 5807 16 of 19

and superiority of Ensemble model designed in this paper. Moreover, it provided 100%
Precision on the left hand and both feet, stressing the importance of aggregating the
performance of multiple weak learners. Our Ensemble method’s Recall exceeded the
average Recall of the weak learners by 5.34%. The F1-score reported a similar increase
value (5.34%) on the other side. The “feet” label was identified with 100% confidence based
on the three aforementioned metrics. In terms of Accuracy score, our Ensemble technique
achieved 92%, representing a 5.34% improvement over the weak learners. This is the same
increase rate recorded for Recall and F1-score. Cohen’s kappa is an important statistic that
measures inter-annotator agreement. The employed Ensemble method scored the highest
kappa score, reflecting a 7.13% increase over the weak learners. This significant increase
in Precision (5.7%), Recall (5.34%), Accuracy (5.34%), F1-score (5.34%), and Kappa (7.13%)
demonstrates the feasibility and superiority of the Ensemble model proposed in this study.
Overall, the Ensemble method produces extremely promising results across all five metrics.

Table 4. The performance evaluation metrics for the proposed models.

Models Labels Precision Recall F1-Score Accuracy Kappa


Left hand 92 79 85
Right hand 100 83 91
1st optimum CNN Both feet 92 92 92 88 84.01
Tongue 75 100 86
Average 90 88 88
Left hand 93 93 93
Right hand 75 75 75
2nd optimum CNN Both feet 92 100 96 88 83.97
Tongue 91 83 87
Average 88 88 88
Left hand 79 79 79
Right hand 75 100 86
3rd optimum CNN Both feet 91 83 87 84 78.63
Tongue 100 75 86
Average 86 84 84
Left hand 100 93 96
Right hand 79 92 85
Average Ensemble Both feet 100 100 100 92 89.33
Tongue 91 83 87
Average 93 92 92

Table 5 compares our optimized Average Ensemble model to five traditional ML


algorithms (LDA, SVM, RF, MLP, and GNB) on the BCI Competition IV-2a dataset. All
the traditional classifiers were evaluated using the same metrics. Our proposed method
outperformed the traditional algorithms by a wide margin across all five metrics. Among
the traditional algorithms, SVM classifier performs the best, with a Precision of 70.2%, a
Recall of 70.83%, an F1-score of 70.35%, an Accuracy of 70%, and a Kappa score of 60.02%,
whereas GNB is the worst classifier in this experiment.
Appl. Sci. 2022, 12, 5807 17 of 19

Table 5. Comparison with traditional ML algorithms.

Classifiers Precision Recall F1-Score Accuracy Kappa


LDA 70.21 68.15 68.2 68 57.22
SVM 70.24 70.83 70.35 70 60.02
RF 59.29 58.15 58.05 58 44.02
MLP 64.81 64.23 63.97 64 52.02
GNB 40.18 40.18 39.45 40 20.13
Average Ensemble of BO-CNNs 93 92 92 92 89.33

CNN can learn complicated features from a heterogeneous dataset, with the ability to
expose the dataset’s hidden characteristics while performing data fitting. Traditional ML
algorithms generally build their models from a fixed set of functions, limiting the model’s
representation capability.
We evaluated our EL-based model against state-of-the-art methods [19–26] using the
same dataset (BCI Competition IV-2a dataset) for fair comparison (Table 6). The accuracy of
our method reached 92%, outperforming all state-of-the-art methods. Average Ensemble,
along with the BO-based hyperparameter optimization, boosted CNN performance.

Table 6. Comparison with state-of-the-art methods [19–26].

Methods Accuracy
Yang’s method [19] 69.27
Cheng’s method [20] 70.5
Abbas’s method [21] 70.7
Sakhavi’s method [22] 74.46
Zhao’s method [23] 75
Liu’s method [24] 76.86
Deng’s method [25] 78.9
ZümrayDokur’s method [26] 79.3
Proposed method 92

To the best of our knowledge, this is the first study to employ BO for optimum
hyperparameter selection in conjunction with the Average Ensemble approach to this
specific EEG MI classification problem. This method was found to be the most effective in
this experiment, with a Precision of 93%, a Recall of 92%, an F1-score of 92%, an accuracy
of 92%, and a kappa score of 89.33%. In the field of medical diagnostics, these values
are considered “very good,” and they can be further enhanced by exploring other EL
strategies. CNN function evaluation is both time consuming and expensive. In our case,
the expensive function is conducted only in a certain area space because the EI acquisition
function aided in swiftly locating the search space. The objective function, represented as
a GP model, is used to train the model through objective function evaluation. Analytical
tracking was done with the GP model because it creates a posterior distribution over the
objective function. The EL technique utilized in this study automated the randomization
procedure, enabling the researchers and investigators to analyze numerous databases and
collect relevant findings. Instead of being limited to a single classifier, they repeatedly
construct several classifiers while randomly altering the inputs. We can achieve a more
efficient prediction strategy by merging numerous single classifiers into one.
It is true that all three optimum CNNs employ the same topology (two convolutional
layers, two batch norm layers, two max pooling layers, three drop out layers, and four dense
layers). Each model, however, has various hyperparameter values (dropout rate, activation
Appl. Sci. 2022, 12, 5807 18 of 19

function, number of dense nodes, batch size, and optimizer) and each model has its own
weaknesses. The idea behind employing an ensemble is that by combining various models
that represent multiple hypotheses about the data, we can come up with a better hypothesis
that is not in the hypothesis space of the models from which the ensemble is constructed.
The three models are complementary in that if one fails, the other succeeds. The accuracy of
this Ensemble method was higher than when a single model was used. This demonstrates
the efficiency of ensemble learning.

4. Conclusions
MI-BCI applications for rehabilitation require a superior performance level and ro-
bustness. The findings of our paper confirm the high potential of our Bayesian optimized
ensembled CNN model for real-life applications. However, more advanced results could
be reached by examining other EL strategies along with better hyperparameter optimizers.

Author Contributions: Conceptualization, S.K.; methodology, S.K.; software, S.K. and I.J.; validation,
S.K., I.A.E.K., I.S.A. and U.K.; formal analysis, S.K. and S.Z.; data curation, S.K.; writing—original
draft preparation, S.K., M.A.A. and M.M.; writing—review and editing, S.K.; supervision, S.Z. All
authors have read and agreed to the published version of the manuscript.
Funding: This article was supported by the Natural Science Foundation of China under Grant
(51377045 and Grant 31400844).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Piccione, F.; Giorgi, F.; Tonin, P.; Priftis, K.; Giove, S.; Silvoni, S.; Palmas, G.; Beverina, F. P300-based brain computer interface:
Reliability and performance in healthy and paralysed participants. Clin. Neurophysiol. 2006, 117, 531–537. [CrossRef]
2. Schalk, G.; Mellinger, J. Brain sensors and signals. In A Practical Guide to Brain–Computer Interfacing with BCI2000; Springer:
Berlin/Heidelberg, Germany, 2010. [CrossRef]
3. Mulder, T. Motor imagery and action observation: Cognitive tools for rehabilitation. J. Neural Transm. 2007, 114, 1265–1278.
[CrossRef]
4. Tang, X.; Li, W.; Li, X.; Ma, Z.; Dang, X. Motor imagery EEG recognition based on conditional optimization empirical mode
decomposition and multi-scale convolutional neural network. Expert Syst. Appl. 2020, 149, 113285. [CrossRef]
5. Shoka, A.; Dessouky, M.; El-Sherbeny, A.; El-Sayed, A. Literature Review on EEG Preprocessing, Feature Extraction, and
Classifications Techniques. Menoufia J. Electron. Eng. Res. 2019, 28, 292–299. [CrossRef]
6. Mirowski, P.; Madhavan, D.; LeCun, Y.; Kuzniecky, R. Classification of patterns of EEG synchronization for seizure prediction.
Clin. Neurophysiol. 2009, 120, 149–171. [CrossRef] [PubMed]
7. Ke, H.; Chen, D.; Li, X.; Tang, Y.; Shah, T.; Ranjan, R. Towards brain big data classification: Epileptic EEG identification with a
lightweight VGGNet on global MIC. IEEE Access 2018, 6, 722–733. [CrossRef]
8. Sushkova, O.; Obukhov, Y.; Kershner, I.; Karabanov, A.; Gabova, A. Classification of early-stage Parkinson’s disease in EEG and
tremor timefrequency features space. Parkinsonism Relat. Disord. 2016, 22, e164. [CrossRef]
9. Brunner, C.; Leeb, R.; Muller-Putz, G.; Schlogl, A.; Pfurtscheller, G. BCI Competition 2008—Graz Data Set A; Graz University of
Technology: Graz, Austria, 2008; Volume 16, pp. 1–6.
10. Ye, J.; Li, Q. A Two-Stage Linear Discriminant Analysis via QR Decomposition. IEEE Tran. Pattern Anal. Mach. Intell. 2005,
27, 929–941.
11. Vapnik, V. The Nature of Statistical Learning Theory; Springer: New York, NY, USA, 1995.
12. Fraiwan, L.; Lweesy, K.; Khasawneh, N.; Wenz, H.; Dickhaus, H. Automated sleep stage identification system based on
timefrequency analysis of a single EEG channel and random forest classifier. Comput. Methods Programs Biomed. 2012, 108, 10–19.
[CrossRef]
13. Nicolaou, N.; Georgiou, J. Detection of Epileptic Electroencephalogram Based on Permutation Entropy and Support Vector
Machines. Expert Syst. Appl. 2012, 39, 202–209. [CrossRef]
14. Kassam, K.S.; Markey, A.R.; Cherkassky, V.L.; Loewenstein, G.; Just, M.A. Identifying emotions on the basis of neural activation.
PLoS ONE 2013, 8, e66032. [CrossRef] [PubMed]
Appl. Sci. 2022, 12, 5807 19 of 19

15. Khan, J.; Bhatti, M.; Khan, U.; Iqbal, R. Multiclass EEG motor-imagery classification with sub-band common spatial patterns.
EURASIP J. Wirel. Commun. Netw. 2019, 2019, 174. [CrossRef]
16. Bhattacharyya, S.; Khasnobish, A.; Konar, A.; Tibarewala, D.N.; Nagar, A.K. Performance analysis of left/right hand movement
classification from EEG signal by intelligent algorithms. In Proceedings of the 2011 IEEE Symposium on Computational
Intelligence, Cognitive Algorithms, Mind, and Brain (CCMB), Paris, France, 11–15 April 2011; pp. 1–8. [CrossRef]
17. BCI Competition II. Berlin Brain-Computer Interface (BBCI). Available online: https://www.bbci.de/competition/ii/ (accessed
on 13 January 2022).
18. Wahid, M.; Tafreshi, R. Improved Motor Imagery Classification Using Regularized Common Spatial Pattern with Majority Voting
Strategy. IFAC PapersOnLine 2021, 54, 226–231. [CrossRef]
19. Yang, H.; Sakhavi, S.; Ang, K.K.; Guan, C. On the use of convolutional neural networks and augmented CSP features for
multi-class motor imagery of EEG signals classification. In Proceedings of the 37th Annual International Conference of the IEEE
Engineering in Medicine and Biology Society, Milan, Italy, 25–29 August 2015; pp. 2620–2623.
20. Cheng, P.; Autthasan, P.; Pijarana, B.; Chuangsuwanich, E.; Wilaiprasitporn, T. Towards asynchronous motor imagery-based
brain-computer interfaces: A joint training scheme using deep learning. In Proceedings of the TENCON-IEEE Region 10
Conference, Jeju, Korea, 28–31 October 2018; pp. 1994–1998.
21. Abbas, W.; Khan, N.A. DeepMI: Deep learning for multiclass motor imagery classification. In Proceedings of the 40th Annual
International Conference of the IEEE Engineering in Medicine and Biology Society, Honolulu, HI, USA, 18–21 July 2018;
pp. 219–222.
22. Sakhavi, S.; Guan, C.; Yan, S. Learning temporal information for brain-computer interface using convolutional neural networks.
IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 5619–5629. [CrossRef] [PubMed]
23. Zhao, X.; Zhang, H.; Zhu, G.; You, F.; Kuang, S.; Sun, L. A multi-branch 3D convolutional neural network for EEG-based motor
imagery classification. IEEE Trans. Neural Syst. Rehabil. Eng. 2019, 27, 2164–2177. [CrossRef] [PubMed]
24. Liu, M.; Zhou, M.; Zhang, T.; Xiong, N. Semi-supervised learning quantization algorithm with deep features for motor imagery
EEG recognition in smart healthcare application. Appl. Soft Comput. 2020, 89, 106071. [CrossRef]
25. Deng, X.; Zhang, B.; Yu, N.; Liu, K.; Sun, K. Advanced TSGL-EEGNet for Motor Imagery EEG-Based Brain-Computer Interfaces.
IEEE Access 2021, 9, 25118–25130. [CrossRef]
26. Zumray, D.; Olmez, T. Classification of Motor Imagery EEG Signals by Using a Divergence Based Convolutional Neural Network.
arXiv 2021, arXiv:2103.10977.
27. Yu, T.; Zhu, H. Hyper-Parameter Optimization: A Review of Algorithms and Applications. arXiv 2020, arXiv:2003.05689.
28. Wu, J.; Chen, X.Y.; Zhang, H.; Xiong, L.D.; Lei, H.; Deng, S.H. Hyperparameter optimization for machine learning models based
on Bayesian optimization. J. Electron. Sci. Technol. 2019, 17, 26–40.
29. Ke, H.; Chen, D.; Shi, B.; Zhang, J.; Liu, X.; Zhang, X.; Li, X. Improving brain E-health services via high-performance EEG
classification with grouping Bayesian optimization. IEEE Trans. Serv. Comput. 2019, 13, 696–708. [CrossRef]
30. Olivas-Padilla, B.E.; Chacon-Murguia, M.I. Classification of multiple motor imagery using deep convolutional neural networks
and spatial filters. Appl. Soft Comput. 2019, 75, 461–472. [CrossRef]
31. Cheng, D.; Liu, Y.; Zhang, L. Exploring motor imagery EEG patterns for stroke patients with deep neural networks. In Proceedings
of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20
April 2018; pp. 2561–2565.
32. Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015,
arXiv:1502.03167.
33. Lee, C.-Y.; Gallagher, P.W.; Tu, Z. Generalizing Pooling Functions in Convolutional Neural Networks: Mixed, Gated, and Tree.
arXiv 2015, arXiv:1509.08985.
34. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks
from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958.
35. Li, L.; Jamieson, K.; De Salvo, G.; Rostamizadeh, A.; Talwalkar, A. Hyperband: A Novel Bandit-Based Approach to Hyperparame-
ter Optimization. J. Mach. Learn. Res. 2016, 18, 6765–6816.
36. Betrò, B. Bayesian methods in global optimization. J. Glob. Optim. 1991, 1, 1–14. [CrossRef]
37. Kramer, O.; Ciaurri, D.E.; Koziel, S. Derivative-free optimization. In Computational Optimization, Methods and Algorithms; Springer:
Berlin/Heidelberg, Germany, 2011; pp. 61–83. [CrossRef]
38. Lutins, E. Ensemble Methods in Machine Learning: What Are They and Why Use Them?|by Evan Lutins|Towards Data Science.
2017. Available online: https://towardsdatascience.com/ensemble-methods-in-machine-learning-what-are-they-and-whyuse-
them-68ec3f9fef5f (accessed on 11 December 2020).
39. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.;
Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830.
40. Bisong, E. Building Machine Learning and Deep Learning Models on Google Cloud Platform; Springer: Berlin/Heidelberg,
Germany, 2019.

You might also like