Applsci 12 05807 v2
Applsci 12 05807 v2
Applsci 12 05807 v2
sciences
Article
Multi-Classification of Motor Imagery EEG Signals Using
Bayesian Optimization-Based Average Ensemble Approach
Souha Kamhi 1,2, *, Shuai Zhang 1,2, *, Mohamed Ait Amou 3 , Mohamed Mouhafid 3 , Imran Javaid 1,2 ,
Isah Salim Ahmad 1,2 , Isselmou Abd El Kader 1,2 and Ummay Kulsum 1,2
1 State Key Laboratory of Reliability and Intelligence of Electrical Equipment, Hebei University of Technology,
Tianjin 300130, China; [email protected] (I.J.); [email protected] (I.S.A.);
[email protected] (I.A.E.K.); [email protected] (U.K.)
2 Hebei Key Laboratory of Bioelectromagnetism and Neural Engineering, School of Health Sciences &
Biomedical Engineering, Hebei University of Technology, Tianjin 300130, China
3 School of Electronics and Information Engineering, Hebei University of Technology, Tianjin 300401, China;
[email protected] (M.A.A.); [email protected] (M.M.)
* Correspondence: [email protected] (S.Z.); [email protected] (S.K.)
Abstract: Motor Imagery (MI) classification using electroencephalography (EEG) has been exten-
sively applied in healthcare scenarios for rehabilitation aims. EEG signal decoding is a difficult
process due to its complexity and poor signal-to-noise ratio. Convolutional neural networks (CNN)
have demonstrated their ability to extract time–space characteristics from EEG signals for better
classification results. However, to discover dynamic correlations in these signals, CNN models must
be improved. Hyperparameter choice strongly affects the robustness of CNNs. It is still challenging
since the manual tuning performed by domain experts lacks the high performance needed for real-life
Citation: Kamhi, S.; Zhang, S.; Ait
applications. To overcome these limitations, we presented a fusion of three optimum CNN models
Amou, M.; Mouhafid, M.; Javaid, I.;
using the Average Ensemble strategy, a method that is utilized for the first time for MI movement
Ahmad, I.S.; Abd El Kader, I.;
Kulsum, U. Multi-Classification of
classification. Moreover, we adopted the Bayesian Optimization (BO) algorithm to reach the optimal
Motor Imagery EEG Signals Using hyperparameters’ values. The experimental results demonstrate that without data augmentation,
Bayesian Optimization-Based our approach reached 92% accuracy, whereas Linear Discriminate Analysis, Support Vector Machine,
Average Ensemble Approach. Appl. Random Forest, Multi-Layer Perceptron, and Gaussian Naive Bayes achieved 68%, 70%, 58%, 64%,
Sci. 2022, 12, 5807. https://doi.org/ and 40% accuracy, respectively. Further, we surpassed state-of-the-art strategies on the BCI competi-
10.3390/app12125807 tion IV-2a multiclass MI database by a wide margin, proving the benefit of combining the output of
Academic Editors: Szczepan Paszkiel,
CNN models with automated hyperparameter tuning.
Ningrong Lei and Maria António
Castro Keywords: hyperparameters; ensemble learning (EL); brain–computer interface (BCI); MI-EEG;
multiclassification
Received: 24 May 2022
Accepted: 6 June 2022
Published: 7 June 2022
Emission Tomography (PET), or functional Magnetic Resonance Imaging (fMRI) [2], which
are more complicated to employ.
MI-based EEG is a commonly used approach within BCIs. It is a cognitive process
that does not require any prior stimuli. It relies completely on the subject imagination of an
action without physically performing it. This helps generate a signal similar to that of body
movement [3]. In general, brain signals are time varying, and subject specific [4], making
MI-EEG feature extraction and classification difficult, especially for conventional Machine
Learning (ML) algorithms [5] that involve a handcrafted signal processing pipeline and a
long training time to learn significant features. Moreover, traditional ML is domain specific.
This reduces its common potential application in the general context of rehabilitation-
related technologies. As a result, they largely fail to satisfy the growing level of performance
needed from real-life-based BCI applications.
Deep Learning (DL) algorithms achieved good results dealing with time series and
other traditional ML algorithm limitations. One of the best Deep Neural Networks (DNN)
algorithms is CNN [6]. It does not require data pre-processing nor feature extraction prior
to classification. The CNN classifier manages raw EEG data analysis successfully in several
application domains and has achieved a satisfying result [7,8]. However, the complex
nature of the signal and the implementation of more advanced CNN architectures demand
an optimal hyperparameter deployment to decrease the training time and increase the
learning rate. Hyperparameter selection is an essential step in building any CNN model.
The manual tuning of those parameters may lead to uncertain results as it depends on the
intuition of the user. The choice of the right combinations is time consuming and demands
expert knowledge. This limitation can be solved by considering optimization algorithms
that help reach the global optima of a given CNN by choosing the appropriate set of values
of the hyperparameters in a given space in less time efficiently and automatically.
This paper’s contributions are as follows: (1) As far as we know, it is the first time,
the concept of Average Ensemble of optimized CNNs is applied to the EEG MI movement
classification problem, achieving a high level of performance. (2) Our proposed approach
outperformed the recent methods in the BCI competition IV-2a 4-class MI database [9] by a
significant margin. (3) The proposed approach’s performance is extensively evaluated by
comparing it to five traditional ML algorithms (Linear Discriminate Analysis [10], Support
Vector Machine [11], Random Forest [12], Multi-Layer Perceptron [13], and Gaussian
Naive Bayes [14]). (4) The optimum hyperparameters are determined automatically by
employing BO.
The classification performance of MI-EEG signals is an important part of building a
strong and reliable BCI application. Many researchers, therefore, investigated different
approaches seeking good results. Javeria et al. [15] used Support Vector Machine, Naïve-
Bayesian Parzen-Window (NBPW), and K-Nearest Neighbour (KNN) algorithms to train
and classify upper limb MI tasks and rest. Features were extracted by carrying out Common
Spatial Pattern and Linear Discriminate Analysis for feature extraction. Since not all of the
extracted features are necessary for classification, the optimal features are selected by using
the Sequential Backward Floating Selection (SBFS) method. Two datasets were employed
to assess the algorithm’s performance. Their proposed method achieves an accuracy of
86.50% on the Wet Gel Electrodes dataset and 60.61% on the Emotive Epoch Headset.
Bhattacharyya et al. [16] classified upper limb movement using raw EEG signals acquired
from the BCI competition-II 2003 dataset [17]. Unreduced features along with reduced
features were tested on the following classifiers: Linear Discriminate Analysis, Quadratic
Discriminant Analysis (QDA), KNN, Support Vector Machine, Radial Basis Function (RBF),
and Naive Bayesian (NB) classifier. Consequently, the best results were achieved by
Support Vector Machine, reaching an accuracy of 82.14% in both scenarios, whilst KNN’s
performance improved with reduced features. Wahid et al. [18] used EEG signals from
40 healthy people and four ML classifiers, including Support Vector Machine, Linear
Discriminate Analysis, Random Forest, and KNN, to identify foot and hand movement
tasks. Common Spatial Pattern features were retrieved from several window widths,
Appl. Sci. 2022, 12, 5807 3 of 19
ranging from 0.3 s to 2 s. The findings demonstrate that with 2 s window sizes and nine
votes, the KNN can reach 78.9%. The ranking study utilizing both accuracy and area under
the curve data demonstrate that the Random Forest method can consistently perform well
with varying window sizes and amounts of votes. Yang et al. [19] explored the effectiveness
of CNN for multi-class MI-EEG data classification. They constructed augmented Common
Spatial Pattern features employing pair-wise projection matrices, which deal with various
frequency ranges. There are eight levels of frequency bands, with 4 Hz and 40 Hz being
the starting and ending frequencies, respectively. The width of each band is selected as
(3 4 7 8 11 12 13 15) Hz and the widths of each window shift are (2 2 4 5 6 6 5 5). In total,
there are 60 frequency bands. Furthermore, they presented a frequency complementary
feature map selection approach that limits the reliance across frequency bands, attaining
an averaged cross-validation accuracy of 69.27%. Their findings show that CNN can
acquire deep and discriminant structural features for EEG classification without depending
on handcrafted features (handcrafted features are those that are manually engineered).
Cheng et al. [20] employed a cascade of one-dimensional convolutional neural networks
and fully connected neural networks (1D-CNN-FC) with a cooperative training method to
classify background EEG and MI-EEG. Six convolutional layers and two fully connected
layers comprise the 1D-CNN-FC cascade. They compared their proposed cascade to
the traditional strategy for motor imagery-BCI, which included Support Vector Machine
for classification and Common Spatial Pattern for feature extraction. Their experiments
compare three classification tasks: Firstly, to classify pure imagery from background EEG,
secondly, transitional imagery from background EEG, and lastly, to separate both pure
and transitional imagery from background EEG using a joint training scheme, considering
pure and transitional imagery signals as the same class. The cooperative training method
outperformed traditional training in their experiments. Abbas et al. [21] presented a
framework in which the covariance maximization method is adopted for discriminating
inter-class data and this technique uses Fast Fourier Transform Energy Maps (FFTEM) to
convert 1D data into 2D data and select features. FFTEM is used as a feature selection
method, which exploits both the spectral and energy information hidden in EEG data. It
is applied to the discriminative features obtained from Common Spatial Pattern filters to
compute energy maps corresponding to frequency bands and EEG channels. These energy
maps are used to train the CNN classifier. It was necessary to convert 1D data to 2D as they
were using 2D CNN. Their proposed classification system achieved a mean kappa value of
0.61%. Sakhavi et al. [22] applied a CNN architecture for classification as well as a novel
temporal representation of the data. Their new representation is developed by adjusting
the filter-bank technique, and the CNN is constructed and optimized for the representation
accordingly. On the BCI competition IV-2a dataset, their approach surpassed the recent
methods in the literature by a 7% improvement in average subject accuracy. Zhao et al. [23]
suggested a three-dimensional visualization approach as well as a multi-branch three-
dimensional CNN to solve MI classification challenges. Their three-dimensional CNN
representation is created by converting EEG data into a series of 2-dimensional arrays
that maintain the sampling electrodes’ spatial distribution. Their innovative experiments
demonstrated that the introduced method achieves state-of-the-art classification kappa
score levels and considerably surpasses other classifiers by a 50% reduction in various
subjects’ standard deviation. Furthermore, they proved that the multi-branch architecture
may reduce overfitting when the size of the dataset is insufficient. Liu et al. [24] presented
a new semi-supervised cartesian k-means for MI-EEG classification. They employed CNN
architectures that had been pre-trained on MI examples to produce deep features, which
they then used to classify MI data. Deep features are the high-level features that are
extracted by the feature extraction block of the CNN. Their experimental results show
that their suggested strategy outperforms other comparable algorithms. Deng et al. [25]
presented the Temporary Sparse Group Lasso (TSGL) EEGNet neural network algorithm,
which has superior efficiency on the MI-EEG and is easily interpretable. Their suggested
algorithm was improved on the basis of a well-recognized DL model, known as EEGNet,
Appl. Sci. 2022, 12, 5807 4 of 19
and it was optimized using the traditional ML approach. Their experimental findings
indicated that their approach obtained an average accuracy of 81.34% and a kappa of
75.11% on the BCI Competition IV 2a dataset [9]. Zümray Dokur et al. [26] evaluated the
influence of the augmentation procedure’s effect on the MI-EEG’s classification performance
rather than employing a previous transformation (Common Spatial Pattern). Furthermore,
they redesigned the DNN topology to improve classification results, reduce the number of
nodes in the architecture, and employ fewer hyperparameters. Moreover, they examined
these enhancements on datasets from BCI competitions in 2005 and 2008, with two and
four classes, respectively, and found that the augmentation had a significant influence on
the average performance.
The above-cited techniques remain relatively practical, due to their drawbacks, and
functioning mode. BO algorithm overcomes those limitations due to its processing man-
ner [27]. The BO concept is based on the assumption function, which holds predicted
values, whereas the acquisition function helps predict the next point until reaching the
global optimum. The posterior probability related to the Gaussian distribution of the
unknown optimized objective function is obtained from prior information and sample
data. It is effective and precise on complex training and high-dimensional data. Further,
it reduces time cost and delivers accurate results. Wu et al. [28] studied the effect of
hyperparameter tuning on an ML algorithm employing BO with Gaussian Process (GP).
They used the same dataset on other optimization algorithms, such as Random Forest
and multi-grained cascade forest. As a result, the BO algorithm outperformed the other
methods. Hengjin et al. [29] aimed to optimize an EEG-based CNN classifier for online
E-health programs, which needs high precision and inexpensive time cost to manage the
brain dynamics simulations. They employed the BO algorithm on time series data in a
fashioned manner. They sliced up the hyperparameters into groups. Each group was
optimized under the same conditions in the same space but separately. Then, the selection
of the most optimized values is conducted through a matrix based on Markov Chain.
The employment of the BO contributed significantly on building a robust CNN model
and successfully handling EEG data. Brenda et al. [30] used two different techniques for
MI-EEG classification on the BCI competition IV dataset 2a [9] and eight other pieces of
EEG data [30]. On both methods, feature extraction was executed by a Filter Bank Common
Spatial Pattern (FBCSP). After frequency band selection, the matrix containing salient fea-
tures is fed into previously optimized conventional CNN to perform binary classifications
of all classes. The same processed data are fed into four more evolutive CNNs classifiers.
As a result, the accuracy improved.
As can be noticed, individual models were used in the majority of EEG signal classifi-
cation investigations. In our study, we focused on merging the optimum results of three
robust CNNs trained separately. Besides that, none of the aforementioned studies used BO
in conjunction with Ensemble Learning (EL) approaches. Consequently, they failed to meet
near-perfect performance, which is unacceptable in the clinical context.
This paper builds upon EL, DL, and BO to accomplish three major goals:
• Design an efficient classification system by merging the top three best CNNs generated
by the help of BO addressed for BCI-based rehabilitation uses.
• Obtain competitive findings using BCI IV-2a competition data by reaching high levels
of Precision, Recall, F1-score, Accuracy, and Kappa.
• Highlight the significant contribution of BO and checkpoint call back techniques on
boosting the performance of CNN on MI data.
The paper is structured as follows: Section 2 discusses the proposed BO-based Average
Ensemble approach for the classification of four EEG tasks, including the left hand, right
hand, tongue, and feet. Section 3 illustrates the experimental results and a comparison of
state-of-the-art strategies. Lastly, Section 4 concludes the paper.
Appl. Sci. 2022, 12, x FOR PEER REVIEW 5 of 19
Appl. Sci. 2022, 12, x FOR PEER REVIEW 5 of 19
Appl. Sci. 2022, 12, 5807 5 of 19
Figure 2.
Figure Experiment tasks
2. Experiment tasks representation
representation on
on aa timeline.
timeline.
Figure 2. Experiment tasks representation on a timeline.
Appl. Sci. 2022, 12, x FOR PEER REVIEW 6 of 19
Appl. Sci. 2022, 12, 5807 6 of 19
Figure 3. The topology of our CNN approach. BN: Batch Normalization; Conv1D: Convolutional
Figure 3. The topology of our CNN approach. BN: Batch Normalization; Conv1D: Convolutional
layer one-dimensional.
layer one-dimensional.
In each convolution layer, we used 64 convolution filters, 3 filter sizes, and 1 time step.
As In each convolution
a fuzzy layer,layers
filter, convolution we used 64 convolution
improve the featuresfilters, 3 filter
of original sizes,and
signals andeliminate
1 time
step. As Aa convolution
noise. fuzzy filter, operation
convolution
takeslayers
placeimprove
betweenthe thefeatures of original
upper layer’s signals
feature vectorsand
and
eliminate noise. A convolution operation takes place between the upper layer’s
the current layer’s convolution filter. Finally, the activation function returns the outcomes feature
vectors and the current
of convolution layer’sEquation
calculations. convolution filter. the
(1) shows Finally, the activation
outcome function layer:
of the convolution returns
the outcomes of convolution calculations. Equation (1) shows the outcome of the convo-
lution layer: X jl = f ( ∑ X jl −1 × Wijl + blj ) (1)
i∈ Mj
= ( × + ) (1)
∈ receptive field. X l represents the characteristic
where M j denotes the current neuron’s j
where the lth layer’s
vector fordenotes jth convolution
the current filter. f means
neuron’s receptive field.a nonlinear function.
represents Wijl represents
the characteristic
the bias
vector for coefficient
the th layer’s assigned to the lth layer’s
th convolution filter.jth convolution
means a nonlinearkernel. function. rep-
We utilized Batch Normalization [32] after each convolutional
resents the bias coefficient assigned to the th layer’s th convolution kernel. layer to enhance per-
formance results and accelerate training convergence. It is accomplished
We utilized Batch Normalization [32] after each convolutional layer to enhance per- by normalizing
each feature
formance at the
results andbatch level during
accelerate training training and thenItre-scaling
convergence. later with
is accomplished by the entire train-
normalizing
ing data in mind.
each feature hat the i batchhleveli during training and then re-scaling later with the entire(2),
Each mini-batch’s dimension is transformed as shown in Equation
Var in
where data ( k )
x mind. E x (mini-batch’s
andEach k ) represent the variance and the mean of kth scalar feature,
training dimension is transformed as shown in Equation
( ) ( )
respectively.
(2), where The numerical
and stability is improved
represent by a minor
the variance and the constant
mean e. of th scalar fea-
ture, respectively. The numerical stability is improved h byi a minor constant .
x((k)) − E x((k))
x̂ ((k)) = −
=q (2)(2)
Var[ x((k))] ++ e
Appl. Sci. 2022, 12, 5807 7 of 19
The learnable parameters β and γ are introduced through Batch Normalization. The
transformation of each layer is described in Equation (3).
BatchNormalization x (k) = γ(k) × x̂ (k) + β(k) (3)
yj = ∑ wij xi + bj (4)
i
To prevent over-fitting and improve generalization, we used the Dropout [34] tech-
nique between dense layers. Dropout offers regularization by removing a portion of the
previous layer’s outputs.
It should be noted that this step did not include:
• The dropout rate.
• The activation functions.
• The number of nodes in the first three dense layers.
These are among the hyperparameters that will be used as inputs for BO in the
next subsection.
There is a range of methods that can be used to accomplish this. Grid Search [35], for
example, is a method for searching exhaustively over a portion of a hyperparameter space
for a certain algorithm. The number of parameter combinations will grow exponentially as
additional hyperparameters are introduced. As a result, the procedure will take a long time
to complete. Random Search [35] is another method for locating valuable hyperparameters.
Random Search, unlike Grid Search, tries out different combinations of parameters at
random. When computing, this causes a high variance. These two approaches are unable to
learn anything from the assessed hyperparameter sets during the optimization procedure.
In this experiment, an efficient strategy known as BO [36] is adopted to select the
optimum hyperparameters. BO incorporates previous knowledge about the objective func-
tion and updates posterior knowledge, increasing accuracy and reducing model loss. The
optimization is carried out in accordance with Bayesian Theorem [37], which is enshrined
in Equation (6), where M denotes the model and O denotes the observation.
Theflow
Figure4.4.The
Figure flowofofBO.
BO.
Hyperparameters
Hyperparameters Range
Range to to Probe
Probe
Activation function ReLU–SELU–Tanh–ELU–Sigmoid
Activation function ReLU–SELU–Tanh–ELU–Sigmoid
Dropout
Dropout rate
rate 0 to
0 to 0.50.5
Number of nodes in the
Number of nodes in the first first dense
dense layer
layer 32 to
32 to 10241024
Number
Number of of nodes
nodes in in
thethe second
second dense
dense layer
layer 32 32
to to
10241024
Number
Number of of nodes
nodes in in
thethe third
third dense
dense layer
layer 32 32
to to
10241024
Optimizer
Optimizer Nadam–RMSProp–Adam–SGD–AdaMax
Nadam–RMSProp–Adam–SGD–AdaMax
Batch
Batch size
size 1 to 128128
1 to
2.4. Ensemble
Ensemble Learning
Learning (EL)
In ML, EL is defined as the training of several models, known as weak learners, and
the merging of their predictions to generate improved results. It allows us to merge all of
these models
models to obtain
obtain a single
single excellent
excellent model
model that
that is
is based on insights
insights from most or all
of the training set [38]. TheThe fact
fact that
that there is no limit to the number of weak learners that
be used
can be usedisisananintriguing
intriguingfeature
featureofofEL.
EL.ItItcould
could range
range from
from asas
fewfewas as two
two to many
to as as many as
as ten
ten weakweak learners.
learners. ELELcancan
bebe used
used inina avariety
varietyofofways
waystotoenhance
enhanceoverall
overallperformance,
performance,
including averaging, weighted averaging, max voting, and many others.
The motivation
motivation behind
behindemploying
employingthe ELEL
the approach
approach is toisimprove
to improve the overall efficiency
the overall effi-
of the top
ciency three
of the topoptimum models models
three optimum generated by BO. Further,
generated it eliminates
by BO. Further, the possibility
it eliminates the pos-of
weak learners
sibility of weakfocusing
learnerson local minima.
focusing on localIn this paper,
minima. wepaper,
In this utilized wethe Average
utilized theEnsemble
Average
method. With
Ensemble this With
method. method,
this each prediction
method, on the test
each prediction onset
theistestaveraged, and finally
set is averaged, andthefi-
prediction is derived from the average prediction model. Each of the
nally the prediction is derived from the average prediction model. Each of the three weak three weak learners
make anmake
learners equalan contribution to the overall
equal contribution to theprediction as shownasinshown
overall prediction Equation (8).
in Equation (8).
3
) ==1/3
y(x(x) 1/3∑ ym ( x() ) (8)
m =1
The specific steps of this method can be described as follows: (1) (1) generate
generate three
three weak
weak
learners
learners (top
(top three optimum
optimum models)
models) from
from the BO experiment,
experiment, (2) train them separately
separately
from scratch
scratch on the same dataset, and (3) merge these weak learners by averaging their
predictions ((y1,,y2,, y3)) for each sample of our dataset calculate y.. An illustration of the
dataset to calculate
structure of the proposed Average
Average Ensemble
Ensemble method
method isis displayed
displayed in
in Figure
Figure 5.
5.
corresponds to the z1 class, N2 belong to the z2 class, and Nc to the zc class. For good discrimi-
nation between classes, we need to determine the separation measure. Equations (9) and (10)
give the scatter plots both within and between class measures, respectively.
c
Si = ∑ (x − µi )(x − µi )T (9)
x ∈ zi
c
SB = ∑ Ni (µi − µ)(µi − µ)T (10)
i =1
where:
c c
1 1 1
Sz = ∑ Si , µi =
Ni ∑ xi µ=
N ∑x = N ∑ Ni µi
i =1 x ∈ zi ∀x i =1
The total scatter similarly can be represented by the matrix ST = SB + SZ . The scatter
matrices SeB and mean vector SeZ for the projected samples can be described as shown in
Equations (11) and (12). The scatter matrices are a measure of the scatter in multivariate
feature space.
c
SeB = ∑ Ni (µei − µe)(µei − µe)T (11)
i =1
c
SeZ = ∑ ∑ (y − µei )(y − µei )T (12)
i =1 y ∈ z i
s.t. yi (w·si − b) ≥ 1 − β i , β i ≥ 0
where the ith input sample is represented by xi , the class label value belonging to xi is
denoted by yi , the slack element that permits an instance to be in the margin (0 ≤ β i ≤ 1)
1
‖ ‖ + , (13)
2
. . y( ∙ − )≥1− , ≥0
Appl. Sci. 2022, 12, 5807 where the th input sample is represented by , the class label value belonging to 11 ofis19
denoted by , the slack element that permits an instance to be in the margin (0 ≤ ≤ 1)
or to be incorrectly classified ( > 1) is described by , the input samples’ number is
indicated by , and the
or to be incorrectly penalty(factor
classified β i > 1is) symbolized by β.i , the input samples’ number is
is describedby
indicated by n, and the penalty factor is symbolized by C.
2.5.3. Random Forest (RF)
2.5.3.
RFRandom Forestan(RF)
[12] employs EL approach for classification that involves various decision trees
duringRF [12]
the employs
training an and
stage EL approach
producesfor classification
individual trees’that involves
average various It
prediction. decision trees
constructs
during the training stage and produces individual trees’ average prediction.
forests with an arbitrary number of trees. The flow of this algorithm is described in the It constructs
forests with
following an (1)
steps: arbitrary
Create number ofbootstrap
N trees of trees. Theinstances
flow of this
based algorithm is described
on the dataset, (2) at in
eachthe
followinginstance,
bootstrap steps: (1) Create
grow oneNunpruned
trees of bootstrap instances based
tree, (3) arbitrarily on theofdataset,
take k-tries (2) at each
the predictors at
bootstrap
each node, instance,
(4) select grow one unpruned
the optimal split from tree,
those(3) variables,
arbitrarilyand take(5)
k-tries of the
predict newpredictors
data by
at each node,
merging the N (4) select
tree’s the optimal
predictions. Thesplit fromoperation
general those variables,
of the and (5) predict
RF classifier new datain
is depicted by
merging
Figure 6. the N tree’s predictions. The general operation of the RF classifier is depicted in
Figure 6.
Figure 7. MLP
Figure7. MLP classifier.
classifier.
p(O|Ci ) × p(Ci )
p(Ci |O) = (14)
p(O)
(oi −µiL )2
1 −
2σ2
p(O = oi |CL ) = q e jL (15)
2
2πσjL
The class is provided by the prediction result, which is given in Equation (16).
Cohen’s Kappa is a metric based on the agreement principle that attempts to account
for assigning the same item in each class at random. It can function on any number
of classes into which the dataset is classified. It can be expressed mathematically as
shown in Equation (18), where pe denotes the expected agreement and po stands for the
observed agreement.
0 po − pe 1 − po
Cohen s Kappa = = 1− (18)
1 − pe 1 − pe
3. Results
All models in this investigation were trained and executed using the Google Colabora-
tory platform [40] with a virtual GPU.
served agreement.
− 1−
Cohen s Kappa = = 1− (18)
1− 1−
Figure 8.
Figure The progress
8. The progress of
of BO.
Figure 99 is
Figure is aa representation
representation of of the
the order
order inin which
which thethe points
points were
were sampled
sampled during
during
optimization, as well as histograms illustrating the distribution of these
optimization, as well as histograms illustrating the distribution of these points for each points for each
search space. Each point’s color encodes the sequence in which samples
search space. Each point’s color encodes the sequence in which samples were assessed. were assessed. The
best parameters are indicated by a red star. A variety of areas were investigated.
The best parameters are indicated by a red star. A variety of areas were investigated. Re- Regarding
the histogram
garding of the activation
the histogram function,function,
of the activation ReLU was ReLUchosen
wasroughly 24 times24
chosen roughly as times
the best
as
activation function for the convolutional layers and the first three dense layers,
the best activation function for the convolutional layers and the first three dense layers, while SELU
came in
while second
SELU camewith 14 sample
in second withcounts. In terms
14 sample of the
counts. dropout
In terms of histogram,
the dropout0.5 was selected
histogram, 0.5
over 15 times. The same sample count was recorded for 0, implying
was selected over 15 times. The same sample count was recorded for 0, implying that we can that
disable
we
the disable
can dropoutthe layer with alayer
dropout certain
withseta of hyperparameters.
certain Concerning
set of hyperparameters. the histograms
Concerning of
the his-
the number of nodes in the first three dense layers, 1024 was the most
tograms of the number of nodes in the first three dense layers, 1024 was the most often often chosen value,
appearing 7 times in the first dense layer, 10 times in the second dense layer, and 12 times
chosen value, appearing 7 times in the first dense layer, 10 times in the second dense layer,
in the third dense layer. When it comes to the optimizer histogram, Nadam was the most
and 12 times in the third dense layer. When it comes to the optimizer histogram, Nadam
frequently selected optimizer with more than 16 sample counts, while Adam and SGD
was the most frequently selected optimizer with more than 16 sample counts, while Adam
ranked second (12 sample counts each). With respect to the histogram of batch size, the
and SGD ranked second (12 sample counts each). With respect to the histogram of batch
highest value of search space (128) was picked most often.
size, the highest value of search space (128) was picked most often.
Table 2 provides more insights with a list of the top five best trials. It is noticed
that the top two best trials (Trials 20 and 51) achieve a maximum level of accuracy of
80%, demonstrating that CNN can learn discriminant features for EEG data multi-class
classification. Another note is that the ReLU activation function was chosen in all these
optimal trials, making it the ideal choice for CNN models. Moreover, in the majority of
these optimum trials, dense nodes of over 300 were used indicating that a higher dense
node value leads to better performance. Further, it is evident that the combination of ReLU,
batch size of roughly 60, and AdaMax optimizer is the optimum option.
In order to prepare the top three optimum models (Trials 20, 51, and 31) for the
Average Ensemble method, we trained them from scratch. Our approach this time was
to execute the Keras checkpoint callback at the end of every epoch, thereby saving the
weights of the model whenever the validation’s accuracy improved. This allowed us to
tune another critical hyperparameter, the number of epochs. Further explanation can be
found in Figure 10, where the checkpoint callback saves the weights of the first optimum
CNN, at epoch 32. The accuracy of the first, second, and third optimum models increased
by 8%, 8%, and 10%, respectively, after applying this method.
Appl. Sci. 2022, 12, 5807
x FOR PEER REVIEW 1414of
of 19
Figure 9.
Figure 9. Representation
Representation of
of the
the evaluation
evaluation plot
plot allows
allows us
us to
to understand
understand the
the progression of the
progression of the search
search
as well as the histograms of explored values. The best point samples are marked with a red star.
as well as the histograms of explored values. The best point samples are marked with a red star.
Dense_nodes1, Dense_nodes2, and Dense_nodes3 refer to the number of nodes in the first, second,
Dense_nodes1, Dense_nodes2, and Dense_nodes3 refer to the number of nodes in the first, second,
and third dense layers, respectively.
and third dense layers, respectively.
Table 2 provides more insights with a list of the top five best trials. It is noticed that
The hyperparameters for the first iteration of BO were chosen at random (ReLU
the top two
activation best trials
functions, (Trials 20
a dropout rateand 51) achieve
of 50%, 32 nodesa for
maximum level dense
the first three of accuracy
layers, of 80%,
a batch
demonstrating that CNN can learn discriminant features for EEG data multi-class
size of 64, and Adam as optimizer) to highlight the benefit of hyperparameter optimization classi-
fication.
and Another
the usage note is thatcall
of checkpoint the back.
ReLUIn activation function
Table 3, we comparewasthe
chosen in all these
performance optimal
of the base
trials, making it the ideal choice for CNN models. Moreover, in the majority
CNN architecture, before and after applying BO and checkpoint call back. It is evident of these op-
timum trials, dense nodes of over 300 were used indicating that a higher
that the performance of our Base CNN model drastically improved after optimizing the dense node value
leads to better
activation performance.
function, Further,
the dropout, the itnumber
is evident that thethe
of nodes, combination
optimizer, oftheReLU,
batchbatch size
size, and
of roughly 60, and AdaMax optimizer is the optimum option.
the number of epochs. In terms of the average performance of the three top CNNs, we
Appl. Sci. 2022, 12, x FOR PEER REVIEW 15 of 19
Figure 10. Plot of accuracy versus various epochs of the first optimum CNN. The perfect number of
epochs (32)
(32) isishighlighted
highlightedbybythe
theblue
bluecircle,
circle, which
which indicates
indicates where
where thethe checkpoint
checkpoint callback
callback saved
saved the
the model’s weights the last time it was executed.
model’s weights the last time it was executed.
The hyperparameters for the first iteration of BO were chosen at random (ReLU acti-
Table Comparisona between
vation3.functions, dropoutthe performance
rate of nodes
of 50%, 32 the basefor
CNNthearchitecture
first threebefore
denseand after applying
layers, a batch
BO and checkpoint callback.
size of 64, and Adam as optimizer) to highlight the benefit of hyperparameter optimiza-
tion and the usage of checkpoint call back. In Table 3, we compare the performance of the
Base CNN Architecture Accuracy Precision Recall F1-Score Kappa
base CNN architecture, before and after applying BO and checkpoint call back. It is evi-
Before BO + checkpoint callback
dent that the performance of our68Base CNN model
58 68 improved
drastically 62 after optimizing
54.7
the activation function, the dropout, the number of nodes, the optimizer, the batch size,
1st optimum CNN 88 90 88 88 84.01
and the
2nd number
optimumof epochs. In terms
CNN 88 of the average
88 performance
88 of the
88 three top83.97
CNNs,
After BO + checkpoint callbackwe notice an average increase of 19.3%, 30%, 18.6%, 24.6%, and 27.5% in Accuracy, Preci-
3rd optimum CNN 86 86 84 84 78.63
sion, Recall, F1-score, and Kappa, respectively.
Average 87.3 88 86.6 86.6 82.2
We combined our three generated CNNs using the Average Ensemble strategy. The
performance evaluation metrics for the proposed models are shown in Table 4. Precision is
described as the positive predictive rate. Based on Precision results, our Ensemble method
surpassed the weak-learners’ average Precision by 5.7% which indicates the feasibility
Appl. Sci. 2022, 12, 5807 16 of 19
and superiority of Ensemble model designed in this paper. Moreover, it provided 100%
Precision on the left hand and both feet, stressing the importance of aggregating the
performance of multiple weak learners. Our Ensemble method’s Recall exceeded the
average Recall of the weak learners by 5.34%. The F1-score reported a similar increase
value (5.34%) on the other side. The “feet” label was identified with 100% confidence based
on the three aforementioned metrics. In terms of Accuracy score, our Ensemble technique
achieved 92%, representing a 5.34% improvement over the weak learners. This is the same
increase rate recorded for Recall and F1-score. Cohen’s kappa is an important statistic that
measures inter-annotator agreement. The employed Ensemble method scored the highest
kappa score, reflecting a 7.13% increase over the weak learners. This significant increase
in Precision (5.7%), Recall (5.34%), Accuracy (5.34%), F1-score (5.34%), and Kappa (7.13%)
demonstrates the feasibility and superiority of the Ensemble model proposed in this study.
Overall, the Ensemble method produces extremely promising results across all five metrics.
CNN can learn complicated features from a heterogeneous dataset, with the ability to
expose the dataset’s hidden characteristics while performing data fitting. Traditional ML
algorithms generally build their models from a fixed set of functions, limiting the model’s
representation capability.
We evaluated our EL-based model against state-of-the-art methods [19–26] using the
same dataset (BCI Competition IV-2a dataset) for fair comparison (Table 6). The accuracy of
our method reached 92%, outperforming all state-of-the-art methods. Average Ensemble,
along with the BO-based hyperparameter optimization, boosted CNN performance.
Methods Accuracy
Yang’s method [19] 69.27
Cheng’s method [20] 70.5
Abbas’s method [21] 70.7
Sakhavi’s method [22] 74.46
Zhao’s method [23] 75
Liu’s method [24] 76.86
Deng’s method [25] 78.9
ZümrayDokur’s method [26] 79.3
Proposed method 92
To the best of our knowledge, this is the first study to employ BO for optimum
hyperparameter selection in conjunction with the Average Ensemble approach to this
specific EEG MI classification problem. This method was found to be the most effective in
this experiment, with a Precision of 93%, a Recall of 92%, an F1-score of 92%, an accuracy
of 92%, and a kappa score of 89.33%. In the field of medical diagnostics, these values
are considered “very good,” and they can be further enhanced by exploring other EL
strategies. CNN function evaluation is both time consuming and expensive. In our case,
the expensive function is conducted only in a certain area space because the EI acquisition
function aided in swiftly locating the search space. The objective function, represented as
a GP model, is used to train the model through objective function evaluation. Analytical
tracking was done with the GP model because it creates a posterior distribution over the
objective function. The EL technique utilized in this study automated the randomization
procedure, enabling the researchers and investigators to analyze numerous databases and
collect relevant findings. Instead of being limited to a single classifier, they repeatedly
construct several classifiers while randomly altering the inputs. We can achieve a more
efficient prediction strategy by merging numerous single classifiers into one.
It is true that all three optimum CNNs employ the same topology (two convolutional
layers, two batch norm layers, two max pooling layers, three drop out layers, and four dense
layers). Each model, however, has various hyperparameter values (dropout rate, activation
Appl. Sci. 2022, 12, 5807 18 of 19
function, number of dense nodes, batch size, and optimizer) and each model has its own
weaknesses. The idea behind employing an ensemble is that by combining various models
that represent multiple hypotheses about the data, we can come up with a better hypothesis
that is not in the hypothesis space of the models from which the ensemble is constructed.
The three models are complementary in that if one fails, the other succeeds. The accuracy of
this Ensemble method was higher than when a single model was used. This demonstrates
the efficiency of ensemble learning.
4. Conclusions
MI-BCI applications for rehabilitation require a superior performance level and ro-
bustness. The findings of our paper confirm the high potential of our Bayesian optimized
ensembled CNN model for real-life applications. However, more advanced results could
be reached by examining other EL strategies along with better hyperparameter optimizers.
Author Contributions: Conceptualization, S.K.; methodology, S.K.; software, S.K. and I.J.; validation,
S.K., I.A.E.K., I.S.A. and U.K.; formal analysis, S.K. and S.Z.; data curation, S.K.; writing—original
draft preparation, S.K., M.A.A. and M.M.; writing—review and editing, S.K.; supervision, S.Z. All
authors have read and agreed to the published version of the manuscript.
Funding: This article was supported by the Natural Science Foundation of China under Grant
(51377045 and Grant 31400844).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Piccione, F.; Giorgi, F.; Tonin, P.; Priftis, K.; Giove, S.; Silvoni, S.; Palmas, G.; Beverina, F. P300-based brain computer interface:
Reliability and performance in healthy and paralysed participants. Clin. Neurophysiol. 2006, 117, 531–537. [CrossRef]
2. Schalk, G.; Mellinger, J. Brain sensors and signals. In A Practical Guide to Brain–Computer Interfacing with BCI2000; Springer:
Berlin/Heidelberg, Germany, 2010. [CrossRef]
3. Mulder, T. Motor imagery and action observation: Cognitive tools for rehabilitation. J. Neural Transm. 2007, 114, 1265–1278.
[CrossRef]
4. Tang, X.; Li, W.; Li, X.; Ma, Z.; Dang, X. Motor imagery EEG recognition based on conditional optimization empirical mode
decomposition and multi-scale convolutional neural network. Expert Syst. Appl. 2020, 149, 113285. [CrossRef]
5. Shoka, A.; Dessouky, M.; El-Sherbeny, A.; El-Sayed, A. Literature Review on EEG Preprocessing, Feature Extraction, and
Classifications Techniques. Menoufia J. Electron. Eng. Res. 2019, 28, 292–299. [CrossRef]
6. Mirowski, P.; Madhavan, D.; LeCun, Y.; Kuzniecky, R. Classification of patterns of EEG synchronization for seizure prediction.
Clin. Neurophysiol. 2009, 120, 149–171. [CrossRef] [PubMed]
7. Ke, H.; Chen, D.; Li, X.; Tang, Y.; Shah, T.; Ranjan, R. Towards brain big data classification: Epileptic EEG identification with a
lightweight VGGNet on global MIC. IEEE Access 2018, 6, 722–733. [CrossRef]
8. Sushkova, O.; Obukhov, Y.; Kershner, I.; Karabanov, A.; Gabova, A. Classification of early-stage Parkinson’s disease in EEG and
tremor timefrequency features space. Parkinsonism Relat. Disord. 2016, 22, e164. [CrossRef]
9. Brunner, C.; Leeb, R.; Muller-Putz, G.; Schlogl, A.; Pfurtscheller, G. BCI Competition 2008—Graz Data Set A; Graz University of
Technology: Graz, Austria, 2008; Volume 16, pp. 1–6.
10. Ye, J.; Li, Q. A Two-Stage Linear Discriminant Analysis via QR Decomposition. IEEE Tran. Pattern Anal. Mach. Intell. 2005,
27, 929–941.
11. Vapnik, V. The Nature of Statistical Learning Theory; Springer: New York, NY, USA, 1995.
12. Fraiwan, L.; Lweesy, K.; Khasawneh, N.; Wenz, H.; Dickhaus, H. Automated sleep stage identification system based on
timefrequency analysis of a single EEG channel and random forest classifier. Comput. Methods Programs Biomed. 2012, 108, 10–19.
[CrossRef]
13. Nicolaou, N.; Georgiou, J. Detection of Epileptic Electroencephalogram Based on Permutation Entropy and Support Vector
Machines. Expert Syst. Appl. 2012, 39, 202–209. [CrossRef]
14. Kassam, K.S.; Markey, A.R.; Cherkassky, V.L.; Loewenstein, G.; Just, M.A. Identifying emotions on the basis of neural activation.
PLoS ONE 2013, 8, e66032. [CrossRef] [PubMed]
Appl. Sci. 2022, 12, 5807 19 of 19
15. Khan, J.; Bhatti, M.; Khan, U.; Iqbal, R. Multiclass EEG motor-imagery classification with sub-band common spatial patterns.
EURASIP J. Wirel. Commun. Netw. 2019, 2019, 174. [CrossRef]
16. Bhattacharyya, S.; Khasnobish, A.; Konar, A.; Tibarewala, D.N.; Nagar, A.K. Performance analysis of left/right hand movement
classification from EEG signal by intelligent algorithms. In Proceedings of the 2011 IEEE Symposium on Computational
Intelligence, Cognitive Algorithms, Mind, and Brain (CCMB), Paris, France, 11–15 April 2011; pp. 1–8. [CrossRef]
17. BCI Competition II. Berlin Brain-Computer Interface (BBCI). Available online: https://www.bbci.de/competition/ii/ (accessed
on 13 January 2022).
18. Wahid, M.; Tafreshi, R. Improved Motor Imagery Classification Using Regularized Common Spatial Pattern with Majority Voting
Strategy. IFAC PapersOnLine 2021, 54, 226–231. [CrossRef]
19. Yang, H.; Sakhavi, S.; Ang, K.K.; Guan, C. On the use of convolutional neural networks and augmented CSP features for
multi-class motor imagery of EEG signals classification. In Proceedings of the 37th Annual International Conference of the IEEE
Engineering in Medicine and Biology Society, Milan, Italy, 25–29 August 2015; pp. 2620–2623.
20. Cheng, P.; Autthasan, P.; Pijarana, B.; Chuangsuwanich, E.; Wilaiprasitporn, T. Towards asynchronous motor imagery-based
brain-computer interfaces: A joint training scheme using deep learning. In Proceedings of the TENCON-IEEE Region 10
Conference, Jeju, Korea, 28–31 October 2018; pp. 1994–1998.
21. Abbas, W.; Khan, N.A. DeepMI: Deep learning for multiclass motor imagery classification. In Proceedings of the 40th Annual
International Conference of the IEEE Engineering in Medicine and Biology Society, Honolulu, HI, USA, 18–21 July 2018;
pp. 219–222.
22. Sakhavi, S.; Guan, C.; Yan, S. Learning temporal information for brain-computer interface using convolutional neural networks.
IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 5619–5629. [CrossRef] [PubMed]
23. Zhao, X.; Zhang, H.; Zhu, G.; You, F.; Kuang, S.; Sun, L. A multi-branch 3D convolutional neural network for EEG-based motor
imagery classification. IEEE Trans. Neural Syst. Rehabil. Eng. 2019, 27, 2164–2177. [CrossRef] [PubMed]
24. Liu, M.; Zhou, M.; Zhang, T.; Xiong, N. Semi-supervised learning quantization algorithm with deep features for motor imagery
EEG recognition in smart healthcare application. Appl. Soft Comput. 2020, 89, 106071. [CrossRef]
25. Deng, X.; Zhang, B.; Yu, N.; Liu, K.; Sun, K. Advanced TSGL-EEGNet for Motor Imagery EEG-Based Brain-Computer Interfaces.
IEEE Access 2021, 9, 25118–25130. [CrossRef]
26. Zumray, D.; Olmez, T. Classification of Motor Imagery EEG Signals by Using a Divergence Based Convolutional Neural Network.
arXiv 2021, arXiv:2103.10977.
27. Yu, T.; Zhu, H. Hyper-Parameter Optimization: A Review of Algorithms and Applications. arXiv 2020, arXiv:2003.05689.
28. Wu, J.; Chen, X.Y.; Zhang, H.; Xiong, L.D.; Lei, H.; Deng, S.H. Hyperparameter optimization for machine learning models based
on Bayesian optimization. J. Electron. Sci. Technol. 2019, 17, 26–40.
29. Ke, H.; Chen, D.; Shi, B.; Zhang, J.; Liu, X.; Zhang, X.; Li, X. Improving brain E-health services via high-performance EEG
classification with grouping Bayesian optimization. IEEE Trans. Serv. Comput. 2019, 13, 696–708. [CrossRef]
30. Olivas-Padilla, B.E.; Chacon-Murguia, M.I. Classification of multiple motor imagery using deep convolutional neural networks
and spatial filters. Appl. Soft Comput. 2019, 75, 461–472. [CrossRef]
31. Cheng, D.; Liu, Y.; Zhang, L. Exploring motor imagery EEG patterns for stroke patients with deep neural networks. In Proceedings
of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20
April 2018; pp. 2561–2565.
32. Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015,
arXiv:1502.03167.
33. Lee, C.-Y.; Gallagher, P.W.; Tu, Z. Generalizing Pooling Functions in Convolutional Neural Networks: Mixed, Gated, and Tree.
arXiv 2015, arXiv:1509.08985.
34. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks
from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958.
35. Li, L.; Jamieson, K.; De Salvo, G.; Rostamizadeh, A.; Talwalkar, A. Hyperband: A Novel Bandit-Based Approach to Hyperparame-
ter Optimization. J. Mach. Learn. Res. 2016, 18, 6765–6816.
36. Betrò, B. Bayesian methods in global optimization. J. Glob. Optim. 1991, 1, 1–14. [CrossRef]
37. Kramer, O.; Ciaurri, D.E.; Koziel, S. Derivative-free optimization. In Computational Optimization, Methods and Algorithms; Springer:
Berlin/Heidelberg, Germany, 2011; pp. 61–83. [CrossRef]
38. Lutins, E. Ensemble Methods in Machine Learning: What Are They and Why Use Them?|by Evan Lutins|Towards Data Science.
2017. Available online: https://towardsdatascience.com/ensemble-methods-in-machine-learning-what-are-they-and-whyuse-
them-68ec3f9fef5f (accessed on 11 December 2020).
39. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.;
Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830.
40. Bisong, E. Building Machine Learning and Deep Learning Models on Google Cloud Platform; Springer: Berlin/Heidelberg,
Germany, 2019.