1 s2.0 S1568494623005768 Main

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Applied Soft Computing 145 (2023) 110558

Contents lists available at ScienceDirect

Applied Soft Computing


journal homepage: www.elsevier.com/locate/asoc

Multi-objective binary grey wolf optimization for feature selection


based on guided mutation strategy
∗ ∗
Xiaobo Li a , , Qiyong Fu a , Qi Li a , Weiping Ding b , , Feilong Lin a , Zhonglong Zheng a
a
School of Computer Science and Technology, Zhejiang Normal University, Jinhua, 321004, China
b
School of Information Science and Technology, Nantong University, Nantong, 226019, China

article info a b s t r a c t

Article history: Feature selection aims to choose a subset of features with minimal feature-feature correlation and
Received 29 November 2022 maximum feature-class correlation, which can be considered as a multi-objective problem. Grey wolf
Received in revised form 28 April 2023 optimization mimics the leadership hierarchy and group hunting mechanism of grey wolves in nature.
Accepted 15 June 2023
However, it can easily fall into local optimization in multi-objective optimization. To address this,
Available online 19 June 2023
a novel multi-objective binary grey wolf optimization based on a guided mutation strategy (GMS),
Keywords: called MOBGWO-GMS, is proposed. In the initialization phase, the population is initialized based on
Feature selection feature correlation, and features are selected using a uniform operator. The proposed GMS uses the
Multi-objective optimization Pearson correlation coefficient to provide direction for local search, improving the local exploration
Grey wolf optimization ability of the population. Moreover, a dynamic agitation mechanism is used for perturbation to prevent
Guided mutation strategy population stagnation due to the use of a single strategy. The strategy is dynamically adjusted to
Dynamic agitation mechanism maintain population diversity and improve detection ability. To evaluate the classification ability of
quasi-optimal subsets, a wrapper-based k-nearest neighbor classifier was employed. The effectiveness
of the proposed algorithm was demonstrated through an extensive comparison with eight well-known
algorithms on fourteen benchmark datasets. Experimental results showed that the proposed approach
is superior in the optimal trade-off between the two fitness evaluation criteria and can easily jump
out of local optima compared to other algorithms.
© 2023 Elsevier B.V. All rights reserved.

1. Introduction methods have problems of ‘‘nesting effect’’. After that, a group of


meta-heuristic (MH) approaches [5–7], such as genetic algorithm
Feature selection (FS) is an essential step in the data prepro- (GA) [8], particle swarm optimization (PSO) [9], harris hawks op-
cessing stage of machine learning and data mining, and has been timization (HHO) [10], artificial bee colony algorithm (ABC) [11],
widely applied in various research fields such as bioinformat- ant colony optimization (ACO) [12], whale optimization algo-
ics [1], image processing [2], and intrusion detection [3], etc. The rithm (WOA) [13], binary grey wolf optimization (BGWO) [14],
purpose of FS is to select an optimal subset of features related binary dandelion algorithm (BDA) [15] etc., has been proposed as
to classification in the original feature space and eliminate irrel- efficient methods for solving FS issues.
evant or redundant features [4]. For an n-dimensional dataset, To date, the vast majority of researchers treat FS as a single-
there are 2n possible solutions for FS, which is a typical NP- objective optimization problem (SOP) to maximize classification
hard problem. Search or optimization techniques are required for accuracy (or minimize classification error rate). Some researchers
finding an optimal feature subset when the dimensionality of the consider the size of the selected feature subset while viewing
feature space is relatively high. Heuristic search methods such as the classification error rate, and they utilize a weighted-sum
sequential forward search and sequential backward search selec- method to combine several objectives (like the number of fea-
tion are therefore adopted. However, since these two methods tures and classification accuracy) into a single objective to be
select features by adding or deleting them, the process cannot optimized [10,16–21]. For the weighted-sum method, a suitable
be modified after each feature is added or deleted, and both weight parameter is essential. However, determining a suitable
weight for each objective is difficult, especially when dealing with
∗ Corresponding authors. high-dimensional data. Nowadays, FS problem may be suitable
for modeling as a multi-objective optimization problem (MOP),
E-mail addresses: [email protected] (X. Li), [email protected] (Q. Fu),
[email protected] (Q. Li), [email protected] (W. Ding), which has usually two main conflicting objectives, namely mini-
[email protected] (F. Lin), [email protected] (Z. Zheng). mizing the number of selected features and minimizing the clas-
URL: http://mypage.zjnu.edu.cn/lxb/zh_CN/index.htm (X. Li). sification error rate. At present, non-dominated sorting genetic

https://doi.org/10.1016/j.asoc.2023.110558
1568-4946/© 2023 Elsevier B.V. All rights reserved.
X. Li, Q. Fu, Q. Li et al. Applied Soft Computing 145 (2023) 110558

algorithm-II (NSGA-II) [22], multi-objective artificial bee colony 3. A dynamic agitation mechanism is proposed to detect
algorithm (MOABC) [23], multi-objective particle swarm opti- whether the algorithm is trapped in local optimum and dy-
mization (MOPSO) [24], and other algorithms have been proposed namically change the search strategy. The archives are han-
based on multi-objective FS problems. dled concurrently to prevent over-storage and performance
Grey wolf optimization (GWO) [25] has recently received ex- impacts.
tensive attention from scholars owing to its strong global search
capability, few parameters, and easy implementation. According The remainder of this paper is organized as follows. An in-
to previous studies, GWO has good exploration and exploita- troduction to the related work is given in Section 2 and the
tion ability to update the solutions during optimization, and it background information is presented in Section 3. Section 4 char-
has been successfully applied to anomaly detection [26], Ara- acterizes the proposed method. The experimental design and
bic text classification [27], disease diagnosis [28], and feature results are demonstrated in Section 5. The conclusions and some
selection [29] problems. Although the algorithm has excellent perspectives for future work are provided in Section 6.
performance in many fields, many researchers have continuously
improved it, making it beneficial to solve the FS problem related 2. Related work
to its inherent defects, however, from some of the literature, we
found that some random mechanisms in GWO may significantly 2.1. Single-objective meta-heuristic algorithm for feature selection
affect the convergence and diversity of the population. In par-
ticular, the leader is selected randomly from the Pareto optimal Swarm-based MHs have become increasingly popular for solv-
solutions in the iterative process, which has a great influence on ing FS problems due to their excellent optimization capabilities.
the late convergence of the algorithm. Unfortunately, in many Some examples of such MHs are GA, PSO, HHO, ACO, ABC, WOA,
multi-objective GWOs, little research has focused on and studied
and their variants. Most researchers tend to treat the FS problem
this issue. It should be noted that when an algorithm converges
as a SOP since they may only focus on a certain optimization
at a later stage, each individual needs to explore near its current
objective. Tahir et al. [32] introduced a binary chaotic genetic
location, rather than being influenced by a randomly selected
algorithm (BCGA) that utilizes chaotic maps to enhance the pop-
leader to explore a feature space far from its current location.
ulation initialization and mutation method of the conventional
Most MH algorithms use random initialization strategies in
GA. The BCGA replaces the random variable selection process
the initial population stage. The disadvantage of this method
with chaotic sequences, thus improving the quality of solutions.
lies that, on the one hand, the initialized population carries less
gainful information and wastes computer resources in the early However, this improvement comes at the cost of greater com-
iteration of the algorithm; on the other hand, especially for high- putational complexity and increases time consumption due to
dimensional datasets, the initialized solutions are concentrated additional algorithmic operations. Li et al. [33] developed a hybrid
together, which cannot perfectly explore the search space and FS technique for high-dimensional datasets that combines a Q-
expand the search range. To overcome the limitations, Wang learning prescreening approach based on the maximum informa-
et al. [30] introduced a population initialization method with tion coefficient with an improved PSO method. Paniri et al. [34]
mixed initialization and threshold selection, and principal com- proposed using the PCC to measure both the redundancy and rel-
ponent analysis is utilized to sort the importance of features. Song evance of features. Specifically, the minimum correlation between
et al. [31] proposed an effective population initialization strategy features is used to measure redundancy, while the maximum
according to the correlation between features and class labels correlation between features and class labels is used to measure
to accelerate the convergence of the population. The Pearson relevance. Karimi et al. [35] employed a differential reinforcement
correlation coefficient (PCC) is both an upgrade of the Euclidean learning algorithm to accomplish semi-supervised classification
distance and an improvement of the cosine similarity in the ab- tasks, which explores the feature space by conducting a non-
sence of dimension values, which can avoid the problem of rating linear function search. Hashemi et al. [36] created multiple filter
scale inflation. In short, the PCC can more accurately calculate the methods for ranking features, which are then used to guide pop-
correlation between two sets of data. ulation evolution. Bayati et al. [37] presented an efficient method
Our paper aims to create a novel guided mutation strategy that combines subspace learning and a memetic algorithm to
(GMS) that collaborates with GWO to jointly search for optimal search the feature space, which demonstrates high efficacy in
solutions and prevents a single search mechanism from get- feature space exploration.
ting trapped in a local optimum. The GMS utilizes correlations The GWO algorithm is a simple and comprehensible opti-
between feature-feature and feature-class to guide population mization algorithm that is easy to implement. Its less parameter
search, thereby increasing the likelihood of avoiding local optima tuning makes it a suitable choice for FS problems [38]. Recent
and guiding the population towards better evolution. Unlike ex- studies have applied GWO to FS, and Table 1 provides a summary
isting hybrid search algorithms, the proposed algorithm is not of these applications.
a mere combination of strategies, instead, it adapts the search
strategy based on the current iterative state until finding an 2.2. Multi-objective meta-heuristic algorithm for feature selection
optimal solution. The main innovations and contributions of this
paper include the following aspects.
When selecting features, multiple conflicting objectives need
1. A novel initialization strategy is proposed. The strategy to be optimized simultaneously. These objectives may involve
first assigns a weight to each feature (the weight is cal- maximizing predictive accuracy, minimizing the size of feature
culated by PCC for feature-class correlation), then applies subsets, minimizing computational complexity, and other factors.
the binary tournament selection to initialize the popula- Therefore, numerous researchers frequently regard FS as a MOP.
tion. The proposed uniform operator maintains a uniform NSGA-II [22], which is a genetic algorithm derived from the
distribution of individuals. classical MOPs, was applied to tackle the multi-objective feature
2. A new GMS is designed to combine with GWO to accelerate subset selection problem in [46]. Amoozegar et al. [47] pro-
convergence. GMS is unique in that it is guided by both posed a multi-objective FS based on PSO, which ranks solutions
the correlation between features, and the correlation be- based on their frequency within the archive set, and individuals
tween classes. For local exploration, this strategy contains with higher rankings are utilized to guide the evolution of both
a triple-mutation mechanism. other individuals and archives. Hu et al. [48] developed a fuzzy
2
X. Li, Q. Fu, Q. Li et al. Applied Soft Computing 145 (2023) 110558

Table 1
Summary research of GWO algorithms for feature selection in recent years.
Methods Basic information Merits Limits
COGWO2D Ibrahim et al. integrated GWO with The COGWO2D enhances the The OBL strategy and the disruption
(2018) [39] differential evolution. effectiveness of local search. Moreover, operator contribute to an increase in
this method incorporates interruption computational complexity.
operators to preserve population
diversity.
MEGWO Tu et al. introduced a multi-strategy The enhanced global-best lead strategy, Each strategy introduces additional
(2019) [40] integrated FS technique that adaptable cooperative strategy, and parameters.
incorporates several strategies. disperse foraging strategy collectively
balance the trade-off between
exploitation and exploration.
GWOCSA Arora et al. proposed a FS technique The GWOCSA aims to combine the GWOCSA algorithm in simple mode is easy
(2019) [41] that utilizes both the GWO and crow strengths of both algorithms to produce to have premature.
search algorithm. highly favorable quasi-optimal solutions.
EGWO Zhao et al. improved the GWO The EGWO’s effectiveness in diagnosing The search capacity of GWO requires
(2019) [42] algorithm by incorporating a chaotic patients with paraquat poisoning was additional enhancement.
search technique to identify the optimal evaluated, and it achieved AUC,
feature subset. accuracy, sensitivity, and specificity
values exceeding 90%.
Improved Hu et al. enhanced the updating The improved BGWO methods have The focus of the experiments has been to
BGWO equation of the BGWO and implemented better classification accuracy. analyze the essential role of the transfer
(2020) [43] it in addressing binary optimization function in GWO, and the comparison
problems. algorithm lacks originality.
TMGWO Abdel-Basset et al. developed a The first stage of the mutation method The TMGWO takes much time to complete
(2020) [20] two-stage mutation FS technique that can reduce the number of selected the evaluation process.
enhances the local search capability of features, while the following stage can
GWO. enhance the classification accuracy of
TMGWO by introducing supplementary
information.
IGWO Shen et al. designed a two-stage method The benefits of the IGWO are The IGWO may have a potential drawback
(2021) [44] that combines GWO with Neural increasingly prominent in terms of time due to its inherent complexity, which could
network for small samples of consumption, classification accuracy, and make implementation challenging.
high-dimensional biomedical data. feature subset size as the dimensionality Furthermore, selecting suitable parameters
of the FS problem grows. could prove to be a difficult task.
GWORS Sathiyabhama et al. presented a new FS GWORS integrates both GWO and rough When dealing with unfamiliar datasets,
(2021) [45] method that utilizes a combination of set theory to improve the accuracy of where the feature size is unknown, the
GWO and rough set theory to identify feature subset. accuracy of GWORS may decrease
important features from mammograms. significantly.
BIGWO Rajammal et al. created an approach that Applying the mutation operation of The dataset used in the experiment is
(2022) [28] utilizes GWO to categorize Parkinson’s BIGWO improves exploration capacity in low-dimensional and presents challenges
disease using a set of features that have FS, resulting in higher accuracy when when applied to other complex
been optimized for optimal performance. classifying Parkinson’s disease. classification tasks.

multi-objective FS technique for which utilizes a fuzzy dominance From Tables 1 and 2, although researchers have put significant
relation to assess the quality of quasi-optimal particles and de- efforts into solving FS problems based on GWO, the algorithms
fines a fuzzy crowding distance metric to select the global leader mentioned still have the following limitations and challenges. In
of particles. Zhang et al. [49] studied a bi-objective artificial bee particular, it is not sufficient to solve MOPs based on GWO in the
colony algorithm that simultaneously considers the feature cost existing research.
and classification which utilizes a convergence-guided search and
1. Random initialization may result in individual distribu-
a dual external archive strategy, leading to improved retrieval of
tions within discrete or aggregated groups. The quality of
different species of bees. Wang et al. [50] developed a bi-objective
the initial populations cannot be guaranteed, which can
FS framework using which incorporates a sample selection strat-
limit GWO’s ability to explore and exploit the search space
egy and a stepped sample utilization strategy based on K-means
effectively.
clustering to obtain better feature subsets with less running time.
2. GWO may become trapped in a local optimum if it does not
Piri et al. [51] introduced a bi-objective optimization technique
explore the search space adequately, resulting in subopti-
based on the harris hawk optimizer, which utilizes the crowding
mal feature subsets.
distance as a third criterion for selecting quasi-optimal solutions.
3. In most of the literature, the lead wolves were randomly
Zhou et al. [52] introduced a two-stage particle cooperative strat-
selected from the non-dominant solutions. The track of the
egy for a multi-objective FS method, which includes optimizing a lead wolves has a certain degree of randomness, which fails
distance metric as the third objective. Their approach combines to speed up convergence in the later stage.
randomly initialized particles and Relief-filtered particles to form
4. Many improved multi-objective GWO algorithms only
an initial population and designs a reset operation. prove their effectiveness in low-dimensional spaces and
In recent years, researchers have investigated the potential need further research when dealing with complex high-
of the GWO algorithm for tackling multi-objective FS problems. dimensional spaces.
Mirjalili et al. [58] introduced the first multi-objective version of
GWO, and subsequent researchers have further improved their al- 3. Background
gorithms, leading to the development of MOGWO. The following
Table 2 summarizes recent research on multi-objective FS based In this section, multi-objective optimization, grey wolf opti-
on GWO. mization and Pearson correlation coefficient are introduced.
3
X. Li, Q. Fu, Q. Li et al. Applied Soft Computing 145 (2023) 110558

Table 2
Summary research of MOGWO algorithms for feature selection in recent years.
Methods Basic information Merits Limits
MO-GWO Emary et al. proposed a multi-objective MO-GWO can filter out redundant Its effectiveness was only tested on
(2015) [53] FS method that combines GWO and information early and improve search low-dimension datasets with less than 18
mutual information. performance in later stages. features. It is difficult to extend the
MO-GWO to high-dimensional datasets.
NSGWO Sahoo and Chandra developed two The NSGWO algorithm achieves a The NSGWO lacks more appropriate and
(2017) [54] different multi-objective GWO remarkable accuracy of 91.1% when comprehensive indicators to reflect the
algorithms, named MOGWO and used for predicting cervical cancer. quality of its solutions, in order to increase
NSGWO. MOGWO utilizes a linear practicality and effectiveness.
scalarization method, while NSGWO is a
posteriori approach.
BMOGWO-S Al-Tashi et al. proposed a binary BMOGWO-S is a cost-effective algorithm There is a shortage of algorithms available
(2020) [55] multi-objective FS algorithm that utilizes for identifying a set of non-dominated for the purpose of algorithm comparison.
a combination of MOGWO and sigmoid solutions. It is computationally efficient The advantage of the BMOGWO-S is
function. The algorithm employs an compared to other algorithms. primarily dependent on the performance of
artificial neural network to assess the the original GWO. The dataset used in the
classification performance of different experiments has a low dimensionality.
feature subsets.
MOBGWO Mendeley et al. proposed a The MOBGWO method demonstrates The execution time of the selected
(2021) [56] multi-objective binary GWO method for efficacy in predicting the energy regressors is heavily reliant on their
FS, which is tested on a dataset for consumption of appliances and can be configuration parameters, which should be
predicting energy usage in appliances. applied to different types of datasets. adjusted according to the dataset’s
characteristics.
Aided Ukken et al. developed a multi-objective The performance of aided BMOGWO-S When dealing with datasets containing few
BMOGWO-S feature selection algorithm based on can be achieved through the utilization features, the performance of the aided
(2023) [57] GWO technique. The algorithm utilizes of statistical information. BMOGWO-S may be less effective than
statistical information from the dataset filter-based methods, especially when
to select relevant features, while also considering the time it takes to run the
incorporating techniques to improve its method.
runtime efficiency.

3.1. Multi-objective optimization β wolves and δ wolves are subordinates of α wolves, whose
primary duties are to help and tail α wolves. ω wolves are the
Multi-objective optimization problem (MOP) is a mathemat- lowest wolves and must obey other lead wolves. The optimiza-
ical problem involving simultaneous optimization of multiple tion process of the GWO includes grey wolf social stratification,
objective functions, and the objective functions of such issues tracking prey, encircling prey and attacking prey.
are often conflicting. The general mathematical description of a (1) Social hierarchy The top three best fittest solutions are
multi-objective optimization problem is as follows: treated as α , β , and δ wolves, respectively, and the remaining
candidate solutions are considered as ω wolves. During hunting
min F (X ) = {f1 (X ), f2 (X ), . . . , fm (X )}T
{
(1) process, ω wolves follow α , β wolves, and δ wolves.
s.t .X ∈ Ω (2) Encircling prey The following equation models the prey
encircled by grey wolves:
where F (X ) is the objective function, m is the number of objective
functions, X ∈ Ω means X belongs to the decision space. −−→
t +1
⏐−
⏐→ ⃗ ⃗ ⏐

xi = ⏐ xtp + A · D⏐ (3)
∀i ∈ {1, 2, . . . , m} : fi (c) ≤ fi (d) and ∃j ∈ {1, 2, . . . , m} : fj (c) < fj (d) −−→
where xti +1 is the vector position of the ith wolf during
(2) −


iteration = t + 1, xti is the position vector of the prey, and A,
where m is the number of objective functions, c and d are two ⃗
D are two coefficient vectors which affect the position of the
solutions of a MOP problem. wolves.
In the MOP, the optimal solution of the MOP needs to be ⃗ and D
The vectors A ⃗ are obtained through the following calcu-
made when two or more conflicting objectives are weighed. To lation:
clarify the advantages and disadvantages of MOP solutions, a ⏐ ⏐
special relationship needs to be defined to compare solutions. For ⃗ = ⏐2a⃗rand1 − a⃗⏐
A (4)
example, c and d are the two solutions of the MOP problem. If ⏐ −→ − →⏐
Eq. (2) is met, then c is better than d. Among all solutions to the ⃗ = ⏐⏐C⃗ xtp − xt ⏐⏐
D (5)
MOP, if solution c satisfies Eq. (2) for any other solution, then c is
called a Pareto solution. The combination of all Pareto solutions ⃗ decrease linearly from 2 to 0 during the
where components of a


is called the Pareto solution set or Pareto front. iteration process, randi=1,2 are random vectors in [0, 1], xt is the
vector position of the ith wolf during iteration = t.
3.2. Grey wolf optimization

→ −

The vectors C and a are calculated as follows:


Mirjalili et al. [25] designed the Grey wolf optimization (GWO) C = 2rand2 (6)
to imitate the strict social dominance level and group hunting

→ 2
mechanism of grey wolves in nature. The grey wolves in GWO al- a =2−I ·( ) (7)
gorithm have four levels, including α wolves, β wolves, δ wolves, MAXiteration
and ω wolves. α wolves are the leaders and are mainly responsi- where I is the number of iterations, MAXi teration is the total
ble for establishing the rules of behavior for the entire population. number of iterations.
4
X. Li, Q. Fu, Q. Li et al. Applied Soft Computing 145 (2023) 110558

(3) Hunting This process simulates the hunting behavior of


grey wolves mathematically. It has the top three best fit solutions,
and then updates other search agents according to the best search
agent. The formula is as follows.
−−→ −
→ −
→ −→
x +x +x
t +1 1 2 3
x = (8)
3
−−→
where xt +1 is the position vector of the ith wolf during

→ − → − →
iteration = t + 1, x1 , x2 , x3 is the position of the three best
wolves, and the specific function is as follows.

x
→=−
x
−→ − −
A
−→ −−−→
·D (9)
123 α,β,δ 1,2,3 α,β,δ

−−−→ ⏐⏐−−→ −−→ − →⏐



Dα,β,δ = ⏐C1,2,3 · xα,β,δ − xt ⏐ (10)



C i = 2 · randi=1,2,3 (11)

→ − → −

where xα , xβ and xδ represent the positions of α wolf, β wolf
and δ wolf respectively.
(4) Attacking prey and Search for prey During this process,
the vector A controls whether to attack the prey, that is, when
|A| < 1, the wolf is forced to attack the prey; when |A| > 1, the
grey wolf is forced to leave the prey in the hope of finding more
suitable prey.

3.3. Pearson correlation coefficient


Fig. 1. Overall framework of MOBGWO-GMS.
The Pearson correlation coefficient (PCC) is a statistical method
to measure the degree of correlation between two variables [59].
In machine learning, it can be used to calculate the similarity be-
f _select. FP, TP, FN and TN indicate false positive, true positive,
tween features and classes (features), to determine whether the
false negative, and true negative, respectively.
extracted features and classes (features) are positively correlated,
negatively correlated, or have no correlation. The correlation
coefficient is expressed by COR(X, Y) as follows. 4.2. Update of the external archive
∑n
(Xi − X̄ ) · (Yi − Ȳ ) The external archive is used to store the non-dominated so-
COR(X , Y ) = √ i=1
(12)
∑n 2 ∑n 2 lutions from the beginning of the iteration to the completion
i=1 (Xi − X̄ ) · i=1 (Yi − Ȳ ) of the current iteration, extract the non-dominated solution set
where X̄ andȲ is the mean of the features X , Y , respectively. The found at each iteration, and then update it. The larger the external
PCC is obtained by dividing the covariance between variables by archive is not always the better, and the dynamic increase of the
the standard deviation between variables. external archive may affect the performance of the algorithm.
Therefore, the external archive is generally set to be equal to
4. The proposed algorithm the size of the population. When the number of non-dominated
solutions is greater than the size of the archive, the crowding
The proposed MOBGWO-GMS utilizes an external archive to distance of all non-dominated solution sets is calculated, and a
save elite solutions. The selection of the elite solution is based truncation operation is used to delete some of the most crowded
on non-dominated sorting and crowding distance, which helps non-dominated solutions to achieve the external archive.
to achieve a well-distributed Pareto front. Fig. 1 gives the overall
framework of the algorithm and the pseudocode is presented in 4.3. Initialization of the population
Algorithm 1. If necessary, the authors can be contacted for the
source code of the MOBGWO-GMS. Inspired by Song et al. [31], we propose an initial population
generation strategy guided by features-class correlation. In this
4.1. Fitness function method, the PCC is used to calculate the correlation between
features and classes, and the population is initialized with the
The most important challenge of optimization problem is to newly proposed uniform operator. For a population of size N and
define the objective function. FS can be treated as a MOP problem dimension M, the initialization formula for each individual is as
with two competing objectives: maximize classification accuracy follows.
and minimize feature subset size. The following definition is (ub − lb) (ub − lb) −
→ − →
applied to the fitness functions in this study. Xi = + · R · Sel (15)
2 2
f1 = f _select (13) where ub and lb are the upper and lower limits of the feature;


Sel is an M-dimensional
−→ vector that indicates whether the feature
FP + FN is selected or not (Seli = 1 if selected and −1 otherwise, i ∈ M);
f2 = (14) −
→ −

TP + TN + FP + FN R is an M-dimensional random
(ub−lb) −
→ − →vector. Since R is a random
where f1 represents the number of features, and f2 represents vector in Eq. (15), 2
· R · Sel will produce random values
the classification error rate. The feature subset size is known as between [−0.5, 0.5]. The individuals produced by Eq. (15) will
5
X. Li, Q. Fu, Q. Li et al. Applied Soft Computing 145 (2023) 110558

Algorithm 1 MOBGWO-GMS Fig. 3 is a schematic diagram of population initialization. The


Input: Population size (N), Dataset and its dimensions (M),upper and whole process of initialization is as follows: First, the numFi
lower bounds of search space (ub, lb), threshold (thres), maximum value of each individual is calculated by Eq. (16). Then, guided
number of iterations (maxIter) by the correlation between features and class labels, the binary
Output: The optimal Pareto front. tournament selection is used to select numFi features. The above


1: Divide DataSet into a training set and a testing set; two points are combined to calculate Sel. The number of selected
2: Initialize wolves and parameters;


‘1’ in Sel is controlled by numFi , and the position where ‘1’ is
3: Fitness assessment and Non-Dominant ranking;
chosen is controlled by the correlation between the feature and
4: Initialize elite archives (Ar) and dynamic detection archives (DA)
the class label. Finally, Eq. (15) is used to obtain the initialized
(with maximum size equal to N);
population.
5: Calculate the correlation between features and class labels and
between features and features;
6: while Stopping criteria not met do 4.4. Multi-objective binary grey wolf optimization
7: Update the adapter ;
8: if rand > adapter then GWO has fine global development ability and shows promising
9: Calculate the crowding distance (CD) of each solution in Ar; performance in various fields. The original GWO algorithm was
10: Select α , β and δ wolves based on CD; proposed for continuous optimization problems and cannot be
11: Update current wolf’s position according to Eq. (8); directly used to solve multi-objective FS problems. In GWO, since
12: Discretize using Eq. (17); the position vector is a continuous value, the candidate solutions
13: Update parameters; can move continuously in the search space. In order to enable
14: Evaluate the fitness of individual wolves;
search candidates to move in the binary search space, AI-Tashi
15: Update Select α , β and δ wolves;
et al. [55] developed a binary version based on the sigmoid
16: else
17: Calculate the number using Eq. (19); activation function, called BGWO. The activation function is as
18: Select the GoodPop population according to the number ; follows.
19: for each individual in GoodPop do 1
−−−−→ Sigmoid(x) = (17)
20: Construct modGS with dependency bootstrap; 1 + e−10(x−0.5)
21: Take the triple-mutation strategies;
Since MOP problem is studied in this paper, three best solu-
22: Mutate the population using Eq. (21);
23: Screen the mutated populations and eliminate the tions to find on the Pareto frontier for leaders should be consid-
disadvantaged populations; ered. To guide other search members to a promising area in the
24: endfor search space, a method is proposed to select α , β and δ wolves
25: endif according to the probability to find similar or better solutions, and
26: Fitness assessment and Non-Dominant ranking; the equation is shown below.
27: Update archives Ar and DA;
if rand > thres
{
Randomly selected
28: endwhiAle Xα,β,δ = (18)
Crow ding distance selected else
where thres is a predefined threshold. Randomly selected means

→ that three individual solutions are randomly selected in the non-
thus possess the following traits: (i) Xi ∈ [0.5, 1] if Seli = 1; (ii)
−→ dominated solution set and assigned to α , β and δ wolves.
Xi ∈ [0, 0.5] if Seli = −1. As can be observed, there is some ran- Crow ding distance selected is similar to that proposed in [55], and
domness in the initialized populations produced in this manner. the one with the largest crowding distance is allocated to the
In addition, the initialized population’s quality is ensured because three lead wolves.


Sel is a feature vector built using the PCC.


The only value of Sel is 1 or −1, a feature is assigned to 1 4.5. Guided mutation strategy
if it is selected, and −1 if it is not. Therefore, we will restrict


the number of 1 in the corresponding Sel for each individual, in A novel guided mutation strategy (GMS) based on the PCC
order to improve the distribution of the initial population in the is proposed to guide mutation. The novelty of the proposed
feature space. This quantity is controlled by numFi , as shown in strategy is that, in addition to the features-class correlation, it
Eq. (16). By using the binary tournament selection, numFi features also uses the feature correlation to guide population evolution.

→ Triple-guided mutation strategies (increase only mutation, de-
are chosen from the feature space, with the matching Sel set to 1
and the remaining features set to −1. crease only mutation, and simultaneous increase and decrease
The numFi is calculated as follows: mutation) are proposed. This proposed strategy can enhance the
[ ] efficiency of space exploration to improve the convergence speed.
(i − 1) × (M − 2) Each individual is divided into two parts, as shown in Fig. 4. The
numFi = 1 + (16)
N first part contains the features that the individual has selected,
while the second part contains the features that the individual has
where i is the number of individuals, M is the dimension of the not selected. All triple-mutation strategies are guided mutations
problem, N is the population size, and [] represents the largest provided by unselected features and selected features. The two
integer that does not exceed []. This equation produces numFi primary components of this strategy are (i) Feature-class and
that is evenly dispersed. The proposed initialization strategy is features correlation can both be taken into account for unselected
used in the right part of Fig. 2, whereas the random initialization or selected features; (ii) Higher feature-class correlation may
strategy is used in the left part. The vertical axis represents be good enough for improved classification accuracy, whereas
the number of individuals, and the horizontal axis denotes the higher features correlation may result in high redundancy. To-
quantity of attributes that have been chosen. Assume there are gether, these two characteristics control population evolution and
1000 individuals in the population and that the characteristic prevent it from occurring blindly without prior knowledge.
dimension is 50, it can be seen that the population generated in GMS is a guided mutation operation that focuses on indi-
the right part is more diverse. viduals with different non-dominated solution ranks to speed
6
X. Li, Q. Fu, Q. Li et al. Applied Soft Computing 145 (2023) 110558

Fig. 2. Comparison of random initialization and uniform initialization.

Fig. 3. Population initialization.

Fig. 5. Linear and non-linear decreasing.

space as completely as possible. In Fig. 5, the abscissa is the


adapter, and the ordinate represents number. The red curse repre-
sents the value of number by linear method, while the blue curve
represents the value of number by non-linear method. As shown
in this figure, it can be observed that nonlinear method explores
individuals with lower non-dominance ranks more completely
Fig. 4. The principle of individual splitting.
than linear decreasing method.
The adapter is calculated as follows:

up the convergence rate and increase additional non-dominated iter


solutions. The essential steps in this operation are (i) choosing adapter = (20)
maxIter
individuals with a non-dominance rank of less than or equal to
the number, called GoodPop; (ii) carrying out triple-guided muta- where iter is the current iteration number, maxIter is the maxi-
tion strategies for each selected individual. Eq. (19) illustrates the mum number of iterations.
formula for updating the number. Assuming that the size of GoodPop equals gN, the size of
number = e1.5×(1−adapter) GoodPop obtained after the triple-mutation strategies is 3gN. A
[ ]
(19)
truncation method is used to choose N individuals as a new popu-
where adapter is a variable that increases with the number of it- lation if 3gN is more than N. If not, some individuals are randomly
erations, as illustrated in Eq. (20). number decreases non-linearly selected from the previous population and incorporated into the
from 4 to 1. The reason why initial value of number is set to 4 is new population. The following equation is used to update the
to maintain population diversity. Since the rate of linear decline population:
is constant, it is easy to lead to insufficient local exploration in
the late stage, so we use a non-linear decreasing method, which −−→ −
→ (ub − lb) −−→ −−−−→
t +1
forces the non-dominated solution set to explore the feature X = Xt + · rand · modGS (21)
2
7
X. Li, Q. Fu, Q. Li et al. Applied Soft Computing 145 (2023) 110558

Fig. 6. Increase only mutation strategy. Fig. 8. Simultaneous increase and decrease mutation strategy.

4.6. Dynamic agitation mechanism

To avoid the population falling into a local optimum due to


a particular strategy (BGWO or GMS), we introduce a dynamic
agitation mechanism (DAM) to monitor the evolution process
of the population. The introduced archive is called the dynamic
detection archive. If a strategy is used multiple times and the
Fig. 7. Decrease only mutation strategy.
elite population does not change, another strategy is assumed
to agitate the population. The evolution of the standard iterative

→ −−→ population is determined by the following formula.
where X t is a solution in the GoodPop, and rand represents a
−−−−→ rand > adapter
{
BGWO if
M-dimensional random vector. modGS is a M-dimensional vector, (22)
which is controlled by number and triple-mutation strategies. The GMS else
−−−−→
new generated population moves mainly based on modGS, and By adding a DAM to it, the adapter is modified in Eq. (23) as
the second half of the equation is used to control the step vector follows.
of the population. The triple-update mechanism will be described ⎧
in detail below. ⎨ 0
⎪ if Ttrub ≥ 2&useT ≤ −2
adapter = 1 elseif Ttrub ≥ 2&useT ≥ 2 (23)
iter

else

maxIter
4.5.1. Increase only mutation strategy
This strategy’s major goal is to increase the number of features where Ttrub checks for changes in the Pareto Font. Ttrub = 1 if
by making the proper mutations, as described in the following. there is no change, otherwise Ttrub = 0. The chosen method is
Step 1: Use the binary tournament strategy repeatedly, select found by using useT . useT will increase by 1 if BGWO is utilized,
number of features with a high feature-class correlation among else it will drop by 1. Ttrub will increased and useT will increase
the unselected features, and save the selected features in the or decrease accordingly when it is determined that the Pareto
increasing pool. Front is unchanged and the same method is being utilized in the
Step 2: Use the binary tournament strategy repeatedly, se- process.
lect number of features with low features correlation from the The adapter will be moved to one of the two extremes (0 or 1),
unselected features, and save the features in the increasing pool. when Ttrub ≥ 2&useT ≤ −2 or Ttrub ≤ 2 &useT ≥ −2. Accord-
−−−−→ ingly, the search strategy will change. To accomplish DAM, Ttrub
Step 3: Update the modGS as depicted in Fig. 6. Assuming
and useT will be assigned to 0 after search strategy is modified.
that number is 3, randomly select 3 feature locations, and set its
According to our experiments, if an algorithm iterates twice and
feature value from 0 to 1.
the non-dominated solutions remain the same, it is highly likely
that the algorithm has reached a local optimum. Additionally, the
setting of 2 or −2 for Ttrub and useT also prevents the algorithm
4.5.2. Decrease only mutation strategy
from falling into local optimum and wasting search resources.
This method requires the construction of a decrease pool. The
basic steps are as follows. 4.7. Analysis of the parameters in the proposed algorithm
Step 1: Use the binary tournament strategy repeatedly, a
number of features with low feature-class correlation are chosen This section presents a detailed analysis of the parameters in
from the selected features and saved in the decreasing pool. the proposed algorithm that adjust diversity and convergence to
Step 2: The selected features guide the location of mutation improve global exploitation and local exploration performance.
based on the features correlation. By using the binary tournament Firstly, we use continuous encoding between 0–1 in the encoding
strategy repeatedly, number features with high features correla- process, followed by discretization using the Sigmoid function in
tion are selected from the unselected features, and the number of Eq. (17). In the initialization strategy in Section 4.3, we adjust the
features is then stored in the decrease pool. number of features selected by each individual using numFi in
−−−−→
Step 3: Update the modGS as depicted in Fig. 7. Assuming Eq. (16), which ensures the diversity of the algorithm in the early
that number is 3, randomly select 3 feature locations, and set its stage. Secondly, during the process of updating the population’s
feature value from 0 to −1. location: (1) The previous iteration of the population uses the
original BGWO operator in Section 4.4 for location updating. To
increase the population’s diversity, we introduce the threshold
4.5.3. Simultaneous increase and decrease mutation strategy thres in Eq. (18) to improve the selection of the three leader
There are two mutation pools in this strategy, i.e., the increase wolves from the Pareto solution set. The larger the thres, the bet-
pool and the decrease pool. The basic step is a combination of ter the diversity of the leader, and the smaller the thres, the better
−−−−→
the above two strategies, and the modGS is updated as shown in the convergence. Our paper also conducts a sensitivity analysis on
Fig. 8. thres in Section 5.4; (2) In the later stage of population iteration,
8
X. Li, Q. Fu, Q. Li et al. Applied Soft Computing 145 (2023) 110558

Table 3 algorithms were set to the same. In the proposed algorithm,


Summary of fourteen standard datasets. the threshold was set to 0.5, which operates GMS to carry out
Datasets No. of No. of No. of directional mutation of feature selection. Parameters of all the
samples classes features
nine algorithms are shown in Table 4.
Vehicle 846 4 18
In all of the algorithms, the population size and the maximum
German 1000 2 20
Ionosphere 351 2 34
number of iterations were set to 50. In the training process, 10-
Waveform 5000 3 40 fold cross-validation with K = 5 was adopted. In addition, for
Sonar 208 2 60 the sake of fairness, all algorithms were run independently 20
Libras 360 15 90 times. All algorithms were run on matlab R2020 with Intel Core
Hill-valley 606 2 100
i7 6700HQ CPU and 32 GB RAM.
Musk 476 2 166
Madelon 2000 2 500
Isolet5 1559 26 617 5.3. Evaluation metrics
SRBCT 83 4 2308
DBword 64 2 4702
Leukemia1 72 3 5327
Three commonly adopted evaluation criteria are used to eval-
Brain_Tumor1 90 5 5920 uate the quality of the obtained Pareto front including: (1) C-
metric; (2) Hypervolume metric; and (3) Inverse generational
distance.
(1) C-metric: C-metric assesses the dominant relationship be-
we use the GMS operator proposed in Section 4.5 to perform tween two Pareto fronts. If C (A, B) = 1, then at least one solution
local exploration and focus more on convergence. The number in A dominates all solutions in B; if C (A, B) = 0, it means that no
parameter and adapter parameter in Eqs. (19) and (20) facilitate solution in B is dominated by the solution in A [47]. This metric
a smooth transition from exploitation to the exploration stage is calculated by Eq. (24).
and regulate the later exploration of the population to ensure
its convergence. Section 4.5 details how the proposed triple- |{b ∈ B/∃a ∈ A : a ≥ b}|
C (A, B) = (24)
mutation strategy unfolds the local exploration task. Finally, the |B|
parameters Ttrub and useT in Section 4.6 prevent the population where C (A, B) computes the fraction of solutions in B where at
from falling into a local optimum and switch between the two least one solution in A dominates [63].
search operators to appropriately increase the diversity of the (2) Hypervolume (HV): HV calculates the volume of the region
population. in the target space bounded by the non-dominated solution set
and the reference point obtained by the algorithm. The larger the
5. Experimental results HV value, the better the comprehensive performance of the algo-
rithm. The performance refers to the convergence and diversity of
In this section, the effectiveness of the proposed MOBGWO- the algorithm. The setting of the reference point is crucial for the
GMS algorithm was tested on fourteen standard datasets. Mean- HV calculation. The number of selected features in the solution
while, eight algorithms were selected for comparison. space was set as between 0 and 1, and the HV reference point
was set as (1.1, 1.1). HV evaluates both diversity and convergence
5.1. Datasets measures. HV is calculated according to the following formula.
|S |
To validate the proposed algorithm, fourteen standard datasets ⋃
HV = δ ( vi ) (25)
from the UCI Machine Learning Repository were used. Table 3 de-
i=1
scribes these datasets by the number of samples, classes and fea-
tures. Specifically, datasets can be divided into three categories. where δ represents the Lebesgue measure, which is used to
The first category is low-dimensional datasets with less than measure the volume. |S | denotes the number of non-dominated
100 features (Vehicle, German, Ionosphere, Waveform, Sonar, solution sets, vi depicts the hypervolume formed by the reference
and Libras). The second category is medium-dimensional datasets point and the ith solution in the solution set.
with more than 100 and less than 1000 features (Hill-valley, (3) Inverted generational distance (IGD): IGD [64] calculation
Musk, Madelon, and Isolet5). SRBCT, DBword, Leukemia1 and is the average Euclidean distance between the solutions in the
Brain_Tumor1 datasets with more than 1000 features are clas- obtained solution set and the true Pareto front. However, in
sified as the third type of high-dimensional datasets. 80% sam- multi-objective FS, the true Pareto front is unknown. In this
ples of each dataset were randomly chosen for training and the experiment, the non-dominated solutions obtained by all algo-
remaining 20% were used for testing. The k-nearest neighbors rithms under 20 executions were selected as the real Pareto
classifier with K = 5 was used to measure the classification frontier.
accuracy of selected features. dist(x∗ , S)

x∗ ∈F ∗
IGD(S) = (26)
|F ∗ |
5.2. Parameter settings
where dist(x∗ , S) represents the Euclidean distance between a
To analyze the effectiveness of the proposed algorithm, point x∗ ∈ F ∗ and its nearest solution in S.
MOBGWO-GMS is compared with MOABC [60], multi-objective
binary grey wolf optimization (MOBGWO) [55], multi-objective 5.4. Effectiveness of the proposed method
binary Harris Hawks optimization (MOHHO) [51], NSGA-II [22],
a modified NSGA-II (NSGAII-SDR) [61], MOPSO [47], ranked fea- A poor initial population may affect algorithm performance
ture PSO feature selection (CMDPSOFS) [47], and evolutionary [65]. Previous studies have demonstrated that traditional random
algorithm for large-scale sparse MOPs (SparseEA) [62]. To obtain initialization may result in slow convergence rates, less infor-
impartial results, the fitness functions of all the comparison mative learning, and significant deviation from the ideal Pareto
algorithms refer to the corresponding literature were changed front [66,67]. To address this issue, we propose a novel initial-
as shown in Section 4.1. The common parameter settings of all ization method that achieves a uniform distribution of initial
9
X. Li, Q. Fu, Q. Li et al. Applied Soft Computing 145 (2023) 110558

Table 4
Parameter settings of nine algorithms.
Algorithm Private parameters Common parameters
MOABC The epsilon size is set to epsilon = 0.001; Controlling
the importance of food sources in the production of
The upper bound ub =
new food sources w1 = 0.8; Controlling the importance
1;
of information provided by an employed bee w2 = 0.7;
The lower bound lb = 0;
MOBGWO No other parameters
Population size N = 50;
MOBHHO Setting beta = 1.5 during the Levy flight;
Iteration count maxIter
NSGA-II The crossover rate is set to 1; The mutation rate is set
= 50;
to 1/D, where D is the dimension of the feature;
NSGA-II-SDR The crossover rate is set to 0.1; The mutation rate is set
to 1/D; The distribution index of both is set to 20;
MOPSO The values of cognitive factor c1 = 1 and social factor
c2 = 2; The inertial weight is set to w = 0.9; The
maximum velocity (Vmax) is set to (ub-lb)/2;
CMDPSOFS The inertial weight (w) is chosen randomly from the
range [0.1, 0.5], while the acceleration constants (c1 and
c2) are randomly chosen from the range [1.5, 2]; The
mutation rate Mr = 1/D, where D is the size of the
feature space.
MOBGWO-GMS The threshold for directional mutations is set to thres =
0.5;
SparseEA The crossover rate is set to 1.0; The mutation rate is set
to 1/D, where D is the dimension of the feature.

Fig. 9. The initial population distribution generated by the proposed initialization and the traditional random initialization.

individuals and increases the amount of useful information. To the GMS strategy constructs two shared pools that facilitate the
evaluate the efficiency of the proposed initialization method, we triple-mutation of selected and unselected features based on two
compare the distribution of the traditional random initializa- types of correlations. This triple-mutation operation enables indi-
tion method with the proposed method by using three datasets viduals to search for positions nearby in later iterations, leading
including Waveform, Musk and Madelon, each with a popula- to faster convergence of the population.
tion size of 100. As shown in Fig. 9, the proposed initialization
method achieves higher accuracy on the Waveform and Madelon
5.5. Results and discussions
datasets. The traditional random initialization method tends to
select a limited range of features, resulting in individuals clumped
together. Our proposed method generates individuals that are The results were analyzed qualitatively and quantitatively.
widely and uniformly distributed in space, which enhances the For the qualitative analysis of the results, the 20 Pareto fronts
global search ability by providing a more diverse and balanced obtained for each algorithm and dataset were combined into two
population. Therefore, it can be concluded that the proposed groups, named ‘‘best’’ and ‘‘average’’. Section 5.5.1 presents their
initialization method can enhance the global search ability by comparison. For the quantitative analysis of the results, the qual-
providing more diverse and balanced populations. ity of the solution is comprehensively evaluated by calculating
GMS is developed to enhance GWO’s ability to solve multi- three significant indicators (C-metric, HV and IGD). In addition,
objective feature selection problems by avoiding local optima. we analyzed the computational complexity and runtime of all
We consider that random factors and limited useful informa- algorithms
tion can cause GWO to become trapped in local optima. To ad-
dress this, GMS conducts local searches by using feature-feature
5.5.1. Pareto frontier analysis
and feature-class correlations. The GMS search is triggered when
The UCI Machine Learning Repository provides fourteen data-
GWO falls into a local optimum. To evaluate GMS’s effectiveness,
sets that are categorized into three subgroups based on their
we compare the search paths of GWO and GMS on the Madelon
dimensionality: low, medium, and high. To obtain two sets of
dataset over the last ten iterations. We randomly selected five
non-dominated solutions, twenty independent experiments were
individuals and compared the performance of the two operators
conducted, and their results were combined to form the ‘‘best’’
under the same conditions, as shown in Fig. 10. It can be seen
and ‘‘average’’ sets. Each dataset was randomly split into a train-
that the GWO operator has high exploration blindness, resulting
ing and a testing set, resulting in four sets of experimental
in poor local exploration. The repeated paths taken by individu-
results: ‘‘best train’’, ‘‘best test’’, ‘‘average train’’, and ‘‘average
als lead to significant waste of computer resources. In contrast,
test’’. Each of the four experiment groups underwent individual
10
X. Li, Q. Fu, Q. Li et al. Applied Soft Computing 145 (2023) 110558

Fig. 10. Convergence of GWO and GMS strategies in late iterations.

Fig. 11. Obtained ‘‘best train’’ and ‘‘best test’’ Pareto fronts from the nine methods on low-dimensional datasets.

analysis. Figs. 11 to 13 show the Pareto frontiers of the pro- obtains better non-dominated solution sets compared to other
posed MOBGWO-GMS algorithm and other algorithms on low-, algorithms on the Ionosphere and Waveform datasets. In other
medium-, and high-dimensional datasets in terms of ‘‘best train’’ words, MOBGWO-GMS achieves a better classification error rate
and ‘‘best test’’ results. Each figure is divided into two parts, with and fewer features.
the left showing the ‘‘best train’’ and the right showing the ‘‘best On the medium-dimensional datasets (HillValley, Musk, Made-
test’’. The horizontal axis denotes the number of selected features, lon, and Isolet5), Fig. 12 displays the performance of the proposed
while the vertical axis represents the classification error rate. The algorithm. The left and right sides of Fig. 12 represent the ‘‘best
dataset name and group are indicated at the top of each figure, train’’ and ‘‘best test’’ results for each dataset, respectively. It
along with the original feature size and classification error rates can be seen that the proposed algorithm achieves the best or
using the full feature set in parentheses. sub-optimal results in all cases. The MOBGWO-GMS excels in
On the low-dimensional datasets (Vehicle, German, Ionosphere, reducing the number of features in Musk, Madelon, and Isolet5
Waveform, Sonar, and Libras), it can be seen from Fig. 11 that all datasets, making it a favorable choice when the decision maker
algorithms can effectively reduce the number of features, and the is cost-conscious. The MOBGWO-GMS achieves high classification
classification error rate is also significantly reduced. In particular, accuracy, especially on the Madelon dataset where it achieves
MOBGWO-GMS shows superior performance with fewer features around 90%, outperforming other algorithms. In Madelon and
in most datasets as depicted by the ‘‘best train’’ results on the left Isolet5, the solution sets obtained by MOBGWO-GMS dominate
side of Fig. 11. For example, when 4 or 5 features are selected other algorithms. MOBGWO-GMS performs better than other
in the German dataset, the corresponding classification error algorithms in terms of the number of features and classification
rate is lower than in other algorithms. Similarly, the ‘‘best test’’ accuracy in the ‘‘best test’’ results, and its solution set outper-
analysis on the right side of Fig. 11 confirms that MOBGWO-GMS forms other algorithms on all four datasets. Notably, the proposed

11
X. Li, Q. Fu, Q. Li et al. Applied Soft Computing 145 (2023) 110558

Fig. 12. Obtained ‘‘best train’’ and ‘‘best test’’ Pareto fronts from the nine methods on medium-dimensional datasets.

Fig. 13. Obtained ‘‘best train’’ and ‘‘best test’’ Pareto fronts from the nine methods on high-dimensional datasets.

initialization strategy allows our algorithm to search for solutions MOBGWO-GMS method in dealing with high-dimensional feature
in lower-dimensional feature ranges in datasets such as Musk and selection problems is remarkable.
Isolet5. Furthermore, the GMS strategy helps the algorithm find Figs. 14 to 16 depict the average Pareto fronts for low, medium,
more potential non-dominated solutions. MOBGWO-GMS offers and high dimensional datasets, respectively. The left side of each
a subset of non-dominated features that can satisfy different graph represents the ‘‘average train’’ and the right side represents
decision-makers’ needs. the ‘‘average test’’. In low-dimensional datasets, as shown in
When it comes to the experimental results on the high- Fig. 14, the average Pareto frontier obtained by MOBGWO-GMS
dimensional datasets (SRBCT, DBWorld, Leukemia1 and Brain_ outperforms other algorithms on some datasets, for example, the
Tumor1), their ‘‘best train’’ and ‘‘best test’’ are shown in Fig. 13. Vehicle, German,Ionosphere and Sonar datasets in the training
In the ‘‘best train’’ on the left side of Fig. 13, we can see that the set, and the German, Waveform and Libras,in the test set. In the
Pareto solution set obtained by MOBGWO-GMS dominates the so- mid-dimensional dataset, as shown in Fig. 15, we can see that
lution set obtained by other algorithms, which demonstrates that our proposed algorithm outperforms other algorithms in Madelon
MOBGWO-GMS can not only reduce the dimension of features (average train and test) and Musk (average test). MOBGWO-GMS
more effectively on high-dimensional datasets but also has ex- is able to control the search range to a smaller number of features.
cellent classification accuracy. In the ‘‘best test’’, MOBGWO-GMS For example, in the Musk dataset, the search range of various
obtains a similar conclusion. Although the classification error rate algorithms can be clearly displayed. The main reason for this
on the SRBCT dataset is slightly lower than that of the MOABC is our proposed initialization strategy. The same conclusion is
algorithm, MOBGWO-GMS can achieve similar classification ac- especially obvious in all high-dimensional datasets, as shown in
curacy to MOABC using less than 30 features. The proposed algo- Fig. 16, which delivers the search range of various algorithms.
rithm can achieve the lowest classification error rate, particularly Meanwhile, MOBGWO-GMS is also significantly competitive in
in SRBCT, DBWorld, Leukemia1, and Brain_Tumor1 datasets, with terms of classification error rate. The GMS strategy can explore
error rates of 14.28%, 16.67%, 16.67%, and 8.03% (using 19, 162, 46, more possible solutions and make the algorithm converge to the
and 88 features, respectively). Therefore, the performance of the optimal Pareto front quickly.

12
X. Li, Q. Fu, Q. Li et al. Applied Soft Computing 145 (2023) 110558

Fig. 14. Obtained ‘‘average train’’ and ‘‘average test’’ Pareto fronts from the nine methods on low-dimensional datasets.

Fig. 15. Obtained ‘‘average train’’ and ‘‘average test’’ Pareto fronts from the nine methods on medium-dimensional datasets.

In conclusion, MOBGWO-GMS demonstrates strong compet- eight comparison algorithms. If C(MOBGWO-GMS, X) is greater
itiveness in multi-objective feature selection. The algorithm ef- than C(X, MOBGWO-GMS), it confirms that the proposed al-
fectively reduces feature dimensionality and minimizes classifi- gorithm has a better convergence ability towards the optimal
cation error rates. Furthermore, the algorithm holds tremendous solution [47]. The computed C-metric values for all datasets in
potential in handling high-dimensional data. the training and testing sets are displayed in Tables 5 and 6. The
first eight columns of the table represent C(MOBGWO-GMS, X),
5.5.2. Performance metrics while the second eight columns show C(X, MOBGWO-GMS). Upon
The purpose of this section is to assess the performance of analyzing the results, it can be observed that:
the MOBGWO-GMS algorithm using three multi-objective feature (1) In the training set, all algorithms (X) exhibit much higher
selection metrics: C-metric, HV, and IGD. values of C(MOBGWO-GMS, X) compared to C(X, MOBGWO-
The C-metric is a widely used measure for evaluating the GMS). For instance, in the Musk dataset, MOBGWO-GMS dom-
effectiveness of solution sets by comparing the scores of two inates 99% of the solutions obtained by MOABC (C(MOBGWO-
Pareto fronts. In our study, we evaluated the performance of GMS, MOABC) = 0.99), whereas none of the solutions in MOBGWO-
nine algorithms using the C-metric index on both the training GMS are dominated by MOABC (C(MOABC, MOBGWO-GMS) =
and testing sets. Specifically, we calculated C(MOBGWO-GMS, 0). Notably, the fact that the mean (Avg) and variance (Std)
X) and C(X, MOBGWO-GMS) using Eq. (24), where X represents in C(X, MOBGWO-GMS) are both zero implies that none of the

13
X. Li, Q. Fu, Q. Li et al. Applied Soft Computing 145 (2023) 110558

Fig. 16. Obtained ‘‘average train’’ and ‘‘average test’’ Pareto fronts from the nine methods on high-dimensional datasets.

Table 5
Average and standard deviation of C-metric values between MOBGWO-GMS and the eight other methods in training data.
Training set C(MOBGWO-GMS,X) as X: C(X,MOBGWO-GMS) as X:

MOABC MOBGWO MOBHHO NSGA-II NSGAII-SDR MOPSO CMDPSOFS SparseEA MOABC MOBGWO MOBHHO NSGA-II NSGAII-SDR MOPSO CMDPSOFS SparseEA

Mean 0.9208 0.7138 0.9395 0.6605 0.2333 0.7259 0.7083 0.8196 0.0393 0.2094 0.0207 0.2295 0.1553 0.2261 0.2600 0.1442
Vehcile
Std 0.1610 0.1816 0.0845 0.1801 0.1567 0.1857 0.1923 0.0807 0.0834 0.1067 0.0426 0.1457 0.0587 0.1460 0.1873 0.0552
Mean 0.8531 0.8244 0.9839 0.8175 0.4167 0.8323 0.8788 0.5945 0.0616 0.1517 0 0.1064 0.2155 0.0999 0.1084 0.3244
German
Std 0.1475 0.1118 0.0394 0.2353 0.1431 0.1876 0.1270 0.1581 0.1055 0.0664 0 0.1445 0.1186 0.1728 0.1068 0.1393
Mean 1 0.7592 1 0.6933 0.2208 0.9117 0.9667 0.7117 0 0.1408 0 0.2092 0.3108 0.0300 0.0083 0.3025
Ionosphere
Std 0 0.2224 0 0.2559 0.1716 0.1515 0.0872 0.1477 0 0.1826 0 0.1870 0.1374 0.1342 0.0373 0.1509
Mean 1 0.9571 1 0.7214 0.3875 0.8291 0.8734 0.8751 0 0.0277 0 0.1926 0.1035 0.0851 0.0346 0.0987
Waveform
Std 0 0.0706 0 0.1910 0.1879 0.1746 0.1293 0.0486 0 0.0474 0 0.1706 0.0414 0.0905 0.0468 0.0356
Mean 0.9275 0.8065 1 0.7508 0.4458 0.8441 0.9168 0.7067 0 0.0975 0 0.1923 0.1380 0.0715 0.0171 0.2639
Sonar
Std 0.1426 0.2097 0 0.2080 0.2392 0.1740 0.1741 0.1493 0 0.1731 0 0.2262 0.0528 0.0709 0.0536 0.0932
Mean 1 0.7631 0.9521 0.7627 0.6208 0.9068 0.9363 0.7983 0 0.1108 0 0.1783 0.1006 0.0619 0.0036 0.2027
Libras
Std 0 0.1298 0.1302 0.2599 0.1130 0.0868 0.1623 0.1175 0 0.0645 0 0.2143 0.0170 0.0534 0.0160 0.0932
Mean 1 0.8312 0.9909 0.5992 0.6042 0.9750 0.8723 0.3417 0 0.0350 0 0.1646 0.2163 0 0.1346 0.5800
HillValley
Std 0 0.2610 0.0407 0.2267 0.0892 0.1118 0.1210 0.2403 0 0.1182 0 0.2478 0.0692 0 0.1156 0.2129
Mean 0.9900 0.9181 0.9944 0.5880 0.7092 0.8800 0.9209 0.8371 0 0.0111 0 0.0744 0.0841 0.0081 0.0205 0.1443
Musk
Std 0.0447 0.1766 0.0248 0.2243 0.1028 0.2225 0.1149 0.0525 0 0.0275 0 0.1356 0.0304 0.0252 0.0375 0.0563
Mean 1 0.8329 1 1 1 1 1 0.6308 0 0.1648 0 0 0 0 0 0.2182
Madelon
Std 0 0.0490 0 0 0 0 0 0.2137 0 0.0196 0 0 0 0 0 0.1044
Mean 0.7579 0.5652 0.9547 0.1157 1 0.3102 0.6907 0.7302 0 0.0782 0 0.0227 0 0.0092 0 0.1493
Isolet5
Std 0.2508 0.3104 0.1019 0.0193 0 0.2751 0.3048 0.1765 0 0.0797 0 0.0430 0 0.0283 0 0.1366
Mean 1 0.9958 1 1 0.9064 1 1 0.8882 0 0 0 0 0.0700 0 0 0.1164
SRBCT
Std 0 0.0186 0 0 0.0366 0 0 0.0594 0 0 0 0 0.0270 0 0 0.0972
Mean 1 0.9917 1 1 0.8977 1 0.9833 0.4458 0 0.0031 0 0 0.0833 0 0.0031 0.5069
DBWorld
Std 0 0.0373 0 0 0.0148 0 0.0745 0.1583 0 0.0140 0 0 0.0227 0 0.0140 0.2482
Mean 0.8822 1 1 1 0.9102 0.8997 0.9950 0.4317 0.0946 0 0 0 0.0875 0.0599 0.0083 0.6189
Leukemia1
Std 0.0226 0 0 0 0.0315 0.0845 0.0224 0.1547 0.0274 0 0 0 0.0324 0.0509 0.0373 0.2211
Mean 0.8713 1 1 1 0.9175 0.9234 0.9524 0.6431 0.1288 0 0 0 0.1004 0.0650 0 0.3554
Brain_Tumor1
Std 0.0467 0 0 0 0.0518 0.0824 0.1286 0.2152 0.0536 0 0 0 0.0763 0.0779 0 0.1873

solutions in X dominates any of the solutions in MOBGWOGMS. t-test comparison are indicated in the tables with the symbols
This trend is evident in the Ionosphere, Waveform, Libras, Isolet5, ‘‘+’’, ‘‘−’’, and ‘‘=’’.
Leukemia1, and Brain_Tumor1 datasets for MOABC, MOBHHO,
1. ‘‘+’’ indicates that the performance of MOBGWO-GMS is
NSGA-II, and CMDPSOFS.
significantly better than the corresponding algorithm;
(2) In the testing set, we observed that the value of
2. ‘‘−’’ indicates that the performance of MOBGWO-GMS is
C(MOBGWO-GMS, X) is significantly higher than the corres-
inferior to the corresponding algorithm;
ponding value of C(X, MOBGWO-GMS). Moreover, C(X,
MOBGWO-GMS) is even closer to zero. These results suggest that 3. ‘‘=’’ indicates that there is no significant difference in the
MOBGWO-GMS has strong convergence capabilities, allowing it performance of the two algorithms.
to converge towards optimal solutions. Table 7 demonstrates that the proposed MOBGWO-GMS al-
HV is another popular metric used for assessing the quality gorithm outperforms the other eight algorithms significantly in
of the Pareto front. Its purpose is to calculate the hypervolume terms of the HV metric. Except for the HillValley datasets, the
of the space formed by the Pareto front and a chosen reference MOBGWO-GMS algorithm achieves the highest HV values. The
point. In this study, the reference point chosen was (1.1, 1.1) MOBGWO-GMS algorithm exhibits superior performance in the
due to its ability to measure both the convergence and diversity HV metric compared to other algorithms, which is attributed to
of the set of solutions. A higher HV value indicates a higher its ability to maintain a diverse population. From Table 8, except
quality of the solution set. Each experiment involved calculating for the Vehcile, German, Ionosphere, and Leukemia1 datasets,
the hypervolume HV using the optimal frontier ‘‘best’’ as the MOBGWO-GMS achieves the biggest value of HV in the remaining
Pareto frontier. The experiment was divided into two groups: the datasets. Our algorithm ranks among the top three in terms of
training set and the testing set. The average of 20 independent HV value on the three low-dimensional datasets (Vehicle, Ger-
algorithm runs was computed. The statistical significance of the man, Ionosphere), although it may not be optimal. And on the
HV results was assessed using a paired t-test with a significance Leukemia1 dataset, the HV value of our algorithm is 0.0847 lower
level of 0.05, as illustrated in Tables 7 and 8. The results of the than that of the SparseEA algorithm.

14
X. Li, Q. Fu, Q. Li et al. Applied Soft Computing 145 (2023) 110558

Table 6
Average and standard deviation of C-metric values between MOBGWO-GMS and the eight other methods in testing data.
Testing set C(MOBGWO-GMS,X) as X: C(X,MOBGWO-GMS) as X:

MOABC MOBGWO MOBHHO NSGA-II NSGAII-SDR MOPSO CMDPSOFS SparseEA MOABC MOBGWO MOBHHO NSGA-II NSGAII-SDR MOPSO CMDPSOFS SparseEA

Mean 0.9083 0.4461 0.8436 0.5658 0.2000 0.2685 0.4475 0.2494 0.0468 0.4685 0.0907 0.2333 0.2416 0.6493 0.4858 0.6495
Vehcile
Std 0.1891 0.2737 0.2279 0.2485 0.1675 0.2197 0.2771 0.2431 0.0974 0.2836 0.1634 0.1399 0.1067 0.2488 0.3159 0.2592
Mean 0.8792 0.1250 0.7750 0.3088 0.2292 0.4193 0.2761 0.7149 0.0403 0.7644 0.1278 0.3990 0.3369 0.4573 0.7682 0.2317
German
Std 0.2067 0.1804 0.2194 0.2797 0.2292 0.3297 0.2447 0.1277 0.1001 0.1855 0.1703 0.1973 0.1734 0.2618 0.2711 0.0949
Mean 0.8417 0.6708 1 0.8017 0.1833 0.8431 0.7803 0.0767 0 0.2950 0 0.1642 0.3683 0.1450 0.0167 0.8400
Ionosphere
Std 0.2386 0.3004 0 0.2555 0.2016 0.1375 0.2869 0.1466 0 0.3212 0 0.2294 0.1728 0.1175 0.0513 0.1887
Mean 1 0.6357 1 0.4169 0.2417 0.8799 0.8705 0.8170 0 0.3145 0 0.5326 0.1344 0.0524 0.0867 0.1446
Waveform
Std 0 0.1809 0 0.2127 0.1597 0.1651 0.1069 0.0727 0 0.2049 0 0.2341 0.0484 0.0772 0.0717 0.0575
Mean 0.9833 0.6808 1 0.7387 0.4083 0.6863 0.6894 0.5770 0 0.2454 0 0.1395 0.2862 0.2302 0.0225 0.3120
Sonar
Std 0.0745 0.2667 0 0.2135 0.2564 0.1701 0.2404 0.2326 0 0.2168 0 0.2353 0.1430 0.0706 0.0697 0.1328
Mean 1 0.5326 0.8776 0.8570 0.6458 0.7537 0.9661 0.4885 0 0.2017 0 0.0929 0.1412 0.1278 0.0125 0.4969
Libras
Std 0 0.2599 0.2199 0.2323 0.0850 0.1573 0.0863 0.2612 0 0.0866 0 0.1654 0.0385 0.0416 0.0385 0.2426
Mean 0.7476 0.5517 0.9524 1 0.6042 0.9373 0.6182 0.4250 0 0.1000 0 0 0.3625 0.1208 0.0792 0.4750
HillValley
Std 0.3082 0.2972 0.1591 0 0.0892 0.0800 0.3578 0.2260 0 0.1446 0 0 0.1022 0.1564 0.1517 0.2479
Mean 1 0.7514 1 0.9333 0.7433 0.9317 0.8975 0.6599 0 0.1498 0 0.0071 0.1498 0.0430 0.1084 0.3039
Musk
Std 0 0.2073 0 0.1502 0.0517 0.0899 0.0695 0.1846 0 0.0785 0 0.0319 0.0633 0.0732 0.0868 0.1566
Mean 1 0.8678 1 0.8073 0.8367 0.9813 1 0.4208 0 0.0797 0 0 0.1642 0.0100 0 0.2845
Madelon
Std 0 0.1955 0 0.2767 0.0353 0.0612 0 0.3261 0 0.0955 0 0 0.0393 0.0447 0 0.1593
Mean 0.9620 0.4698 0.7764 0.3183 0.8532 0.9065 0.5388 0.7910 0 0.0933 0.0087 0.0145 0.0347 0 0.0069 0.1242
Isolet5
Std 0.0801 0.3220 0.2244 0.2823 0.0614 0.1922 0.3260 0.1208 0 0.0890 0.0315 0.0350 0.0087 0 0.0308 0.0810
Mean 0.9150 0.6820 1 0.9625 0.9242 0.8858 0.9917 0.5333 0 0.2580 0 0 0.2580 0.0867 0 0.5852
SRBCT
Std 0.2014 0.2021 0 0.1223 0.0096 0.1341 0.0373 0.1160 0 0.1172 0 0 0.1172 0.1145 0 0.1889
Mean 1 0.8364 0.9633 0.9417 0.9804 0.8165 0.7983 0.5250 0 0.2433 0 0 0.3367 0.1517 0.1100 0.4450
DBWorld
Std 0 0.1361 0.1134 0.1816 0.0000 0.2009 0.2202 0.1118 0 0.1680 0 0 0.1066 0.1849 0.1632 0.1965
Mean 0.7695 1 1 1 0.9804 0.6650 0.8074 0.5000 0.3808 0 0 0 0.3808 0.3808 0.3517 0.5942
Leukemia1
Std 0.0788 0 0 0 0.0000 0.1516 0.0841 0 0.1066 0 0 0 0.1066 0.1066 0.1572 0.1736
Mean 0.3000 0.9000 0.9500 0.9000 0.8302 0.5583 0.5750 0.7000 0.4225 0 0 0 0.4225 0.4225 0.3308 0.1942
Brain_Tumor1
Std 0.2513 0.2052 0.1539 0.2052 0.2419 0.2181 0.2260 0.2513 0.1126 0 0 0 0.1126 0.1126 0.2001 0.2865

Table 7
Average, standard deviation and t-test of HV index values of the nine methods on training data.
HV (Training set) MOABC MOBGWO MOBHHO NSGA-II NSGAII-SDR MOPSO CMDPSOFS SparseEA MOBGWO-GMS
Mean 0.6375 0.7183 0.7215 0.7246 0.5809 0.7172 0.7317 0.7085 0.7410
Vehcile + + + + + + + +
Std 0.0234 0.0050 0.0097 0.0118 0.0051 0.0213 0.0070 0.0066 0.0046
Mean 0.6949 0.7462 0.7394 0.7561 0.7272 0.7442 0.7518 0.7535 0.7727
German + + + + + + + +
Std 0.0332 0.0054 0.0184 0.0066 0.0107 0.0262 0.0177 0.0069 0.0049
Mean 0.7944 0.9124 0.8207 0.9078 0.8541 0.8436 0.8740 0.8995 0.9215
Ionosphere + + + + + + + +
Std 0.0228 0.0122 0.0191 0.0107 0.0236 0.0472 0.0225 0.0080 0.0059
Mean 0.6766 0.7925 0.7288 0.8122 0.6491 0.7271 0.7534 0.8022 0.8313
Waveform + + + + + + + +
Std 0.0249 0.0107 0.0113 0.0192 0.0612 0.0382 0.0218 0.0058 0.0031
Mean 0.7018 0.8643 0.7565 0.8884 0.7335 0.7775 0.7799 0.8745 0.9086
Sonar + + + + + + + +
Std 0.0179 0.0149 0.0171 0.0143 0.0442 0.0274 0.0322 0.0075 0.0120
Mean 0.5667 0.7523 0.6758 0.7915 0.5567 0.6550 0.6526 0.7959 0.8071
Libras + + + + + + + +
Std 0.0144 0.0118 0.0164 0.0088 0.1095 0.0255 0.0188 0.0087 0.0165
Mean 0.4811 0.6349 0.5527 0.6749 0.5941 0.5433 0.5323 0.6649 0.6716
HillValley + + + = + + + +
Std 0.0132 0.0105 0.0124 0.0099 0.0193 0.0210 0.0138 0.0054 0.0090
Mean 0.6368 0.8325 0.7309 0.8683 0.7927 0.7081 0.6852 0.8764 0.9062
Musk + + + + + + + +
Std 0.0126 0.0136 0.0126 0.0131 0.0207 0.0288 0.0145 0.0072 0.0075
Mean 0.4509 0.6357 0.5336 0.6693 0.5211 0.4881 0.4776 0.6946 0.8825
Madelon + + + + + + + +
Std 0.0085 0.0202 0.0069 0.0165 0.0118 0.0142 0.0082 0.0142 0.0266
Mean 0.5690 0.7790 0.6734 0.7588 0.6466 0.6060 0.5895 0.8532 0.8555
Isolet5 + + + + + + + =
Std 0.0083 0.0065 0.0102 0.0092 0.0120 0.0165 0.0072 0.0069 0.0092
Mean 0.5741 0.8602 0.7423 0.6979 0.5979 0.5811 0.6037 0.9592 0.9856
SRBCT + + + + + + + +
Std 0.0063 0.0119 0.0133 0.0116 0.0154 0.0142 0.0059 0.0080 0.0029
Mean 0.4474 0.7443 0.5567 0.5218 0.5361 0.4741 0.4225 0.9041 0.9525
DBWorld + + + + + + + +
Std 0.0118 0.0364 0.0126 0.0599 0.0424 0.0195 0.0037 0.0136 0.0196
Mean 0.8107 0.5466 0.7157 0.6307 0.5642 0.5495 0.5296 0.9486 0.9758
Leukemia1 + + + + + + + +
Std 0.0082 0.0036 0.0094 0.0068 0.0099 0.0063 0.0059 0.0168 0.0109
Mean 0.7692 0.5138 0.6638 0.5857 0.5496 0.5209 0.5423 0.8982 0.8992
Brain_Tumor1 + + + + + + + =
Std 0.0062 0.0035 0.0084 0.0062 0.0065 0.0060 0.0041 0.0055 0.0177

Table 8
Average, standard deviation and t-test of HV index values of the nine methods on testing data.
HV (Testing Set) MOABC MOBGWO MOBHHO NSGA-II NSGAII-SDR MOPSO CMDPSOFS SparseEA M0BGWO-GMS
Mean 0.6438 0.7144 0.7058 0.6782 0.6023 0.7676 0.7485 0.7386 0.7344
Vehcile + + + + + − − =
Std 0.0231 0.0166 0.0118 0.0263 0.0096 0.0306 0.0121 0.0064 0.0238
Mean 0.7218 0.8071 0.7757 0.7785 0.7499 0.7917 0.8154 0.7573 0.8042
German + = + + + + − +
Std 0.0250 0.0067 0.0170 0.0228 0.0106 0.0205 0.0090 0.0103 0.0135
Mean 0.7938 0.9088 0.7996 0.9066 0.8452 0.8276 0.8659 0.9447 0.9215
Ionosphere + + + + + + + −
Std 0.0235 0.0172 0.0246 0.0169 0.0499 0.0312 0.0281 0.0109 0.0163
Mean 0.6696 0.8191 0.7220 0.8288 0.6632 0.7538 0.7585 0.8185 0.8379
Waveform + + + + + + + +
Std 0.0217 0.0085 0.0111 0.0147 0.0650 0.0279 0.0168 0.0071 0.0059
Mean 0.6967 0.8848 0.7199 0.8994 0.7869 0.8056 0.8195 0.8602 0.9190
Sonar + + + = + + + +
Std 0.0254 0.0210 0.0182 0.0402 0.0659 0.0289 0.0289 0.0123 0.0236
Mean 0.4102 0.6116 0.5157 0.5961 0.5065 0.5061 0.4680 0.6340 0.6430
Libras + + + + + + + =
Std 0.0098 0.0215 0.0148 0.0284 0.0603 0.0156 0.0186 0.0098 0.0472
Mean 0.4846 0.6578 0.5325 0.5930 0.5562 0.5033 0.5560 0.6413 0.6757
HillValley + + + + + + + +
Std 0.0122 0.0162 0.0148 0.0176 0.0173 0.0190 0.0132 0.0053 0.0298
Mean 0.5699 0.8152 0.6637 0.8397 0.6979 0.6761 0.6334 0.8654 0.8954
Musk + + + + + + + +
Std 0.0114 0.0195 0.0121 0.0142 0.0409 0.0237 0.0155 0.0080 0.0204
Mean 0.4623 0.6538 0.5490 0.6508 0.5013 0.4944 0.4783 0.6824 0.7722
Madelon + + + + + + + +
Std 0.0071 0.0242 0.0076 0.0142 0.0187 0.0143 0.0124 0.0065 0.0363
Mean 0.5211 0.7395 0.6426 0.7092 0.5905 0.5540 0.5624 0.7968 0.8032
Isolet5 + + + + + + + =
Std 0.0051 0.0096 0.0086 0.0126 0.0176 0.0139 0.0080 0.0084 0.0152
Mean 0.5369 0.7150 0.4732 0.6148 0.3817 0.4540 0.4570 0.8757 0.8872
SRBCT + + + + + + + =
Std 0.0216 0.0212 0.0158 0.0188 0.0219 0.0217 0.0129 0.0001 0.0720
Mean 0.4486 0.7718 0.6620 0.6210 0.4749 0.5731 0.5858 0.9172 0.8775
DBWorld + + + + + + + =
Std 0.0267 0.0154 0.0218 0.0035 0.0045 0.0300 0.0151 0.0000 0.2129
Mean 0.5923 0.3526 0.5352 0.4946 0.4720 0.4587 0.4383 0.9585 0.8738
Leukemia1 + + + + + + + −
Std 0.0017 0.0015 0.0132 0.0092 0.0051 0.0299 0.0102 0.0000 0.0576
Mean 0.7396 0.4790 0.5968 0.5611 0.5422 0.4882 0.4945 0.7933 0.8221
Brain_Tumor1 + + + + + + + +
Std 0.0024 0.0017 0.0060 0.0025 0.0129 0.0066 0.0039 0.0000 0.0601

15
X. Li, Q. Fu, Q. Li et al. Applied Soft Computing 145 (2023) 110558

Table 9
Average, standard deviation and t-test of IGD index values of the nine methods on training data.
IGD (Training set) MOABC M0BGWO M0BHHO NSGA-II NSGAII-SDR MOPSO CMDPSOFS SparseEA MOBGWO-GMS
Mean 0.1353 0.0996 0.1112 0.0970 0.2716 0.0991 0.0989 0.1143 0.0856
Vehcile + + + + + + + +
Std 0.0153 0.0088 0.0069 0.0076 0.0059 0.0105 0.0064 0.0072 0.0044
Mean 0.1132 0.1295 0.1149 0.1234 0.1820 0.0994 0.1069 0.1137 0.1010
German + + + + + = = +
Std 0.0114 0.0096 0.0037 0.0098 0.0147 0.0079 0.0075 0.0141 0.0113
Mean 0.1348 0.0551 0.1274 0.0608 0.1236 0.1022 0.0829 0.0737 0.0470
Ionosphere + + + + + + + +
Std 0.0242 0.0109 0.0163 0.0077 0.0288 0.0386 0.0150 0.0124 0.0050
Mean 0.1240 0.0836 0.1062 0.0526 0.2804 0.0960 0.0748 0.0727 0.0464
Waveform + + + + + + + +
Std 0.0187 0.0080 0.0092 0.0077 0.0636 0.0215 0.0101 0.0070 0.0071
Mean 0.2306 0.0920 0.1960 0.0846 0.2620 0.1595 0.1658 0.1183 0.0711
Sonar + + + + + + + +
Std 0.0191 0.0101 0.0132 0.0119 0.0557 0.0227 0.0269 0.0100 0.0097
Mean 0.3043 0.0829 0.1858 0.0392 0.2431 0.1930 0.2112 0.0520 0.0319
Libras + + + = + + + +
Std 0.0190 0.0133 0.0199 0.0170 0.1230 0.0301 0.0242 0.0052 0.0078
Mean 0.3309 0.1015 0.2235 0.0660 0.1402 0.2302 0.2556 0.0788 0.0671
HillValley + + + = + + + +
Std 0.0240 0.0119 0.0213 0.0089 0.0174 0.0308 0.0214 0.0108 0.0115
Mean 0.2799 0.0825 0.1728 0.0563 0.1094 0.1953 0.2244 0.0732 0.0494
Musk + + + + + + + +
Std 0.0176 0.0067 0.0150 0.0076 0.0165 0.0303 0.0191 0.0079 0.0088
Mean 0.4648 0.2090 0.3586 0.2000 0.3362 0.4140 0.4333 0.1763 0.0317
Madelon + + + + + + + +
Std 0.0104 0.0209 0.0120 0.0192 0.0131 0.0207 0.0122 0.0240 0.0112
Mean 0.3742 0.1287 0.2305 0.2034 0.2349 0.3448 0.3493 0.0408 0.0364
Isolet5 + + + + + + + +
Std 0.0090 0.0141 0.0123 0.0106 0.0093 0.0209 0.0079 0.0031 0.0049
Mean 0.4650 0.1439 0.2664 0.3192 0.4026 0.4497 0.4454 0.0466 0.0111
SRBCT + + + + + + + +
Std 0.0037 0.0053 0.0137 0.0079 0.0106 0.0144 0.0049 0.0116 0.0042
Mean 0.5521 0.2154 0.4348 0.4597 0.4521 0.5190 0.5695 0.0857 0.0302
DBWorld + + + + + + + +
Std 0.0125 0.0376 0.0215 0.0567 0.0342 0.0189 0.0041 0.0155 0.0196
Mean 0.1681 0.4929 0.2844 0.3817 0.4421 0.4852 0.4924 0.0353 0.0187
Leukemia1 + + + + + + + +
Std 0.0062 0.0045 0.0137 0.0049 0.0071 0.0083 0.0054 0.0188 0.0077
Mean 0.1593 0.4876 0.2789 0.3846 0.4191 0.4766 0.4645 0.0424 0.0291
Brain_Tumor1 + + + + + + + +
Std 0.0065 0.0030 0.0110 0.0038 0.0066 0.0096 0.0040 0.0111 0.0166

Table 10
Average, standard deviation and t-test of IGD index values of the nine methods on testing data.
IGD (Testing set) MOABC MOBGWO MOBHHO NSGA-II NSGAII-SDR MOPSO CMDPSOFS SparseEA MOBGWO-GMS
Mean 0.1312 0.1024 0.1307 0.1394 0.2470 0.0527 0.0774 0.0753 0.0870
Vehcile + + + + + − = −
Std 0.0119 0.0152 0.0139 0.0327 0.0109 0.0197 0.0098 0.0121 0.0217
Mean 0.0986 0.0903 0.0738 0.1151 0.1651 0.0618 0.0525 0.1155 0.0737
German + + = + + = − +
Std 0.0100 0.0199 0.0081 0.0460 0.0175 0.0150 0.0158 0.0167 0.0229
Mean 0.1559 0.0593 0.1698 0.0648 0.1348 0.1342 0.1060 0.0306 0.0509
Ionosphere + = + + + + + −
Std 0.0267 0.0170 0.0256 0.0146 0.0604 0.0227 0.0303 0.0097 0.0165
Mean 0.1309 0.0691 0.1092 0.0543 0.2700 0.0793 0.0764 0.0601 0.0385
Waveform + + + + + + + +
Std 0.0185 0.0124 0.0088 0.0110 0.0662 0.0167 0.0086 0.0085 0.0095
Mean 0.2409 0.0763 0.2437 0.0774 0.1985 0.1322 0.1377 0.1215 0.0629
Sonar + + + = + + + +
Std 0.0315 0.0142 0.0191 0.0425 0.0756 0.0263 0.0273 0.0157 0.0201
Mean 0.3969 0.1692 0.2963 0.1943 0.2864 0.2797 0.3223 0.1732 0.1551
Libras + = + + + + + =
Std 0.0169 0.0197 0.0205 0.0281 0.0745 0.0195 0.0260 0.0125 0.0440
Mean 0.3438 0.0836 0.2552 0.1554 0.1828 0.2680 0.2500 0.0987 0.0584
HillValley + + + + + + + +
Std 0.0196 0.0180 0.0231 0.0213 0.0184 0.0265 0.0205 0.0071 0.0352
Mean 0.3263 0.0945 0.2374 0.0813 0.2076 0.2176 0.2648 0.0804 0.0614
Musk + + + + + + + +
Std 0.0172 0.0131 0.0156 0.0099 0.0430 0.0255 0.0209 0.0090 0.0191
Mean 0.4641 0.1944 0.3537 0.2249 0.3580 0.4130 0.4427 0.1845 0.0932
Madelon + + + + + + + +
Std 0.0082 0.0206 0.0102 0.0138 0.0209 0.0196 0.0117 0.0073 0.0318
Mean 0.3817 0.1370 0.2416 0.1861 0.2577 0.3370 0.3491 0.0845 0.0727
Isolet5 + + + + + + + +
Std 0.0055 0.0065 0.0067 0.0094 0.0132 0.0157 0.0095 0.0073 0.0116
Mean 0.4821 0.2320 0.5399 0.3398 0.6152 0.5344 0.5403 0.0933 0.0892
SRBCT + + + + + + + =
Std 0.0122 0.0166 0.0288 0.0136 0.0302 0.0218 0.0127 0.0000 0.0802
Mean 0.5532 0.1883 0.3511 0.3577 0.4986 0.4605 0.4467 0.0684 0.0644
DBWorld + + + + + + + =
Std 0.0288 0.0186 0.0303 0.0051 0.0059 0.0193 0.0051 0.0000 0.0611
Mean 0.3908 0.6729 0.4815 0.4840 0.5119 0.5494 0.5695 0.0111 0.1137
Leukemia1 + + + + + + + −
Std 0.0028 0.0069 0.0174 0.0107 0.0068 0.0285 0.0124 0.0000 0.0697
Mean 0.1664 0.4964 0.3167 0.3792 0.4253 0.4890 0.4753 0.1507 0.1171
Brain_Tumor1 + + + + + + + +
Std 0.0028 0.0024 0.0083 0.0035 0.0090 0.0132 0.0051 0.0000 0.0706

IGD is the third metric used to measure the quality of Pareto 5.5.3. Sensitivity analysis of the parameter
solution sets in our experiment. Tables 9 and 10 are the IGD In MOBGWO-GMS, a threshold called thres is utilized to guide
analysis of the nine algorithms on fourteen datasets, the mean the leader wolf’s selection from the Pareto solution set. To
and standard deviation obtained from 20 independent runs. t- determine an appropriate range of thres value experiments were
test was also performed on the results of the experiments. Based conducted, and the results are presented in Fig. 17. The figure
on the findings from Tables 9 and 10, we can observe that: shows the correlation between thres and the performance of our
(1) In the testing set, MOBGWO-GMS achieves the lowest IGD proposed method, as measured by the HV metric. We found that
value on all training datasets except for the German dataset. Al- the Waveform dataset obtains the largest HV at thres = 0.6,
though MOPSO had the lowest IGD value on the German dataset, the Isolet5 dataset obtains at thres = 0.5, and the DBWorld
a statistical analysis using the t-test showed no significant dif- dataset obtains at thres = 0.4, as shown in Fig. 17. Based on
ference in performance between MOPSO and our algorithm. (2) our experimental data, we find that the optimal value of thres
In the testing set, our algorithm achieves a slightly higher IGD is influenced by the dimension of the dataset, but is typically
value than other algorithms on low-dimensional datasets such as around 0.5. Therefore, we propose to set thres to 0.5 for ex-
Vehicle, German, and Ionosphere. However, on other datasets, our perimental comparisons. Additionally, the implementation of the
algorithm still obtained the smallest IGD value compared to the triple-mutation strategy in GMS introduces only one additional
other algorithms. parameter (thres), which significantly improves the portability of
In summary, the above-mentioned metrics demonstrate that our algorithm.
the proposed algorithm performs exceptionally well in most
datasets. The experiments reveal that the algorithm outperforms
other methods by having a lower classification error rate while 5.5.4. Computational complexity and runtime analysis
using the same number of features. Moreover, the proposed Algorithm 1 gives the pseudocode of the MOBGWO-GMS algo-
approach is superior in achieving an optimal balance between the rithm. The dimension of the dataset is M, the number of popula-
two fitness evaluation criteria, resulting in a better Pareto front. tions is N, and the maximum number of iterations of is maxIter.
According to the Algorithm 1, the time complexity analysis is
16
X. Li, Q. Fu, Q. Li et al. Applied Soft Computing 145 (2023) 110558

Fig. 17. HV with respect to the value of q for the Waveform, Isolet5, and DBWorld datasets. The vertical dashed lines indicate the best thresholds for thres.

Table 11
Time complexity analysis of nine algorithms.
Algorithm The overall complexity of the algorithm The complexity of KNN classifier Explanatory note
MOABC O(maxIter ∗ (N ∗ D + M ∗ N 2 ))
MOBGWO O(maxIter ∗ (N ∗ D + M ∗ N 2 ))
M:Number of objective functions;
MOBHHO O(maxIter ∗ (N ∗ D + M ∗ N 2 )) N:Population size;
NSGA-II O(maxIter ∗ M ∗ N2) D:Dimension of the problem;
NSGAII-SDR O(maxIter ∗ M ∗ N3) O(N ∗ S ∗ D ∗ log(S) ∗ K )
S:Number of samples;
MOPSO O(maxIter ∗ (N ∗ D + M ∗ N 2 )) K:K-fold cross validation;
CMDPSOFS O(maxIter ∗ (N ∗ D + M ∗ N 2 )) maxIter:Maximum number of iterations
SparseEA O(maxIter ∗ (N ∗ D + M ∗ N 2 ))
MOBGWO_GMS O(maxIter ∗ (N ∗ D + M ∗ N 2 ))

Table 12
Average running CPU time of nine algorithms.
Run Time MOABC MOBGWO MOBHHO NSGA-II NSGAII-SDR MOPSO CMDPSOFS SparseEA BGWO-GMS
Vehcile 2.9134 1.3587 2.5663 1.6712 1.5866 1.4847 2.5141 1.5991 2.2895
German 2.8879 1.3301 2.5354 11.2481 1.6365 1.4763 2.4255 1.8137 2.4178
Ionosphere 2.5157 1.3009 2.2905 2.0833 1.5665 1.3625 2.1286 1.6711 2.0467
Waveform 4.6683 2.7515 4.7444 5.8407 3.0626 2.6086 4.3276 3.2728 4.4544
Sonar 2.4614 1.2727 2.2803 2.6101 1.3982 1.2521 2.0300 1.6081 2.4126
Libras 2.6164 1.3450 2.4004 2.6169 1.5304 1.3317 2.2436 1.9175 2.6421
HillValley 2.6541 1.3486 2.4454 2.6623 1.6408 1.3102 2.0519 1.6016 2.1065
Musk 2.6732 1.2698 2.4881 2.5252 1.6869 1.3780 2.0815 1.8743 1.9568
Madelon 8.5921 2.2138 9.2071 6.8140 4.2155 4.4407 6.7070 3.4174 3.3393
Isolet5 7.6602 2.1940 8.1016 6.5277 4.1845 3.9585 6.3248 3.8327 2.8599
SRBCT 2.8537 1.4805 2.7919 2.6473 1.8586 1.5690 2.2725 5.7360 2.0007
DBWorld 3.1721 1.4280 3.0847 2.7836 1.7878 1.6315 2.4028 9.1485 1.8639
Leukemia1 4.1380 1.5612 3.6187 3.1399 2.2868 2.0617 3.2094 11.3248 2.0625
Brain_Tumor1 4.5473 1.6424 4.0188 3.8111 2.6046 2.2864 3.4998 12.1604 2.1380

described as follows. The complexity of the algorithm is mainly Table 12, the runtime of the MOBGWO-GMS algorithm is signifi-
divided into two stages, stage 1 is mainly concentrated on step 2 cantly higher than that of the MOPSO and MOBGWO algorithms
and step 3, and stage 2 is mainly concentrated on the while loop on each dataset. The main reason is that the triple-mutation
of steps 6–28. In the first stage, the complexity of the new initial- strategies of MOBGWO-GMS in local search consume some time.
ization strategy proposed in step 2 is O(M ∗ N); the complexity However, compared with other algorithms, MOBGWO-GMS still
of normal implementation fitness evaluation and non-dominated has good competitiveness. The runtime of MOBGWO-GMS is
sorting in step 3 is O(M ∗ N + M ∗ N 2 ). In the second stage, the much lower than that of NSGA-II, MOABC, and MOBHHO al-
complexity mainly focuses on the GMS strategy (steps 17–23) gorithms on German and Sonar datasets. While the calculation
and step 26. In the worst case of GMS, that is, when the size of process of GMS requires a certain amount of time to execute,
GoodPop is N, the complexity is as high as O(maxIter ∗ M ∗ N 2 ), the incorporation of the GMS does not significantly increase the
which is much higher than the complexity O(maxIter ∗ M ∗ N) computational complexity as represented by big O notation.
of steps 9–15 (BGWO strategy); the complexity of step 26 is
O(maxIter ∗ (M ∗ N + M ∗ N 2 )). It can be seen that the complex- 6. Conclusions and future works
ity of the second stage is much higher than that of the first
stage. Therefore, the total time complexity of the MOBGWO- In this paper, our goal is to achieve a Pareto front with
GMS algorithm is O(maxIter ∗ M ∗ N 2 ). Table 11 displays the time low classification error and a reduced number of features while
complexities of the compared algorithms. preserving population diversity. The MOBGWO-GMS uses the
For the sake of simplicity, we compare the CPU runtime of the Pearson correlation coefficient to guide the mutation strategy, op-
nine algorithms in a single iteration of the main loop. Table 12 erates the uniform initialization population method, and employs
shows the average running CPU time of each algorithm when three different increase and decrease mutation strategies to effi-
independently run 20 times on fourteen datasets. As shown in ciently search in the search space. To verify the competitiveness
of MOBGWO-GMS, eight well-known algorithms and fourteen
17
X. Li, Q. Fu, Q. Li et al. Applied Soft Computing 145 (2023) 110558

benchmark datasets were used for comparison. The experimental [10] Y. Zhang, R. Liu, X. Wang, H. Chen, C. Li, Boosted binary harris hawks
results show that, in most cases, both in terms of dimensionality optimizer and feature selection, Eng. Comput. 37 (4) (2021) 3741–3770.
[11] A. Al-Ani, Feature subset selection using ant colony optimization,
reduction and classification error, the MOBGWO-GMS method
International Journal of Computational Intelligence (2005).
has significant competitive advantage. Especially on medium and [12] M. Paniri, M.B. Dowlatshahi, H. Nezamabadi-pour, Ant-TD: Ant colony op-
large-scale datasets, which are more common in the domain of timization plus temporal difference reinforcement learning for multi-label
machine learning and data mining, it performs excellently and feature selection, Swarm Evol. Comput. 64 (2021) 100892.
[13] M. Mafarja, S. Mirjalili, Whale optimization approaches for wrapper feature
can find a set of optimal feature subsets with good convergence
selection, Appl. Soft Comput. 62 (2018) 441–453.
and diversity. [14] E. Emary, H.M. Zawbaa, A.E. Hassanien, Binary grey wolf optimization
It should be noted that the algorithm’s performance is not approaches for feature selection, Neurocomputing 172 (2016) 371–381.
exceptional on some low-dimensional datasets. This may be be- [15] Y. Zhao, J. Dong, X. Li, H. Chen, S. Li, A binary dandelion algorithm using
cause lower-dimensional datasets have fewer optimal solutions, seeding and chaos population strategies for feature selection, Appl. Soft
Comput. (2022) 109166.
and other algorithms may also escape local optima. Our next [16] P. Agrawal, T. Ganesh, A.W. Mohamed, Chaotic gaining sharing knowledge-
focus is to optimize the efficiency of the algorithm by minimizing based optimization algorithm: an improved metaheuristic algorithm for
its complexity. Furthermore, we intend to utilize the improved feature selection, Soft Comput. 25 (14) (2021) 9505–9528.
algorithm to classify large-scale imbalanced datasets. [17] K. Hussain, N. Neggaz, W. Zhu, E.H. Houssein, An efficient hybrid sine-
cosine harris hawks optimization for low and high-dimensional feature
selection, Expert Syst. Appl. 176 (2021) 114778.
CRediT authorship contribution statement [18] N. Neggaz, A.A. Ewees, M. Abd Elaziz, M. Mafarja, Boosting salp swarm
algorithm by sine cosine algorithm and disrupt operator for feature
Xiaobo Li: Supervision, Resources, Proiect administration, Val- selection, Expert Syst. Appl. 145 (2020) 113103.
idation, Funding acquisition, Writing – review & editing. Qiy- [19] K. Chen, F.-Y. Zhou, X.-F. Yuan, Hybrid particle swarm optimization with
spiral-shaped mechanism for feature selection, Expert Syst. Appl. 128
ong Fu: Conceptualization, Methodology, Software, Investigation,
(2019) 140–156.
Writing – original draft. Qi Li: Formal analysis, Data curation, [20] M. Abdel-Basset, D. El-Shahat, I. El-henawy, V.H.C. de Albuquerque, S.
Funding acquisition. Weiping Ding: Formal analysis, Data cura- Mirjalili, A new fusion of grey wolf optimizer algorithm with a two-phase
tion, Software. Feilong Lin: Formal analysis, Data curation, Soft- mutation for feature selection, Expert Syst. Appl. 139 (2020) 112824.
[21] Y. Xu, X. Li, Q. Li, A discrete teaching–learning based optimization algo-
ware. Zhonglong Zheng: Formal analysis, Data curation.
rithm with local search for rescue task allocation and scheduling, Appl.
Soft Comput. 134 (2023) 109980.
Declaration of competing interest [22] K. Deb, A. Pratap, S. Agarwal, T. Meyarivan, A fast and elitist multiobjec-
tive genetic algorithm: NSGA-II, IEEE Trans. Evol. Comput. 6 (2) (2002)
The authors declare that they have no known competing finan- 182–197.
[23] E. Hancer, B. Xue, M. Zhang, D. Karaboga, B. Akay, A multi-objective
cial interests or personal relationships that could have appeared
artificial bee colony approach to feature selection using fuzzy mutual
to influence the work reported in this paper. information, in: 2015 IEEE Congress on Evolutionary Computation, CEC,
IEEE, 2015, pp. 2420–2427.
Data availability [24] B. Xue, M. Zhang, W.N. Browne, Particle swarm optimization for feature
selection in classification: A multi-objective approach, IEEE Trans. Cybern.
43 (6) (2012) 1656–1671.
No data was used for the research described in the article.
[25] S. Mirjalili, S.M. Mirjalili, A. Lewis, Grey wolf optimizer, Adv. Eng. Softw.
69 (2014) 46–61.
Acknowledgments [26] H. Almazini, K. Ku-Mahamud, Grey wolf optimization parameter control
for feature selection in anomaly detection, Int. J. Intel. Eng. Syst. 14 (2)
This work was supported by the National Natural Science (2021) 474–483.
[27] H. Chantar, M. Mafarja, H. Alsawalqah, A.A. Heidari, I. Aljarah, H. Faris, Fea-
Foundation of China under Grant Nos. 61373057, 62273310, ture selection using binary grey wolf optimizer with elite-based crossover
62272419; Key Project of Zhejiang Provincial Natural Science for arabic text classification, Neural Comput. Appl. 32 (2020) 12201–12220.
Foundation of China under Grant No. LZ22F020010; Zhejiang [28] R.R. Rajammal, S. Mirjalili, G. Ekambaram, N. Palanisamy, Binary grey wolf
Provincial Natural Science Foundation of China under Grant No. optimizer with mutation and adaptive K-nearest neighbour for feature
selection in parkinson’s disease diagnosis, Knowl.-Based Syst. 246 (2022)
LY22F030006.
108701.
[29] Q. Al-Tashi, S.J.A. Kadir, H.M. Rais, S. Mirjalili, H. Alhussian, Binary opti-
References mization using hybrid grey wolf optimization for feature selection, Ieee
Access 7 (2019) 39496–39508.
[1] Y. Sun, C. Lu, X. Li, The cross-entropy based multi-filter ensemble method [30] S. Wang, X. Yu, W. Jia, A new population initialization of particle swarm
for gene selection, Genes 9 (5) (2018) 258. optimization method based on pca for feature selection, J. Big Data 3 (1)
[2] M. Canayaz, Classification of diabetic retinopathy with feature selection (2021) 1.
over deep features using nature-inspired wrapper methods, Appl. Soft [31] X.-f. Song, Y. Zhang, D.-w. Gong, X.-y. Sun, Feature selection using bare-
Comput. 128 (2022) 109462. bones particle swarm optimization with mutual information, Pattern
[3] S. Maza, M. Touahria, Feature selection for intrusion detection using new Recognit. 112 (2021) 107804.
multi-objective estimation of distribution algorithms, Appl. Intell. 49 (12) [32] M. Tahir, A. Tubaishat, F. Al-Obeidat, B. Shah, Z. Halim, M. Waqas, A novel
(2019) 4237–4257. binary chaotic genetic algorithm for feature selection and its utility in
[4] Q. Al-Tashi, S.J. Abdulkadir, H.M. Rais, S. Mirjalili, H. Alhussian, Approaches affective computing and healthcare, Neural Comput. Appl. (2020) 1–22.
to multi-objective feature selection: A systematic literature review, IEEE [33] X. Li, J. Ren, MICQ-IPSO:: An effective two-stage hybrid feature selection
Access 8 (2020) 125076–125096. algorithm for high-dimensional data, 2022.
[5] S. Li, X. Li, H. Chen, Y. Zhao, J. Dong, A novel hybrid hunger games [34] M. Paniri, M.B. Dowlatshahi, H. Nezamabadi-Pour, MLACO: A multi-
search algorithm with differential evolution for improving the behaviors label feature selection algorithm based on ant colony optimization,
of non-cooperative animals, IEEE Access 9 (2021) 164188–164205. Knowl.-Based Syst. 192 (2020) 105285.
[6] H. Chen, X. Li, S. Li, Y. Zhao, J. Dong, Improved slime mould algorithm [35] F. Karimi, M.B. Dowlatshahi, A. Hashemi, SemiACO: A semi-supervised
hybridizing chaotic maps and differential evolution strategy for global feature selection based on ant colony optimization, Expert Syst. Appl. 214
optimization, IEEE Access (2022). (2023) 119130.
[7] H. Chen, S. Li, X. Li, Y. Zhao, J. Dong, A hybrid adaptive differential evolution [36] A. Hashemi, M. Joodaki, N.Z. Joodaki, M.B. Dowlatshahi, Ant colony opti-
based on Gaussian tail mutation, Eng. Appl. Artif. Intell. 119 (2023) 105739. mization equipped with an ensemble of heuristics through multi-criteria
[8] W. Siedlecki, J. Sklansky, A note on genetic algorithms for large-scale decision making: A case study in ensemble feature selection, Appl. Soft
feature selection, Pattern Recognit. Lett. 10 (5) (1989) 335–347. Comput. 124 (2022) 109046.
[9] S.B. Sakri, N.B.A. Rashid, Z.M. Zain, Particle swarm optimization feature [37] H. Bayati, M.B. Dowlatshahi, A. Hashemi, MSSL: A memetic-based sparse
selection for breast cancer recurrence prediction, IEEE Access 6 (2018) subspace learning algorithm for multi-label classification, Int. J. Mach.
29637–29647. Learn. Cybern. 13 (11) (2022) 3607–3624.

18
X. Li, Q. Fu, Q. Li et al. Applied Soft Computing 145 (2023) 110558

[38] Q. Al-Tashi, H. Md Rais, S.J. Abdulkadir, S. Mirjalili, H. Alhussian, A review [53] E. Emary, W. Yamany, A.E. Hassanien, V. Snasel, Multi-objective gray-
of grey wolf optimizer-based feature selection methods for classification, wolf optimization for attribute reduction, Procedia Comput. Sci. 65 (2015)
Evolut. Mach. Learn. Techn. Algor. Appl. (2020) 273–286. 623–632.
[39] R.A. Ibrahim, M. Abd Elaziz, S. Lu, Chaotic opposition-based grey-wolf [54] A. Sahoo, S. Chandra, Multi-objective grey wolf optimizer for improved
optimization algorithm based on differential evolution and disruption cervix lesion classification, Appl. Soft Comput. 52 (2017) 64–80.
operator for global optimization, Expert Syst. Appl. 108 (2018) 1–27. [55] Q. Al-Tashi, S.J. Abdulkadir, H.M. Rais, S. Mirjalili, H. Alhussian, M.G.
[40] Q. Tu, X. Chen, X. Liu, Multi-strategy ensemble grey wolf optimizer and Ragab, A. Alqushaibi, Binary multi-objective grey wolf optimizer for feature
its application to feature selection, Appl. Soft Comput. 76 (2019) 16–30. selection in classification, IEEE Access 8 (2020) 106247–106263.
[41] S. Arora, H. Singh, M. Sharma, S. Sharma, P. Anand, A new hybrid [56] D. Moldovan, A. Slowik, Energy consumption prediction of appliances using
algorithm based on grey wolf optimization and crow search algorithm for machine learning and multi-objective binary grey wolf optimization for
unconstrained function optimization and feature selection, Ieee Access 7 feature selection, Appl. Soft Comput. 111 (2021) 107745.
(2019) 26343–26361. [57] A.F.V. Ukken, A. Bindu Jayachandran, J.K. Punnath Malayathodi, P. Das,
[42] X. Zhao, X. Zhang, Z. Cai, X. Tian, X. Wang, Y. Huang, H. Chen, L. Hu, Statistically aided binary Multi-Objective Grey Wolf Optimizer: a new
Chaos enhanced grey wolf optimization wrapped ELM for diagnosis of feature selection approach for classification, J. Supercomput. (2023) 1–33.
paraquat-poisoned patients, Comput. Biol. Chem. 78 (2019) 481–490. [58] S. Mirjalili, S. Saremi, S.M. Mirjalili, L.d.S. Coelho, Multi-objective grey wolf
[43] P. Hu, J.-S. Pan, S.-C. Chu, Improved binary grey wolf optimizer and its optimizer: a novel algorithm for multi-criterion optimization, Expert Syst.
application for feature selection, Knowl.-Based Syst. 195 (2020) 105746. Appl. 47 (2016) 106–119.
[44] C. Shen, K. Zhang, Two-stage improved grey wolf optimization algorithm [59] J. Adler, I. Parmryd, Quantifying colocalization by correlation: the pearson
for feature selection on high-dimensional classification, Complex Intell. correlation coefficient is superior to the Mander’s overlap coefficient,
Syst. (2021) 1–21. Cytometry Part A 77 (8) (2010) 733–742.
[45] B. Sathiyabhama, S.U. Kumar, J. Jayanthi, T. Sathiya, A. Ilavarasi, V. Yuvara- [60] R. Akbari, R. Hedayatzadeh, K. Ziarati, B. Hassanizadeh, A multi-objective
jan, K. Gopikrishna, A novel feature selection framework based on grey artificial bee colony algorithm, Swarm Evol. Comput. 2 (2012) 39–52.
wolf optimizer for mammogram image analysis, Neural Comput. Appl. 33 [61] Y. Tian, R. Cheng, X. Zhang, Y. Su, Y. Jin, A strengthened dominance relation
(21) (2021) 14583–14602. considering convergence and diversity for evolutionary many-objective
[46] A. Khan, A.R. Baig, Multi-objective feature subset selection using non- optimization, IEEE Trans. Evol. Comput. 23 (2) (2018) 331–345.
dominated sorting genetic algorithm, J. Appl. Res. Technol. 13 (1) (2015) [62] Y. Tian, X. Zhang, C. Wang, Y. Jin, An evolutionary algorithm for large-scale
145–159. sparse multiobjective optimization problems, IEEE Trans. Evol. Comput. 24
[47] M. Amoozegar, B. Minaei-Bidgoli, Optimizing multi-objective PSO based (2) (2019) 380–393.
feature selection method using a feature elitism mechanism, Expert Syst. [63] E. Zitzler, L. Thiele, M. Laumanns, C.M. Fonseca, V.G. Da Fonseca, Perfor-
Appl. 113 (2018) 499–514. mance assessment of multiobjective optimizers: An analysis and review,
[48] Y. Hu, Y. Zhang, D. Gong, Multiobjective particle swarm optimization IEEE Trans. Evol. Comput. 7 (2) (2003) 117–132.
for feature selection with fuzzy cost, IEEE Trans. Cybern. 51 (2) (2020) [64] H. Chen, G. Wu, W. Pedrycz, P.N. Suganthan, L. Xing, X. Zhu, An adaptive
874–888. resource allocation strategy for objective space partition-based multiob-
[49] Y. Zhang, S. Cheng, Y. Shi, D.-w. Gong, X. Zhao, Cost-sensitive feature jective optimization, IEEE Trans. Syst. Man Cybern. Syst. 51 (3) (2019)
selection using two-archive multi-objective artificial bee colony algorithm, 1507–1522.
Expert Syst. Appl. 137 (2019) 46–58. [65] A. Deniz, H.E. Kiziloz, On initial population generation in feature subset
[50] X.-h. Wang, Y. Zhang, X.-y. Sun, Y.-l. Wang, C.-h. Du, Multi-objective feature selection, Expert Syst. Appl. 137 (2019) 11–21.
selection based on artificial bee colony: An acceleration approach with [66] C.A. Coello Coello, Evolutionary multi-objective optimization: some current
variable sample size, Appl. Soft Comput. 88 (2020) 106041. research trends and topics that remain to be explored, Front. Comput. Sci.
[51] J. Piri, P. Mohapatra, An analytical study of modified multi-objective harris China 3 (1) (2009) 18–30.
hawk optimizer towards medical data feature selection, Comput. Biol. Med. [67] P. Ngatchou, A. Zarei, A. El-Sharkawi, Pareto multi objective optimiza-
135 (2021) 104558. tion, in: Proceedings of the 13th International Conference on, Intelligent
[52] Y. Zhou, J. Kang, H. Guo, Many-objective optimization of feature selection Systems Application To Power Systems, IEEE, 2005, pp. 84–91.
based on two-level particle cooperation, Inform. Sci. 532 (2020) 91–109.

19

You might also like