Lit Rev 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO.

8, AUGUST 2021 1

A Pairwise Comparison Relation-assisted


Multi-objective Evolutionary Neural Architecture
Search Method with Multi-population Mechanism
Yu Xue, Senior Member, IEEE, Chenchen Zhu, MengChu Zhou, Fellow, IEEE, Mohamed Wahib,
and Moncef Gabbouj, Fellow, IEEE

ONVOLUTIONAL neural networks (CNNs) have


Abstract—Neural architecture search (NAS) enables re-
C
arXiv:2407.15600v1 [cs.NE] 22 Jul 2024

searchers to automatically explore vast search spaces and find achieved great success in various machine learning
efficient neural networks. But NAS suffers from a key bottleneck, tasks [1]–[4]. However, in order to achieve promising per-
i.e., numerous architectures need to be evaluated during the
search process, which requires a lot of computing resources and formance, traditional CNNs are usually designed manually
time. In order to improve the efficiency of NAS, a series of by human experts with extensive domain knowledge and
methods have been proposed to reduce the evaluation time of experience. Not every interested user has such expertise, and
neural architectures. However, they are not efficient enough and even for experts, designing CNNs is also a time-consuming
still only focus on the accuracy of architectures. In addition to and error-prone process. Therefore, in order to facilitate and
the classification accuracy, more efficient and smaller network
architectures are required in real-world applications. To address automate the design of deep convolutional neural networks,
the above problems, we propose the SMEM-NAS, a pairwise com- Zoph et al. first proposed the concept of neural architec-
parison relation-assisted multi-objective evolutionary algorithm ture search (NAS) [5] and developed a well-known neural
based on a multi-population mechanism. In the SMEM-NAS, architecture search method for CNNs called NASNet [6],
a surrogate model is constructed based on pairwise compari- which can obtain neural architectures with highly competitive
son relations to predict the accuracy ranking of architectures,
rather than the absolute accuracy. Moreover, two populations performance compared to hand-crafted neural architectures. In
cooperate with each other in the search process, i.e., a main recent years, researchers have developed many NAS methods,
population guides the evolution, while a vice population expands which has attracted increasing attention from both industry
the diversity. Our method aims to provide high-performance and academia in a variety of learning tasks [7], [8], such as
models that take into account multiple optimization objectives. object detection [9], semantic segmentation [10], and natural
We conduct a series of experiments on the CIFAR-10, CIFAR-
100 and ImageNet datasets to verify its effectiveness. With only language processing [11].
a single GPU searching for 0.17 days, competitive architectures In addition to the accuracy of neural networks, real-world
can be found by SMEM-NAS which achieves 78.91% accuracy applications also require NAS methods to find computationally
with the MAdds of 570M on the ImageNet. This work makes a efficient network architectures, e.g., low power consumption
significant advance in the important field of NAS. in mobile applications and low latency in autonomous driv-
Index Terms—Evolutionary neural architecture search, surro- ing. Generally speaking, the classification accuracy usually
gate model, multi-objective optimization. continuously increases with the complexity of the network
architectures (i.e., the number of layers, the number of chan-
I. I NTRODUCTION nels, etc.) [12]. This implies that maximizing the accuracy
and minimizing the network complexity are two competing
and conflicting objectives, thus requiring multi-objective opti-
mization for NAS. Existing NAS algorithms can be classified
This work was supported by the National Natural Science Foundation of into three categories including reinforcement learning (RL)
China (NO.62376127, NO.61876089, NO.61876185, NO.61902281), the Nat-
ural Science Foundation of Jiangsu Province (BK20141005) and the Natural based [6], [13], evolutionary algorithm (EA) based [14],
Science Foundation of the Jiangsu Higher Education Institutions of China [15] and gradient-based methods [16], [17]. Despite recent
(14KJB520025), Jiangsu Distinguished Professor Programme. (Corresponding advances in RL-based and gradient-based NAS methods, they
author: Yu Xue.)
Yu Xue and Chenchen Zhu are with the School of Software, Nanjing are still not easily applicable to multi-objective NAS. Gradient-
University of Information Science and Technology, Jiangsu, China. E-mails: based methods in which the search space is continuous rely
[email protected]; [email protected]. on gradient descent to optimize the neural architectures, while
MengChu Zhou is with the Department of Electrical and Computer Engi-
neering, NewJersey Institute of Technology, Newark, NJ 07102 USA. E-mail: other objectives, such as complexity, can not be optimized by
[email protected]. the loss function. So, multiple objectives are not easy to be
Mohamed Wahib is with RIKEN Center for Computational Science (R- optimized and the diversity of architectures could be missed
CCS), Japan. E-mail: [email protected].
Moncef Gabbouj is with the Department of Computing Sciences, Tampere in gradient-based methods. RL-based methods are more costly
University, Tampere, Finland. E-mail: [email protected]. and time-consuming than the other two methods. Evolutionary
algorithm is a search method based on evolutionary principles
This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may that searches for an optimal solution through the evolution and
no longer be accessible. selection [14]. When dealing with multi-objective problems in
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 2

NAS field, evolutionary algorithms are more adaptable and the predicted results are consistent with the true performance
flexible. ranking of architectures. In the evolutionary algorithm, the
Regarding multi-objective NAS, researchers have proposed selection operation needs to be performed according to the
many methods for finding high-performance architectures that fitness values of the individuals, and only the individuals with
satisfy multiple metrics simultaneously [18]. For example, high fitness are selected for subsequent evolution. Therefore,
MnasNet [19] can be used to design a multi-objective neural it is more important to obtain the ranking of the candidate
architecture search approach that optimizes both accuracy and architectures than to predict the true accuracy.
real-world latency on mobile devices. MixNet [20] can be To address the above problems, we propose an efficient
used to design a new mixed depthwise convolution with accu- algorithm, called SMEM-NAS, which is a surrogate-assisted
racy and computation as the optimization metrics. But these multi-objective evolutionary algorithm with a multi-population
methods have lower efficiency in optimizing architectures. In mechanism for NAS in this work. Firstly, we establish an
addition to the problem of search efficiency, there is a lack of efficient surrogate model based on pairwise comparison re-
improvement in the search strategy. lations to obtain the accuracy ranking of candidate architec-
During the search process, there is often a lack of di- tures. Then we conduct practical training on the selected top-
versity in the generated architecture, leading to premature ranked architectures. Furthermore, to consider other perfor-
convergence to suboptimal solutions. This phenomenon can be mance metrics and increase the diversity of the algorithm, we
attributed to several factors, including the selection operation propose a multi-objective evolutionary algorithm based on a
favouring one particular objective to retain individuals and multi-population mechanism. Our approach aims to provide
the inadequate evolutionary operators. These constraints often a set of high-performance architectures that take into account
hinder the exploration of the entire Pareto front, thereby multiple optimization objectives. We validate the effectiveness
limiting the quality of the obtained architecture [21]. For of the proposed method on an image classification task with
example, NSGANetV1 [15] tends to search for architectures standard datasets, CIFAR-10, CIFAR-100 and ImageNet. The
within a small range around a certain FLOPs value, which computational results show that our method outperforms most
may lead to the convergence of the local optimal solution. state-of-the-art NAS methods. This work intends to make the
Therefore, how to maintain the diversity of the population following contributions to advance the field of NAS:
without sacrificing performance is a challenging problem 1) It proposes a method to optimize both the accuracy and
[22], [23]. There exist a number of diversity preservation network complexity for NAS. We choose the number of
approaches in existing multi-objective evolutionary algorithms, multiply-add operations (MAdds) as the second objective.
which usually select diverse architectures from the current It can facilitate the ENAS to achieve high-quality architec-
generation and put them to the next generation to increase the tures with low computational cost.
diversity of the population during the search [21]. Inspired by 2) A more efficient surrogate model based on pairwise com-
these methods, which have not been well applied in NAS, we parison relations is proposed. Different from existing sur-
improve the selection operation and design a multi-population rogate models that focus on absolute accuracy, ours focuses
mechanism to generate more diverse architectures. more on the relative rankings among network architectures.
Nevertheless, regardless of the search space and search strat- It is further able to shorten the time consumption of the
egy used, one bottleneck in this field is the need to evaluate search process compared to previous surrogate models.
a large number of network architectures during the search 3) A multi-objective evolutionary algorithm based on a multi-
process, which requires huge computational resources and population mechanism is proposed, in which a main popu-
time [24]. Many researchers have proposed lots of methods to lation guides the evolution and a vice population assists
improve the efficiency of NAS algorithms, including weight the evolution. The mechanism can greatly prevent the
sharing [13], [17], population memory [14], early stopping algorithm from falling into local optima and speed up the
mechanism [25], and so on. But most of these methods still convergence of the algorithm.
have to explicitly or implicitly select numerous architectures The remainder of this paper is organized as follows: Section
and then perform training to evaluate them. Recently, some II presents the related works and background. Section III
researchers have established surrogate models to virtually eval- describes our proposed method in detail. We present the
uate network architectures, which can execute the environment experimental design and results to verify the effectiveness and
selection with the predicted accuracy, significantly reducing efficiency of our method with respect to its peers in Section
the evaluation time [26]. However, one of the key challenges IV. Finally, the conclusions and future works are drawn in
in obtaining reliable surrogate models is that obtaining train- Section V.
ing samples (pairs of architecture and accuracy) is compu-
tationally expensive in some extend. In existing methods, II. R ELATED W ORK
sample utilization of surrogate models is not efficient. For In recent years, with deep neural networks widely de-
example, PRE-NAS [27] trains and evaluates a large number ployed in different applications and computing environments,
of architectures to be as training samples for the surrogate researches and explorations for NAS are particularly attractive.
model. But, these training samples are without special data In the following, Section II-A and Section II-B present studies
processing and the quantity is very limited. Moreover, we find related to multi-objective NAS and surrogate models for NAS.
that surrogate models are not necessary to produce reliable The introduction and usage of the supernet are presented in
accuracy estimates (accuracy in the absolute sense), as long as Section II-C.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 3

A. Multi-objective NAS accuracy predictors constructed and surrogate models adopted


In the early methods of NAS, most studies only focus on in most existing methods are simple and could not make full
the accuracy of neural networks. This often leads to that the use of training samples. Inaccurate accuracy predictions also
searched networks suffer from excessive computation time and tend to lead to rank disorder of architecture performance,
model size. On the other hand, the accuracy is rarely the only which affects the environment selection. Recently, Wang et al.
metric to be considered in real-world applications. It is often propose a neural architecture search method based on parti-
necessary to consider other additional and conflicting goals cle swarm optimization, which uses support vector machine
specific to the deployment scenario in the real world, such (SVM) as a surrogate model to conduct the preliminary
as model size, model computation, inference latency, and so exploration on the individual comparison relationship [42].
on. Thus, some researchers have tried to take these secondary Despite the impressive progress, they search in a specific
objectives into account while searching for architectures with space constructed by themselves rather than in a general search
high accuracy. For example, Lu et al. consider both classifica- space. That makes their method difficult to compare with other
tion accuracy and model computation, and then use the non- related methods and the method is not general enough.
dominated sorting genetic algorithm II (NSGA-II) as the multi-
objective optimization method [15], [28]. Besides, Xue et al. C. SuperNet
propose a multi-objective evolutionary algorithm with a prob- To further improve the search efficiency of the proposed
ability stack for NAS, which considers accuracy and time algorithm, we choose to adopt the weight sharing technique.
consumption [29]. Wang et al. adopt a multi-objective parti- At first, we need a supernet that includes all searchable archi-
cle swarm optimization algorithm to simultaneously optimize tectures which are sub-networks. In this paper, the supernet
classification accuracy and FLOPs [30]. Du et al. further refine we use is OnceForAll [43]. OnceForAll supports variations
the multi-objective optimization process in NAS with the ref- in four factors: depth, width, convolutional kernel size, and
erence point-based environment selection [31]. Most of these image resolution. In OnceForAll, an overall network is firstly
existing approaches rely on the evaluation of architectures, and trained by using a large amount of GPU resources, i.e., a
are not efficient enough and still time-consuming [32], [33]. maximal network is constructed by taking all the searched
There lacks the improvement of multi-objective evolutionary hyperparameters of the architectures (depth, width and kernel
algorithms (MOEAs) in these methods, which tends to choose size) to their maximal values. Then the supernet is trained with
the existing multi-objective evolutionary algorithms such as the progressive shrinking algorithm [43], in which the largest
NSGA-II [34]. There exists the phenomenon of small model network is fine-tuned to support sub-networks and they are
traps during the search process [35]. Specifically, during the added to the sampling space by sharing weights. The weights
early search, smaller models with fewer parameters but higher inherited from the trained supernet are used as a warm-up for
test accuracy tend to dominate larger models that are less the gradient descent algorithm during our architecture search.
accurate but have the potential for higher accuracy [36]. More-
over, unsuitable evolution operators or conflicting objective
characteristics may also lead to low diversity of the population III. T HE P ROPOSED M ETHOD F OR E VOLUTIONARY
and the imbalance between diversity and convergence, which M ULTI - OBJECTIVE N EURAL A RCHITECTURE S EARCH
may cause the search results to fall into the local optima [37]. This section presents the details of our proposed pairwise
comparison relation-assisted multi-objective evolutionary al-
gorithm for NAS. We firstly present the framework of the
B. Surrogate Models
proposed algorithm in Subsection III-A, and the details of the
The main computational bottleneck in NAS is the per- proposed search space and encoding, surrogate model, and
formance evaluation of architectures. In recent years, many multi-objective algorithm with a multi-population mechanism
surrogate-based methods have been proposed to accelerate the are presented in Subsections III-B to III-D.
evaluation process in evolutionary algorithms [38]. The PNAS
proposed by Liu et al. from Google uses long short-term
memory (LSTM) as a surrogate model to replace the training A. Overall Framework
of deep neural networks [39]. Afterwards, Luo et al. propose An overview of the proposed algorithm is illustrated in
a neural network architecture optimization method, NAO, by Fig. 1. As shown in Fig. 1, our algorithm generally follows
using MLP as a surrogate model which works better than the basic process of genetic algorithm. Compared to existing
PNAS [16]. E2EPP [40] uses an offline surrogate model based ENAS methods, it mainly differs in the following two aspects:
on random forest (RF) to predict the performance directly one is that a surrogate model is constructed to assist in the
from encodings of architectures. The online surrogate model evaluation. Another is to add a multi-population mechanism
is further improved in NSGANetV2 [34], which uses the into the process of generating new individuals.
architecture search to guide the construction of the accuracy Algorithm 1 shows the pseudo code of the proposed al-
predictor and significantly reduces the number of training gorithm. Firstly, in order to construct an efficient surrogate
samples. Guo et al. exploit a ranking loss function to learn model, some evaluated architectures are needed to construct
a predictor model that predicts the relative score of architec- the training dataset. So before the search process starts, an
tures [41]. They can directly score and rank the performance of empty archive A to store training samples should be initialized
architectures to assist the evaluation process in NAS. However, (line 3). N individuals are randomly sampled from the search
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 4

(a) Offspring Generating with A Multi-population Mechanism (b) Surrogate-assisted Evaluation

Best P1
Parent P2 Surrogate
population P3 Model
P4
non-dominated Et

……
Fitness
sort Rank Predict
Parents from Et Pn-3
Pn-2
Retain
NK Pn-1
Architectures Crossover
Population Worst Pn
Threshold > θ & Offspring
Pt
Mutation
Y Pairwise
Parent Parent_1 from Et
Calculate population Encode
Parent_2 from Ft
crowding Ft
distance

Individuals

(c) The Overall Flowchart


Surrogate- Surrogate- Criterion is Yes Obtain final
Offspring Generating with Environmental
Initialization assisted
A Multi-population Mechanism
assisted
Selection
Pt  Rt satisfied architectures
Evaluation Evaluation

No

Fig. 1: Overall Framework of the multi-objective algorithm based on a multi-population mechanism (MP-MOEA) and the
surrogate model assisting the search process (SMEM-NAS). (a) Offspring are generated with a multi-population mechanism.
(b) The illustration of the evaluation assisted by the surrogate model. (c) The flowchart of the proposed algorithm.

space and then decoded to train. As mentioned above, we


Algorithm 1: SMEM-NAS employ the weight sharing during the training. The weights
inherited from the trained supernet are used as a warm up for
Input: Search space S, number of initial samples N ,
the stochastic gradient descent (SGD) algorithm to improve
SuperNet SW , number of iterations T ,
the search efficiency (lines 4-9). Afterwards, we enter the
complexity objective fc , number of selected
main search loop of the algorithm (line 10). The surrogate
candidates K, candidates population Rt .
model is the first to be constructed using the archive A
1 i ← 0 // Initialize a sampling
(line 11). A detailed description of the surrogate model is
individual counter.
given later in subsection III-C. Then we define a set to store
2 t ← 0 // Initialize a iteration
individuals to be evaluated. Next, we use the proposed multi-
counter.
objective evolutionary algorithm with multi-population mecha-
3 A ← ϕ // Initialize an empty archive
nism (MP-MOEA for short) combine with our surrogate model
to store evaluated architectures.
to optimize both accuracy (Acc) and MAdds, i.e., maximize
4 while i < N do
classification accuracy and minimize model complexity (line
// Sample N architectures
12). This process is described in detail later in subsection
Randomly.
III-D. To improve the sample efficiency of our search, we
5 α ← Sample(S)
train the surrogate model using an online learning approach.
6 ωα ← SW (α) // weights inherited from
The retained individuals of each generation are decoded and
the trained supernet.
trained, and these retained offspring are used as new training
7 acc ← SGD(α, ωα )
samples to update the surrogate model (line 14-18). The
8 A ← A ∪ (α, acc)
above steps are repeated until the conditions for the end of
9 end
the procedure are satisfied. Finally, after the multi-objective
10 while t < T do
evolutionary search, the optimal architectures are selected from
11 predictor ← Construct surrogate model from
the pool of architectures based on the non-dominated sorting
A// Online update the predictor.
(line 21).
12 Solutions ← MP-MOEA(A, predictor, fc )
13 Rt ← Select K elite architectures from Solutions
14 for α in Rt do B. Search Space and Encoding
// Add candidates to A. The search process of NAS starts with constructing a search
15 ωα ← SW (α) space that can contain most CNNs for image tasks. In this
16 acc ← SGD(α, ωα ) paper, the backbone structure is based on MobileNetV3 [44].
17 A ← A ∪ (α, acc) The structure consists of three stages. The start stage extracts
18 end features, and the final stage outputs categories. These two
19 t←t+1 parts do not need to be searched. The middle stage needs
20 end to be searched and consists of five sequentially connected
21 return architectures chosen by NonDominatedSort(A) MBConvBlocks [43] that progressively decrease the feature
map size and increase the number of channels. Each block
is composed of a series of layers, and every layer adopts the
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 5

Block 1 Block 2 ~ Block 4 Block 5

Final Stage
Start Stage
(a)

Output
……

Input
K=5 K=5 K=7 K=3 K=7 K=5 K=3
An architecture Skip
E=3 E=3 E=4 E=6 E=6 E=6 E=6

Encode

Block 1 Block 2 ~ Block 4 Block 5


(b)
Encoding
0 + 2 1 1 2 0 0 0 1 2 + …… + 1 2 1 0 0 2 2 2 0

Resolutions Layers Kernel Size Expansion Rate


(c)
Considered 192 196 200 … 252 256 2 3 4 3 5 7 3 4 6
options
{ 0, 1, 2, …… ,15, 16 } { 0, 1, 2} { 0, 1, 2} { 0, 1, 2}

Fig. 2: A candidate architecture encoding example. The encoding is divided into five parts by blocks. The parameters we
search include image resolution, the number of layers in each block, the expansion rate and the kernel size in each layer.

Input uses stride 2 if the feature map size decreases and all the
other layers in the block use stride 1. The number of layers
of each convolution block (depth) is selected from {2, 3, 4};
Start Stage
for each layer, we search the expansion rate of the first 1 × 1
…… convolution and the kernel size of the depth-wise separable
Middle Stage
convolution. The expansion rate is selected from {3, 4, 6},
Layer 1
Block 1 1×1 Conv and the kernel size is selected from {3, 5, 7}. Moreover, we
also allow the CNNs to obtain multiple input image size
Depth-wise Conv
(resolution), ranging from 192 to 256 with a stride of 4. For
Block 2 1×1 Conv the encoding strategy, we use a fixed 46-bit integer string. If
the encoding length of the architecture with fewer layers is
Block 3
…… less than 46 bits, the fixed length is achieved by padding with
Layer L zeros. Fig. 2 shows the encoding strategy, and Fig. 3 shows
1×1 Conv
the search space of this paper. As shown in these figures, the
Block 4 Depth-wise Conv architecture is composed of five blocks. The encoding of this
architecture is composed of image resolution and other parts
1×1 Conv
representing five blocks. For each block, its encoding consists
Block 5
of the number of layers, the expansion rate, and the kernel size
……
of each layer. In addition, the values in the encoding are all the
Final Stage index of considered options. As depicted in Fig. 2, the first bit
“0” denotes the index of the first candidate resolution which
Output
means the resolution is 192. The remaining 45 bits represent
five blocks, each of which is represented by 9 bits. e.g., the
Fig. 3: Search space: The left part shows a complete network Block 1 can be encoded as “211200012”. The first bit “2”
stacked by five blocks. The right part represents a Block and denotes the index of candidate layers which means the Block
its internal structure, which consists of multiple layers. 1 has four layers. Then, the next four bits “1120” denotes the
kernel size of each layer. Specifically, the kernel size in each
layer in order is {5, 5, 7, 3}. Similarly, the rest four bits denotes
that the expansion rate in each layer in order is {3, 3, 4, 6}. In
inverted bottleneck structure [45]: first, a 1 × 1 convolution, Block 5, the gray bit “0” represents no layer and the encoding
which is used to convert the input channel to the expansion is padded with zero, which is not from considered options.
channel; second, a depth-wise convolution, which is the ex-
pansion channel and contains the parameter stride; at last, a
1×1 convolution, which is used to convert from the expansion
channel to the output channel. In addition, only the first layer
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 6

Pairwise
obtain the performance ranking of all architectures. The details
Arc_1 Arc_2 2
2 0
Surrogate
Label are as follows: assuming that there are n architectures, we set
1 Model
1 1 0 Arc_1 is
a score for each architecture corresponding to its performance.
0 0
Which
1
better Firstly, we match the first architecture with the other n − 1

……
Concatenate Input Output
architectures and use the trained surrogate model to predict

is
better? Arc_2 is
2 1
1 0 the relationship of these pairs of architectures. For each pair
better
1 2 of architectures, if the first architecture outperforms the second
2
1 0
0 one, the score of the first architecture is added by 1, otherwise
Fig. 4: An illustration of the surrogate model based the score of the second architecture is added by 1. Then, the
on pairwise comparison relation: the input contains the second architecture is paired one by one with the remaining
concatenation of two architectures, and the output indicates architectures after removing the first architecture, and the
which architecture is better. above process is repeated until the last architecture is left.
Finally, each architecture is compared n − 1 times and the
score ranges from 0 to n − 1. The number of comparisons
C. Surrogate Model in total is n(n − 1)/2. The larger the value of the score, the
more times the corresponding architecture wins in the overall
Unlike most of the existing surrogate-assisted methods used comparisons, which implies that the architecture has a superior
to predict the classification accuracy of architectures, we use performance. The results from pairwise relations prediction
the surrogate model to obtain the performance ranking of will be used in the non-dominated sorting for the environment
candidate architectures. In the evolutionary algorithm, the selection during the subsequent evolutionary process.
selection operation needs to be carried out according to the
fitness values of individuals, and individuals with high fitness
will be selected for the subsequent evolution. Therefore, it is D. Multi-objective Evolutionary Algorithm Based on Multi-
more important to obtain the ranking of the candidates than population Mechanism
to predict the true accuracy. We propose a surrogate model In multi-objective evolutionary algorithms, diversity needs
based on pairwise comparison relations, which is constructed to be considered as well as convergence. During the practical
on the basis of a classification model. Its inputs are vectors evolution process, it is easy to fall into the local optimum
of the encoding combinations of pairwise architectures, and without exploring the region of the optimal solution. To over-
its outputs are labels indicating which one is the better come this problem, we propose a multi-objective evolution-
solution. In this paper, we try to study the comparison relation ary algorithm based on the multi-population mechanism, see
between two individuals and transform the comparison relation Algorithm 2 for details. The multi-population mechanism is
learning problem into a binary classification problem. Binary mainly used in the selection operation to select parents for the
classification models can be trained better with fewer samples. next generation during the evolution process. Firstly, two sub-
For example, n(n−1)/2 training samples can be obtained from populations are constructed from the initial population Pt (line
a dataset containing n labelled samples. Given a small number 1). The main population is Et , and the vice population is Ft .
of samples, the surrogate model with better performance can Specifically, the initial population Pt is non-dominated sorted
be obtained through the extended training samples. according to the predicted architectural rankings obtained from
In this paper, SVM is used as the surrogate model. In order the above surrogate model and the calculated MAdds. After
to train the surrogate model, we should firstly construct the that, the non-dominated solutions of the first level are taken as
training dataset. The construction method is the same as that the main population Et (line 2-5). Then, the individuals in the
proposed in EffPNet [42]. The specific process is as follows: population Et are removed from the population Pt , and the
firstly, the data used to train needs to be built from the archive crowding distances of the remaining solutions are calculated.
A. The n individuals in the archive A are matched in pairs K individuals with high crowding ranking are selected to form
sequentially. In a pair of individuals, if the fitness value of the the vice population Ft (line 6). We set a threshold based on
first individual is better than the other, the class label will be 1. the number of generations, and when this threshold is less
Otherwise, the class label will be 0. Then n(n − 1)/2 training than θ, Both parent p1 and p2 come from the main population
samples can be obtained to feed into the SVM for training. Et (line 13). If this threshold is greater than θ, parent p1
The above data processing transforms the direct prediction comes from the primary population and parent p2 comes from
of architecture performance into a binary classification task the secondary population (line 15). A number of offspring
comparing the performance of a pair of architectures. We take architectures are generated until the number of offspring is
a pair of encodings of two architectures as the input of the smaller than m in the current generation. m is a number used
surrogate model, and the output is label “1” or “0” as described to limit the number of offspring. The threshold is a value
above. To make it more intuitive, an example is provided in related to the evolution generation, which is generated by the
Fig. 4 to present the concatenation of the encodings of two following equation:
architectures, as well as the output. Label “1” indicates that
architecture Arc 1 has better performance, while label “0” g < 41 G

 random(δ, 0.7)
indicates that architecture Arc 2 has better performance. random(0, δ) 1
≤ g ≤ 34 G
T hreshold = 4G (1)
Through the above pairwise relations prediction, we can 
random(δ, 1) g > 43 G
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 7

Algorithm 2: Proposed MP-MOEA be the best individuals. So, parents are all selected from the
Input: Initial population Pt , surrogate model main population Et , which helps to achieve a fast convergence
predictor, complexity objective fc , number of rate. In the later stage, the population evolution tends to be
generations G, vice population size K. stable, we use the vice population to explore more solution
1 Pt ← architectures in A // A from Algo.1. regions and increase the diversity of the population to avoid
2 Predict the accuracy ranking of architectures in Pt by the algorithm falling into local optima. In the main loop of the
predictor // predictor see Algo.1. algorithm, the main and vice populations will be iteratively
3 Evaluate the complexity fc of each architecture in Pt updated, and they will collaborate to achieve optimization of
4 NonDominatedSort(Pt ) according to the predicted the correlation metric with higher accuracy and lower network
ranking and complexity complexity.
5 Et ← All non-dominated solutions in Pt // main
population. IV. E XPERIMENTS
6 Ft ← Select K solutions from the rest dominated
In this section, we conduct a series of experiments to val-
solutions in Pt by crowding distance // vice idate the effectiveness of our proposed method SMEM-NAS.
population. Specifically, we firstly show our experimental configurations
7 g ←0
in Section IV-A. Next, Section IV-B presents and analyses our
8 Qt ← ϕ // offspring population.
experimental results on benchmark datasets. Then we evaluate
9 while g < G do
the effectiveness of the multi-population mechanism in Section
// Multi-population Mechanism. IV-C. The comparison results with other algorithms in terms of
10 θ ← 0.5 accuracy and complexity, search cost is also discussed. Finally,
11 T hreshold ← Generated by Eq.(1) Section IV-D show some ablation experiments for surrogate
12 while the number of architectures in Qt < m do models and the initial population.
13 if θ < T hreshold then
14 p1 , p2 ← Randomly select two different
solutions from Et A. Experimental Configurations
15 else Regarding the benchmark datasets used in the experiments,
16 p1 , p2 ← Randomly select one solution p1 we use CIFAR-10, CIFAR-100 and ImageNet datasets consis-
from Et , another solution p2 from Ft tent with the existing ENAS works. In this study, the process
17 end of searching for architectures is performed separately on these
18 p′1 , p′2 ← Crossover(p1 , p2 ) three datasets. The performance of the final searched CNN
19 q1 , q2 ← Mutation(p′1 ), Mutation(p′2 ) architectures is also evaluated on three datasets. We evaluate
20 Qt ← Qt ∪ q1 ∪ q2 the effectiveness of different architectures in terms of both
21 end classification accuracy and computational complexity.
22 Pt ← Pt ∪ Qt For each dataset, we start with 100 randomly sampled
23 Predict the accuracy ranking of architectures in Pt architectures from the search space as the initial population.
24 Evaluate fc of each architecture in Pt Then the evolutionary search is performed, running for a total
25 NonDominatedSort(Pt ) of 25 iterations. Except for the initial population, for each
26 Et ← All non-dominated solutions in Pt subsequent iteration, we retain and evaluate eight architectures
27 Ft ← Update by crowding distance from generated offspring. They are used for the training of
28 g ←g+1 the surrogate model and for the next iteration. So, in the
29 end
whole search process, we evaluate 300 architectures in total,
30 return Pt
which are also used to train the surrogate model. Because
OnceForAll is based on the ImageNet dataset, we fine-tune
the weights inherited from the supernet for five epochs when
searching on CIFAR-10 and CIFAR-100. All our experiments
where δ is a randomly generated number between [0, 1], g
are performed on a single Nvidia RTX 3090 GPU card. Finally,
is the number of generation, and G is the total number of
we select three promising architectures from the obtained non-
generations.
dominated solutions and then train them for 200 epochs by the
Integers are used to encode the architectures in this paper. standard SGD optimizer with momentum, where the initial
The crossover operator uses two-point crossover. Because the learning rate, the momentum rate and the batch size are set to
traditional polynomial mutation is based on real numbers, we 0.01, 0.9 and 128.
rewrite it to fit the integer encoding. In the early stage of the
search, we should focus on the convergence and diversity. It
is vital to keep the whole population evolving in the right B. Results on Standard Datasets
direction, so the main population should be the dominant In this section, we show the results of our method on
one, and the vice population should play an assisting role benchmark datasets to validate its effectiveness. We conduct
in the exploration. In the middle stage of the search, more the search and train the final architectures separately on three
consideration is given to convergence, and the parents should datasets. This is different from existing NAS methods that
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 8

TABLE I: Comparison with state-of-the-art image classifiers on the CIFAR-10 dataset. The multi-objectives used for architecture
optimization are performance and model complexity. *: MAdds computed using the genotypes provided by the authors. †
denotes results from the reimplementation. SMEM-NAS(w/o): Architectures found by SMEM-NAS without the multi-population
mechanism. We have bolded the minimum MAdds and maximum accuracy of the searched architectures.
Architecture Accuracy (%) MAdds (M) Search Cost (GPU Days) Search Method
EfficientNet-B0 [12] 98.1 387 - manual
MobileNetV2 [45] 95.74 300 - manual
PC-DARTS [46] 97.5 532* 0.3 GD
P-DARTS [47] 97.43 558* 0.1 GD
MixNet-M [20] 97.9† 359 - GD
NoisyDARTS [48] 97.63 534 0.4 GD
FairDARTS [49] 97.46 373 0.25 GD
sharpDARTS [50] 97.71 357 1.8 GD
AmoebaNet-B [51] 97.5 555 3150 EA
FairNAS-A [52] 98.2 391 12 EA
FairNAS-B [52] 98.1 348 12 EA
FairNAS-C [52] 98.0 324 12 EA
CARS [35] 97.43 728 0.4 EA
NSGA-Net [28] 97.98 817 4 EA
NASNet-A [6] 97.35 608∗ 1800 RL
SMEMNAS(w/o) 97.85 755 1.25 EA
SMEMNAS-S(ours) 97.96 200 1.25 EA
SMEMNAS-M(ours) 98.03 259 1.25 EA
SMEMNAS-L(ours) 98.13 305 1.25 EA

TABLE II: Comparison with state-of-the-art image classifiers on the CIFAR-100 dataset. † denotes results from the
reimplementation. SMEM-NAS(transfer): the architecture, SMEMNAS-L, which searched on CIFAR-10.
Architecture Accuracy (%) MAdds (M) Search Cost (GPU Days) Search Method
EfficientNet-B0 [12] 88.1 400 - manual
MobileNetV2 [45] 80.8 300 - manual
PC-DARTS [46] 83.1 - 0.3 GD
MixNet-M [20] 87.1† 359 - GD
FairNAS-A [52] 87.3 391 12 EA
FairNAS-B [52] 87 348 12 EA
FairNAS-C [52] 86.7 324 12 EA
NSGA-Net [28] 85.58 817 4 EA
ZenNet [53] 84.4 487 0.5 EA
MUXNet-M [54] 86.11 400 11 EA
NASNet-A Large [6] 86.7 12031 1800 RL
NASNet-A mobile [6] 83.9 600 1800 RL
SMEMNAS(w/o) 86.65 348 1.25 EA
SMEMNAS(transfer) 87.18 305 1.25 EA
SMEMNAS-S(ours) 86.71 197 1.25 EA
SMEMNAS-M(ours) 87.25 250 1.25 EA
SMEMNAS-L(ours) 87.95 755 1.25 EA

SMEM-NAS FairNAS NoisyDARTS FairDARTS order to validate the effectiveness of the proposed method,
SMEM-NAS(w/o) EfficientNet-B0 NASNet-A CARS we compare the non-dominated architectures obtained by
NSGA-Net sharpDARTS
98.2
SMEM-NAS with those obtained by other state-of-the-art NAS
methods. The selected peer methods can be broadly divided
98.0 into three categories: manually designed by human experts,
EA-based, and non-EA-based (e.g., RL and GD). In addition,
Accuracy(%)

97.8 we give the searched results by SMEM-NAS without the


multi-population mechanism.
97.6
Results on CIFAR-10: After completing the search using
97.4
our method, we selected three promising architectures, and
97.2 for ease of representation we named them SMEMNAS-S/M/L
200 300 400 500 600 700 800 (sorted by MAdds). Fig. 5 and Table I show the results
MAdds(M)
of SMEM-NAS and the comparison with other methods on
Fig. 5: Accuracy and number of multi-adds in millions CIFAR-10. Fig. 5 visualizes the dominant relationship be-
on CIFAR-10. Models from multi-objective approaches are tween the network models in terms of accuracy and model
joined with lines. SMEM-NAS finds more accurate solutions complexity. It is obvious that our models always outperform
with fewer complexity. most models or our models are comparable to them. Firstly,
from the perspective of time, our method takes 1.25 GPU
days and significantly reduces the search time compared with
mostly use a transfer learning setup. We think the transferred methods based on EA and RL. The surrogate model we used
architectures maybe not optimal on another dataset. Then, in effectively accelerates the search process. Compared with the
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 9

TABLE III: Comparison with state-of-the-art image classifiers on the ImageNet dataset. The search cost excludes the supernet
training cost.
Architecture Top-1 Acc (%) Top-5 Acc (%) MAdds (M) Search Cost (GPU Days) Search Method
EfficientNet-B0 [12] 76.3 93.2 390 - manual
EfficientNet-B1 [12] 78.8 94.4 700 - manual
MobileNetV2 [45] 72.0 91.0 300 - manual
PC-DARTS [46] 75.8 92.7 597 3.8 GD
P-DARTS [47] 75.6 92.6 557 0.3 GD
SNAS [55] 72.7 90.8 522 1.5 GD
β-DARTS [56] 76.1 93.0 609 0.4 GD
NAP [57] 75.5 92.6 574 4 GD
Shapely-NAS [58] 76.1 - 582 4.2 GD
OnceForAll [43] 76.9 93.2 230 2 EA
FairNAS [52] 77.5 - 392 12 EA
NSGANetV1 [15] 76.9 93.0 292.5 12 EA
CARS [35] 75.2 92.5 591 0.4 EA
MUXNet [54] 76.6 93.2 318 11 EA
MixPath [59] 77.2 93.5 378 10.3 EA
SLE-NAS-B [60] 75.7 92.5 412 3 EA
NASNet-A [6] 74.0 91.6 564 1800 RL
MnasNet [19] 76.13 92.85 391 - RL
SMEMNAS-S(ours) 77.75 93.75 360 0.17 EA
SMEMNAS-M(ours) 78.16 93.92 468 0.17 EA
SMEMNAS-L(ours) 78.91 94.23 570 0.17 EA

SMEM-NAS FairNAS MUXNet-m MobileNetV2 accuracy.


SMEM-NAS(w/o) EfficientNet-B0 ZenNet NSGA-Net Results on ImageNet: Different from the conclusion on
NASNet-A mobile MixNet-M
the CIFAR datasets, there is no need for pre-training before
88
searching on the ImageNet dataset. Table III shows the results
87
of the proposed method on the ImageNet dataset, SMEM-NAS
86 outperforms other methods and achieves the highest accuracy
Accuracy(%)

85 of 78.91% with the MAdds of 570M. In addition, our proposed


84 surrogate model effectively reduces the search time, and only
83 takes 4 hours to complete the search.
82 In summary, the results on the three benchmark datasets
81
validate the effectiveness of the proposed method in terms of
the architectural performance and time consumption.
200 300 400 500 600 700 800
MAdds(M)

Fig. 6: Accuracy and number of multi-adds in millions C. Performance of Multi-population Mechanism


on CIFAR-100. Models from multi-objective approaches are The multi-population mechanism is an important strategy
joined with lines. to increase the diversity of the solutions while ensuring
convergence in this study. In order to verify its effectiveness,
further experiments are conducted on the CIFAR-100 dataset
GD-based method, although there are some deficiencies in in this section. We separately conduct the architecture search
time, the architecture we searched has better performance. Our with and without the multi-population mechanism, and the
model achieves the accuracy of 98.13%, surpassing almost all comparison between the two search processes is shown in
peer competitors. Among them, our model is slightly lower Fig. 7. The figures show the changes of the pareto front
than FairNAS-A, but has nearly 100M lower MAdds and the and the distribution of new individuals generated during the
search time is much less than FairNAS. In general, SMEM- evolutionary process. Fig. 7a shows the search process without
NAS is superior to or consistent with other models in different the multi-population mechanism and Fig. 7b shows the search
metrics. process using the multi-population mechanism. Fig. 7 shows
Results on CIFAR-100: Similar to the settings on CIFAR- the distribution of the architectures searched in the archive,
10, we also conducted experiments on CIFAR-100, and Fig. which have been realistically evaluated. Specifically, in both
6 and Table II show the results of our model compared to figures, the blue dots represent individuals at the initial and
networks from other works. We also provide three models on early stages of the evolutionary process, and the corresponding
this dataset, large, medium and small, in which the highest blue line is the pareto front at the early stage. In Fig. 7a, it is
accuracy can be 87.95%. From Fig. 6, it is clear that our obvious that except for the randomly initialized individuals, the
architectures dominates most methods in terms of accuracy new individuals generated in the later evolution are all almost
and complexity. Table II shows more details. On this dataset, distributed in the same solution region, and there is no obvious
although our results are slightly inferior to EfficientNet, the change in the pareto front. However, in Fig. 7b, we can see that
hand-crafted model requires significant expertise and labour. the new individuals generated are uniformly distributed unlike
Additionally, we can see that the architecture with the highest without the multi-population mechanism. The pareto front is
accuracy searched on the CIFAR-10 dataset transferred to also further updated in the excellent direction (lower error
CIFAR-100 does not perform optimally, achieving 87.16% rate, smaller model size) and more promising individuals are
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10

architectures till the 5th iteration architectures till the 5th iteration
new architectures between the 5th and 20th iteration new architectures between the 5th and 20th iteration
28 28

26 26
Error(%)

Error(%)
24 24

22 22

20 20

50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500
MAdds(M) MAdds(M)

(a) Architectures distribution Without MP (b) Architectures distribution With MP

Fig. 7: The comparison between SMEM-NAS with and without the multi-population mechanism (MP) on CIFAR-100. We plot
the distribution of architectures in the archive during the search process in two figures. The blue dots represent all the candidate
architectures searched in the 5th iteration, and the red dots represent the new candidates generated till the 20th iteration. In
addition, the blue and red lines represent the pareto front of the 5th iteration and 20th iteration respectively.

1st iteration
coefficient (KT au) to show the correlation between their true
5th iteration rankings and predicted rankings. Fig. 9 shows the searched
28 15th iteration
final iteration results using different models and the corresponding KT au.
Architectures existing in the archive are used to train surrogate
26
models. After that, new K architectures are retained in each
Error(%)

generation and used for testing. The figure shows the results
24 of the last generation. From the KT au and the distribution of
individuals in the search process, it can be seen that SVM is
22 the best classifier with very limited training samples. The new
individuals generated during the evolution when using KNN
20 and RF as surrogate models are worse and non-uniformly
100 150 200 250 300 350 400 distributed, which may be caused by the inaccurate prediction
MAdds(M)
at the early stage. In addition, the performance of MLP is
Fig. 8: Pareto Front Update: Changes of the pareto front similar to SVM as shown in the figure. But, SVM can take less
during the search process on CIFAR-100. time for training and achieve better results than MLP with the
limited samples. Therefore, we choose SVM as the surrogate
model.
explored. Evidently, the multi-population mechanism is better
Efficiency of the surrogate model: According to our
able to ensure the diversity of solutions and prevent them from
experimental setup (initial population with 100 individuals,
falling into the local optima. Additionally, we show changes of
evolving for 25 generations), approximately thousands of
the pareto front during the search process by using our method
offspring would be generated during the whole search process.
in Fig. 8. As we can see that the pareto front has stabilized
If all these architectures are truly evaluated, it may take more
in the early stage of the search, and is constantly changing
than 10 GPU days. By employing the proposed surrogate
and updating with the increase of iterations. In general, that is
model, the search time is significantly reduced.
the same as our expected effect, accelerating the convergence
speed while maintaining the diversity. Compared to traditional surrogate models, we conduct data
augmentation, expanding from n samples to n(n−1)/2, which
greatly increases the training data. Increased training data
D. Ablation Study can effectively improve the reliability of the surrogate model.
Comparison of classification models: In order to construct Table IV shows the performance comparison of the proposed
more accurate surrogate models, we consider four classifica- surrogate model with the traditional regression surrogate mod-
tion models: SVM, random forest (RF), k-nearest neighbors els. We use 300 samples for training models, and in each
(KNN), and multi-layer perceptron (MLP). We perform four experiment each surrogate model is trained with the same data.
groups of experiments on the CIFAR-100 dataset. For more We train three models with the proposed classification-based
intuitive comparisons, we compute the correlation of 300 fully method (SVM, RF, and KNN), and we train another three
trained and evaluated architectures during the search process models with the traditional regression-based method (MLP,
with each surrogate model. We choose the Kendall’s Tau Decision Tree, and AdaBoost).
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11

Architectures Searched Architectures Searched 30 Architectures Searched Architectures Searched


Pareto Front 28 Pareto Front Pareto Front Pareto Front
28 28
28
26 26
26
26
Error(%)

Error(%)

Error(%)

Error(%)
24 24
24 24

22 22
22 22

20 20 20
20

100 200 300 400 500 600 700 100 200 300 400 500 600 700 100 200 300 400 500 600 700 100 200 300 400 500 600 700
MAdds(M) MAdds(M) MAdds(M) MAdds(M)

300 300 300 300

250 250 250 250

200 200 200 200


True_rank

True_rank

True_rank

True_rank
150 150 150 150

100 100 100 100

50 50 50 50

0 0 0 0
0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300
Predicted_rank Predicted_rank Predicted_rank Predicted_rank

(a) SVM (KT au = 0.93) (b) MLP (KT au = 0.91) (c) KNN (KT au = 0.88) (d) RF (KT au = 0.85)

Fig. 9: Search with different surrogate models. The top row shows the results of the search using four different models.The
horizontal and vertical axes are error and MAdds, respectively. The bottom row shows the correlation between the predicted
accuracy ranking and the true ranking using KT au.

Table IV highlights that while the training and predic- 0.837


0.8363
0.8360
tion time of the proposed surrogate model surpasses that 0.836
of traditional models, classification-based surrogate models
0.835
demonstrate superior performance in terms of correlation 0.8339
0.834
coefficients. Among these, our SVM achieves the highest
HV

KT au coefficient at 0.8113. Therefore, the proposed surrogate 0.833

model can provide more reliable prediction results and is more 0.832
0.8313
capable of retaining and choosing the excellent individuals 0.831
during the search. Moreover, although it takes more time to 0.83
train the proposed surrogate model, the training and prediction 25 50 100 150
Initial_Size N
only takes a few more minutes during the actual search, which
is acceptable compared to the whole search time. Fig. 10: The hypervolume values are achieved with different
size of initial samples.
TABLE IV: Comparison with different surrogate models. We
record the Spearman coefficients, KT au coefficients between
the predicted accuracy rankings and the true accuracy rank- small-scale experiments on the CIFAR-10. We set different
ings. The results are the average of ten experiments carried sizes of the initial population, i.e., N = 25, 50, 100 and
out. 150. We record the effect of different population sizes on the
hypervolume (HV) value and the comparison results are shown
Type Model Trainning time (s) Prediction time (s) Kendall’s Tau Spearman’s Rho

Classification
SVM (ours)
RF
48.822
9.679
0.452
0.443
0.8113
0.7145
0.9494
0.8731
in Fig. 10. From the figure, it can be seen that HV achieves the
KNN
MLP
62.763
58.785
0.425
0.0356
0.6243
0.5417
0.8161
0.6319
highest value when the population size N = 100. Although
Regression DecisionTree
AdaBoost
0.441
0.225
0.0396
0.0347
0.5605
0.6998
0.7590
0.8753
the HV obtained at N = 150 and at N = 100 are almost
equivalent, 50% more individuals need to be evaluated when
N = 150, which is more time consuming. In addition, we
TABLE V: Comparison with different sizes of initial samples test the prediction performance of the initial surrogate model
to train surrogate model. as well. As seen in Table V, the reliability of the surrogate
model is low if the initial population size is too small, which is
Kendall’s Tau Spearman’s Rho Initial size another reason for the unfavourable search results at N = 25
0.4465 0.6154 25
0.5846 0.7485 50 and 50. Besides, if the surrogate model has low reliability,
0.6955 0.8575 100 the search results are prone to be less stable. Therefore, after
0.7472 0.9123 150 comprehensive consideration, we choose N = 100 as the size
of the initial population.
Hyper-parameters: The initial population is important for
both the training of the surrogate model and the subsequent V. C ONCLUSION
evolution. In order to achieve a better balance between search In this paper, we propose an evolutionary multi-objective
efficiency and search quality, we test the appropriate value algorithm by using a multi-population mechanism and a surro-
of the initial population size, N , by conducting the following gate model based on pairwise comparison relations to perform
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12

neural architecture search. The proposed method utilizes a [12] M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for con-
surrogate model to predict accuracy rankings of architectures volutional neural networks,” in Proceedings of the 36th International
Conference on Machine Learning, pp. 6105–6114, 2019.
without focusing on accuracy values. The proposed surrogate [13] H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean, “Efficient neural
model makes full use of the limited training samples and architecture search via parameters sharing,” in Proceedings of the 35th
can greatly reduce the search overhead. Besides, a multi- International Conference on Machine Learning, pp. 4095–4104, 2018.
[14] Y. Sun, B. Xue, M. Zhang, and G. G. Yen, “Completely automated
population mechanism is used to improve the diversity in CNN architecture design based on blocks,” IEEE Transactions on Neural
multi-objective NAS, which not only increases the population Networks and Learning Systems, vol. 31, no. 4, pp. 1242–1254, 2020.
diversity in the evolution process, but also accelerates the [15] Z. Lu, I. Whalen, Y. Dhebar, K. Deb, E. D. Goodman, W. Banzhaf,
and V. N. Boddeti, “Multiobjective evolutionary design of deep convo-
convergence of the algorithm. We have conducted a series of lutional neural networks for image classification,” IEEE Transactions on
experiments on benchmark datasets to verify the effectiveness Evolutionary Computation, vol. 25, no. 2, pp. 277–291, 2021.
of SMEM-NAS. The final architectures found by SMEM-NAS [16] R. Luo, F. Tian, T. Qin, E. Chen, and T.-Y. Liu, “Neural architecture
optimization,” in Advances in Neural Information Processing Systems,
outperform several existing search methods. vol. 31, 2018.
In this work, our surrogate model only considers the accu- [17] H. Liu, K. Simonyan, and Y. Yang, “DARTS: Differentiable architecture
racy and ignores other indicators of networks. Single-objective search,” in International Conference on Learning Representations, 2018.
surrogate models are not effectively adapted to multi-objective [18] R. Zhang, Y. Sun, and M. Zhang, “GPU based genetic programming for
faster feature extraction in binary image classification,” IEEE Transac-
search frameworks. Out next work plans to explore the multi- tions on Evolutionary Computation, pp. 1–1, 2023.
objective surrogate model to predict domination relations of [19] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard,
architectures. Besides, it can be observed that the initial pop- and Q. V. Le, “MnasNet: Platform-aware neural architecture search
for mobile,” in Proceedings of the IEEE/CVF Conference on Computer
ulation is vital to the results of evolutionary algorithms. The Vision and Pattern Recognition, pp. 2820–2828, 2019.
proposed method simply uses random sampling to initialise the [20] M. Tan and Q. V. Le, “MixConv: Mixed depthwise convolutional
population, and the sampled architectures are non-uniformly kernels,” in British Machine Vision Conference, 2019.
[21] Y. Tian, C. He, R. Cheng, and X. Zhang, “A multistage evolutionary al-
distributed in the objective space, which may not be a proper gorithm for better diversity preservation in multiobjective optimization,”
approach for subsequent evolutionary exploration. Therefore, IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 51,
we intend to investigate efficient sampling methods to initialize no. 9, pp. 5880–5894, 2021.
[22] F. Ming, W. Gong, and L. Wang, “A two-stage evolutionary algorithm
the population by considering a given problem’s features. with balanced convergence and diversity for many-objective optimiza-
tion,” IEEE Transactions on Systems, Man, and Cybernetics: Systems,
R EFERENCES vol. 52, no. 10, pp. 6222–6234, 2022.
[23] S. Yang, H. Huang, F. Luo, Y. Xu, and Z. Hao, “Local-diversity
[1] J. Bi, K. Xu, H. Yuan, J. Zhang, and M. Zhou, “Network attack
evaluation assignment dtrategy for decomposition-based multiobjective
prediction with hybrid temporal convolutional network and bidirectional
evolutionary algorithm,” IEEE Transactions on Systems, Man, and
GRU,” IEEE Internet of Things Journal, vol. 11, no. 7, pp. 12619–12630,
Cybernetics: Systems, vol. 53, no. 3, pp. 1697–1709, 2023.
2024.
[24] S. Liu, H. Zhang, and Y. Jin, “A survey on computationally efficient
[2] Z. Bao, S. Yang, Z. Huang, M. Zhou, and Y. Chen, “A lightweight block
neural architecture search,” Journal of Automation and Intelligence,
with information flow enhancement for convolutional neural networks,”
vol. 1, no. 1, p. 100002, 2022.
IEEE Transactions on Circuits and Systems for Video Technology,
vol. 33, no. 8, pp. 3570–3584, 2023. [25] Y. Sun, B. Xue, M. Zhang, and G. G. Yen, “A particle swarm
[3] Z. Zhao, Z. Yang, C. Li, Q. Zeng, W. Guan, and M. Zhou, “Dual feature optimization-based flexible convolutional autoencoder for image classi-
interaction-based graph convolutional network,” IEEE Transactions on fication,” IEEE Transactions on Neural Networks and Learning Systems,
Knowledge and Data Engineering, vol. 35, no. 9, pp. 9019–9030, 2023. vol. 30, no. 8, pp. 2295–2309, 2019.
[4] Z. Tan, J. Chen, Q. Kang, M. Zhou, A. Abusorrah, and K. Sedraoui, [26] B. Baker, O. Gupta, R. Raskar, and N. Naik, “Accelerating neural
“Dynamic embedding projection-gated convolutional neural networks architecture search using performance prediction,” in International Con-
for text classification,” IEEE Transactions on Neural Networks and ference on Learning Representations, 2018.
Learning Systems, vol. 33, no. 3, pp. 973–982, 2022. [27] Y. Peng, A. Song, V. Ciesielski, H. M. Fayek, and X. Chang, “PRE-
[5] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement NAS: Predictor-assisted evolutionary neural architecture search,” in
learning,” in International Conference on Learning Representations, Proceedings of the Genetic and Evolutionary Computation Conference,
2017. pp. 1066–1074, 2022.
[6] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable [28] Z. Lu, I. Whalen, V. Boddeti, Y. Dhebar, K. Deb, E. Goodman,
architectures for scalable image recognition,” in Proceedings of the and W. Banzhaf, “NSGA-Net: Neural architecture search using multi-
IEEE/CVF Conference on Computer Vision and Pattern Recognition, objective genetic algorithm,” in Proceedings of the Genetic and Evolu-
pp. 8697–8710, 2018. tionary Computation Conference, pp. 419–427, 2019.
[7] Z. Ding, Y. Chen, N. Li, and D. Zhao, “BNAS-v2: Memory-efficient [29] Y. Xue, C. Chen, and A. Słowik, “Neural architecture search based on
and performance-collapse-prevented broad neural architecture search,” a multi-objective evolutionary algorithm with probability stack,” IEEE
IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 52, Transactions on Evolutionary Computation, vol. 27, no. 4, pp. 778–786,
no. 10, pp. 6259–6272, 2022. 2023.
[8] Z. Ding, Y. Chen, N. Li, D. Zhao, and C. L. P. Chen, “Stacked BNAS: [30] B. Wang, Y. Sun, B. Xue, and M. Zhang, “Evolving deep neural
Rethinking broad convolutional neural network for neural architecture networks by multi-objective particle swarm optimization for image clas-
search,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, sification,” in Proceedings of the Genetic and Evolutionary Computation
vol. 53, no. 9, pp. 5679–5690, 2023. Conference, pp. 490–498, 2019.
[9] G. Ghiasi, T.-Y. Lin, and Q. V. Le, “NAS-FPN: Learning scalable [31] L. Tong and B. Du, “Neural architecture search via reference point based
feature pyramid architecture for object detection,” in Proceedings of the multi-objective evolutionary algorithm,” Pattern Recognition, vol. 132,
IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 108962, 2022.
pp. 7036–7045, 2019. [32] X. Xie, Y. Sun, Y. Liu, M. Zhang, and K. C. Tan, “Architecture
[10] V. Nekrasov, H. Chen, C. Shen, and I. Reid, “Fast neural architecture augmentation for performance predictor via graph Isomorphism,” IEEE
search of compact semantic segmentation models via auxiliary cells,” Transactions on Cybernetics, vol. 54, no. 3, pp. 1828–1840, 2024.
in Proceedings of the IEEE/CVF Conference on Computer Vision and [33] Z. Lv, C. Qian, and Y. Sun, “Benchmarking analysis of evolutionary
Pattern Recognition, pp. 9126–9135, 2019. neural architecture search,” IEEE Transactions on Evolutionary Compu-
[11] D. Magalhães, R. H. R. Lima, and A. Pozo, “Creating deep neural net- tation, pp. 1–1, 2023.
works for text classification tasks using grammar genetic programming,” [34] Z. Lu, K. Deb, E. Goodman, W. Banzhaf, and V. N. Boddeti, “NS-
Applied Soft Computing, vol. 135, p. 110009, 2023. GANetV2: Evolutionary multi-objective surrogate-assisted neural archi-
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

tecture search,” in Proceedings of the European Conference on Computer [56] P. Ye, B. Li, Y. Li, T. Chen, J. Fan, and W. Ouyang, “β-DARTS:
Vision, pp. 35–51, 2020. Beta-decay regularization for differentiable architecture search,” in Pro-
[35] Z. Yang, Y. Wang, X. Chen, B. Shi, C. Xu, C. Xu, Q. Tian, and C. Xu, ceedings of the IEEE/CVF Conference on Computer Vision and Pattern
“CARS: Continuous evolution for efficient neural architecture search,” Recognition, pp. 10864–10873, 2022.
in Proceedings of the IEEE/CVF Conference on Computer Vision and [57] Y. Ding, Y. Wu, C. Huang, S. Tang, F. Wu, Y. Yang, W. Zhu, and
Pattern Recognition, pp. 1829–1838, 2020. Y. Zhuang, “NAP: Neural architecture search with pruning,” Neurocom-
[36] C. Ying, A. Klein, E. Christiansen, E. Real, K. Murphy, and F. Hutter, puting, vol. 477, pp. 85–95, 2022.
“NAS-Bench-101: Towards reproducible neural architecture search,” in [58] H. Xiao, Z. Wang, Z. Zhu, J. Zhou, and J. Lu, “Shapley-NAS:
International Conference on Machine Learning, pp. 7105–7114, 2019. Discovering operation contribution for neural architecture search,” in
[37] G. Chen and J. Li, “A diversity ranking based evolutionary algorithm Proceedings of the IEEE/CVF Conference on Computer Vision and
for multi-objective and many-objective optimization,” Swarm and Evo- Pattern Recognition, pp. 11892–11901, 2022.
lutionary Computation, vol. 48, pp. 274–287, 2019. [59] X. Chu, S. Lu, X. Li, and B. Zhang, “MixPath: A unified approach for
[38] Z. Song, H. Wang, and Y. Jin, “A surrogate-assisted evolutionary one-shot neural architecture search,” in Proceedings of the IEEE/CVF
framework with regions of interests-based data selection for expensive International Conference on Computer Vision, pp. 5972–5981, 2023.
constrained optimization,” IEEE Transactions on Systems, Man, and [60] J. Huang, B. Xue, Y. Sun, M. Zhang, and G. G. Yen, “Split-level
Cybernetics: Systems, vol. 53, no. 10, pp. 6268–6280, 2023. evolutionary neural architecture search with elite weight inheritance,”
[39] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, IEEE Transactions on Neural Networks and Learning Systems, pp. 1–
A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture 15, 2023.
search,” in Proceedings of the European Conference on Computer Vision,
pp. 19–34, 2018.
[40] Y. Sun, H. Wang, B. Xue, Y. Jin, G. G. Yen, and M. Zhang, “Surrogate-
assisted evolutionary deep learning using an end-to-end random forest-
based performance predictor,” IEEE Transactions on Evolutionary Com-
putation, vol. 24, no. 2, pp. 350–364, 2020.
[41] B. Guo, T. Chen, S. He, H. Liu, L. Xu, P. Ye, and J. Chen, “Generalized
global ranking-aware neural architecture ranker for efficient image clas-
sifier search,” in Proceedings of the 30th ACM International Conference
on Multimedia, pp. 3730–3741, 2022.
[42] B. Wang, B. Xue, and M. Zhang, “Surrogate-assisted particle swarm
optimization for evolving variable-length transferable blocks for image
classification,” IEEE Transactions on Neural Networks and Learning
Systems, vol. 33, no. 8, pp. 3727–3740, 2022.
[43] H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han, “Once-For-All: Train
one network and specialize it for efficient deployment,” in International
Conference on Learning Representations, 2020.
[44] A. Howard, M. Sandler, B. Chen, W. Wang, L.-C. Chen, M. Tan, G. Chu,
V. Vasudevan, Y. Zhu, R. Pang, H. Adam, and Q. Le, “Searching for Mo-
bileNetV3,” in Proceedings of the IEEE/CVF International Conference
on Computer Vision, pp. 1314–1324, 2019.
[45] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mo-
bileNetV2: Inverted residuals and linear bottlenecks,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 4510–4520, 2018.
[46] Y. Xu, L. Xie, X. Zhang, X. Chen, G.-J. Qi, Q. Tian, and H. Xiong, “PC-
DARTS: Partial channel connections for memory-efficient architecture
search,” in International Conference on Learning Representations, 2019.
[47] X. Chen, L. Xie, J. Wu, and Q. Tian, “Progressive differentiable archi-
tecture search: Bridging the depth gap between search and evaluation,”
in Proceedings of the IEEE/CVF International Conference on Computer
Vision, pp. 1294–1303, 2019.
[48] X. Chu and B. Zhang, “Noisy differentiable architecture search,” in
British Machine Vision Conference, 2021.
[49] X. Chu, T. Zhou, B. Zhang, and J. Li, “Fair DARTS: Eliminating unfair
advantages in differentiable architecture search,” in Proceedings of the
European Conference on Computer Vision, pp. 465–480, 2020.
[50] A. Hundt, V. Jain, and G. D. Hager, “sharpDARTS: Faster and
more accurate differentiable architecture search,” arXiv preprint
arXiv:1903.09900, 2019.
[51] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution
for image classifier architecture search,” in Proceedings of the AAAI
Conference on Artificial Intelligence, vol. 33, pp. 4780–4789, 2019.
[52] X. Chu, B. Zhang, and R. Xu, “FairNAS: Rethinking evaluation fairness
of weight sharing neural architecture search,” in Proceedings of the
IEEE/CVF International Conference on Computer Vision, pp. 12239–
12248, 2021.
[53] M. Lin, P. Wang, Z. Sun, H. Chen, X. Sun, Q. Qian, H. Li, and R. Jin,
“Zen-NAS: A zero-shot NAS for high-performance image recognition,”
in Proceedings of the IEEE/CVF International Conference on Computer
Vision, pp. 337–346, 2021.
[54] Z. Lu, K. Deb, and V. N. Boddeti, “MUXConv: Information multiplexing
in convolutional neural networks,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 12044–
12053, 2020.
[55] S. Xie, H. Zheng, C. Liu, and L. Lin, “SNAS: stochastic neural architec-
ture search,” in International Conference on Learning Representations,
2018.

You might also like