2023 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 17–20, 2023, ROME, ITALY
COMPRESSING WAV2VEC2 FOR EMBEDDED APPLICATIONS
Oswaldo Ludwig, Tom Claes
Cerence Inc., Guldensporenpark 32, 9820 Merelbeke, Belgium.
ABSTRACT
Wav2vec2 self-supervised multilingual training learns speech
units common to multiple languages, leading to better generalization capacity. However, Wav2vec2 is larger than other
E2E ASR models such as the Conformer ASR. Therefore,
the objective of this work is to reduce the Wav2vec footprint
by pruning lines from the intermediate dense layers of the
encoder block, since they represent about two thirds of the
encoder parameters. We apply Genetic Algorithms (GA) to
solve the combinatorial optimization problem associated with
pruning, which means running many copies of the Wav2vec2
decoder in parallel using multiprocessing on a computer grid,
so an effort was made to optimize the GA for good performance with few CPUs. The experiments show a small absolute word error rate damage of 0.21% (1.26% relative) for
a pruning of 40% and compare this value with those of the
usual L1-norm pruning and model restructuring by singular
value decomposition.
Index Terms— ASR, Wav2vec, GA, SVD
1. INTRODUCTION
In recent years, semi-supervised learning [1] has been used
to improve end-to-end automatic speech recognition (ASR),
which currently easily outperforms hybrid models in benchmark datasets in many languages when it comes to WER
[2]. Multilingual models like Wav2vec2 XLSR53 [3] are
pre-trained in many languages in a self-supervised manner,
making them a good choice for low-resource languages due
to their improved generalization capability [4].
The main disadvantage of Wav2vec2 XLSR53 is the
model’s footprint, this model has 317M trainable parameters
versus 120M of Conformer-CTC large [5] and 30M of its
medium version, for example. This implies higher latency.
Our experiments using beam decoding with external language
model (LM) shown about 3 times more latency for Wav2vec
compared to Conformer-CTC large.1
In this paper, we show how evolutionary computation can
help improve structured pruning in order to considerably reThis work has been supported by Cerence Inc.
1 https://docs.nvidia.com/deeplearning/nemo/user-
guide/docs/en/stable/asr/models.html
979-8-3503-2411-2/23/$31.00 ©2023 IEEE
duce Wav2vec2 XLSR53 footprint while maintaining its original performance and generalization capacity.
The paper is organized as follows. Section 2 briefly reports the state-of-the-art in compressing neural models, Section 3 justifies and explains the proposed evolutionary structured pruning detailing the algorithm. Experimental results
are presented in Section 4, and conclusions in Section 5.
2. RELATED WORK
There are four main ways to compress models: quantization
[6], which can be applied on top of other compression techniques to further improve the compression, knowledge distillation [7], model restructuring [8] and pruning.
Knowledge distillation (KD) is a method that trains a
smaller model, called a student, using outputs of one or more
larger pre-trained models, called teachers. The usual technique is to use the logits in the final layer of the teacher
model as a target for the student model, as this target is more
informative than the usual one-hot vector; however, multiple
intermediate results from the teacher model can also be used
in a composite loss function [9]. Model restructuring applies
linear transformations, such as singular value decomposition
(SVD), to the weight matrices and then restructures the model
to take advantage of the inherent sparseness of the original
matrices.
The work [7] applies KD to Wav2vec with a good compression ratio of 4.8 times, but the authors report WER 3.62
times higher than the original model, which is too much
for our applications. KD was our first attempt to compress
Wav2vec2-xlsr-53, but preliminary results were not promising using only in-house training data. Our interpretation
for this poor result was that running the teacher model only
on our fine-tuning data is not sufficient to properly transfer
to the student model the multilingual acoustic knowledge
encoded in the teacher model during self-supervised pretraining. Furthermore, the computational cost is much higher
than restructuring or pruning the model, which is relevant in
the case of a portfolio of dozens of languages.
Pruning identifies and removes redundant or less important weights and/or components and mainly falls into two
categories: unstructured and structured. Unstructured pruning methods include variants such as magnitude weight prun-
ing [10], which simply removes weights close to zero, and
movement-based pruning [11], which removes weights that
move towards zero during fine-tuning. Since unstructured
pruning considers each weight individually, the set of pruned
weights can be irregular across the model blocks, with negligible improvement in runtime memory and speed.
Structured pruning focuses on pruning entire structured
blocks of weights such as full connected layer lines, attention
heads or even an entire Transformer layer, which can be removed without seriously degrading the final performance due
to the skip connection of its residual architecture.
It has been empirically shown that high accuracy is possible with only 1 or 2 attention heads per encoder unit, even
when the encoder has 16 attention heads [12]. In fact, a
learned Transform often has a lot of redundancy [13]. The
work [14] found that while the front and back layers in a
BERT model [15] play clear roles in extracting either lowlevel or task-specific linguistic knowledge, the roles of the
middle layers are less important.
3. EVOLUTIONARY STRUCTURED PRUNING
Most structured pruning methods analyze the importance of
structures in isolation, such as magnitude weight pruning
[10]. However, pruning is a combinatorial optimization problem, as the impact of pruning on a weight or structure is a
function of other pruned weights (or structures). Therefore,
the research hypothesis of this work is that structured pruning
can be better when performed with an efficient algorithm to
solve the associated combinatorial optimization problem.
Our work addresses the structured pruning through evolutionary computation [16], i.e. genetic algorithms (GA)
[17]. The proposed method requires only Wav2vec2 decoding without gradient-based training; therefore, no GPU is
needed, it runs on CPUs with full parallelism, i.e. each GA
individual is evaluated in a different CPU, the results are
centralized in a single CPU, where the genetic operators are
applied to compose the next generation.
The Wav2vec XLSR53 encoder is a pipeline of 24 blocks,
each modeled by Equations (1) to (5). Each block contains
16 attention heads, Equations (1) to (3), emitting a 1024dimensional tensor hj−1 , which is the input of a layer normalization [18] operation, Equation (4), whose output hj passes
through a pair of fully-connected (FC) layers, Equation (5),
the first FC layer maps this 1024-dimensional input to a 4096dimensional inner space and the second maps back to a 1024dimensional tensor, which is the input to another layer normalization operation.
QK T
Att (Q, K, V ) = softmax √
·V
(1)
dH
(2)
headi = Att QWiQ , KWiK , V WiV
hj−1 = MHA (Q, K, V ) = [head1 , . . . , headN ] W O (3)
hj−1 − E [hj−1 ]
hj = p
⊙γ+β
Var [hj−1 ] + ϵ
T
hj+1 = φ (W1 hj + b1 ) W2 + b2
(4)
(5)
where Q = K = V represent query, key, and value tensors
that encode the outputs of the previous layer (a sequence of
1D tensors), dH is the dimension of the hidden representations, [·, ·] is the concatenation operator, WiQ , WiK , WiV , for
i = 1 . . . N , W O , γ, β, W1 , W2 , b1 and b2 are learnable affine
transform parameters, φ is the gelu activation, and N = 16 is
the number of heads.
This configuration results in 201.4M of parameters in the
intermediate FC layers of Equation (5), plus 100.8M in the
attention heads and normalization layers. So we prune lines
of W1 and W2 and the respective positions of b1 , as these
tensors represent about two-thirds of the encoder parameters,
resulting in a smaller inner dimension.
In a simple exhaustive search, the number of combinationsto check would be given by the binomial coefficient
Ltot
Lpr , where Ltot = 196608 is the total number of lines in
the 24 pairs of FC layers and Lpr is the number of pruned
lines. This factorial growth makes it difficult to prune individual lines, so we group the lines into nstr blocks of consecutive lines and prune npr blocks. The total number of blocks
per FC layer, nstr , is hereafter called the pruning granularity.
Running GA with multilingual data would be very computationally intensive, so the idea is to create a pruned seed
for each language to further fine tuning with our data.
There are 3 main possibilities for pruned model architecture with an increasing degree of performance impact:
1. Pruning a fixed number of line blocks along the model
without any restriction on the number of line blocks
per layer, resulting in an asymmetrical encoder architecture. This approach can lead to better performance,
as previous works, like [14], report that the intermediate layers are less important for Transformer-based encoders, i.e. they can allow for more aggressive pruning.
2. Pruning the same number of line blocks per layer, generating a symmetrical architecture.
3. Pruning entire encoder layers. This is possible due to
the skip connections of the residual encoder architecture. It has been observed that the nonlinear components of some layers have a contribution close to zero
after training, so the layer approaches an identity. This
approach results in the easiest combinatorial problem,
but a greater performance impact is expected.
Symmetric pruning is easier to implement using the Hugging Face framework2 as a starting point. In this case, the
number of combinations is given by:
nlay
nstr !
(6)
npr ! (nstr − npr )!
2 https://huggingface.co/docs/transformers/index
where nlay is 24 for Wav2vec XLSR53, while the number of
combinations for asymmetric pruning is:
(nstr × nlay )!
(npr × nlay )! (nlay (nstr − npr ))!
(7)
Asymmetric pruning is also a more difficult combinatorial
problem, as can be seen by substituting the values of nstr ,
npr and nlay used in this work3 in (6) and (7). Therefore, we
adopted symmetric pruning in this work.
Algorithm 1 describes the main structure of the GA code,
while the Algorithm 2 details how a new chromosome is assembled given the constraint of not repeating line block indexes in an FC layer, which may imply the application of the
mutation operator, see Line 15 of Algorithm 2.
The indexes of the pruned line blocks belonging to an FC
layer are encoded in a npr -dimensional vector. We build a
chromosome vector by concatenating nlay of these vectors,
see Line 3 of Algorithm 1 for a more detailed description.
There are other particularities of our algorithm:
Algorithm 1 Optimal pruning by GA
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
1. we encode the L1-norm solution on a chromosome and
seed it into the initial population, to speed up GA convergence, see Line 4 of Algorithm 1.
20:
21:
22:
23:
24:
2. we apply elitism [19] only when the fitness degrades
relative to the previous generation, to avoid attracting
the population to the best individual, extinguishing the
diversity, which would result in an early convergence
to a local minimum, see Lines 18-23 of Algorithm 1.
3. we use a special distribution [17], see Equation (8),
which allows controlling the selective pressure, which
must be kept very low in this application, also to avoid
early convergence, as we work with few CPUs.
epϑ − 1
+1
(8)
round (npop − 1) p
e −1
where npop is the number of GA individuals, p is the selective
pressure and ϑ ∈ [0, 1) is a random number sampled with
uniform distribution.
Once we have the optimal pruned model, we fine tune it
using the same training data Utr used for pruning. The better
the pruning, the more information from the Wav2vec2 pretraining is kept, resulting in a better generalization ability, as
well as having an easier fine-tuning, i.e. a better starting point.
4. EXPERIMENTS
This section reports our experimental results. The baseline
methods for our experiments are pruning of FC lines of Equation (5) based on the smallest L1-norm [20] and restructuring
the same equation by applying SVD. In the case of the usual
3n
lay
= 24, nstr = 32, npr = 20
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
Input: p, M , nlay , nstr , npr , npop , Utr : selective pressure, W2V model,
number of encoder layers, pruning granularity, number of pruned blocks per layer,
number of GA individuals, and training data respectively.
Output: I ∗ , f itness∗ : indexes of pruned blocks and respective fitness function.
Generate a set with Npop chromosomes {Cr} for the initial population, in
which genes encode the indexes of pruned blocks of FC layers. {Cr} is
composed by concatenating nlay vectors encoding npr indexes in the set
{i ∈ N : 0 ≤ i < nstr } without repetition.
Seed the solution from L1-norm pruning (encoded in a chromosome) in the initial
population to accelerate GA convergence.
ϕbest ← ∞ (i.e. numpy.finfo).
Cbest ← any chromosome from {Cr}
for generation = 1 : maxgener do
Evaluating the population using multiprocessing on a computer grid:
for i = 1 : npop do
Call the script of the modified W2V decoder in the background (i.e. in
parallel) using the pruning configuration encoded in the ith chromosome
Ci and a randomly sampled sub-set of Utr , the decoder script saves the
resulting WER in file fi corresponding to the ith chromosome.
end for
search for fi files.
while Any resulting file fi is missing do
wait t seconds and search for fi files again.
end while
load all fi files, sort results (WER) by chromosome index and store them in the
fitness vector ϕ ∈ RNpop .
Rank individuals/chromosomes according to their fitness ϕi .
if min (ϕ) < ϕbest then
ϕbest ← min (ϕ).
Cbest ← argmin(ϕ) (storing best chromosome).
else
Seed Cbest in the current GA population {Cr} (elitism).
end if
The special crossover:
Crchild ← {}
for k = 1 : npop do
Randomly selecting the indexes of parents by using the asymmetric distribution proposed in [17]:
ϑj ← random number∈ [0, 1) with uniform distribution,
j = 1, 2
parentj ← round
(npop − 1)
pϑj
e
−1
ep −1
+ 1 , j = 1, 2
Run Algorithm 2 to assemble the child chromosome Ckchild .
Add Ckchild to Crchild .
end for
Cr ← Crchild : (update population)
end for
I ∗ ← Cbest
f itness∗ ← ϕbest
L1-norm pruning, we report results for pruning block of lines
and isolated lines.
For restructuring the model through SVD, we follow the
previous work [8]. The weight tensors Wn , n ∈ {1, 2}, of
Equation (5) can be decomposed as follows:
T
Wn1024×4096 = Un1024×4096 Σ4096×4096
Vn4096×4096
(9)
n
where Σn is a diagonal matrix with 4096 singular values on
the diagonal in the decreasing order, while Un and Vn are the
left and right matrices of singular vectors.
To have a model compression of 40%, we only keep 307
biggest singular values by truncating Un , Vn and Σn , yielding
the approximation:
T
(10)
Ṽn307×4096
Wn1024×4096 ∼
= Ũn1024×307 Σ̃307×307
n
Therefore Equation (5) can be approximated as follows:
T
hj+1 = φ Ũ1 N1 hj + b1 N2T Ũ2T + b2
(11)
Algorithm 2 Assembling the chromosome Ckchild
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
Input: nlay , nstr , parent1 and parent2 : number of encoder layers, the number of pruned blocks per layer, parent chromosomes respectively.
Output: Ckchild : child chromosome.
Assembling the chromosome Ckchild :
l←0
Ckchild ← {}
for m = 1 : nlay do
cm ← {}
for z = 1 : npr do
Randomly select a parent from parent1 and parent2 and take its lth
gene g (l) also storing the other parent gene as ḡ (l).
if gene g (l) ∈
/ cm then
Add g (l) to set cm .
else if gene other parent ḡ (l) ∈
/ cm then
Add ḡ (l) to set cm .
else
mutation: randomly sample an index ind ∈ {i ∈ N |0 ≤ i < nstr
and i ∈
/ cm }.
end if
l←l+1
end for
Add cm to set Ckchild .
end for
Cast set Ckchild to tensor.
Fig. 1. Graphical representation of Equation (5) versus Equation (11).
where Nn = Σ̃n ṼnT , n ∈ {1, 2}. This compression method
is illustrated in Figure 1.
In the case of GA pruning, the first hyper-parameter to
be investigated is the pruning granularity, i.e. the number
of line blocks per FC layer. The greater the granularity, the
greater the potential for good pruning, but the more difficult
the combinatorial optimization problem for the GA. Investigating the optimal granularity by running the GA multiple
times would be very expensive; therefore, we run the lighter
L1-norm pruning for different granularity values to plot the
curve granularity × WER and use it as a reference to choose
the GA granularity, as shown in Figure 2.
In Figure 2 we can see a large decay of WER until the
granularity of 32 and a less intense evolution after this point.
Therefore, we adopted a granularity of 32 for the GA pruning, aiming at a viable combinatorial problem. The other GA
hyper-parameters are: selective pressure p = 3, number of
GA individuals npop = 62 and npr = 20. For all fine-tuning
sessions, we use the Adam optimizer with initial learning rate
of 1e-5, batch size 4, accumulating the loss gradient 8 times
before updating the model. We fine tune for 9 epochs, when
Fig. 2. Curve granularity × WER using L1-norm for a 25%
pruning (message data).
the validation loss curve is flat.
Given our company’s interest, rather than using publicly
available data [21], we use in-house training and test data consisting of actual noisy in-car voice commands with varying
degrees of SNR. The usual benchmark datasets do not represent our data/interests, as shown in the first row of Table
1, which reports a high WER on our test sets of a model
pre-trained on Librispeech data. Our test sets are related to
the in-car virtual assistant, such as requesting music, sending
SMS messages, GPS navigation to points of interest (POI)
and phone dialing. The training set Utr has 5.9K hours of
audio on the same domains.
We evaluated pruning about 40% from the model size, i.e.
125M of trainable parameters. No language models are used
during decoding, i.e. no shallow fusion [22]; therefore, we
use a simple greedy search for both the baseline and pruned
models. Note that due to the conditional independence assumption of models based on CTC loss, like Wav2vec, decoding with a beamwidth greater than one doesn’t improve
WER when there is no external LM.
The baseline model is Wav2vec2 XLSR53 fine tuned on
our data. We prune the model and fine tune the pruned model
with the same in-house training data used for the baseline
model. The result of pruning with L1-norm and GA, as well
as model restructuring by SVD, can be seen in Figure 3. We
seed the L1-norm solution in the GA initial population to accelerate the convergence, as explained in Section 3.
Comparing the pruning methods, the WERR of GA over
the L1-norm is 30.44%, which supports our research hypothesis that pruning results are better when approached with an
efficient algorithm to solve the associated combinatorial optimization problem.
Table 1 shows the WER and relative WERR after fine tuning without external LM using greedy decoder.
SVD performs similarly to GA pruning in terms of WER
after fine-tuning (17.06% vs 17.07%); however, the restructured model has a twice as long chain of weight matrices
in the intermediate FC layers, as can be seen by comparing
Equations (5) and (11). Although both techniques result in the
same number of operations and parameters (3.1 million parameters in each intermediate FC block), the longer sequence
small relative damage of 1.17% was achieved by restructuring these layers by SVD, which takes just a few seconds to
compute using a SciPy library. However, due to its more serial nature, the resulting restructured model tends not to be
as computationally efficient at runtime as the pruned model,
which allows for better parallelism.
6. REFERENCES
[1] A. Baevski, H. Zhou, A. Mohamed, and M. Auli,
“wav2vec 2.0: A framework for self-supervised learning of speech representations,” 2020.
Fig. 3. Homogeneous pruning and restructuring of W2V before fine-tuning.
[2] Y. Zhang, J. Qin, D. S. Park, W. Han, C.-C. Chiu,
R. Pang, Q. V. Le, and Y. Wu, “Pushing the limits of
semi-supervised learning for automatic speech recognition,” arXiv preprint arXiv:2010.10504, 2020.
of product operations of the SVD restructuring does not favor
parallel computation, as the algorithm must wait for the result
of the current product to execute the next one. Therefore, the
pruned FC block is about 28% faster than the restructured FC
block during runtime.4
[3] A. Conneau, A. Baevski, R. Collobert, A. Mohamed,
and M. Auli, “Unsupervised cross-lingual representation learning for speech recognition,” arXiv preprint
arXiv:2006.13979, 2020.
Table 1. Results after fine-tuning without external LM.
WER music
WER message
WER POI
WER dialing
average
W2V out-of-the-box
W2V fine tuned
pruning by GA + fine tune
pruning by L1-norm + fine tune
restructuring by SVD + fine tune
32.25%
20.98%
22.01%
24.97%
22.01%
25.64%
15.00%
13.53%
14.09%
13.86%
32.25%
13.67%
14.57%
15.62%
14.57%
48.25%
17.78%
18.17%
19.04%
17.78%
34.60%
16.86%
17.07%
18.43%
17.06%
WERR pruning by GA
WERR pruning by L1-norm
WERR restructuring by SVD
-4.91%
-19.02%
-4.91%
9.80%
6.07%
7.60%
-6.58%
-14.26%
-6.58%
-2.19%
-7.09%
0.00%
-1.26%
-9.33%
-1.17%
5. CONCLUSIONS
We proposed a new neuroevolution-based method to solve the
combinatorial optimization problem associated with pruning
and compared it with the usual L1-norm pruning and SVDbased model restructuring. This method can be applied to any
pre-trained Transformer-based model to preserve as much information as possible from its pre-training. Here, the idea is
to preserve Wav2vec2’s generalization capacity, which is an
important feature for applications in an extensive language
portfolio that includes poorly resourced languages in the corpus linguistics sense.
The proposed GA-based pruning required 62 CPUs for
about a day to find the optimal pruning setup. The experimental results support our method, showing a small relative
damage of 1.26% for a pruning of about 40% of the model
parameters.
In general, our experiments indicate that focusing on the
intermediate FC layers is a good way to achieve a high compression ratio with little impact on performance, an equally
4 This
number refers to CPU runtime and depends on the adopted hardware and software ecosystem for multi-threaded inference.
[4] W.-N. Hsu, A. Sriram, A. Baevski, T. Likhomanenko,
Q. Xu, V. Pratap, J. Kahn, A. Lee, R. Collobert, G. Synnaeve et al., “Robust wav2vec 2.0: Analyzing domain
shift in self-supervised pre-training,” arXiv preprint
arXiv:2104.01027, 2021.
[5] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang,
J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech
recognition,” arXiv preprint arXiv:2005.08100, 2020.
[6] Y. Boo and W. Sung, “Fixed-point optimization of transformer neural network,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 2020, pp. 1753–1757.
[7] Z. Peng, A. Budhkar, I. Tuil, J. Levy, P. Sobhani, R. Cohen, and J. Nassour, “Shrinking bigfoot: Reducing wav2vec 2.0 footprint,” arXiv preprint
arXiv:2103.15760, 2021.
[8] J. Xue, J. Li, and Y. Gong, “Restructuring of deep neural
network acoustic models with singular value decomposition.” in Interspeech, 2013, pp. 2365–2369.
[9] H.-J. Chang, S.-w. Yang, and H.-y. Lee, “Distilhubert:
Speech representation learning by layer-wise distillation
of hidden-unit bert,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 2022, pp. 7087–7091.
[10] M. A. Gordon, K. Duh, and N. Andrews, “Compressing
bert: Studying the effects of weight pruning on transfer
learning,” arXiv preprint arXiv:2002.08307, 2020.
[11] V. Sanh, T. Wolf, and A. Rush, “Movement pruning:
Adaptive sparsity by fine-tuning,” Advances in Neural
Information Processing Systems, vol. 33, pp. 20 378–
20 389, 2020.
[12] P. Michel, O. Levy, and G. Neubig, “Are sixteen heads
really better than one?” Advances in neural information
processing systems, vol. 32, 2019.
[13] P. Ganesh, Y. Chen, X. Lou, M. A. Khan, Y. Yang,
H. Sajjad, P. Nakov, D. Chen, and M. Winslett, “Compressing large-scale transformer-based models: A case
study on bert,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 1061–1080, 2021.
[14] I. Tenney, D. Das, and E. Pavlick, “Bert rediscovers the classical nlp pipeline,” arXiv preprint
arXiv:1905.05950, 2019.
[15] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova,
“Bert:
Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint
arXiv:1810.04805, 2018.
[16] J. Poyatos, D. Molina, A. Martinez, J. Del Ser, F. Herrera et al., “Evoprunedeeptl: An evolutionary pruning
model for transfer learning based deep neural networks,”
arXiv preprint arXiv:2202.03844, 2022.
[17] O. Ludwig, U. Nunes, R. Araújo, L. Schnitman, and
H. A. Lepikson, “Applications of information theory,
genetic algorithms, and neural models to predict oil
flow,” Communications in Nonlinear Science and Numerical Simulation, vol. 14, no. 7, pp. 2870–2885, 2009.
[18] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
[19] C. W. Ahn and R. S. Ramakrishna, “Elitism-based compact genetic algorithms,” IEEE Transactions on Evolutionary Computation, vol. 7, no. 4, pp. 367–385, 2003.
[20] A. Kumar, A. M. Shaikh, Y. Li, H. Bilal, and B. Yin,
“Pruning filters with l1-norm and capped l1-norm for
cnn compression,” Applied Intelligence, vol. 51, pp.
1152–1160, 2021.
[21] S.-w. Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai,
K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T.
Lin et al., “Superb: Speech processing universal performance benchmark,” arXiv preprint arXiv:2105.01051,
2021.
[22] C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H.-C.
Lin, F. Bougares, H. Schwenk, and Y. Bengio, “On using monolingual corpora in neural machine translation,”
arXiv preprint arXiv:1503.03535, 2015.