Academia.eduAcademia.edu

COMPRESSING WAV2VEC2 FOR EMBEDDED APPLICATIONS

2023

https://doi.org/10.1109/MLSP55844.2023.10285964

Wav2vec2 self-supervised multilingual training learns speech units common to multiple languages, leading to better generalization capacity. However, Wav2vec2 is larger than other E2E ASR models such as the Conformer ASR. Therefore, the objective of this work is to reduce the Wav2vec footprint by pruning lines from the intermediate dense layers of the encoder block, since they represent about two thirds of the encoder parameters. We apply Genetic Algorithms (GA) to solve the combinatorial optimization problem associated with pruning, which means running many copies of the Wav2vec2 decoder in parallel using multiprocessing on a computer grid, so an effort was made to optimize the GA for good performance with few CPUs. The experiments show a small absolute word error rate damage of 0.21% (1.26% relative) for a pruning of 40% and compare this value with those of the usual L1-norm pruning and model restructuring by singular value decomposition.

2023 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 17–20, 2023, ROME, ITALY COMPRESSING WAV2VEC2 FOR EMBEDDED APPLICATIONS Oswaldo Ludwig, Tom Claes Cerence Inc., Guldensporenpark 32, 9820 Merelbeke, Belgium. ABSTRACT Wav2vec2 self-supervised multilingual training learns speech units common to multiple languages, leading to better generalization capacity. However, Wav2vec2 is larger than other E2E ASR models such as the Conformer ASR. Therefore, the objective of this work is to reduce the Wav2vec footprint by pruning lines from the intermediate dense layers of the encoder block, since they represent about two thirds of the encoder parameters. We apply Genetic Algorithms (GA) to solve the combinatorial optimization problem associated with pruning, which means running many copies of the Wav2vec2 decoder in parallel using multiprocessing on a computer grid, so an effort was made to optimize the GA for good performance with few CPUs. The experiments show a small absolute word error rate damage of 0.21% (1.26% relative) for a pruning of 40% and compare this value with those of the usual L1-norm pruning and model restructuring by singular value decomposition. Index Terms— ASR, Wav2vec, GA, SVD 1. INTRODUCTION In recent years, semi-supervised learning [1] has been used to improve end-to-end automatic speech recognition (ASR), which currently easily outperforms hybrid models in benchmark datasets in many languages when it comes to WER [2]. Multilingual models like Wav2vec2 XLSR53 [3] are pre-trained in many languages in a self-supervised manner, making them a good choice for low-resource languages due to their improved generalization capability [4]. The main disadvantage of Wav2vec2 XLSR53 is the model’s footprint, this model has 317M trainable parameters versus 120M of Conformer-CTC large [5] and 30M of its medium version, for example. This implies higher latency. Our experiments using beam decoding with external language model (LM) shown about 3 times more latency for Wav2vec compared to Conformer-CTC large.1 In this paper, we show how evolutionary computation can help improve structured pruning in order to considerably reThis work has been supported by Cerence Inc. 1 https://docs.nvidia.com/deeplearning/nemo/user- guide/docs/en/stable/asr/models.html 979-8-3503-2411-2/23/$31.00 ©2023 IEEE duce Wav2vec2 XLSR53 footprint while maintaining its original performance and generalization capacity. The paper is organized as follows. Section 2 briefly reports the state-of-the-art in compressing neural models, Section 3 justifies and explains the proposed evolutionary structured pruning detailing the algorithm. Experimental results are presented in Section 4, and conclusions in Section 5. 2. RELATED WORK There are four main ways to compress models: quantization [6], which can be applied on top of other compression techniques to further improve the compression, knowledge distillation [7], model restructuring [8] and pruning. Knowledge distillation (KD) is a method that trains a smaller model, called a student, using outputs of one or more larger pre-trained models, called teachers. The usual technique is to use the logits in the final layer of the teacher model as a target for the student model, as this target is more informative than the usual one-hot vector; however, multiple intermediate results from the teacher model can also be used in a composite loss function [9]. Model restructuring applies linear transformations, such as singular value decomposition (SVD), to the weight matrices and then restructures the model to take advantage of the inherent sparseness of the original matrices. The work [7] applies KD to Wav2vec with a good compression ratio of 4.8 times, but the authors report WER 3.62 times higher than the original model, which is too much for our applications. KD was our first attempt to compress Wav2vec2-xlsr-53, but preliminary results were not promising using only in-house training data. Our interpretation for this poor result was that running the teacher model only on our fine-tuning data is not sufficient to properly transfer to the student model the multilingual acoustic knowledge encoded in the teacher model during self-supervised pretraining. Furthermore, the computational cost is much higher than restructuring or pruning the model, which is relevant in the case of a portfolio of dozens of languages. Pruning identifies and removes redundant or less important weights and/or components and mainly falls into two categories: unstructured and structured. Unstructured pruning methods include variants such as magnitude weight prun- ing [10], which simply removes weights close to zero, and movement-based pruning [11], which removes weights that move towards zero during fine-tuning. Since unstructured pruning considers each weight individually, the set of pruned weights can be irregular across the model blocks, with negligible improvement in runtime memory and speed. Structured pruning focuses on pruning entire structured blocks of weights such as full connected layer lines, attention heads or even an entire Transformer layer, which can be removed without seriously degrading the final performance due to the skip connection of its residual architecture. It has been empirically shown that high accuracy is possible with only 1 or 2 attention heads per encoder unit, even when the encoder has 16 attention heads [12]. In fact, a learned Transform often has a lot of redundancy [13]. The work [14] found that while the front and back layers in a BERT model [15] play clear roles in extracting either lowlevel or task-specific linguistic knowledge, the roles of the middle layers are less important. 3. EVOLUTIONARY STRUCTURED PRUNING Most structured pruning methods analyze the importance of structures in isolation, such as magnitude weight pruning [10]. However, pruning is a combinatorial optimization problem, as the impact of pruning on a weight or structure is a function of other pruned weights (or structures). Therefore, the research hypothesis of this work is that structured pruning can be better when performed with an efficient algorithm to solve the associated combinatorial optimization problem. Our work addresses the structured pruning through evolutionary computation [16], i.e. genetic algorithms (GA) [17]. The proposed method requires only Wav2vec2 decoding without gradient-based training; therefore, no GPU is needed, it runs on CPUs with full parallelism, i.e. each GA individual is evaluated in a different CPU, the results are centralized in a single CPU, where the genetic operators are applied to compose the next generation. The Wav2vec XLSR53 encoder is a pipeline of 24 blocks, each modeled by Equations (1) to (5). Each block contains 16 attention heads, Equations (1) to (3), emitting a 1024dimensional tensor hj−1 , which is the input of a layer normalization [18] operation, Equation (4), whose output hj passes through a pair of fully-connected (FC) layers, Equation (5), the first FC layer maps this 1024-dimensional input to a 4096dimensional inner space and the second maps back to a 1024dimensional tensor, which is the input to another layer normalization operation.   QK T Att (Q, K, V ) = softmax √ ·V (1) dH   (2) headi = Att QWiQ , KWiK , V WiV hj−1 = MHA (Q, K, V ) = [head1 , . . . , headN ] W O (3) hj−1 − E [hj−1 ] hj = p ⊙γ+β Var [hj−1 ] + ϵ T hj+1 = φ (W1 hj + b1 ) W2 + b2 (4) (5) where Q = K = V represent query, key, and value tensors that encode the outputs of the previous layer (a sequence of 1D tensors), dH is the dimension of the hidden representations, [·, ·] is the concatenation operator, WiQ , WiK , WiV , for i = 1 . . . N , W O , γ, β, W1 , W2 , b1 and b2 are learnable affine transform parameters, φ is the gelu activation, and N = 16 is the number of heads. This configuration results in 201.4M of parameters in the intermediate FC layers of Equation (5), plus 100.8M in the attention heads and normalization layers. So we prune lines of W1 and W2 and the respective positions of b1 , as these tensors represent about two-thirds of the encoder parameters, resulting in a smaller inner dimension. In a simple exhaustive search, the number of combinationsto check would be given by the binomial coefficient Ltot Lpr , where Ltot = 196608 is the total number of lines in the 24 pairs of FC layers and Lpr is the number of pruned lines. This factorial growth makes it difficult to prune individual lines, so we group the lines into nstr blocks of consecutive lines and prune npr blocks. The total number of blocks per FC layer, nstr , is hereafter called the pruning granularity. Running GA with multilingual data would be very computationally intensive, so the idea is to create a pruned seed for each language to further fine tuning with our data. There are 3 main possibilities for pruned model architecture with an increasing degree of performance impact: 1. Pruning a fixed number of line blocks along the model without any restriction on the number of line blocks per layer, resulting in an asymmetrical encoder architecture. This approach can lead to better performance, as previous works, like [14], report that the intermediate layers are less important for Transformer-based encoders, i.e. they can allow for more aggressive pruning. 2. Pruning the same number of line blocks per layer, generating a symmetrical architecture. 3. Pruning entire encoder layers. This is possible due to the skip connections of the residual encoder architecture. It has been observed that the nonlinear components of some layers have a contribution close to zero after training, so the layer approaches an identity. This approach results in the easiest combinatorial problem, but a greater performance impact is expected. Symmetric pruning is easier to implement using the Hugging Face framework2 as a starting point. In this case, the number of combinations is given by:  nlay nstr ! (6) npr ! (nstr − npr )! 2 https://huggingface.co/docs/transformers/index where nlay is 24 for Wav2vec XLSR53, while the number of combinations for asymmetric pruning is: (nstr × nlay )! (npr × nlay )! (nlay (nstr − npr ))! (7) Asymmetric pruning is also a more difficult combinatorial problem, as can be seen by substituting the values of nstr , npr and nlay used in this work3 in (6) and (7). Therefore, we adopted symmetric pruning in this work. Algorithm 1 describes the main structure of the GA code, while the Algorithm 2 details how a new chromosome is assembled given the constraint of not repeating line block indexes in an FC layer, which may imply the application of the mutation operator, see Line 15 of Algorithm 2. The indexes of the pruned line blocks belonging to an FC layer are encoded in a npr -dimensional vector. We build a chromosome vector by concatenating nlay of these vectors, see Line 3 of Algorithm 1 for a more detailed description. There are other particularities of our algorithm: Algorithm 1 Optimal pruning by GA 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 1. we encode the L1-norm solution on a chromosome and seed it into the initial population, to speed up GA convergence, see Line 4 of Algorithm 1. 20: 21: 22: 23: 24: 2. we apply elitism [19] only when the fitness degrades relative to the previous generation, to avoid attracting the population to the best individual, extinguishing the diversity, which would result in an early convergence to a local minimum, see Lines 18-23 of Algorithm 1. 3. we use a special distribution [17], see Equation (8), which allows controlling the selective pressure, which must be kept very low in this application, also to avoid early convergence, as we work with few CPUs.   epϑ − 1 +1 (8) round (npop − 1) p e −1 where npop is the number of GA individuals, p is the selective pressure and ϑ ∈ [0, 1) is a random number sampled with uniform distribution. Once we have the optimal pruned model, we fine tune it using the same training data Utr used for pruning. The better the pruning, the more information from the Wav2vec2 pretraining is kept, resulting in a better generalization ability, as well as having an easier fine-tuning, i.e. a better starting point. 4. EXPERIMENTS This section reports our experimental results. The baseline methods for our experiments are pruning of FC lines of Equation (5) based on the smallest L1-norm [20] and restructuring the same equation by applying SVD. In the case of the usual 3n lay = 24, nstr = 32, npr = 20 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: Input: p, M , nlay , nstr , npr , npop , Utr : selective pressure, W2V model, number of encoder layers, pruning granularity, number of pruned blocks per layer, number of GA individuals, and training data respectively. Output: I ∗ , f itness∗ : indexes of pruned blocks and respective fitness function. Generate a set with Npop chromosomes {Cr} for the initial population, in which genes encode the indexes of pruned blocks of FC layers. {Cr} is composed by concatenating nlay vectors encoding npr indexes in the set {i ∈ N : 0 ≤ i < nstr } without repetition. Seed the solution from L1-norm pruning (encoded in a chromosome) in the initial population to accelerate GA convergence. ϕbest ← ∞ (i.e. numpy.finfo). Cbest ← any chromosome from {Cr} for generation = 1 : maxgener do Evaluating the population using multiprocessing on a computer grid: for i = 1 : npop do Call the script of the modified W2V decoder in the background (i.e. in parallel) using the pruning configuration encoded in the ith chromosome Ci and a randomly sampled sub-set of Utr , the decoder script saves the resulting WER in file fi corresponding to the ith chromosome. end for search for fi files. while Any resulting file fi is missing do wait t seconds and search for fi files again. end while load all fi files, sort results (WER) by chromosome index and store them in the fitness vector ϕ ∈ RNpop . Rank individuals/chromosomes according to their fitness ϕi . if min (ϕ) < ϕbest then ϕbest ← min (ϕ). Cbest ← argmin(ϕ) (storing best chromosome). else Seed Cbest in the current GA population {Cr} (elitism). end if The special crossover: Crchild ← {} for k = 1 : npop do Randomly selecting the indexes of parents by using the asymmetric distribution proposed in [17]: ϑj ← random number∈ [0, 1) with uniform distribution, j = 1, 2  parentj ← round (npop − 1) pϑj e −1 ep −1 + 1 , j = 1, 2 Run Algorithm 2 to assemble the child chromosome Ckchild . Add Ckchild to Crchild . end for Cr ← Crchild : (update population) end for I ∗ ← Cbest f itness∗ ← ϕbest L1-norm pruning, we report results for pruning block of lines and isolated lines. For restructuring the model through SVD, we follow the previous work [8]. The weight tensors Wn , n ∈ {1, 2}, of Equation (5) can be decomposed as follows: T Wn1024×4096 = Un1024×4096 Σ4096×4096 Vn4096×4096 (9) n where Σn is a diagonal matrix with 4096 singular values on the diagonal in the decreasing order, while Un and Vn are the left and right matrices of singular vectors. To have a model compression of 40%, we only keep 307 biggest singular values by truncating Un , Vn and Σn , yielding the approximation: T  (10) Ṽn307×4096 Wn1024×4096 ∼ = Ũn1024×307 Σ̃307×307 n Therefore Equation (5) can be approximated as follows:  T hj+1 = φ Ũ1 N1 hj + b1 N2T Ũ2T + b2 (11) Algorithm 2 Assembling the chromosome Ckchild 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: Input: nlay , nstr , parent1 and parent2 : number of encoder layers, the number of pruned blocks per layer, parent chromosomes respectively. Output: Ckchild : child chromosome. Assembling the chromosome Ckchild : l←0 Ckchild ← {} for m = 1 : nlay do cm ← {} for z = 1 : npr do Randomly select a parent from parent1 and parent2 and take its lth gene g (l) also storing the other parent gene as ḡ (l). if gene g (l) ∈ / cm then Add g (l) to set cm . else if gene other parent ḡ (l) ∈ / cm then Add ḡ (l) to set cm . else mutation: randomly sample an index ind ∈ {i ∈ N |0 ≤ i < nstr and i ∈ / cm }. end if l←l+1 end for Add cm to set Ckchild . end for Cast set Ckchild to tensor. Fig. 1. Graphical representation of Equation (5) versus Equation (11). where Nn = Σ̃n ṼnT , n ∈ {1, 2}. This compression method is illustrated in Figure 1. In the case of GA pruning, the first hyper-parameter to be investigated is the pruning granularity, i.e. the number of line blocks per FC layer. The greater the granularity, the greater the potential for good pruning, but the more difficult the combinatorial optimization problem for the GA. Investigating the optimal granularity by running the GA multiple times would be very expensive; therefore, we run the lighter L1-norm pruning for different granularity values to plot the curve granularity × WER and use it as a reference to choose the GA granularity, as shown in Figure 2. In Figure 2 we can see a large decay of WER until the granularity of 32 and a less intense evolution after this point. Therefore, we adopted a granularity of 32 for the GA pruning, aiming at a viable combinatorial problem. The other GA hyper-parameters are: selective pressure p = 3, number of GA individuals npop = 62 and npr = 20. For all fine-tuning sessions, we use the Adam optimizer with initial learning rate of 1e-5, batch size 4, accumulating the loss gradient 8 times before updating the model. We fine tune for 9 epochs, when Fig. 2. Curve granularity × WER using L1-norm for a 25% pruning (message data). the validation loss curve is flat. Given our company’s interest, rather than using publicly available data [21], we use in-house training and test data consisting of actual noisy in-car voice commands with varying degrees of SNR. The usual benchmark datasets do not represent our data/interests, as shown in the first row of Table 1, which reports a high WER on our test sets of a model pre-trained on Librispeech data. Our test sets are related to the in-car virtual assistant, such as requesting music, sending SMS messages, GPS navigation to points of interest (POI) and phone dialing. The training set Utr has 5.9K hours of audio on the same domains. We evaluated pruning about 40% from the model size, i.e. 125M of trainable parameters. No language models are used during decoding, i.e. no shallow fusion [22]; therefore, we use a simple greedy search for both the baseline and pruned models. Note that due to the conditional independence assumption of models based on CTC loss, like Wav2vec, decoding with a beamwidth greater than one doesn’t improve WER when there is no external LM. The baseline model is Wav2vec2 XLSR53 fine tuned on our data. We prune the model and fine tune the pruned model with the same in-house training data used for the baseline model. The result of pruning with L1-norm and GA, as well as model restructuring by SVD, can be seen in Figure 3. We seed the L1-norm solution in the GA initial population to accelerate the convergence, as explained in Section 3. Comparing the pruning methods, the WERR of GA over the L1-norm is 30.44%, which supports our research hypothesis that pruning results are better when approached with an efficient algorithm to solve the associated combinatorial optimization problem. Table 1 shows the WER and relative WERR after fine tuning without external LM using greedy decoder. SVD performs similarly to GA pruning in terms of WER after fine-tuning (17.06% vs 17.07%); however, the restructured model has a twice as long chain of weight matrices in the intermediate FC layers, as can be seen by comparing Equations (5) and (11). Although both techniques result in the same number of operations and parameters (3.1 million parameters in each intermediate FC block), the longer sequence small relative damage of 1.17% was achieved by restructuring these layers by SVD, which takes just a few seconds to compute using a SciPy library. However, due to its more serial nature, the resulting restructured model tends not to be as computationally efficient at runtime as the pruned model, which allows for better parallelism. 6. REFERENCES [1] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” 2020. Fig. 3. Homogeneous pruning and restructuring of W2V before fine-tuning. [2] Y. Zhang, J. Qin, D. S. Park, W. Han, C.-C. Chiu, R. Pang, Q. V. Le, and Y. Wu, “Pushing the limits of semi-supervised learning for automatic speech recognition,” arXiv preprint arXiv:2010.10504, 2020. of product operations of the SVD restructuring does not favor parallel computation, as the algorithm must wait for the result of the current product to execute the next one. Therefore, the pruned FC block is about 28% faster than the restructured FC block during runtime.4 [3] A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised cross-lingual representation learning for speech recognition,” arXiv preprint arXiv:2006.13979, 2020. Table 1. Results after fine-tuning without external LM. WER music WER message WER POI WER dialing average W2V out-of-the-box W2V fine tuned pruning by GA + fine tune pruning by L1-norm + fine tune restructuring by SVD + fine tune 32.25% 20.98% 22.01% 24.97% 22.01% 25.64% 15.00% 13.53% 14.09% 13.86% 32.25% 13.67% 14.57% 15.62% 14.57% 48.25% 17.78% 18.17% 19.04% 17.78% 34.60% 16.86% 17.07% 18.43% 17.06% WERR pruning by GA WERR pruning by L1-norm WERR restructuring by SVD -4.91% -19.02% -4.91% 9.80% 6.07% 7.60% -6.58% -14.26% -6.58% -2.19% -7.09% 0.00% -1.26% -9.33% -1.17% 5. CONCLUSIONS We proposed a new neuroevolution-based method to solve the combinatorial optimization problem associated with pruning and compared it with the usual L1-norm pruning and SVDbased model restructuring. This method can be applied to any pre-trained Transformer-based model to preserve as much information as possible from its pre-training. Here, the idea is to preserve Wav2vec2’s generalization capacity, which is an important feature for applications in an extensive language portfolio that includes poorly resourced languages in the corpus linguistics sense. The proposed GA-based pruning required 62 CPUs for about a day to find the optimal pruning setup. The experimental results support our method, showing a small relative damage of 1.26% for a pruning of about 40% of the model parameters. In general, our experiments indicate that focusing on the intermediate FC layers is a good way to achieve a high compression ratio with little impact on performance, an equally 4 This number refers to CPU runtime and depends on the adopted hardware and software ecosystem for multi-threaded inference. [4] W.-N. Hsu, A. Sriram, A. Baevski, T. Likhomanenko, Q. Xu, V. Pratap, J. Kahn, A. Lee, R. Collobert, G. Synnaeve et al., “Robust wav2vec 2.0: Analyzing domain shift in self-supervised pre-training,” arXiv preprint arXiv:2104.01027, 2021. [5] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020. [6] Y. Boo and W. Sung, “Fixed-point optimization of transformer neural network,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 1753–1757. [7] Z. Peng, A. Budhkar, I. Tuil, J. Levy, P. Sobhani, R. Cohen, and J. Nassour, “Shrinking bigfoot: Reducing wav2vec 2.0 footprint,” arXiv preprint arXiv:2103.15760, 2021. [8] J. Xue, J. Li, and Y. Gong, “Restructuring of deep neural network acoustic models with singular value decomposition.” in Interspeech, 2013, pp. 2365–2369. [9] H.-J. Chang, S.-w. Yang, and H.-y. Lee, “Distilhubert: Speech representation learning by layer-wise distillation of hidden-unit bert,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7087–7091. [10] M. A. Gordon, K. Duh, and N. Andrews, “Compressing bert: Studying the effects of weight pruning on transfer learning,” arXiv preprint arXiv:2002.08307, 2020. [11] V. Sanh, T. Wolf, and A. Rush, “Movement pruning: Adaptive sparsity by fine-tuning,” Advances in Neural Information Processing Systems, vol. 33, pp. 20 378– 20 389, 2020. [12] P. Michel, O. Levy, and G. Neubig, “Are sixteen heads really better than one?” Advances in neural information processing systems, vol. 32, 2019. [13] P. Ganesh, Y. Chen, X. Lou, M. A. Khan, Y. Yang, H. Sajjad, P. Nakov, D. Chen, and M. Winslett, “Compressing large-scale transformer-based models: A case study on bert,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 1061–1080, 2021. [14] I. Tenney, D. Das, and E. Pavlick, “Bert rediscovers the classical nlp pipeline,” arXiv preprint arXiv:1905.05950, 2019. [15] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [16] J. Poyatos, D. Molina, A. Martinez, J. Del Ser, F. Herrera et al., “Evoprunedeeptl: An evolutionary pruning model for transfer learning based deep neural networks,” arXiv preprint arXiv:2202.03844, 2022. [17] O. Ludwig, U. Nunes, R. Araújo, L. Schnitman, and H. A. Lepikson, “Applications of information theory, genetic algorithms, and neural models to predict oil flow,” Communications in Nonlinear Science and Numerical Simulation, vol. 14, no. 7, pp. 2870–2885, 2009. [18] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016. [19] C. W. Ahn and R. S. Ramakrishna, “Elitism-based compact genetic algorithms,” IEEE Transactions on Evolutionary Computation, vol. 7, no. 4, pp. 367–385, 2003. [20] A. Kumar, A. M. Shaikh, Y. Li, H. Bilal, and B. Yin, “Pruning filters with l1-norm and capped l1-norm for cnn compression,” Applied Intelligence, vol. 51, pp. 1152–1160, 2021. [21] S.-w. Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin et al., “Superb: Speech processing universal performance benchmark,” arXiv preprint arXiv:2105.01051, 2021. [22] C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H.-C. Lin, F. Bougares, H. Schwenk, and Y. Bengio, “On using monolingual corpora in neural machine translation,” arXiv preprint arXiv:1503.03535, 2015.