MODULE 3
MODULE 3
MODULE 3
Much of the recent work in machine learning and computer vision has focused on learning tech
niques for high-level tasks such as image classification. Many of the state-of-the-art models employ
Convolutional Neural Networks (CNNs) to extract high-level feature representations by processing
the input data using multiple layers of convolutions, usually followed by some non-linear transform.
CNNs have successfully demonstrated to yield high-quality feature representations that produce
state-of-the-art results on a variety of tasks, not only on image classification , but also on semantic
segmentation and object detection among others. These models are trained to produce high-
quality features using backpropagation, usually by pretraining on a large dataset (such as ImageNet)
and then fine tuning on the relevant dataset. Unfortunately, supervised learning suffers from
certain challenges, especially, in terms of scalability since it requires large amounts of labeled data.
Labeling millions of images requires extensive effort and is time consuming. Moreover, supervised
training with a predefined set of classes, limits the generalizability of the learned feature
representations to novel classes.
To overcome the difficulties of labeling large amounts of training data, effort has gone into the
development of semi-supervised and unsupervised learning techniques. The goal of unsupservised
learning techniques is to learn representations that are interpretable, easily transferable to novel
tasks and novel object categories, and to disentangle the informative representation of the data
from nuisance variables (e.g. lighting, viewpoint, etc.) purely from unlabeled data. A common and
widely used method for unsupervised learning is to do clustering using k-Means. k-Means clustering
is a simple method that groups input features into different clusters. Traditionally, this approach
mainly used low-level features such as raw pixel intensities, HOG features, GIST features, SIFT
features, etc. Although the performance of k-means on such features is usually poor, Wang et al.
(2015) used deep network features and employed k-means clustering to show strong results on
grouping object parts. But, the deep network that was used to extract the features was pre-trained
on ImageNet using class-label supervision (so, object knowledge was known). It would be a natural
extension to see if one can learn robust features using hierarchical feature learning in a purely
unsupervised manner.
However, since the objectives of unsupervised learning are not as concrete as the objectives of
supervised learning, optimizing deep hierarchical models using backpropagation becomes difficult.
Attempts have been made to come up with “pretext” objective functions, which are usually driven
by “common sense” requirements, to do unsupervised learning. Some examples of these objectives
include minimizing the reconstruction error , training models to identify surrogate
classes ,predicting spatial position of image patches , and minimizing the distance in the
representation space for objects tracked over a time period in a video sequence
Recently, much interest has gone into adversarial training. Generative Adversarial Networks (GANs)
are of particular interest in this work. Progress in GANs have enabled significant improvement in the
quality of images being generated in the past couple of years. While much of the recent effort has
gone in the development of better architectures and training procedures for modeling and training
the generative network, in this work, we systematically study the power of the representations
learned by the generator’s adversary, i.e., the discriminative model.
Next discussion is on how to learn a deep network using generative adversarial training. We use the
features extracted from the discriminative component and fuse it with traditional unsupservised
learning algorithms like k-Means to improve their performance. Various experiments are performed
over many different datasets (CIFAR-10, CIFAR-100 and STL-10) and show that the representations
that can be learned purely by unsupervised learning from an adversarial signal helps to learn
meaningful representations of input data. Our experiments show that under situations with minimal
amounts of supervised training examples (and1 large amounts of unsupervised data), the
representations learned with adversarial training perform competitively in comparison to
supervised training on a similar architecture. We now provide a brief summary of adversarial
training employed by GAN and Info- GAN.
Figure 1: Figure shows the InfoGAN architecture that was used in all the experiments. Notice that
the input to G(.) is a combination of z and c. Also notice that most of the parameters are shared
between the Q(.) network and the D(.) network, thus improving the computational efficiency.
We use the DCGAN architecture from Radford et al. (2015) since it is widely used for generating
2
images. Figure 1 shows a visualization of the architecture.
Generator: Note that the generator has been slightly modified to accept the structured latent code,
c, in addition to the random noise, z. The first layer is a fully-connected (fc) layer, which is then
reshaped into a 2-D grid of spatial resolution s/16 s/16, where s is the size of the output image to
be produced. Subsequent to this reshaping, the architecture has× four layers of transposed
convolution (sometimes referred to as deconvolution) with a stride of 2, each of which upsamples
the input features to twice the spatial resolution. These layers are sandwiched by batch norm and
ReLU layers. Finally, we use a tanh non-linearity to map the features into
[−1, 1].
Discriminator: The discriminator is a standard CNN with a series of convolutional layers followed by
non-linearities. The architecture uses four convolutional layers sandwiched by batch norm and
leakyReLU layers. We don’t use max pooling to reduce the spatial resolution of the input. Instead,
we convolve the feature maps with a stride of two, which results in the output of each convolution
layer to be half the spatial resolution of the input feature map. This base architecture is shared
between D(.) and Q(.). On top of this shared network, we use an fc layer to extract the features,
which are then used to predict the categorical distribution. Notice that most of the computational
cost is shared between the D(.) and the Q(.) networks thereby making the entire training process to
be computationally efficient.
As mentioned previously, while InfoGAN has the ability to group data into multiple groups
automatically, there is no constraint to enforce that the groups need to correspond to the various
object-level categories that are present in the dataset. While this turned out to be true for the
MNIST dataset we believe that it was possible because the variations in the strokes that produce
different digits correspond to the source of biggest variation in the dataset, which conveniently
corresponds to the various digit categories, thereby enabling InfoGAN to act as a category recogni-
tion model. In more realistic datasets, the sources of biggest variation need not (and, usually, do
not) correspond to variations in the object-level categories. Our experiments show this to be true.
When we trained InfoGAN to automatically group the CIFAR-10 images into 10 categories, we found
that while InfoGAN was able to group the images into different groups, the groups did not
correspond to object category-level groupings. Figure 2 shows some example samples generated by
the model.
Each row corresponds to a different category and each column in the row corresponds to a different
sample from that category (obtained by keeping c fixed and by varying z). We can see that while
each row look different from each other, it does not correspond to the CIFAR-10 categories.
Therefore, we employ a hybrid approach to unsupervised clustering. We first train the
discriminative network using either the vanilla GAN objective or the InfoGAN objective, until
convergence. Upon convergence, we extract features for each image in the training set, from the
top of the shared network, labeled as φ(x) and do average pooling across the spatial resolution, for
each feature channel. We then cluster these features using k-means++ into a discrete set of k-
categories. We set k to be the number of object classes that are present in the respective dataset.
The cluster centers learned by k-means++ clustering act as the templates for the k categories that
are present in the dataset.
During testing, we extract the feature representation of the test images by passing them through
the discriminative network trained using the generator as an adversary, do average pooling on φ(x),
and compute the distance of the test feature vector to each of the centers learnt by k- means++
clustering during the training phase. The test image is assigned an index corresponding to the index
of the closest center. Our experiments show that clustering on φ(x) produces better results than
directly using the recognition model of InfoGAN. Note that while we use the simple k- means++
algorithm for clustering, it could be replaced by more sophisticated unsupervised learning
algorithms. We do not explore further down this route since the scope of this work is to study the
strength of the features learned by adversarial training.
3
Figure 2: Figure shows samples generated from InfoGAN trained on the CIFAR-10 dataset when the
system was encouraged to identify 10 categories. Each row corresponds to a different cluster
identified by InfoGAN. Each column corresponds to a different sample from that clusters. We
can see that while InfoGAN can identify clusters that are different from each other, they do not
correspond to the CIFAR-10 categories.
An advantage of the hybrid approach is that it now allows us to use a variety of different “pretext”
objectives. In other words one can decouple the training objective from the testing requirements. In
fact, we experimented by encouraging InfoGAN to identify more groups in the training data than
number of object-categories in the dataset. For example, we trained InfoGAN on CIFAR-10 dataset
by encouraging the system to identify [10, 20, 30, 35, 40, 50 and 75] groups. Of course, these
groups do not correspond to category-level groupings. However, to our surprise, we found that
when the features obtained from InfoGANs trained on large number of categories were used for
clustering, they performed better at object categorization than the features obtained from an
InfoGAN trained on the same number of object categories as present in the dataset.
We perform experiments on multiple datasets; CIFAR-10, CIFAR-100 and STL-101. We use ground
truth labels only for evaluation purposes and for training the supervised learning baseline. The
training procedure is entirely unsupervised. We report results using two standard metrics that are
used for evaluating unsupervised learning algorithms; Adjusted RAND Index (ARI) and the
Normalized Mutual Information (NMI) score. We provide three baselines; (i) we report results using
simple features such as pixel intensities, HOG and GIST, which we call low-level visual features, (ii)
we report results on the features obtained using standard GAN training, (iii) as an upper bound, we
report results using supervised learning where we train the weights in a discriminator network with
the same architecture using category-level labels that are provided by the datasets.
It is important to remember that we are interested in comparing the quality of the learned features
that can be used for transfer to novel images and not just the classification score on an pre-defined
set of categories. The classification accuracy captures only how well a test image was correctly
classified. If incorrectly classified, it does not quantify how bad the mistake was. ARI, on the other
hand, is a better metric for evaluating the properties of the features because it measures not only
how accurately pairs of objects were correctly grouped together, but also takes into account how
many pairs of data points were incorrectly grouped. Therefore, when comparing with the model
that was trained using supervised learning, we ignore the top-level classification layer of that model,
and quantify the quality of the representations, i.e., the features extracted from the penultimate
layer, using ARI after clustering on them.
4
Figure 3: This figure shows all the 64 filters from the first layer of the discriminative network trained
on CIFAR-10. The visualization on the left corresponds to the filters learned using adversarial
training. The visualization on the right corresponds to the filters learned for the same architecture
using supervised learning. It is interesting to see that there the filters on the left have more high
frequency components and the filters on the right are more smooth.
Before we go into the quantitative results, we visualize the filters of the first layer of the discriminative
network and compare them across two different training procedures. Figure 3 shows the visualization. On the
left are the filters from the network that was trained using adversarial training. On the right are the filters from
a network with the same architecture but trained using class-level supervision. Both these networks were
trained using the CIFAR-10 dataset. We can see that while some of the filters look similar to each other, many
of them are quite different. It is clear that the filters on the right are more smooth than the filters on the left.
Recollect that filters on the left are trained to fit both the real images and the generated images. When the
generated images are not as high-quality as the real images, the filters that D(.) learns might not be as
regularized as the ones learnt using only real data. We hypothesize that improving the quality of the
generated images can help regularize the first layer filters in D(.). We leave this route of exploration for future
work.
CIFAR-10
The CIFAR-10 consists of 50k training images and 10k testing images, of size 32 32, divided× among 10
categories. We trained the model for two different image sizes; 32 32 and 64 64. × We trained InfoGAN with
different numbers of categories 10, 20, 30, 35, 40, 50, 75 { . Figure 4a shows a plot of the performance
measures versus the number of groups InfoGAN was trained to}identify. We can see from the figure that as we
increase the number of categories, the performance of the model goes up into a certain point and drop after
that. This indicates that there exists databases for which grouping into more categories than present in the
ground truth might help. We also plot the performance of the InfoGAN model when used directly as a
prediction model. We can see from the plots that k-means++ clustering produces better results (ARI-32=0.097;
NMI-32=0.18) than direct prediction (ARI-32-InfoGAN: 0.085; NMI-32-InfoGAN: 0.14). We label the direct
prediction results with a (-InfoGAN).
Figure 4b compares the performance when using different features. We can see that InfoGAN features trained
with 50 clusters beats the features learned using vanilla GAN by a small margin. However, supervised training
does much better (as one might have expected).
5
(a) (b)
Figure 4: CIFAR-10: (a) Plots the performance of the grouping algorithm when using the features
learned from InfoGAN training when trained over multiple categories. Zero groups corresponds to
vanilla GAN. -32 and -64 correspond to the output sizes of the generated images. -InfoGAN
corresponds to the results obtained with direct prediction using the recognition model in InfoGAN.
(b) Note that InfoGAN features perform better than vanilla GAN features. However, supervised
learning outperforms unsupervised learning on this database.
CIFAR-100
In these sets of experiments, we use the images from the CIFAR-100 database for training. This
database also contains 50k training examples and 10k test images, divided among 100 fine scale
categories and 20 coarse level categories. We test the performance on the coarse categories. As
before, we experiment the InfoGAN training with multiple categories 10, { 20, 35, 50 . While the
trend is not as noticeable as in the case of CIFAR-10, the best performance
}is obtained when we use
50 categories. Also, as before, the k-means++ clustering of the features produces better
performance (ARI=0.04) than the recognition model of InfoGAN (ARI=0.036).
(a) (b)
Figure 5: CIFAR-100: (a) # of groups used to train InfoGAN has less of an effect on CIFAR-100 than it had
on CIFAR-10. However, the performance of k-means++ clustering is still better than direct prediction
using the recognition model of InfoGAN. Please see Fig. 4a for labeling conventions.
(b) InfoGAN features and GAN features perform similarly on this dataset. However, supervised
learning features are only slightly better than the unsupervised counterparts.
STL-10
Finally, we also perform experiments on the STL-10 dataset. This database consists of 5000 images for
training with labels, 100000 training images without labels, and 8000 images for testing. The dataset
consists of 10 categories( airplane, bird, car, cat, deer, dog, horse, monkey,
× ship, truck.), and all the
images are of size 96*96. This dataset brings out the advantages of unsupervised learning algorithms. The
database is more than two times bigger than CIFAR-10 and CIFAR-100 datasets in terms of the number of
images and each image is 9 times the size of the CIFAR images. Figure 6b shows that the unsupervised
learning with adversarial training outperforms the same models trained using supervised learning. From
Figure 6a, we also notice that the features learned using vanilla GAN does better than the features
learned using InfoGAN. Increasing the complexity of the datasets makes it difficult for InfoGAN to group
the images in the dataset.
Neural architecture search is the task of automatically finding one or more architectures for a neural network
that will yield models with good results (low losses), relatively quickly, for a given dataset. Neural architecture
search is currently an emergent area. There is a lot of research going on, there are many different approaches
to the task, and there isn’t a single best method generally or even a single best method for a specialized kind of
problem such as object identification in images.
Neural architecture search is an aspect of AutoML, along with feature engineering, transfer learning, and
hyperparameter optimization. It’s probably the hardest machine learning problem currently under active
research; even the evaluation of neural architecture search methods is hard. Neural architecture search
research can also be expensive and time-consuming. The metric for the search and training time is often given
in GPU-days, sometimes thousands of GPU-days
The motivation for improving neural architecture search is fairly obvious. Most of the advances in neural
network models, for example in image classification and language translation, have required considerable
hand-tuning of the neural network architecture, which is time-consuming and error-prone. Even compared to
the cost of high-end GPUs on public clouds, the cost of data scientists is very high, and their availability tends
to be low.
As multiple authors (for example Lindauer and Hutter, Yang et al., and Li and Talwalkar) have observed, many
neural architecture search (NAS) studies are irreproducible, for any of several reasons. Additionally, many
neural architecture search algorithms either fail to outperform random search (with early termination criteria
applied) or were never compared to a useful baseline.
Yang et al. showed that many neural architecture search techniques struggle to significantly beat a randomly
sampled average architecture baseline. (They called their paper “NAS evaluation is frustratingly hard.”) They
also provided a repository that includes the code used to evaluate neural architecture search methods on
several different datasets as well as the code used to augment architectures with different protocols.
In neural network research, ablation means removing features from neural networks to determine their
importance. In NAS research, it refers to removing features from the search pipeline and training techniques,
7
including hidden components, again to determine their importance.
Neural architecture search methods
Elsken et al. (2018) did a survey of neural architecture search methods, and categorized them in terms of
search space, search strategy, and performance estimation strategy. Search spaces can be for whole
architectures, layer by layer (macro search), or can be restricted to assembling pre-defined cells (cell search).
Architectures built from cells use a drastically reduced search space; Zoph et al. (2018) estimate a 7x speedup.
Search strategies for neural architectures include random search, Bayesian optimization, evolutionary
methods, reinforcement learning, and gradient-based methods. There have been indications of success for all
of these approaches, but none have really stood out.
The simplest way of estimating performance for neural networks is to train and validate the networks on data.
Unfortunately, this can lead to computational demands on the order of thousands of GPU-days for neural
architecture search. Ways of reducing the computation include lower fidelity estimates (fewer epochs of
training, less data, and downscaled models); learning curve extrapolation (based on a just a few epochs);
warm-started training (initialize weights by copying them from a parent model); and one-shot models with
weight sharing (the subgraphs use the weights from the one-shot model). All of these methods can reduce the
training time to a few GPU-days rather than a few thousands of GPU-days. The biases introduced by these
approximations aren’t yet well understood, however.
Recently, generative adversarial networks (GANs) have been shown a power in improving the
performance of image-based SSL problems . In semi-GAN , authors converted the M-class
classification task into solving (M + 1)-class problem where the synthetic (M + 1)th class is generated
by the GAN’s generator. Later on, Dai et al. provided a theoretical insight that the generated data
are able to boost the performance of classifier under certain assumptions. Our work is motivated by
the semi-GAN.
GraphSGAN first investigated the adversarial learning over graph, where the graph is embedding
into an embedding space and synthetic data are generated in the corresponding space. The multi-
layer perceptron (MLP) is trained as the classifier on the embedding vectors. However, to our
knowledge, there is still no existed method to combine the adversarial learning to convolution-based
GNNs on graph-based SSL task. In this work, we explore the potential of incorporating the
convolution-based GNN and GAN. The challenges of constructing a general framework have three
folds: first, the attributed graph data are non-Euclidean whose distribution contains information of
graph topology structure as well as the attributes of nodes. Hence, it is not trivial to construct
generator to model the distribution. Second, even the generator can model the graph’s distribution,
the generator should be trained properly to boost the performance of the classifier. A poor-quality
generator would introduce noise to the existed graph and affect the classifier. Third, many variants
of GCN have been proposed continuously. The framework should be built with flexibility to adapt to
different convolution-based GNNs.
8
We construct a novel approach called GraphCGAN to deal with above challenges. First, to model the
distribution of graph, the generator is built sequentially from two sub-generators: one models the
attribute information (node’s attribute) and another one models the graph topology structure
(adjacency relation of node). Second, in GraphCGAN, the generator is trained based on the feature
matching technique which minimizes the distance between generated nodes and real nodes in the
constructed feature space. This technique showed a good performance in SSL tasks in practice. For
GCN, the attributes of nodes are aggregated convolutionally by multiple layers. The representation
of the last layer is usually considered as the prediction for the labels. For variants of GCN, the main
differences exist in the strategy of layer aggregation. In our framework, we choose the second to the
last layer of convolution-based GNN as the feature matching functions.
PRILIMINARY
9
For objective function of generator, Salimans et al. (2016) found minimizing feature matching loss
in Equation 3 achieved superior performance in practice
10
Loss Functions
11
Network Compression
In recent years, deep learning has rapidly grown and begun to show its robust ability in
representation learning, achieving remarkable success in diverse applications. This achievement has
been possible through its ability to discover, learn, and perform automatic representation by
transforming raw data into an abstract representation. The process of deep learning utilizes a
hierarchical level of neural networks of different kinds, such as multilayer perceptron (MLP),
convolutional neural networks (CNNs), and recurrent neural networks (RNNs). This hierarchical
representation allows models to learn features at multiple abstraction levels, meaning that
complicated concepts can be learned from simpler ones. Neurons in earlier layers of a network learn
low-level features, while neurons in later layers learn more complex concepts [1]. The achievement
of neural networks in a variety of applications is accompanied by a dramatic increase in
computational costs and memory requirements. Due to the sufficient amount of data and advanced
computing power, neural networks have turned into 12 wider and deeper architectures, driving state-
of-the-art performances in a wide range of applications. Despite their great success, neural networks
have a massive number of parameters, and their significant redundancy in the parameterization has
become a widely-recognized property [2]. The over-parametrized and redundant nature of neural
networks incur expensive computational costs and high storage requirements. To classify a single
image, the VGG-16 model [3], for instance, requires more than 30 billion float point operations
(FLOPs), and contains about 138 million parameters with more than 500 MB of storage space. This
presents significant challenges and restricts many CNN applications. Recognizing the importance of
network units can help to reduce the model complexity by discarding less essential units.
Most of the computational complexity originates in the convolutional layers due to massive
multiplication and addition operations, although they contain less parameters due to parameter
sharing. The number of FLOPs is utilized as a popular metric to estimate the complexity of CNN
models. The FLOPs in convolutional layers are calculated as follows [4]:
FLOPs = 2HW(CinK2 + 1)Cout, (1)
where H,W, Cout refers to the height, width and number of channels in the output tensor, K is the
kernel size, Cin denotes the number of input channels, and 1 is the corresponding bias. In contrast,
most of the weights parameters exist in fully-connected layers, where the dense vector-matrix
multiplications are very substantial resources. Table 1 represents the complexity of several CNNs’
architectures, which consist of two parts: (1) the computational complexity is essentially related to
the convolutional layers and (2) the parameters in fully-connected layers dominate complexity.
Accordingly, reducing the computational complexity of the convolutional layers became the focus of
most model acceleration methods, while model compression methods mainly target the parameters
of the fully-connected layers.
These complexities present significant challenges and restrict many applications. For instance,
deploying sizeable deep learning models to a resource-limited device leads to various constraints as
on-device memory is limited [8]. Therefore, reducing computational costs and storage requirements
is critical to widen the applicability of deep learning models in a broader range of applications (e.g.,
mobile devices, autonomous agents, embedded systems, and real-time applications). Reducing the
complexity of models while maintaining their powerful performance creates unprecedented
opportunities for researchers to tackle major challenges in deploying deep learning systems to a
resource-limited device. Network pruning focuses on discarding unnecessary parts of neural
networks to reduce the computational costs and memory requirements associated with deep
models. Pruning approaches have received considerable attention as a way to tackle over-
parameterization and redundancy.
Consequently, over-parameterized networks can be efficiently compressed and allow for the
acquisition of a small subset of the whole model, representing the reference model with fewer
parameters . There is no authoritative guide for choosing the best network architecture; a model
may require a certain level of redundancy during model training to guarantee excellent performance
. Hence, decreasing the size of a model after training can be an effective solution.
Pruning approaches were conceptualized in the early 1980s and ’90s, and can be applied to any part
of deep neural networks . Optimal Brain Damage (OBD) 13 by LeCun et al. , and Optimal Brain Surgeon
(OBS) by Hassibi et al. are considered pioneering works of network pruning, demonstrating that
several unimportant weights can be removed from a trained network with little accuracy loss. Due to
expensive computation costs, these methods are not applicable to today’s deep models. Obtaining a
sub-network with fewer parameters without reducing accuracy is the main goal of pruning
algorithms. The pruned version, a subset of the whole model, can represent the reference model at
a smaller size or with a smaller number of parameters. Over-parameterized networks can therefore
be efficiently compressed while maintaining the property of better generalization .
When talking about the cost of neural networks, the count of parameters is surely one of the most
widely used metrics, along with FLOPS (floating-point operations per second). It is indeed intimidating
to see networks displaying astronomical amounts of weights (up to billions for some), often
correlated with stellar performance. Therefore, it is quite intuitive to aim at reducing directly this
count by removing parameters themselves. Actually, pruning connections is one of the most
widespread paradigms in the literature, enough to be considered as the default framework when
dealing with pruning. The seminal work of Han et al.[26] presented this kind of pruning and served as
a basis for numerous contributions [18, 21, 25].
Directly pruning parameters has many advantages. First, it is simple, since replacing the value of their
weight with zero, within the parameter tensors, is enough to prune a connection. Widespread deep
learning frameworks, such as Pytorch, allow to easily access all the parameters of a network, making
it extremely simple to implement. Still, the greatest advantage of pruning connections remains yet
that they are the smallest, most fundamental elements of networks and, therefore, they are
numerous enough to prune them in large quantities without impacting performance. Such a fine
granularity allows pruning very subtle patterns, up to parameters within convolution kernels, for
example. As pruning weights is not limited by any constraint at all and is the finest way to prune a
network, such a paradigm is called unstructured pruning.
However, this method presents a major, fatal drawback: most frameworks and hardware cannot
accelerate sparse matrices’ computation, meaning that no matter how many zeros you fill the
parameter tensors with, it will not impact the actual cost of the network. What does impact it,
however, is pruning in a way that directly alters the very architecture of the network, which any
framework can handle.
14
1.2 — Structured pruning
This is the reason why many works have focused on pruning larger structures, such as whole neurons
[36] or, for its direct equivalent within the more modern deep convolutional networks, convolution
filters [40, 41, 66]. Filter pruning allows for an exploitable and yet fine enough granularity, as large
networks tend to include numerous convolution layers, each counting up to hundreds or thousands
of filters. Not only does removing such structures result in sparse layers that can be directly
instantiated as thinner ones, but doing so also eliminates the feature maps that are the outputs of
such filters.
Therefore, not only are such networks lighter to store, due to fewer parameters, but also they require
less computations and generate lighter intermediate representations, hence needing less memory
during runtime. Actually, it is sometimes more beneficial to reduce bandwidth rather than the
parameter count. Indeed, for tasks that involve large images, such as semantic segmentation or
object detection, intermediate representations may be prohibitively memory-consuming, way more
than the network itself. For these reasons, filter pruning is now seen as the default kind of structured
pruning.
Yet, when applying such a pruning, one should pay attention to the following aspects. Let’s consider
how a convolution layer is built: for Cin input channels and Cout output ones, a convolution layer is
made of Cout filters, each counting Cin kernels; each filter outputs one feature map and within each
filter, one kernel is dedicated to each input channel. Considering this architecture, and acknowledging
15
a regular convolutional network basically stacks convolution layers, when pruning whole filters, one
may observe that pruning a filter, and then the feature map it outputs, actually results in pruning the
corresponding kernels in the ensuing layer too. That means that, when pruning filters, one may
actually prune twice the amount of parameters thought to be removed in the first place.
Let’s consider too that, when a whole layer happens to get pruned (which tends to happen because of
layer collapse [62], but does not always break the network, depending on the architecture), the
previous layer’s outputs are now totally unconnected, hence pruned too: pruning a whole layer may
actually prune all its previous layers whose outputs are not somehow connected elsewhere (because
of residual connections [28] or whole parallel paths [61]). Therefore, when pruning filters, one should
consider computing the exact number of actually pruned parameters. Indeed, pruning the same
amount of filters, depending on their distribution within the architecture, may not lead to the same
actual amount of pruned parameters, making any result impossible to compare with.
Another kind of exploitable structure, though, is to turn convolutions into “shift layers” by pruning all
but one parameter in each kernel, which can then be summed up as a combination of a shifting
operation and a 1 × 1 convolution [24].
2 — Pruning criteria
Once one has decided what kind of structure to prune, the next question one may ask could be:
“Now, how do I figure out which ones to keep and which ones to prune?”. To answer that one needs
a proper pruning criteria, that will rank the relative importance of the parameters, filters or else.
16
2.1 — Weight magnitude criterion
One criterion that is quite intuitive and surprisingly efficient is pruning weights whose absolute value
(or “magnitude”) is the smallest. Indeed, under the constraint of a weight-decay, those which do not
contribute significantly to the function are expected to have their magnitude shrink during training.
Therefore, the superfluous weights are expected to be those of lesser magnitude [8].
Notwithstanding its simplicity, the magnitude criterion is still widely used in modern works [21, 26,
58], making it a staple of the domain.
However, although this criterion seems trivial to implement in the case of unstructured pruning, one
may wonder how to adapt it to structured pruning. One straightforward way is to order filters
depending on their norm (L 1 or L 2 for example) [40, 70]. If this method is quite straightforward one
may desire to encapsulate multiple sets of parameters within one measure: for example, a
convolutional filter, its bias and its batch-normalization parameters together, or even corresponding
filters within parallel layers whose outputs are then fused and whose channels we would like to
reduce.
One way to do that, without having to compute the combined norm of these parameters, involves
inserting a learnable multiplicative parameter for each feature map after each set of layers you want
to prune. This gate, when reduced to zero, effectively prunes the whole set of parameters responsible
for this channel and the magnitude of this gate accounts for the importance of all of them. The
method hence consists in pruning the gates of lesser magnitude [36, 41].
Magnitude of the weight is not the only popular criterion (or family of criteria) that exists. Actually,
the other main criterion to have lasted up to now is the magnitude of the gradient. Indeed, back in
the 80's some fundamental works [37, 53] theorized, through a Taylor decomposition of the impact of
removing a parameter on the loss, that some metrics, derived from the back-propagated gradient,
may provide a good way to determine which parameters could be pruned without damaging the
network.
More modern implementations of this criterion [4, 50] actually accumulate gradients over a
minibatch of training data and prune on the basis of the product between this gradient and the
corresponding weight of each parameter. This criterion can be applied to the aforementioned gates
too [49].
One final aspect to take into consideration is whether the chosen criterion is applied globally to all
parameters or filters of the network, or if it is computed independently for each layer. While global
pruning has proven many times to yield better results, it can lead to layer collapse [62]. A simple way
to avoid this problem is to resort to layer-wise local pruning, namely pruning the same rate at each
layer, when the used method cannot prevent layer collapse.
17
3 — Pruning method
Now that we have got our pruning structure and criterion, the only parameter left is which method
should we use to prune a network. This is actually the topic on which the literature can be the most
confusing, as each paper will bring its own quirks and gimmicks, so much that one may get lost
between what is methodically relevant and what is just a specificity of a given paper.
This is why we will thematically overview some of the most popular families of method to prune
neural networks, in an order that highlights the evolution of the use of sparsity during training.
While not straying too far, some methods have brought significant modifications to the
aforementioned classic framework by Han et al. . Gale et al. have pushed the principle of iterations
further by removing an increasing amount of weights progressively all along the training process,
which allows benefiting from the advantages of iterations and to remove the whole fine-tuning
process. He et al. reduce prunable filters to 0, at each epoch, while not preventing them from
learning and being updated afterward, in order to let their weights grow back after pruning while
enforcing sparsity during training.
18
Finally, the method of Renda et al. [58] involves fully retraining a network once it is pruned. Unlike
fine-tuning, which is performed at the lowest learning-rate, retraining follows the same learning-rate
schedule as training, hence its name: “Learning-Rate Rewinding”. This retraining has shown to yield
better performance than mere fine-tuning, at a significantly higher cost.
In order to speed up training, avoid fine-tuning and prevent any alteration of the architecture during
or after training, multiple works have focused on pruning before training. In the wake of SNIP , many
works have studied the use of the work of Le Cun et al. or of Mozer and Smolensky to prune at
initialization , including intensive theoretical studies. However, Optimal Brain Damage relies on
multiple approximations including an “extremal” approximation that “assumes that parameter
deletion will be performed after training has converged” ; this fact is rarely mentioned, even among
works that are based on it. Some works have raised reservations about the ability of such methods to
generate masks whose relevance outshines random ones of similar distribution per layer [20].
Another family of methods that study the relationship between pruning and initialization gravitates
around the “Lottery Ticket Hypothesis” [18]. This hypothesis states that “randomly-initialized, dense
neural network contains a subnet-work that is initialized such that — when trained in isolation — it
can match the test accuracy of the original network after training for at most the same number of
iterations”. In practice, this literature studies how well a pruning mask, defined using an already
converged network, can be applied to the network back when it was just initialized. Multiple works
have expanded, stabilized or studied this hypothesis . However, once again multiple works tend to
question the validity of the hypothesis and of the method used to study it and some even tend to
show that its benefits rather came from the principle of fully training with the definitive mask instead
of a hypothetical “Winning Ticket” .
Comparison between the classic “train, prune and fine-tune” framework [26], the lottery ticket
experiment [18] and learning rate rewinding [58]. (image by author)
The previous methods are linked by a seemingly shared underlying theme: training under sparsity
constraints. This principle is at the core of a family of methods, called sparse training, which consists
19 while its distribution varies and is progressively
in enforcing a constant rate of sparsity during training
adjusted. Introduced by Mocanu et al. , it involves: 1) initializing the network with a random mask
that prunes a certain proportion of the network 2) training this pruned network during one epoch 3)
pruning a certain amount of weights of lower magnitude and 4) regrowing the same amount of
random weights.
That way, the pruning mask, at first random, is progressively adjusted to target the least import
weights while enforcing sparsity all throughout training. The sparsity level can be the same for each
layer or global . Other methods have extended sparse training by using a certain criterion to regrow
weights instead of choosing them randomly .
Sparse training cuts and grows different weights periodically during training, which leads to an
adjusted mask that should target only relevant parameters.
Many methods, instead of pruning connections manually or penalizing auxiliary parameters, rather
apply various kinds of penalties to weights themselves to make them progressively shrink toward 0.
This notion is actually pretty ancient [57], as weight-decay is already an essential element to the
weight magnitude criterion. Beyond using a mere weight-decay, even back then multiple works
focused on elaborating penalties specifically designed
20 to enforce sparsity [55, 65]. Today, various
methods apply different regularizations, on top of the weight decay, to increase further the sparsity
(typically, using L 1 norm [41]).
Among modern works, multiple methods rely on the LASSO (Least Absolute Shrinkage and Selection
Operator) [22, 31, 66] to prune weights or groups. Other methods develop penalties that target weak
connections to increase the gap between the parameters to keep and those to prune, so that their
removal has less impact [7, 16]. Some methods show that targeting a subset of weights with a
penalization that grows all throughout training can progressively prune them and make their removal
seamless [6, 9, 63]. The literature also counts a whole range of methods built around the principle of
“Variational Dropout” [34], a method based on variational inference [5] applied to deep learning [35].
As a pruning method [48], it birthed multiple works that adapt its principle to structured pruning [43,
54].
4 — Available frameworks
If most of these methods have to be implemented from scratch (or can be reused from the provided
sources of each paper, if they do provide them), some frameworks exist to apply basic methods or
make the aforementioned implementation easier.
4.1 — Pytorch
Pytorch provide multiple quality-of-life features to help pruning networks. The provided tools allow
to easily apply a mask to a network and maintain this mask during training, as well as it allows to
easily revert that mask if needed. Pytorch also provide some basic pruning methods, such as global or
local pruning, whether it is structured or not. Structured pruning can be applied on any dimension of
the weights tensors, which lets pruning filters, rows of kernels or even some rows and columns inside
kernels. Those in-built basic methods also allow pruning randomly or depending on various norms.
4.2 — Tensorflow
The Keras library from Tensorflow provide some basic tools to prune weights of lower magnitudes.
Such as in the work of Han et al , the efficiency of pruning is measured in terms of how much does the
redundancy, introduced by all the inserted zeros, allows to compress the model better (which
combines well with quantization).
4.3 — ShrinkBench
Blalock et al. provide in their work a custom library in an effort to help the community normalize how
pruning algorithm are compared. Based on Pytorch, ShrinkBench aims at making the implementation
of pruning methods easier while normalizing the conditions under which they are trained and tested.
It provides different baselines, such as random pruning, global or layerwise and weight magnitude or
gradient magnitude pruning.
21
3.1.1. Weight-Based Methods
It can be argued that the use of weight-based methods suffers from certain limitations. The need to
remove low-weight connections means that important neurons whose activation does not
contribute enough due to low-magnitude income or outgoing connections could be ignored.
Moreover, the overall impact of weight-based pruning on network compression is lower than
neuron-based methods. Pruning a neuron eliminates22 entire rows or columns of the weight matrices
from both the former and later layers connected to that neuron, while weight-based methods only
prune the low-weight connections between layers. To process the resulting sparse weight-matrices,
some methods also require a particular software/hardware accelerator that off-the-shelf libraries do
not support. Despite these drawbacks, the weight-based methods can be applied in combination
with unit-based methods to add extra compression value.
Unit-based methods represent a pruning approach proposed to eliminate the least important units.
He et al. developed a simple unit-based pruning strategy that involves evaluating the importance of
a neuron by summing the output weights of each one, and eliminating unimportant nodes based on
this. They also apply neuron-based pruning utilizing the entropy of neuron activation. Their entropy
function evaluates the activation distribution of each neuron based on a predefined threshold, which
is only suitable with a sigmoid activation function. Since this method damages the network’s
accuracy, additional fine-tuning is required to obtain satisfactory performance. Alqahtani et al.
proposed a majority voting technique to compare the activation values among neurons and assign a
voting score to quantitatively evaluate their importance, which helps to effectively reduce model
complexity by eliminating the less influential neurons. Their method simultaneously identifies the
critical neurons and prunes the model during training without involving any pre-training or fine-
tuning procedures.
Srinivas et al. also introduced a unit-based pruning method by evaluating the weights similarity of
neurons in a layer. A neuron is removed when its weights are similar to that of another in its layer.
Mariet et al. introduced Divnet, which selects a subset of diverse neurons and merges similar
neurons into one. The subset is selected based on activation patterns by defining a probability
measure over subsets of neurons. As with others, these pruning methods require
software/hardware accelerators that are unsupported by off-the-shelf libraries and a multi-step
procedure to prune neurons.
Filter-level pruning strategies have been widely studied. The aim of these strategies is to evaluate
the importance of intermediate units, where pruning is conducted according to the lowest scores. Li
et al. suggested such a pruning method based on the absolute weighted sum, and Liu et al.
proposed a pruning method based on the mean gradient of feature maps in each layer, which
reflects the importance of features extracted by convolutional kernels. Other data-driven pruning
methods have been developed to prune non-informative filters. For instance, Polyak et al. designed
a statistical pruning method that removes filters based on variance of channels by applying the
feature maps activation variance to evaluate the critical filters. Unimportant filters can also be
pruned according to the level of importance. Luo’s pruning method is based on the entropy of the
channels’ output to evaluate the importance of their filters, and prunes the lowest output entropy,
while Hu et al. evaluated the importance of filters based on the average percentage of zero
activations (APoZ) in their output feature maps.
Furthermore, Luo et al. proposed the ThiNet method, which applies a greedy strategy for channel
selection. This prunes the target layer by greedily selecting the input channel with the smallest
increase in reconstruction error. The least-squares approach is applied to indicate a subset of input
channels which have the smallest impact to approximate the output feature map. A general channel
23
pruning approach is also presented by Liu et al. , where a layer-grouping algorithm is proposed to
find coupled channels automatically. Then a unified metric based on Fisher information is derived to
evaluate the importance of a single channel and coupled channels. These methods tend to compress
networks by simply adopting straightforward selection criteria based on statistical information.
However, dealing with an individual CNN filter requires an intuitive process to determine selective
and semantically meaningful criteria for filter selection, where each convolution filter responds to a
specific high-level concept associated with different semantic parts. The most recent work is a CNN
pruning method inspired by neural network interpretability. Yeom et al. [46] combined the two
disconnected research lines of interpretability and model compression by basing a pruning method
on layer-wise relevance propagation (LRP) , where weights or filters are pruned based on their
relevance score. Alqahtani et al. proposed a framework to measure the importance of individual
hidden units by computing a measure of relevance to identify the most critical filters, introducing the
use of the activation of feature maps to detect valuable information and the essential semantic parts
to evaluate the importance of feature maps
It could be argued that compressing a network via a training process may provide more effective
solutions. Ding et al. presented an optimization method that enforces correlation among filters to
converge at the same values to create identical filters, of which, redundant ones are safely
eliminated during training. He et al. proposed a filter pruning method which prunes convolutional
filters in the training phase. After each training epoch, the method measures the importance of
filters based on L2 norm, and the least essential filters are set to zero. He et al. later iteratively
measured the importance of the filter by calculating the distance between the convolution kernel
and the origin or the geometric mean based on which redundant kernels are identified and pruned
during training. Liu et al. trained an auxiliary network to predict the weights of the pruned networks
and estimate the performance of the remaining filters. Moreover, Zhonghui et al. applied a training
objective to compress the model as a task of learning a scaling factor associated with each filter and
estimating its importance by evaluating the change in the loss function. AutoPruner embedded the
pruning phase into an end-to-end trainable framework. After each activation, an extra layer is added
to estimate a similar scaling effect of activation, which is then binarized for pruning. A significant
drawback of iterative pruning is the extensive computational cost; and pruning procedures based on
training iterations often change the optimization function and even introduce hyper-parameters
which make the training more challenging to converge.
The disadvantages of binary networks include significant performance drops when dealing with
larger CNNs, and they ignore the impact of binarization on accuracy loss. To overcome this, Hou et
al. employed a proximal Newton algorithm with a diagonal Hessian approximation to minimize the
overall loss associated with binary weights, and Lin et al. quantized the representations at each
layer when computing parameter gradients, converting multiplications into binary shifts by enforcing
the values of the neurons of power-of-two integers.
Low-rank factorization has been utilized for model compression and acceleration to achieve further
speedup and obtain small CNN models. Rigamonti et al. [68] postprocessed the learned filters by
employing a shared set of separable 1D filters to approximate convolutional filters with low-rank
filters, and Denton et al. used low-rank approximation and clustering schemes to reduce the
computational complexity of CNNs. Jaderberg et al. suggested using different tensor decomposition
schemes, achieving double speedup for a particular convolutional layer with little drop in model
accuracy. Low-rank factorization has also been used to exploit low-rankness in fully-connected
layers. Denil et al. utilized a low-rank decomposition of the weight matrices which learned from an
auto-encoder to reduce the number of dynamic parameters, while Sainath et al. showed that low-
rank factorization of the last weighting layer significantly reduces the number of parameters. Lu et
al. adopted SVD to composite the fully-connected layer, attempting to design compact multi-task
deep learning architectures. Low-rank approximation is made in a layer-by-layer fashion: at each
layer, the layer is fine-tuned based on a reconstruction objective, while keeping all other layers fixed.
Following this approach, Lebedev et al. applied the non-linear least-squares algorithm, a type of
Canonical Polyadic Decomposition (CPD), to approximate the weight tensors of the convolution
kernels. Tai et al. introduced a closed-form solution25to obtain results of the low-rank decomposition
through training constrained CNNs from scratch. The Batch Normalization layer (BN) is utilized to
normalize the activations of the latent, hidden layers. This procedure has been shown to be effective
in learning the low-rank constrained networks.
26