Kurtz 20 A

Inducing and Exploiting Activation Sparsity for Fast Neural Network Inference
Mark Kurtz * 1 Justin Kopinsky * 1 Rati Gelashvili 1 Alexander Matveev 1 John Carr 1 Michael Goin 1
William Leiserson 1 Sage Moore 1 Bill Nell 1 Nir Shavit 1 Dan Alistarh 1 2
Abstract 1. Introduction
Deep neural networks (DNNs) are able to achieve state-of-
Optimizing deep neural networks for inference the-art performance in several application domains, such
has recently become an extremely active area as image classification, speech recognition, and automated
of research. One of the go-to solutions in this decision making, e.g. (Krizhevsky et al., 2012; Vaswani
context is weight pruning, which aims to reduce et al., 2017; Silver et al., 2016). Along with this wide ar-
computational and memory footprint by remov- ray of applications comes the need to reduce the significant
ing large subsets of the connections in a neural computational and memory footprint of DNNs. To this end,
network. Surprisingly, much less attention has several techniques have been designed to obtain optimized,
been given to exploiting sparsity in the activation resource-efficient variants of a given deep model. Prun-
maps, which tend to be naturally sparse in many ing and quantization are arguably the standard methods for
settings thanks to the structure of rectified linear achieving resource-efficient models, which have received
(ReLU) activation functions. In this paper, we considerable attention, e.g. (Liu et al., 2017; Luo et al., 2017;
present an analysis of methods for maximizing Gray et al., 2017; Han et al., 2015; Li et al., 2016; Mishra
the sparsity of the activations in a trained neu- et al., 2017; Zhu et al., 2016). However, the vast majority
ral network, and show that, when coupled with of existing work has focused on compressing the weights
an efficient sparse-input convolution algorithm, (connections) in the neural network, for which several regu-
we can leverage this sparsity for significant per- larization (Molchanov et al., 2017) and thresholding-based
formance gains. To induce highly sparse activa- methods (Han et al., 2015; Gale et al., 2019) are now known.
tion maps without accuracy loss, we introduce a It is therefore perhaps surprising that sparsifying activation
new regularization technique, coupled with a new maps has received relatively little attention. A non-trivial
threshold-based sparsification method based on a fraction of the activations are sparse as a natural conse-
parameterized activation function called Forced- quence of the structure of Rectified Linear Unit (ReLU)
Activation-Threshold Rectified Linear Unit (FA- activation functions. This observation has been leveraged
TReLU). We examine the impact of our methods by hardware accelerators, e.g. (Albericio et al., 2016; Han
on popular image classification models, showing et al., 2016; Parashar et al., 2017), and reference (Rhu et al.,
that most architectures can adapt to significantly 2018) performed an analysis of naturally-occurring acti-
sparser activation maps without any accuracy loss. vation sparsity. Recently, (Georgiadis, 2019) explored L1
Our second contribution is showing that these regularization to increase the number of zeroes in the acti-
these compression gains can be translated into in- vation maps, showing that sparsity can be increased by up
ference speedups: we provide a new algorithm to to 60% for image classification models.
enable fast convolution operations over networks
with sparse activations, and show that it can en- A second gap in the literature is the absence of software
able significant speedups for end-to-end inference support for sparsity, and in particular activation sparsity, on
on a range of popular models on the large-scale common hardware. Currently, running models with higher
ImageNet image classification task on modern activation sparsity rates on common CPU or GPU platforms
Intel CPUs, with relatively low retraining cost. will not result in computational speedups, and improvements
are only reported in relative sparsity percentage, or synthetic
* memory compression rates (Georgiadis, 2019). It is not at
Equal contribution 1 Neural Magic 2 IST Austria. Correspon-
dence to: Dan Alistarh <[email protected]>. all clear how these compression rates will relate to speedup
in real-world implementations, and it is therefore difficult
Proceedings of the 37 th International Conference on Machine to evaluate the practical impact of existing methods.
Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by
the author(s). In this paper, we address both these gaps with respect to
activation sparsity. We begin by performing an in-depth and test it on a range of popular CNNs for image classifica-
analysis of regularization and thresholding methods as a tion on the classic ImageNet ILSVRC2012 dataset (Deng
way to increase activation map sparsity in convolutional et al., 2009). We find that 1) many popular models have
neural networks. Specifically, we present a set of techniques significant “natural” activation sparsity, without any specific
which can significantly boost naturally-occurring activation activation regularization; 2) the natural activation sparsity of
sparsity in CNNs, without loss of accuracy. Our methods these networks can be consistently and significantly boosted
can be both applied statically (requiring no retraining) and using our techniques. We show the resulting boosted models
dynamically (if fine-tuning is possible), and significantly can be executed with speedups of more than 2× compared
improve upon existing regularization-based methods (Geor- to state-of-the-art CPU and GPU inferencing solutions.
giadis, 2019), often by more than 2× in terms of relative
Related Work. The literature on model compression for
improvement to baseline sparsity. We complement these
DNNs is extremely vast, so we restrict our attention to
techniques with negative results, showing that activation
work on analyzing and leveraging activation sparsity. The
sparsification cannot smoothly recover accuracy through
fact that activation sparsity arises naturally is well-known,
re-training (as opposed to weight sparsification (Gale et al.,
and has been leveraged by several architecture proposals,
2019)), and that applying thresholding independently per
e.g. (Albericio et al., 2016; Han et al., 2016; Parashar et al.,
each channel is possible but only yields limited gains. Our
2017); in particular, reference (Rhu et al., 2018) performed
second contribution is a general algorithm which can lever-
an in-depth analysis of activation sparsity on a range of
age activation sparsity for computational gains, and its effi-
convolutional models. We extend this analysis here.
cient CPU-based implementation. The resulting framework
can lead to inference speedups of more than 2x on a range of Another related line of work is that on compressing activa-
popular CNNs for image classification, relative to industrial tion maps. A common technique for reducing the mem-
CPU- and GPU-based inference frameworks, and to our ory footprint of activation maps is quantization, which
optimized dense baseline. has been employed successfully by several references, see
e.g. (Mishra et al., 2017) and references therein. We do not
Our sparsity-boosting methods combine a regularizer fol-
investigate quantization here, and leave a thorough treatment
lowing the Hoyer sparsity metric (Hoyer, 2004), together
of the impact of our sparsification techniques in conjunction
with a variant of the classic ReLU activation, which we call
with quantization for future work. Reference (Gudovskiy
Forced Activation Threshold ReLU (FATReLU). Simply
et al., 2018) proposed a projection technique coupled with
put, FATReLU implements a variable threshold for the com-
non-linear dimensionality reduction, which required mod-
mon ReLU activation function, below which all activations
ifying the network structure, while (Alwani et al., 2016)
are set to zero, based on the intuition is that a non-trivial
proposed to stochastically prune activations as an adversar-
fraction of the positive activations can be eliminated without
ial defense. Both techniques cause significant accuracy loss,
significant impact on the output. We develop techniques to
and are therefore outside the scope of our study. Agostinelli
determine and optimize FATReLU thresholds per layer, and
et al. (2014) propose learning piecewise linear activation
perform an analysis of the interplay between these methods
functions to improve the accuracy of given models. FA-
and the accuracy of the resulting model. In short, we find
TReLU is piecewise linear, but the goals and methods we
that sparsity can be significantly boosted via Hoyer regular-
investigate in this paper are different.
ization and thresholding, with no accuracy loss, beyond L1
regularization. The methods we propose induce negligible The work closest to ours is (Georgiadis, 2019), who pro-
(< 0.3%) accuracy loss on ImageNet-scale models, and can posed and investigated the use of L1-regularization applied
even result in minor accuracy increase. However, contrary to the activation maps, and showed that it can result in a sig-
to weight pruning methods, which can gradually trade off nificant increase (up to 60% relative to naturally-occurring
accuracy for increased sparsity, we find that sharp thresh- activation sparsity) on a range of CNNs for image clas-
olds exist for activations, beyond which accuracy drops, and sification. The paper goes on to explore several efficient
cannot be recovered. This observation simplifies the fine- encoding techniques for the activations, and evaluates them
tuning process, since, up to this threshold, we are usually synthetically in terms of their resulting compression fac-
able to recover full accuracy, and there is little benefit in tors, but provides no inference experiments. We show that
fine-tuning beyond this threshold. Hoyer regularization is superior to L1, in the sense that it
provides higher activation sparsity without accuracy loss
Our second contribution is a computational framework to
on all the models we investigated. The thresholding meth-
leverage activation sparsity for computational gains, tailored
ods we propose are complementary to regularization in the
to CPUs. This framework is based on an algorithm for fast
sense that they can be applied independently of whether the
convolutions on sparse inputs, for which we present an effi-
base model has been regularized or not, or of the regular-
cient vectorized implementation, and back by several non-
ization method. In addition, our paper provides a complete
trivial optimizations. We implement our framework in C++,
framework for leveraging activation sparsity for fast infer- we examine the average activation map sparsity across sev-
ence on CPUs, as well as end-to-end inference speedup for eral batches, we notice that layers which are closer to the
activation-sparsified models. input tend to have activation sparsity that is lower than this
threshold, whereas later layers tend to have higher activa-
To our knowledge, the only reference to explicitly leverage
tion sparsity. One intuitive (but imprecise) explanation for
input sparsity for performance gains is the recent prelimi-
this phenomenon could be that earlier layers adapt to extract
nary publication of (Dong et al., 2019). By contrast, their
more numerous low-level features, whereas the later layers
algorithm is more complex, and requires high input spar-
would extract higher-level features. Please see Figure 4 for
sity to be efficient: in particular, as stated in the reference,
an illustration. The standard deviation of the recorded spar-
the resulting algorithm can only be applied to certain types
sities is under 1% across batches, so we omit confidence in-
of tasks and models, such as LiDAR-based detection, or
tervals for visibility, noting that this stable behaviour across
character recognition. For this reason, we do not directly
batches is somewhat surprising.
compare against it. Our technique is applicable and efficient
in a much wider range of scenarios. The Impact of Network Depth and Width. In this con-
text, it is natural to ask whether wider or deeper networks
Related algorithmic ideas have been investigated in (Park
will tend to have higher activation sparsity. We examined
et al., 2016b;a; Chen, 2018). The critical distinction is
this trend on pre-trained ImageNet models, in particular
that all these references explore leveraging sparsity in the
comparing ResNet50 with its 2x wide variant (Zagoruyko
weights, rather than in the activations, leading to a differ-
& Komodakis, 2016), as well as with the deeper ResNet101
ent algorithm structure and implementation. For example,
and its 2x wide variant. We use the Torchvision pretrained
our procedure critically requires efficient on-the-fly input
models as examples. The results are provided in Table 3.
compression, whereas weight sparsity techniques can pre-
(We observed similar results in a depth-width ablation study
compress the weights offline. Another key difference from
on residual networks on CIFAR-10, which we omit for
these approaches is that they require retraining, since kernel
brevity.) First, average activation sparsity does indeed in-
sparsity is not naturally present in neural networks with-
crease with network depth (e.g. 53% to 57% for ResNet50
out specific regularization or thresholding. Second, the
vs ResNet101), corroborating the intuition that “higher level
speedups achieved by these methods are bound to be lim-
features” develop deeper in the network. Second, wider
ited by the fact that, even with thresholding, kernels cannot
networks do have a higher fraction of zero activations (e.g.
usually be sparsified to the extremely large ratios which can
53% to 58% for ResNet50 vs 2xWideResNet50), matching
be naturally present in activations without loss, e.g > 90%.
the intuition that only a limited subset of the features are
Our work can also be examined in the broader context of necessary to classify a certain input, whose proportion does
model compression methods, which is an extremely active not necessarily increase with layer width. Moreover, as can
research area, e.g. (Wu et al., 2016; Zhu et al., 2016; Mishra be seen from the result for 2xWide ResNet101 (63%), these
et al., 2017; Mellempudi et al., 2017; Zhang et al., 2017; trends compound.
Park et al., 2016b; Han et al., 2016; Polino et al., 2018;
L1 Regularization. In Figure 4(a), we also examine the
Frankle & Carbin, 2018). We develop the first thresholding-
impact of L1 regularization applied to the activations on
based method specifically for activations, along with spe-
the sparsity. We follow the proposal of (Georgiadis, 2019),
cific sensitivity analysis and tuning techniques.
which consists of fine-tuning an accurate pre-trained model
2. Activation Sparsity in CNNs with L1 regularization for a number of epochs, as well
as the carefully optimized regularization parameter values
2.1. Natural and Regularized Activation Sparsity provided, which ensure no accuracy loss. We notice that this
Naturally-Arising Sparse Activations. We begin by ex- method can boost the sparsity of activations by an extra 1%
amining the natural sparsity of activation maps in CNNs. and 4% on average on ResNet50 and Mobilenet, respectively.
For simplicity, we will focus on residual models trained (See Table 1 for values across models.)
on the ImageNet (ILSVRC2012) task, although our find- Hoyer Regularization. We go beyond the L1 sparsity-
ings are generally valid across other datasets (in particular, inducing regularization, and consider the square Hoyer reg-
2
CIFAR-10 and 100 (Krizhevsky et al., 2014)) and architec- ( d |vi |)
P
ularizer, defined for a vector ~v as H(~v ) = Pi=1 d 2 .
tures (ResNet (He et al., 2016), Mobilenet (Howard et al., i=1 vi
2017))–please see Section 5 for full results. This regularizer has a range of desirable properties as a
measure of sparsity (Hoyer, 2004), such as scale-invariance
Activation sparsity is linked with the structure of the ReLU and differentiability almost-everywhere. It is popular for
non-linearity: if input data to this function were completely compressed sensing, and has only recently been applied for
random and zero-centered, then we would expect an output weight sparsification (Yang et al., 2019); to our knowledge,
activation sparsity concentrated around 50%. However, if we are the first to investigate it for activation sparsity.
Figure 1. Illustration of the impact of regularization and boosting on the output distribution of a convolutional layer (ResNet18, layer 5).
The Y axis is log-scale. Notice that all methods significantly narrow the set of non-zero activations; however, Hoyer and boosted Hoyer
allow for more “diversity” in the activations, which explains their better performance.
Figure 4(a) presents the output activation sparsities for each Further, Figure 2 (center, right), shows that regularization
layer of ResNet18, when regularized with square Hoyer may serve to stabilize activations, in the sense that a larger
such that there is no accuracy loss. Specifically, for each fraction can be thresholded on regularized models, with-
ReLU’s output we apply the square Hoyer regularization out accuracy loss. Moreover, we found the benefits from
multiplied by a hyperparameter determined experimentally regularization to be approximately independent from, and
to the cost function. We found values between 10−8 (conser- additive with, the benefits from thresholding. Layers other
vative) to 10−7 (more aggressive) to work for this parameter, than the one depicted exhibited a similar pattern, with some
for all the models we considered. Our initial learning rate variance in the particular sparsity values.
for retraining is 5×10−3 , and we maintain standard momen-
tum and weight decay values. With these parameters, we 3. Boosting Activation Sparsity
retrain for 10 epochs to stabilize weights and recover accu- In this section, we investigate generic ways to systemati-
racy. We note that this recalibration process is significantly cally produce networks with high activation sparsity. We
less expensive than for L1 regularization (Georgiadis, 2019), begin with static methods (which require no retraining), and
which required 90 epochs of training for recovery. The im- then continue with dynamic methods, which are allowed to
provements relative to the additional sparsity induced by retrain in order to recover accuracy.
L1 are of 2.4x and 8x, for Mobilenet and ResNet50, respec-
tively. Our experimental results in Section 5 clearly suggest Forced-Activation Thresholds. Formally, the Forced-
that square Hoyer is superior to classic L1 regularization. Activation Threshold ReLU activation function (FATReLU)
is simply defined as:
2.2. The Distribution of Activations
(
We now focus our attention to the distribution of activa- x when x ≥ T ;
FATReLU T (x) =
tions in the layers of a neural network. We performed a 0 otherwise.
basic histogram analysis for layers of ResNet18, from the
original pre-trained model, as well as from the L1, Hoyer-
regularized, and boosted variants of the same model. We Note that FATReLU cannot be simulated by simply adding
notice that, for all instances, a non-trivial fraction of the ac- a linear bias term to ReLU. Further, not only is FATReLU
tivations are clustered around zero. Next, we implement an not differentiable at T , but it is not even continuous at T ,
activation sensitivity analysis procedure: independently for which renders training neural networks from scratch with
each layer, we fix a threshold T below which all of the acti- FATReLU cumbersome. However, our use case allows us
vations will be set to zero. We then increase this threshold to use it to directly replace ReLU on a pre-trained model
and examine the loss of accuracy. The resulting graph for a whose activations we wish to further sparsify.
set of layers of pretrained ResNet18 is presented in Figure 2. Baseline Model. We assume an accurate pre-trained model
Results suggest that a non-trivial fraction of the activations for the target architecture and task. We first fine-tune the
can be set to zero without affecting the loss. The results provided model using the square Hoyer regularizer, which
presented are averaged over a set of 128 mini-batches. We sets a fraction of the activations to zero, and also “stabi-
found these results to be extremely consistent, and therefore lizes” the other activations, allowing a larger fraction to be
omit error bars for visibility. thresholded via FATReLU.
the exact root cause of this phenomenon is difficult, but we

conjecture that it is related to the fact that the forward and
backward information flow through the layer is break down
due to the high activation threshold.
Dynamic Thresholding. Due to this bi-modal recovery
behavior, we use dynamic thresholding to simplify the pro-
cess of finding the optimal thresholds per layer. We fix a
small accuracy loss tolerance, τ (0.2% in our experiments),
and, for each layer, we refer to the static analysis results
Figure 2. Illustration of the sensitivity analysis results for a single to identify the maximal threshold for which accuracy loss
convolutional layer in ResNet18, for different threshold values (X remained below τ , determined by binary search over the
axis) and different regularizers (panels). range of thresholds. Once this threshold is determined for
each layer, we run one fine-tuning training epoch until either
Activation Sensitivity Analysis. We first aim to find layer- recovery is achieved, or recovery fails. Using this success
wise activation thresholds which sparsify a large fraction of or failure criterion, we can perform binary search on τ to
the activations preserving accuracy. We adapt the weight determine the largest τ for which recovery is possible.
sparsity sensitivity analysis (Han et al., 2015) for the case We adopt this procedure since it has low cost, and similar
of activations. Intuitively, we estimate the “derivative” of outcomes to more complex iterative procedures we have
the loss with respect to the activation sparsity of each layer. investigated, both in terms of sparsity and accuracy. In ad-
The procedure starts by identifying a set of target sparsity dition, Dynamic Thresholding performs particularly well
percentages for the outputs of the different layers. For each when used in conjunction with Hoyer regularization (please
layer L, we pick a maximal percentage TL of the extra see Figure 4(a)). We adopt Hoyer regularization plus Boost-
activations which should be set to zero, in addition to the ing via Dynamic Thresholding as our main method for gen-
natural sparsity. We evaluate and record the loss at discrete erating activation-sparse models.
thresholds t ∈ [0, TL ]. (This procedure exclusively uses
batches from the training set.) Channel-wise Thresholding. Next, we ask whether we
could further increase activation sparsity by performing Dy-
We thus obtain a “sensitivity profile” for each layer, based namic Thresholding channel-wise, setting a distinct thresh-
on which we set a threshold for the activations of the layer. old for each channel of each layer. This procedure is costly,
We usually pick the threshold for each layer to be the largest since it requires fine-grained tuning across each channel, and
value which did not result in accuracy loss, modulo some requires care, since the impact of each individual channel
small error tolerance. A typical set of results is illustrated in on the loss may be small. We proceed as follows.
Figure 2. It is not uncommon for F AT ReLU t to improve
accuracy at low threshold values– one possible explanation We start from the layer-wise FATReLU thresholds deter-
is that this serves to remove some of the noise from the mined above. Next, we perform a one-shot sensitivity anal-
activations close to the zero threshold. ysis for each channel in each layer, by estimating the piece-
wise integral of the cross-entropy loss relative to the channel
Retraining and Sharp Activation Thresholds. The above threshold, obtained from sensitivity analysis. We adopt the
procedure is static, in the sense that the model weights are maximum threshold across all channels as the maximum
not modified, and the model is not retrained. It results in value to integrate for across all channels. A lower integral
consistent, but relatively limited improvements in terms of value suggests that the channel is less sensitive to thresh-
activation sparsity: for instance, for the ResNet18 model, olding. Based on the results of this channel-wise sensitivity
the average increase across layers due to static boosting is procedure, we partition the channels into groups based on
under 3% globally. We wish to achieve higher thresholds their sensitivity. For each channel group (e.g. the 25% least
by allowing retraining of the network to adapt the weights sensitive channels, and so on), we perform binary search on
to the higher thresholds. Such procedures are common for their joint thresholds, attempting to increase their FATReLU
weight sparsification (Zhu & Gupta, 2017; Gale et al., 2019). threshold, until the point where we reach the tolerance in
Perhaps surprisingly, we find that this behavior is bi-modal terms of accuracy difference.
for activations: we can increase activation sparsity within We have performed channel-wise thresholding for ResNet18
a continuous range and still have the model recover full following the above procedure. Please see Figure 3 for a
accuracy through retraining at each level within the range. sample of the results. On the positive side, the procedure
However, each layer appears to have a “sharp” activation does not diverge–we are able to systematically increase
threshold beyond which the model is no longer able to re- thresholds per channels without accuracy loss. On the nega-
cover accuracy, even with significant retraining. Identifying
(a) Channel Thresholds after Channel-Wise Sensitivity Analysis. (b) Sparsity Boosting across Channels. We can only obtain a low
Notice the relatively low proportion of channels which can be average activation sparsity increase, at significantly increased
boosted past the average threshold. computational cost.
Figure 3. Channel-wise boosting results for a fixed “highly-sensitive” layer of ResNet18.
tive side, we can only obtain an average activation sparsity encoding the number of non-zero elements per row, i, deriv-
increase of approximately 2% relative to the coarse-grained able as row pointers[i + 1] − row pointers[i].
dynamic thresholding method, at significantly increased
The Algorithm. Algorithm 1 shows a simple pseudo-code
computational cost. Further analysis of the results suggests
implementation to compute the convolution of a dense ker-
that a fraction of approximately 30% of the channels cannot
nel K with sparse input I given in CSR format to produce
be boosted past the layer threshold, whereas a small fraction
output O. For simplicity, we assume that the input data
of approximately 10% of the channels have negligible im-
has one channel dimension and one spatial dimension. In
pact on the loss and thus can be completely eliminated. The
particular, the input I is a CSR representation of data with
cost of this method outweigh its computational benefits.
dimensionality IC ×Ix , the output O is a matrix with dimen-
sions OC ×Ox , and the kernel K is a tensor with dimensions
4. Leveraging Activation Sparsity
OC × IC × Kx . Extending to more spatial dimensions (as
Background. To make use of activation sparsity at runtime, is typical, e.g., in image processing NNs) is straightforward
we implement an algorithm to perform sparse convolutions and omitted for clarity.
on data that is initially produced (e.g. from a previous layer) AVX Implementation. We implemented Algorithm 1 on
in a standard (i.e. dense) format. We make use of a variant Intel’s Skylake architecture with x86+AVX512 instruction
of the Compressed Sparse Row (CSR) representation (e.g. set. Both CSR compression and Algorithm 1 can be im-
as implemented in (Wang et al., 2014)). Prior work has plemented efficiently using available SIMD instructions.
taken advantage of CSR for computing convolutions when Algorithm 2 demonstrates how to implement Algorithm 1
the kernels are sparse, on both GPUs (Park et al., 2016b) in a SIMD way. Note that ÔC refers to the number of vec-
and CPUs (Park et al., 2016a), where one has the luxury tors of output channels to be computed, i.e. ÔC = OC /r
of being able to pre-compress the sparse kernels prior to when there are r values per vector. In our implementation,
inference with no performance overhead. However, for acti- r = 16, as we use FP32 data stored in 512-bit vector regis-
vations, the location of the non-zero elements is not known ters. Note that we assume that there are ÔC vector registers,
until inference time, and so we must be able to efficiently (0) (^
O )
vout · · · voutC , available to hold intermediate results. Other-
compress the activations at run time. Once compressed, we
wise, we can subdivide the output tensor O along its channel
can apply Algorithm 1 to the compressed input. Importantly,
dimension into blocks small enough to be held in register,
both CSR compression and sparse-input convolution can be
and execute Algorithm 2 independently for each block.
implemented efficiently on modern hardware, i.e. without
the need to branch on zero elements. SIMD Compression. Because we must compress our input
data at runtime, we also require an efficient algorithm to
We use a “3-array” variation of CSR, wherein a sparse ma-
compress a matrix M to CSR format. This can be done as
trix M is represented with the following three arrays:
follows: given a SIMD vector, v, of 16 floats, which we
• values: Element j contains the j th non-zero element want to compress, we use the vcmp instruction to identify
of M in row-major order the locations of the non-zero elements in v stored in a mask
• columns: Element j contains the column index in M register m. Then use the vcompress instruction twice:
of the corresponding element values[j] once applied to v with mask m to produce contiguous non-
• row pointers: Element i contains a pointer to the zero elements to be written to values, and a second time
first element in values which came from row i of M applied to the vector {j, . . . , j + 15} with mask m (where
j is the column index of the first element of v in M ) to
Note that row pointers serves the additional function of
produce column indices to be written to columns. The
popcnt instruction applied to m can be used to keep track loop iterations before it is actually needed, at the cost of
of the number of non-zero elements and thereby maintain requiring s additional registers to hold pending input values.
the offset for writing to values and columns, as well as to
Hot kernels in cache. In some layers of some networks,
record row pointers[i].
convolutional kernels are so large that they do not fit
in cache. For instance, the last several convolutions of
Algorithm 1 Sparse Convolution Resnet50 are either 2048 × 512 × 1 × 1 or 512 × 512 × 3 × 3,
1: for (ox, kx) ∈ [0, Ox ) × [0, Kx ) do which, at 4 bytes per (floating point) value, are 4MB and
2: ix ← ox + kx
3: for in loc ∈ [row pointers[ix], row pointers[ix + 1]) do 9MB respectively, yet L2 cache sizes of Intel machines are
4: ic ← columns[ix][`] commonly only 1MB. Keeping kernels in L2 is critical for
5: for oc ∈ [0, OC ) do performance since every iteration of the inner-most loop
6: O[oc][ox] += values[in loc] ∗ K[oc][ic][kx] accesses a different kernel value (line 4). To ensure that ker-
7: end for nels remain hot, we use a combination of two techniques.
8: end for
9: end for Firstly, if the kernel dimensions are such that the values
associated with a single spatial pixel do fit in cache (i.e.
4IC OC < 1MB), then we can order the outer loops of
Algorithm 2 AVX Sparse Convolution Algorithm 2 so that the loops over the spatial dimensions of
1: for (ox, kx) ∈ [0, Ox ) × [0, Kx ) do the kernels is outermost. That is, for each of the Kx spatial
(0) (^
O )
2: initialize vout · · · voutC to 0 coordinates of the kernel, we will compute partial outputs by
3: ix ← ox + kx performing all of the multiply-adds involving kernel values
4: for in loc ∈ [row pointers[ix], row pointers[ix + 1]) do
5: vin ← vbroadcast(values[in loc]) that share that coordinate, ultimately accumulating all of
6: ic ← columns[ix][`] the partial results together. Thus, we only need to move the
7: for oc = 1 to ÔC do kernel values into L2 cache once and reuse them from there,
8: // K[oc][ic][kx] points to a kernel vector in memory at the cost of a few extra reads and writes of the (typically
9: voc oc
out ← vfmadd(vin , K[oc][ic][kx], vout ) smaller) inputs and outputs.
10: end for
11: end for Hot compression. To save on expensive memory accesses,
(0) (^
O )
12: Store vout · · · voutC to memory locations O[0 · · · ÔC ][ox] we ensure that the results of the input pre-compression are
13: end for used before being evicted from cache. To accomplish this,
we subdivide the Sparse Convolutional operation into sub-
Next, we discuss a number of optimizations which we apply tasks, each of which contains a block of data which fits
to our framework, focusing on CPU-based implementations. entirely in cache. Then, we can process each block by first
Multicore. Our sparse convolution framework is embarrass- running the CSR compressor on only that block, and then
ingly parallel: we partition O into blocks O1 , . . . , On and immediately applying Algorithm 2 to the resulting com-
assign blocks to n threads. Each thread fully computes its pressed data while it is still hot.
corresponding block of O. In order to avoid many threads 5. Experimental Results
having to load the same input data, we minimize the over-
Goals, Setup and Tasks. We experimentally validate our
laps between pre-images of the blocks Oi . Observing that
methods by applying them to a range of classic convolu-
two elements of O with different channel coordinates, but
tional models for image classification. We aim to determine
which share the same spatial coordinate, have identical pre-
the extent to which our techniques can boost activation spar-
images, we partition O spatially as much as possible, rather
sity, and the impact this has in terms of layer-wise and
than partitioning along channels. In some cases, image sizes
end-to-end inference speedup on real models and tasks,
are too small to get spatial partitions with enough work to
compared against optimized baselines which do not lever-
saturate threads, in which case we can choose to additionally
age activation sparsity. We focus on the ResNet (He et al.,
partition along channels after all.
2016), and Mobilenet (Howard et al., 2017) architectures,
Input Pre-loading. We observe that the input broadcast on applied to ImageNet ILSVRC2012 (Deng et al., 2009).
line 4 of Algorithm 2 has the potential to be high latency
We implemented our thresholding methods in Pytorch, mak-
since it must read from memory. Fortunately, modern CPUs
ing use of the provided pre-trained models as starting points
can hide the latency of such memory accesses via pipelining
for the regularization and thresholding procedures. We im-
them, i.e. executing instructions which do not depend on
plemented our sparse-input convolution in C++, on top of
the results of the load while waiting for the values from
an existing fully-dense baseline framework, which uses
memory to become available. In order to take full advantage
optimized direct convolution or general matrix multiply
of this pipeline, we re-order the memory loads to be as
(GeMM) operations for all layers. This framework gets an
early as possible, by issuing each broadcast instruction s
(a) Input Activation Map Sparsities for ResNet18/ImageNet. (b) Layer Latencies and Speedups for ResNet18/ImageNet.
Figure 4. Layerwise sparsities and speedups for ResNet18/ImageNet. The sparsified variant achieves significant speedups since it
significantly reduces overhead in the more computationally-heavy layers.
(a) Activation Map Sparsities for Mobilenet/ImageNet. (b) Layerwise Speedups for Mobilenet/ImageNet. Note: even-
numbered layers are depthwise convolutions to which we do not
apply our sparse algorithms.
Figure 5. A sample of our results for the Mobilenet model trained on the ImageNet dataset.
ONNX file (Bai et al., 2019) describing the network archi- Model Baseline L1 Hoyer Boosted Hoyer
tecture as an input, parses and optimizes the graph, and then ResNet18 53% 55% 62 % 67 %
generates the Just-in Time compiled (JITted) assembly code
ResNet50 53% 54% 61 % 65 %
for each layer. This baseline framework is well-optimized:
as evident in Table 2, inference numbers using Dense match Mobilenet 48% 52 % 58 % 60 %
a state-of-the-art industrial solutions (MXNet 1.3 (Chen
et al., 2015) using Intel MKL-DNN for CPU inference, and Table 1. Average activation sparsities using different methods.
Pytorch/CUDA10 for GPU inference).
We perform our performance experiments on an AWS tivation sparsity of 1) the baseline pre-trained models from
C5.12xlarge instance which sports an Intel Cascade Lake Pytorch, 2) the L1-regularized models following the opti-
chip with 24 physical cores, has 96 GB of memory and runs mized hyperparameter values from (Georgiadis, 2019), 3)
Ubuntu 18.04, as well as on a local server with the same the (square) Hoyer-regularized models whose hyperparam-
configuration. For GPU inference, we used P2.xlarge in- eters we identify through grid search, 4) the dynamically-
stance with one NVIDIA K80 GPU, running Pytorch 1.2.0 boosted variants of the Hoyer-regularized models, following
with CUDA10, using 16bit half precision. the algorithm from Section 3. For methods 2)–4) we per-
formed fine-tuning for 20 epochs to recover or even increase
Boosting Activation Sparsity. Our first experiments evalu- accuracy under regularization. ((Georgiadis, 2019) recom-
ate the ability of various methods to induce a large subset of mends 90 epochs of retraining with regularization, but we
activations to be zero. In particular, we study the average ac- were able to reproduce their results with this compressed
Model MXNet+MKL-DNN NVIDIA K80 Dense Natural Sparsity Hoyer Reg. Boosted Hoyer
ResNet18 113.41 100.16 107.25 68.40 63.67 60.92 (1.86x)
ResNet50 317.49 350.2 256.06 194.86 183.21 180.5 (1.75x)
Mobilenet 88.55 114.3 62.64 58.93 51.80 49.77 (1.78x)
Table 2. Average inference running times in ms for batch size 64 on various models and variants (AWS C5.24xlarge for CPU and AWS
P2.xlarge for GPU). Speedups are presented in brackets relative to the state-of-the art MXNet/MKL-DNN CPU inference framework.
schedule.) We present average values, with the note that Table 3 presents average natural and boosted sparsities for
results are extremely stable across sample batches (stan- deep and wide residual models. For these experiments, we
dard deviation < 1%). For all of the models presented, the found that the MXNet benchmark does not efficiently sup-
accuracy loss relative to the Torchvision baseline is < 0.3%. port the wide/deep models we consider; we therefore present
speedups relative to our own dense implementation, which
Table 1 presents average results for each technique, while
provides a more competitive baseline. All experiments are
Table 3 presents baseline and Boosted Hoyer results for
executed at 12 threads. (Trends for other batch sizes and
wide and deep models.
thread counts are similar, and therefore omitted.)
A sample of layer-wise results are presented in Figures 4
Generally, we find that activation sparsity can lead to sig-
(ResNet18) and 5 (Mobilenet) , while the average sparsities
nificant and consistent speedups across the layers, roughly
are presented in Table 1. One salient trend is that Hoyer
proportional to the amount of activation sparsity. A signif-
and Dynamic Boosting are able to consistently boost sparsi-
icant fraction of the speedup can already be obtained on
ties, significantly beyond the baseline or L1 regularization.
top of the pretrained models, by exploiting their natural
For instance, for the input layer of Mobilenet, they both
sparsity. At the same time, regularization and boosting con-
reduce density by ∼ 2× versus the natural sparsity, and
sistently provide additional speedups, in particular for the
by 50% versus L1 regularization. We note that, across all
computationally-heavy but accurate wide/deep models. In
layers of all networks, there are only two layers where L1
fact, fortunately, the layers with the largest computational
regularization provides higher sparsity (the input layers of
overhead have high input sparsity (especially with boosting).
the residual networks), and by a very narrow margin. The
second noticeable trend is that Dynamic Boosting can con- The end-to-end results are summarized in the last column of
sistently reduce the density of activations without accuracy each table. Experiments confirm that Hoyer with Dynamic
loss: for Mobilenet, these margins are almost negligible, Boosting consistently provides the highest speedups for
but they become significant for the residual models, where ResNets and Mobilenet, in the range of 1.67x (ResNet50)
boosting almost doubles the sparsity improvement of the to 2.57x (WideResNet101), relative to our optimized dense
best regularizer (Hoyer). A third observation (Table 3) is implementation.
that our methods are especially effective in the context of
accurate but heavy wide and deep models, where activation 6. Conclusions and Future Work
density can be effectively halved through boosting, without
accuracy loss. We have presented a framework for augmenting and leverag-
ing activation sparsity in DNNs for computational speedups.
Model Natural AS Boosted Speedup Our framework leverages two new techniques: on the ma-
chine learning side, a set of regularization and thresholding
ResNet50 53% 65% 1.67x
tools to boost the average and peak activation sparsity of
2x Wide ResNet50 58% 81% 2.04x networks; on the technical side, an algorithm for efficiently
ResNet101 57% 79% 1.53x performing convolutions on sparse inputs, along with its op-
2x Wide ResNet101 63% 84% 2.57x timized implementation in C++. Our techniques are imple-
mented in an extensible, modular framework, which could
Table 3. Average activation sparsity and speedup. be leveraged by researchers wishing to extend our results
for both creating models with higher activation sparsity, or
End-to-End Inference Performance. We now turn our at- faster algorithms for sparse convolutions. Our framework is
tention to how well the activation sparsity numbers we saw particularly well-suited for speeding-up inference on accu-
in the previous section translate to actual speedups in end- rate, but heavy, deep and wide models.
to-end inference on the respective models. Figures 4 and 5
In future work, we plan to explore additional strategies
presents execution times layer-by-layer, whereas Tables 2
for memory-bound layers, and investigate the impact of
and 3 presents average total execution times for the models
quantization on sparsity on computational speedups.
at batch size 64 under various configurations and speedup.
References Gray, S., Radford, A., and Kingma, D. P. Gpu kernels for
block-sparse weights. arXiv preprint arXiv:1711.09224,
Agostinelli, F., Hoffman, M., Sadowski, P., and Baldi, P.
2017.
Learning activation functions to improve deep neural
networks. arXiv preprint arXiv:1412.6830, 2014. Gudovskiy, D., Hodgkinson, A., and Rigazio, L. Dnn feature
Albericio, J., Judd, P., Hetherington, T., Aamodt, T., Jerger, map compression using learned representation over gf (2).
N. E., and Moshovos, A. Cnvlutin: Ineffectual-neuron- In Proceedings of the European Conference on Computer
free deep neural network computing. ACM SIGARCH Vision (ECCV), pp. 0–0, 2018.
Computer Architecture News, 44(3):1–13, 2016.
Han, S., Pool, J., Tran, J., and Dally, W. Learning both
Alwani, M., Chen, H., Ferdman, M., and Milder, P. Fused- weights and connections for efficient neural network. In
layer cnn accelerators. In The 49th Annual IEEE/ACM Advances in neural information processing systems, pp.
International Symposium on Microarchitecture, pp. 22. 1135–1143, 2015.
IEEE Press, 2016.
Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz,
Bai, J., Lu, F., Zhang, K., et al. Onnx: Open neural M. A., and Dally, W. J. Eie: efficient inference engine
network exchange. https://github.com/onnx/ on compressed deep neural network. In 2016 ACM/IEEE
onnx, 2019. 43rd Annual International Symposium on Computer Ar-
chitecture (ISCA), pp. 243–254. IEEE, 2016.
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao,
T., Xu, B., Zhang, C., and Zhang, Z. Mxnet: A flexible He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-
and efficient machine learning library for heterogeneous ing for image recognition. In Proceedings of the IEEE
distributed systems. arXiv preprint arXiv:1512.01274, conference on computer vision and pattern recognition,
2015. pp. 770–778, 2016.
Chen, X. Escort: Efficient sparse convolutional neural Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang,
networks on gpus. arXiv preprint arXiv:1802.10280, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets:
2018. Efficient convolutional neural networks for mobile vision
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, applications. arXiv preprint arXiv:1704.04861, 2017.
L. Imagenet: A large-scale hierarchical image database.
Hoyer, P. O. Non-negative matrix factorization with sparse-
In 2009 IEEE conference on computer vision and pattern
ness constraints. Journal of machine learning research, 5
recognition, pp. 248–255. Ieee, 2009.
(Nov):1457–1469, 2004.
Dong, X., Liu, L., Li, G., Li, J., Zhao, P., Wang, X.,
and Feng, X. Exploiting the input sparsity to accel- Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet
erate deep neural networks: poster. In Hollingsworth, classification with deep convolutional neural networks.
J. K. and Keidar, I. (eds.), Proceedings of the 24th In Advances in neural information processing systems,
ACM SIGPLAN Symposium on Principles and Prac- pp. 1097–1105, 2012.
tice of Parallel Programming, PPoPP 2019, Washing-
Krizhevsky, A., Nair, V., and Hinton, G. The cifar-10 dataset.
ton, DC, USA, February 16-20, 2019, pp. 401–402.
online: http://www. cs. toronto. edu/kriz/cifar. html, 55,
ACM, 2019. ISBN 978-1-4503-6225-2. doi: 10.
2014.
1145/3293883.3295713. URL https://doi.org/
10.1145/3293883.3295713. Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf,
Frankle, J. and Carbin, M. The lottery ticket hypothesis: H. P. Pruning filters for efficient convnets. arXiv preprint
Finding sparse, trainable neural networks. arXiv preprint arXiv:1608.08710, 2016.
arXiv:1803.03635, 2018.
Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., and Zhang, C.
Gale, T., Elsen, E., and Hooker, S. The state of sparsity in Learning efficient convolutional networks through net-
deep neural networks. arXiv preprint arXiv:1902.09574, work slimming. In Proceedings of the IEEE International
2019. Conference on Computer Vision, pp. 2736–2744, 2017.
Georgiadis, G. Accelerating convolutional neural networks Luo, J.-H., Wu, J., and Lin, W. Thinet: A filter level pruning
via activation map compression. In Proceedings of the method for deep neural network compression. In Proceed-
IEEE Conference on Computer Vision and Pattern Recog- ings of the IEEE international conference on computer
nition, pp. 7085–7095, 2019. vision, pp. 5058–5066, 2017.
Mellempudi, N., Kundu, A., Mudigere, D., Das, D., Kaul, Wu, X., Wu, Y., and Zhao, Y. High performance binarized
B., and Dubey, P. Ternary neural networks with fine- neural networks trained on the imagenet classification
grained quantization. arXiv preprint arXiv:1705.01462, task. CoRR, abs/1604.03058, 2016. URL http://
2017. arxiv.org/abs/1604.03058.
Mishra, A. K., Nurvitadhi, E., Cook, J. J., and Marr, Yang, H., Wen, W., and Li, H. Deephoyer: Learning sparser
D. WRPN: wide reduced-precision networks. CoRR, neural network with differentiable scale-invariant sparsity
abs/1709.01134, 2017. URL http://arxiv.org/ measures. arXiv preprint arXiv:1908.09979, 2019.
abs/1709.01134.
Zagoruyko, S. and Komodakis, N. Wide Residual Networks.
Molchanov, D., Ashukha, A., and Vetrov, D. Variational ArXiv e-prints, May 2016.
dropout sparsifies deep neural networks. In Proceed-
Zhang, H., Li, J., Kara, K., Alistarh, D., Liu, J., and Zhang,
ings of the 34th International Conference on Machine
C. Zipml: Training linear models with end-to-end low
Learning-Volume 70, pp. 2498–2507. JMLR. org, 2017.
precision, and a little bit of deep learning. In International
Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkate- Conference on Machine Learning, pp. 4035–4043, 2017.
san, R., Khailany, B., Emer, J., Keckler, S. W., and Dally,
Zhu, C., Han, S., Mao, H., and Dally, W. J. Trained ternary
W. J. Scnn: An accelerator for compressed-sparse con-
quantization. CoRR, abs/1612.01064, 2016. URL http:
volutional neural networks. In 2017 ACM/IEEE 44th
//arxiv.org/abs/1612.01064.
Annual International Symposium on Computer Architec-
ture (ISCA), pp. 27–40. IEEE, 2017. Zhu, M. and Gupta, S. To prune, or not to prune: exploring
the efficacy of pruning for model compression. arXiv
Park, J., Li, S., Wen, W., Tang, P. T. P., Li, H., Chen, Y., preprint arXiv:1710.01878, 2017.
and Dubey, P. Faster cnns with direct sparse convolutions
and guided pruning. arXiv preprint arXiv:1608.01409,
2016a.
Park, J., Li, S. R., Wen, W., Li, H., Chen, Y., and Dubey,
P. Holistic sparsecnn: Forging the trident of accuracy,
speed, and size. arXiv preprint arXiv:1608.01409, 1(2),
2016b.
Polino, A., Pascanu, R., and Alistarh, D. Model compres-

sion via distillation and quantization. arXiv preprint
arXiv:1802.05668, 2018.
Rhu, M., O’Connor, M., Chatterjee, N., Pool, J., Kwon, Y.,
and Keckler, S. W. Compressing dma engine: Leverag-
ing activation sparsity for training deep neural networks.
In 2018 IEEE International Symposium on High Perfor-
mance Computer Architecture (HPCA), pp. 78–91. IEEE,
2018.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,
Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,
Panneershelvam, V., Lanctot, M., et al. Mastering the
game of go with deep neural networks and tree search.
Nature, 529(7587):484–489, 2016.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention
is all you need. arXiv preprint arXiv:1706.03762, 2017.
Wang, E., Zhang, Q., Shen, B., Zhang, G., Lu, X., Wu,
Q., and Wang, Y. Intel math kernel library. In High-
Performance Computing on the Intel R Xeon Phi, pp.
167–188. Springer, 2014.

Kurtz 20 A

Uploaded by

Copyright:

Available Formats

Kurtz 20 A

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kurtz 20 A

Uploaded by

Copyright:

Available Formats

Inducing and Exploiting Activation Sparsity for Fast Neural Network Inference

the exact root cause of this phenomenon is difficult, but we

Figure 3. Channel-wise boosting results for a fixed “highly-sensitive” layer of ResNet18.

Polino, A., Pascanu, R., and Alistarh, D. Model compres-

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

You might also like