Net Adapt
Net Adapt
Net Adapt
2
Google Inc.
{tjy,sze}@mit.edu, {howarda,bochen,andypassion,ago,sandler,hadam}@google.com
1 Introduction
Pretrained
Network Budget Platform
Metric Budget
Empirical Measurements
Latency 3.8
Metric Proposal A … Proposal Z
…
Energy 10.5 Latency 15.6 … 14.3
…
…
…
Energy 41 … 46
NetAdapt Measure
Network Proposals
A B C D Z
# Layers
Adapted …
Network
# Filters
and energy consumption. The relationship between an indirect metric and the
corresponding direct metric can be highly non-linear and platform-dependent as
observed by [15, 25, 26]. In this work, we will also demonstrate empirically that
a network with a fewer number of MACs can be slower when actually running
on mobile devices; specifically, we will show that a network of 19% less MACs
incurs 29% longer latency in practice (see Table 1).
There are two common approaches to designing efficient network architec-
tures. The first is designing a single architecture with no regard to the underlying
platform. It is hard for a single architecture to run optimally on all the platforms
due to the different platform characteristics. For example, the fastest architec-
ture on a desktop GPU may not be the fastest one on a mobile CPU with the
same accuracy. Moreover, there is little guarantee that the architecture could
meet the resource budget (e.g., latency) on all platforms of interest. The second
approach is manually crafting architectures for a given target platform based
on the platform’s characteristics. However, this approach requires deep knowl-
edge about the implementation details of the platform, including the toolchains,
the configuration and the hardware architecture, which are generally unavailable
given the proprietary nature of hardware and the high complexity of modern sys-
tems. Furthermore, manually designing a different architecture for each platform
can be taxing for researchers and engineers.
In this work, we propose a platform-aware algorithm, called NetAdapt, to
address the aforementioned issues and facilitate platform-specific DNN deploy-
NetAdapt 3
2 Related Work
There is a large body of work that aims to simplify DNNs. We refer the readers
to [21] for a comprehensive survey, and summarize the main approaches below.
The most related works are pruning-based methods. [6, 14, 16] aim to remove
individual redundant weights from DNNs. However, most platforms cannot fully
take advantage of unstructured sparse filters [26]. Hu et al. [10] and Srinivas et
al. [20] focus on removing entire filters instead of individual weights. The draw-
back of these methods is the requirement of manually choosing the compression
rate for each layer. MorphNet [5] leverages the sparsifying regularizers to auto-
matically determine the layerwise compression rate. ADC [8] uses reinforcement
4 T.-J. Yang et al.
learning to learn a policy for choosing the compression rates. The crucial dif-
ference between all the aforementioned methods and ours is that they are not
guided by the direct metrics, and thus may lead to sub-optimal performance, as
we see in Sec. 4.3.
Energy-aware pruning [25] uses an energy model [24] and incorporates the
estimated energy numbers into the pruning algorithm. However, this requires de-
signing models to estimate the direct metrics of each target platform, which re-
quires detailed knowledge of the platform including its hardware architecture [3],
and the network-to-array mapping used in the toolchain [2]. NetAdapt does not
have this requirement since it can directly use empirical measurements.
DNNs can also be simplified by approaches that involve directly designing ef-
ficient network architectures, decomposition or quantization. MobileNets [9, 18]
and ShuffleNets [27] provide efficient layer operations and reference architecture
design. Layer-decomposition-based algorithms [13, 23] exploit matrix decompo-
sition to reduce the number of operations. Quantization [11, 12, 17] reduces
the complexity by decreasing the computation accuracy. The proposed algo-
rithm, NetAdapt, is complementary to these methods. For example, NetAdapt
can adapt MobileNets to further push the frontier of efficient networks as shown
in Sec. 4 even though MobileNets are more compact and much harder to simplify
than the other larger networks, such as VGG [19].
3 Methodology: NetAdapt
We propose an algorithm, called NetAdapt, that will allow a user to automat-
ically simplify a pretrained network to meet the resource budget of a platform
while maximizing the accuracy. NetAdapt is guided by direct metrics for resource
consumption, and the direct metrics are evaluated by using empirical measure-
ments, thus eliminating the requirement of detailed platform-specific knowledge.
Algorithm 1: NetAdapt
Input: Pretrained Network: N et0 (with K CONV and FC layers), Resource
Budget: Bud, Resource Reduction Schedule: ∆Ri
Output: Adapted Network Meeting the Resource Budget: Nˆet
1 i = 0;
2 Resi = TakeEmpiricalMeasurement(N eti );
3 while Resi > Bud do
4 Con = Resi - ∆Ri ;
5 for k from 1 to K do
/* TakeEmpiricalMeasurement is also called inside
ChooseNumFilters for choosing the correct number of filters
that satisfies the constraint (i.e., current budget). */
6 N F iltk , Res Simpk = ChooseNumFilters(N eti , k, Con);
7 N et Simpk = ChooseWhichFilters(N eti , k, N F iltk );
8 N et Simpk = ShortTermFineTune(N et Simpk );
9 N eti+1 , Resi+1 = PickHighestAccuracy(N et Simp: , Res Simp: );
10 i = i + 1;
11 Nˆet = LongTermFineTune(N eti );
12 return Nˆet;
where N eti is the network generated by the ith iteration, and N et0 is the initial
pretrained network. As the number of iterations increases, the constraints (i.e.,
current resource budget Resj (N eti−1 )−∆Ri,j ) gradually become tighter. ∆Ri,j ,
which is larger than zero, indicates how much the constraint tightens for the j th
resource in the ith iteration and can vary from iteration to iteration. This is
referred to as “resource reduction schedule”, which is similar to the concept of
learning rate schedule. The algorithm terminates when Resj (N eti−1 ) − ∆Ri,j
is equal to or smaller than Budj for every resource type. It outputs the final
adapted network and can also generate a sequence of simplified networks (i.e.,
the highest accuracy network from each iteration N et1 , ..., N eti ) to provide the
efficient frontier of accuracy and resource consumption trade-offs.
For simplicity, we assume that we only need to meet the budget of one resource,
specifically latency. One method to reduce the latency is to remove filters from
the convolutional (CONV) or fully-connected (FC) layers. While there are other
ways to reduce latency, we will use this approach to demonstrate NetAdapt.
The NetAdapt algorithm is detailed in pseudo code in Algorithm 1 and in
Fig. 2. Each iteration solves Eq. 2 by reducing the number of filters in a single
CONV or FC layer (the Choose # of Filters and Choose Which Filters
blocks in Fig. 2). The number of filters to remove from a layer is guided by
empirical measurements. NetAdapt removes entire filters instead of individual
weights because most platforms can take advantage of removing entire filters,
6 T.-J. Yang et al.
Pretrained Network
Choose Choose
# of Filters # of Filters
Measure
Choose
Which Filters
... Choose
Which Filters
Short-Term Short-Term
Fine-Tune Fine-Tune
Platform
Layer 1 Layer K
Pick Highest
Over
Budget Accuracy
Within Budget
Long-Term
Fine-Tune
Adapted Network
Fig. 2. This figure visualizes the algorithm flow of NetAdapt. At each iteration, Ne-
tAdapt decreases the resource consumption by simplifying (i.e., removing filters from)
one layer. In order to maximize accuracy, it tries to simplify each layer individually
and picks the simplified network that has the highest accuracy. Once the target budget
is met, the chosen network is then fine-tuned again until convergence.
and this strategy allows reducing both filters and feature maps, which play an
important role in resource consumption [25]. The simplified network is then
fine-tuned for a short length of time in order to restore some accuracy (the
Short-Term Fine-Tune block).
In each iteration, the previous three steps (highlighted in bold) are applied on
each of the CONV or FC layers individually3 . As a result, NetAdapt generates
K (i.e., the number of CONV and FC layers) network proposals in one iteration,
each of which has a single layer modified from the previous iteration. The network
proposal with the highest accuracy is carried over to the next iteration (the
Pick Highest Accuracy block). Finally, once the target budget is met, the
chosen network is fine-tuned again until convergence (the Long-Term Fine-
Tune block).
This section describes the key blocks in the NetAdapt algorithm (Fig. 2).
Choose Number of Filters This step focuses on determining how many
filters to preserve in a specific layer based on empirical measurements. NetAdapt
gradually reduces the number of filters in the target layer and measures the
resource consumption of each of the simplified networks. The maximum number
3
The algorithm can also be applied to a group of multiple layers as a single unit
(instead of a single layer). For example, in ResNet [7], we can treat a residual block
as a single unit to speed up the adaptation process.
NetAdapt 7
Layer 1 Layer 2
Cat
# Channels # Channels
1 2 3 2 4 6 8 Layer 2 6 Filters
2 1 3 5 2 1 2 3 4
# Filters
# Filters
Layer 1 4 Filters
4 2 4 6 4 2 3 4 5
6 3 5 7 6 3 4 5 6 Latency
8 4 6 8 8 4 5 6 7 6 + 4 = 10 ms
Fig. 3. This figure illustrates how layer-wise look-up tables are used for fast resource
consumption estimation.
of filters that can satisfy the current resource constraint will be chosen. Note
that when some filters are removed from a layer, the associated channels in the
following layers should also be removed. Therefore, the change in the resource
consumption of other layers needs to be factored in.
Choose Which Filters This step chooses which filters to preserve based on
the architecture from the previous step. There are many methods proposed in
the literature, and we choose the magnitude-based method to keep the algorithm
simple. In this work, the N filters that have the largest `2-norm magnitude will
be kept, where N is the number of filters determined by the previous step. More
complex methods can be adopted to increase the accuracy, such as removing the
filters based on their joint influence on the feature maps [25].
Short-/Long-Term Fine-Tune Both the short-term fine-tune and long-
term fine-tune steps in NetAdapt involve network-wise end-to-end fine-tuning.
Short-term fine-tune has fewer iterations than long-term fine-tune.
At each iteration of the algorithm, we fine-tune the simplified networks with
a relatively smaller number of iterations (i.e., short-term) to regain accuracy, in
parallel or in sequence. This step is especially important while adapting small
networks with a large resource reduction because otherwise the accuracy will
drop to zero, which can cause the algorithm to choose the wrong network pro-
posal.
As the algorithm proceeds, the network is continuously trained but does not
converge. Once the final adapted network is obtained, we fine-tune the network
with more iterations until convergence (i.e., long-term) as the final step.
140
120
Fig. 4. The comparison between the estimated latency (using layer-wise look-up tables)
and the real latency on a single large core of Google Pixel 1 CPU while adapting the
100% MobileNetV1 with the input resolution of 224 [9].
4 Experiment Results
In this section, we apply the proposed NetAdapt algorithm to MobileNets [9, 18],
which are designed for mobile applications, and experiment on the ImageNet
dataset [4]. We did not apply NetAdapt on larger networks like ResNet [7] and
VGG [19] because networks become more difficult to simplify as they become
smaller; these networks are also seldom deployed on mobile platforms. We bench-
mark NetAdapt against three state-of-the-art network simplification methods:
– Multipliers [9] are simple but effective methods for simplifying networks.
Two commonly used multipliers are the width multiplier and the resolu-
tion multiplier; they can also be used together. Width multiplier scales the
number of filters by a percentage across all convolutional (CONV) and fully-
connected (FC) layers, and resolution multiplier scales the resolution of the
input image. We use the notation “50% MobileNetV1 (128)” to denote ap-
plying a width multiplier of 50% on MobileNetV1 with the input image
resolution of 128.
NetAdapt 9
We perform most of the experiments and study on MobileNetV1 and detail the
settings in this section.
NetAdapt Configuration MobileNetV1 [9] is based on depthwise separable
convolutions, which factorize a m × m standard convolution layer into a m × m
depthwise layer and a 1×1 standard convolution layer called a pointwise layer. In
the experiments, we adapt each depthwise layer with the corresponding pointwise
layer and choose the filters to keep based on the pointwise layer. When adapting
the small MobileNetV1 (50% MobileNetV1 (128)), the latency reduction (∆Ri,j
in Eq. 2) starts at 0.5 and decays at the rate of 0.96 per iteration. When adapting
other networks, we use the same decay rate but scale the initial latency reduction
proportional to the latency of the initial pretrained network.
Network Training We preserve ten thousand images from the training
set, ten images per class, as the holdout set. The new training set without the
holdout images is used to perform short-term fine-tuning, and the holdout set is
used to pick the highest accuracy network out of the simplified networks at each
iteration. The whole training set is used for the long-term fine-tuning, which is
performed once in the last step of NetAdapt.
Because the training configuration can have a large impact on the accuracy,
we apply the same training configuration to all the networks unless otherwise
stated to have a fairer comparison. We adopt the same training configuration as
MorphNet [5] (except that the batch size is 128 instead of 96). The learning rate
for the long-term fine-tuning is 0.045 and that for the short-term fine-tuning is
0.0045. This configuration improves ADC network’s top-1 accuracy by 0.3% and
almost all multiplier networks’ top-1 accuracy by up to 3.8%, except for one data
point, whose accuracy is reduced by 0.2%. We use these numbers in the following
analysis. Moreover, all accuracy numbers are reported on the validation set to
show the true performance.
Mobile Inference and Latency Measurement We use Google’s Tensor-
Flow Lite engine [22] for inference on a mobile CPU and Qualcomm’s Snap-
dragon Neural Processing Engine (SNPE) for inference on a mobile GPU. For
experiments on mobile CPUs, the latency is measured on a single large core of
10 T.-J. Yang et al.
59%
57%
Top-1 Accuracy
55%
53% Multipliers
1.7x Faster
51% 0.3% Higher Accuracy MorphNet
49% NetAdapt
47%
45% 1.6x Faster
0.3% Higher Accuracy
43%
41%
3 5 7 9 11 13
Latency (ms)
Fig. 5. The figure compares NetAdapt (adapting the small MobileNetV1) with the
multipliers [9] and MorphNet [5] on a mobile CPU of Google Pixel 1.
Google Pixel 1 phone. For experiments on mobile GPUs, the latency is measured
on the mobile GPU of Samsung Galaxy S8 with SNPE’s benchmarking tool. For
each latency number, we report the median of 11 latency measurements.
72%
71%
Top-1 Accuracy
70% 1.4x Faster Multipliers
69% 0.2% Higher Accuracy
1.2x Faster ADC
68% 0.4% Higher Accuracy
NetAdapt
67%
66% NetAdapt (Better
Training Config.)
65%
64%
30 50 70 90 110 130
Latency (ms)
Fig. 6. The figure compares NetAdapt (adapting the large MobileNetV1) with the
multipliers [9] and ADC [8] on a mobile CPU of Google Pixel 1. Moreover, the accuracy
of the adapted networks can be further increased by up to 1.3% through using a better
training configuration (simply adding dropout and label smoothing).
72%
71%
1.2x Faster
Top-1 Accuracy
Multipliers
70% 0.2% Higher Accuracy
69% ADC
1.1x Faster
68% 0.1% Higher Accuracy NetAdapt
67%
NetAdapt (Better
66% Training Config.)
65%
64%
7 9 11 13 15 17
Latency (ms)
Fig. 7. This figure compares NetAdapt (adapting the large MobileNetV1) with the
multipliers [9] and ADC [8] on a mobile GPU of Samsung Galaxy S8. Moreover, the
accuracy of the adapted networks can be further increased by up to 1.3% through using
a better training configuration (simply adding dropout and label smoothing).
60%
60%
55%
50%
Top-1 Accuracy
Top-1 Accuracy
Fig. 8. The accuracy of different short- Fig. 9. The comparison between before
term fine-tuning iterations when adapt- and after long-term fine-tuning when
ing the small MobileNetV1 (without long- adapting the small MobileNetV1 on a mo-
term fine-tuning) on a mobile CPU of bile CPU of Google Pixel 1. Although the
Google Pixel 1. Zero iterations means no short-term fine-tuning preserves the accu-
short-term fine-tuning. racy well, the long-term fine-tuning gives
the extra 3.4% on average (from 1.8% to
4.5%).
Initialization (ms) Decay Rate # of Total Iterations Top-1 Accuracy (%) Latency (ms)
0.5 0.96 28 47.7 4.63
0.5 1.0 20 47.4 4.71
0.8 0.95 20 46.7 4.65
Table 2. The influence of resource reduction schedules.
450
400
Number of Filters 350
300
Multipliers
250
NetAdapt
200
150
100
50
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Conv2d Layer Index
Fig. 10. NetAdapt and the multipliers generate different simplified networks when
adapting the small MobileNetV1 to match the latency of 25% MobileNetV1 (128).
gives poor performance. After fine-tuning a network for a short amount of time
(ten thousand iterations), the accuracy is always kept above 20%, which allows
the algorithm to make a better decision. Although further increasing the number
of iterations improves the accuracy, we find that using forty thousand iterations
leads to a good accuracy versus speed trade-off for the small MobileNetV1.
First, NetAdapt removes more filters in layers 7 to 10, but fewer in layer 6.
Since the feature map resolution is reduced in layer 6 but not in layers 7 to 10,
we hypothesize that when the feature map resolution is reduced, more filters are
needed to avoid creating an information bottleneck.
The second observation is that NetAdapt keeps more filters in layer 13 (i.e.
the last CONV layer). One possible explanation is that the ImageNet dataset
contains one thousand classes, so more feature maps are needed by the last FC
layer to do the correct classification.
5 Conclusion
In summary, we proposed an automated algorithm, called NetAdapt, to adapt a
pretrained network to a mobile platform given a real resource budget. NetAdapt
can incorporate direct metrics, such as latency and energy, into the optimization
to maximize the adaptation performance based on the characteristics of the
platform. By using empirical measurements, NetAdapt can be applied to any
platform as long as we can measure the desired metrics, without any knowledge
of the underlying implementation of the platform. We demonstrated empirically
that the proposed algorithm can achieve better accuracy versus latency trade-off
(by up to 1.7× faster with equal or higher accuracy) compared with other state-
of-the-art network simplification algorithms. In this work, we aimed to highlight
the importance of using direct metrics in the optimization of efficient networks;
we hope that future research efforts will take direct metrics into account in order
to further improve the performance of efficient networks.
Bibliography
[1] Audet, C., J. E. Dennis, J.: A progressive barrier for derivative-free nonlin-
ear programming. SIAM Journal on Optimization 20(1), 445–472 (2009)
[2] Chen, Y.H., Emer, J., Sze, V.: Eyeriss: A Spatial Architecture for Energy-
Efficient Dataflow for Convolutional Neural Networks. In: Proceedings of the
43rd Annual International Symposium on Computer Architecture (ISCA)
(2016)
[3] Chen, Y.H., Krishna, T., Emer, J., Sze, V.: Eyeriss: An Energy-Efficient
Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE
Journal of Solid-State Circuits 52, 127–138 (2016)
[4] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A
large-scale hierarchical image database. In: IEEE Conference on Computer
Vision and Pattern Recognition (CVPR). pp. 248–255. IEEE (2009)
[5] Gordon, A., Eban, E., Nachum, O., Chen, B., Yang, T.J., Choi, E.: Mor-
phnet: Fast & simple resource-constrained structure learning of deep net-
works. In: IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) (2018)
[6] Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections
for efficient neural network. In: Advances in Neural Information Processing
Systems. pp. 1135–1143 (2015)
[7] He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image
Recognition. In: IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR) (2016)
[8] He, Y., Han, S.: Adc: Automated deep compression and acceleration with
reinforcement learning. arXiv preprint arXiv:1802.03494 (2018)
[9] Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand,
T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural
networks for mobile vision applications. arXiv preprint arXiv:1704.04861
(2017)
[10] Hu, H., Peng, R., Tai, Y.W., Tang, C.K.: Network Trimming: A Data-
Driven Neuron Pruning Approach towards Efficient Deep Architectures.
arXiv preprint arXiv:1607.03250 (2016)
[11] Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized
neural networks. In: Advances in Neural Information Processing Systems.
pp. 4107–4115 (2016)
[12] Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H.,
Kalenichenko, D.: Quantization and training of neural networks for efficient
integer-arithmetic-only inference. arXiv preprint arXiv:1712.05877 (2017)
[13] Kim, Y.D., Park, E., Yoo, S., Choi, T., Yang, L., Shin, D.: Compression of
deep convolutional neural networks for fast and low power mobile applica-
tions. arXiv preprint arXiv:1511.06530 (2015)
[14] Le Cun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: Advances
in Neural Information Processing Systems (1990)
16 T.-J. Yang et al.
[15] Liangzhen Lai, Naveen Suda, V.C.: Not all ops are created equal! In: SysML
(2018)
[16] Molchanov, P., Tyree, S., Karras, T., Aila, T., Kautz, J.: Pruning convolu-
tional neural networks for resource efficient transfer learning. arXiv preprint
arXiv:1611.06440 (2016)
[17] Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: Xnor-net: Imagenet
classification using binary convolutional neural networks. In: European Con-
ference on Computer Vision (ECCV) (2016)
[18] Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., Chen, L.C.: Inverted
residuals and linear bottlenecks: Mobile networks for classification, detection
and segmentation. In: IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) (2018)
[19] Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-
Scale Image Recognition. In: International Conference on Learning Repre-
sentations (ICLR) (2014)
[20] Srinivas, S., Babu, R.V.: Data-free parameter pruning for deep neural net-
works. arXiv preprint arXiv:1507.06149 (2015)
[21] Sze, V., Chen, Y.H., Yang, T.J., Emer, J.S.: Efficient processing of deep
neural networks: A tutorial and survey. Proceedings of the IEEE 105(12),
2295–2329 (Dec 2017). https://doi.org/10.1109/JPROC.2017.2761740
[22] TensorFlow Lite: https://www.tensorflow.org/mobile/tflite/
[23] Yang, Z., Moczulski, M., Denil, M., de Freitas, N., Smola, A., Song, L.,
Wang, Z.: Deep fried convnets. In: Proceedings of the IEEE International
Conference on Computer Vision. pp. 1476–1483 (2015)
[24] Yang, Tien-Ju and Chen, Yu-Hsin and Emer, Joel and Sze, Vivienne: A
Method to Estimate the Energy Consumption of Deep Neural Networks.
In: Asilomar Conference on Signals, Systems and Computers (2017)
[25] Yang, Tien-Ju and Chen, Yu-Hsin and Sze, Vivienne: Designing energy-
efficient convolutional neural networks using energy-aware pruning. In:
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2017)
[26] Yu, J., Lukefahr, A., Palframan, D., Dasika, G., Das, R., Mahlke, S.: Scalpel:
Customizing dnn pruning to the underlying hardware parallelism. In: Pro-
ceedings of the 44th Annual International Symposium on Computer Archi-
tecture (2017)
[27] Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely ef-
ficient convolutional neural network for mobile devices. arXiv preprint
arXiv:1707.01083 (2017)