Szegedy Rethinking The Inception CVPR 2016 Paper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Rethinking the Inception Architecture for Computer Vision

Christian Szegedy Vincent Vanhoucke Sergey Ioffe Jon Shlens


Google Inc. [email protected] [email protected] [email protected]
[email protected]

Zbigniew Wojna
University College London
[email protected]

Abstract networks. VGGNet [18] and GoogLeNet [20] yielded simi-


larly high performance in the 2014 ILSVRC [16] classifica-
Convolutional networks are at the core of most state- tion challenge. One interesting observation was that gains
of-the-art computer vision solutions for a wide variety of in the classification performance tend to transfer to signifi-
tasks. Since 2014 very deep convolutional networks started cant quality gains in a wide variety of application domains.
to become mainstream, yielding substantial gains in vari- This means that architectural improvements in deep con-
ous benchmarks. Although increased model size and com- volutional architecture can be utilized for improving perfor-
putational cost tend to translate to immediate quality gains mance for most other computer vision tasks that are increas-
for most tasks (as long as enough labeled data is provided ingly reliant on high quality, learned visual features. Also,
for training), computational efficiency and low parameter improvements in the network quality resulted in new appli-
count are still enabling factors for various use cases such as cation domains for convolutional networks in cases where
mobile vision and big-data scenarios. Here we are explor- AlexNet features could not compete with hand engineered,
ing ways to scale up networks in ways that aim at utilizing crafted solutions, e.g. proposal generation in detection[4].
the added computation as efficiently as possible by suitably Although VGGNet [18] has the compelling feature of
factorized convolutions and aggressive regularization. We architectural simplicity, this comes at a high cost: evalu-
benchmark our methods on the ILSVRC 2012 classification ating the network requires a lot of computation. On the
challenge validation set demonstrate substantial gains over other hand, the Inception architecture of GoogLeNet [20]
the state of the art: 21.2% top-1 and 5.6% top-5 error for was also designed to perform well even under strict con-
single frame evaluation using a network with a computa- straints on memory and computational budget. For ex-
tional cost of 5 billion multiply-adds per inference and with ample, GoogleNet employed around 7 million parameters,
using less than 25 million parameters. With an ensemble of which represented a 9× reduction with respect to its prede-
4 models and multi-crop evaluation, we report 3.5% top-5 cessor AlexNet, which used 60 million parameters. Further-
error and 17.3% top-1 error on the validation set and 3.6% more, VGGNet employed about 3x more parameters than
top-5 error on the official test set. AlexNet.
The computational cost of Inception is also much lower
1. Introduction than VGGNet or its higher performing successors [6]. This
has made it feasible to utilize Inception networks in big-data
Since the 2012 ImageNet competition [16] winning en- scenarios[17], [13], where huge amount of data needed to
try by Krizhevsky et al [9], their network “AlexNet” has be processed at reasonable cost or scenarios where memory
been successfully applied to a larger variety of computer or computational capacity is inherently limited, for example
vision tasks, for example to object-detection [5], segmen- in mobile vision settings. It is certainly possible to mitigate
tation [12], human pose estimation [22], video classifica- parts of these issues by applying specialized solutions to tar-
tion [8], object tracking [23], and superresolution [3]. get memory use [2], [15] or by optimizing the execution of
These successes spurred a new line of research that fo- certain operations via computational tricks [10]. However,
cused on finding higher performing convolutional neural these methods add extra complexity. Furthermore, these
networks. Starting in 2014, the quality of network architec- methods could be applied to optimize the Inception archi-
tures significantly improved by utilizing deeper and wider tecture as well, widening the efficiency gap again.

12818
Still, the complexity of the Inception architecture makes ity merely provides a rough estimate of information
it more difficult to make changes to the network. If the ar- content.
chitecture is scaled up naively, large parts of the computa-
tional gains can be immediately lost. Also, [20] does not 2. Higher dimensional representations are easier to pro-
provide a clear description about the contributing factors cess locally within a network. Increasing the activa-
that lead to the various design decisions of the GoogLeNet tions per tile in a convolutional network allows for
architecture. This makes it much harder to adapt it to new more disentangled features. The resulting networks
use-cases while maintaining its efficiency. For example, will train faster.
if it is deemed necessary to increase the capacity of some 3. Spatial aggregation can be done over lower dimen-
Inception-style model, the simple transformation of just sional embeddings without much or any loss in rep-
doubling the number of all filter bank sizes will lead to a resentational power. For example, before performing a
4x increase in both computational cost and number of pa- more spread out (e.g. 3 × 3) convolution, one can re-
rameters. This might prove prohibitive or unreasonable in a duce the dimension of the input representation before
lot of practical scenarios, especially if the associated gains the spatial aggregation without expecting serious ad-
are modest. In this paper, we start with describing a few verse effects. We hypothesize that the reason for that
general principles and optimization ideas that that proved is the strong correlation between adjacent unit results
to be useful for scaling up convolution networks in efficient in much less loss of information during dimension re-
ways. Although our principles are not limited to Inception- duction, if the outputs are used in a spatial aggrega-
type networks, they are easier to observe in that context as tion context. Given that these signals should be easily
the generic structure of the Inception style building blocks compressible, the dimension reduction even promotes
is flexible enough to incorporate those constraints naturally. faster learning.
This is enabled by the generous use of dimensional reduc-
tion and parallel structures of the Inception modules which 4. Balance the width and depth of the network. Optimal
allows for mitigating the impact of structural changes on performance of the network can be reached by balanc-
nearby components. Still, one needs to be cautious about ing the number of filters per stage and the depth of
doing so, as some guiding principles should be observed to the network. Increasing both the width and the depth
maintain high quality of the models. of the network can contribute to higher quality net-
works. However, the optimal improvement for a con-
2. General Design Principles stant amount of computation can be reached if both are
increased in parallel. The computational budget should
Here we will describe a few design principles based therefore be distributed in a balanced way between the
on large-scale experimentation with various architectural depth and width of the network.
choices with convolutional networks. At this point, the util-
ity of the principles below are speculative and additional fu- Although these principles might make sense, it is not
ture experimental evidence will be necessary to assess their straightforward to use them to improve the quality of net-
domain of validity. Still, grave deviations from these prin- works out of box. The idea is to use them judiciously in
ciples tended to result in deterioration in the quality of the ambiguous situations only.
networks and fixing situations where those deviations were
detected resulted in improved architectures. 3. Factorizing Convolutions with Large Filter
Size
1. Avoid representational bottlenecks, especially early in
the network. Feed-forward networks can be repre- Much of the original gains of the GoogLeNet net-
sented by an acyclic graph from the input layer(s) to work [20] arise from a very generous use of dimension re-
the classifier or regressor. This defines a clear direction duction, just like in the “Network in network” architecture
for the information flow. For any cut separating the in- by Lin et al [?]. This can be viewed as a special case of fac-
puts from the outputs, one can access the amount of torizing convolutions in a computationally efficient manner.
information passing though the cut. One should avoid Consider for example the case of a 1×1 convolutional layer
bottlenecks with extreme compression. In general the followed by a 3 × 3 convolutional layer. In a vision net-
representation size should gently decrease from the in- work, it is expected that the outputs of near-by activations
puts to the outputs before reaching the final represen- are highly correlated. Therefore, we can expect that their
tation used for the task at hand. Theoretically, infor- activations can be reduced before aggregation and that this
mation content can not be assessed merely by the di- should result in similarly expressive local representations.
mensionality of the representation as it discards impor- Here we explore other ways of factorizing convolutions
tant factors like correlation structure; the dimensional- in various settings, especially in order to increase the com-

2819
Factorization with Linear vs ReLU activation
0.8

0.7
ReLU
Linear
0.6

0.5

Top−1 Accuracy
0.4

0.3

0.2

0.1

0
0 0.5 1 1.5 2 2.5 3 3.5 4
Iteration 6
x 10

Figure 1. Mini-network replacing the 5 × 5 convolutions. Figure 2. One of several control experiments between two Incep-
tion models, one of them uses factorization into linear + ReLU
layers, the other uses two ReLU layers. After 3.86 million opera-
putational efficiency of the solution. Since Inception net- tions, the former settles at 76.2%, while the latter reaches 77.2%
works are fully convolutional, each weight corresponds to top-1 Accuracy on the validation set.
one multiplication per activation. Therefore, any reduction
in computational cost results in reduced number of param-
eters. This means that with suitable factorization, we can change the number of activations/unit by a constant alpha
end up with more disentangled parameters and therefore factor. Since the 5 × 5 convolution is aggregating, α is
with faster training. Also, we can use the computational typically slightly larger than one (around 1.5 in the case
and memory savings to increase the filter-bank sizes of our of GoogLeNet). Having a two layer replacement for the
network while maintaining our ability to train each model 5 × 5 layer, it seems reasonable to reach this expansion
√ in
replica on a single computer. two steps: increasing the number of filters by α in both
steps. In order to simplify our estimate by choosing α = 1
3.1. Factorization into smaller convolutions (no expansion), Sliding this network can be represented by
Convolutions with larger spatial filters (e.g. 5 × 5 or two 3 × 3 convolutional layers which reuses the activations
7 × 7) tend to be disproportionally expensive in terms of between adjacent tiles. This way, we end up with a net
9+9
computation. For example, a 5 × 5 convolution with n fil- 25 × reduction with a relative gain of 28% by this fac-
ters over a grid with m filters is 25/9 = 2.78 times more torization. The exact same saving holds for the parameter
computationally expensive than a 3 × 3 convolution with count as each parameter is used exactly once in the compu-
the same number of filters. Of course, a 5 × 5 filter can cap- tation of the activation of each unit. Still, this setup raises
ture dependencies between signals between activations of two general questions: Does this replacement result in any
units further away in the earlier layers, so a reduction of the loss of expressiveness? If our main goal is to factorize the
geometric size of the filters comes at a large cost of expres- linear part of the computation, would it not suggest to keep
siveness. However, we can ask whether a 5 × 5 convolution linear activations in the first layer? We have ran several con-
could be replaced by a multi-layer network with less pa- trol experiments (for example see figure 2) and using linear
rameters with the same input size and output depth. If we activation was always inferior to using rectified linear units
zoom into the computation graph of the 5 × 5 convolution, in all stages of the factorization. We attribute this gain to
we see that each output looks like a small fully-connected the enhanced space of variations that the network can learn
network sliding over 5 × 5 tiles over its input (see Figure 1). especially if we batch-normalize [7] the output activations.
Since we are constructing a vision network, it seems natural One can see similar effects when using linear activations for
to exploit translation invariance again and replace the fully the dimension reduction components.
connected component by a two layer convolutional archi-
3.2. Spatial Factorization into Asymmetric Convo-
tecture: the first layer is a 3 × 3 convolution, the second is a
lutions
fully connected layer on top of the 3 × 3 output grid of the
first layer (see Figure 1). Sliding this small network over The above results suggest that convolutions with filters
the input activation grid boils down to replacing the 5 × 5 larger 3 × 3 a might not be generally useful as they can
convolution with two layers of 3 × 3 convolution (compare always be reduced into a sequence of 3 × 3 convolutional
Figure 4 with 5). layers. Still we can ask the question whether one should
This setup clearly reduces the parameter count by shar- factorize them into smaller, for example 2 × 2 convolutions.
ing the weights between adjacent tiles. To analyze the ex- However, it turns out that one can do even better than 2 × 2
pected computational cost savings, we will make a few sim- by using asymmetric convolutions, e.g. n × 1. For example
plifying assumptions that apply for the typical situations: using a 3 × 1 convolution followed by a 1 × 3 convolution
We can assume that n = αm, that is that we want to is equivalent to sliding a two layer network with the same

2820
Filter Concat

nx1

1xn

nx1 nx1

1xn 1xn 1x1


Figure 3. Mini-network replacing the 3 × 3 convolutions. The
lower layer of this network consists of a 3 × 1 convolution with 3
1x1 1x1 Pool 1x1
output units.

Base
Filter Concat
Figure 6. Inception modules after the factorization of the n × n
convolutions. In our proposed architecture, we chose n = 7 for
5x5 3x3 1x1 the 17 × 17 grid. (The filter sizes are picked using principle 3)
.
1x1
1x1 1x1 Pool

Filter Concat

Base

1x3 3x1
Figure 4. Original Inception module as described in [20].

3x3 1x3 3x1 1x1

Filter Concat
1x1 1x1 Pool 1x1

3x3 Base

Figure 7. Inception modules with expanded the filter bank outputs.


3x3 3x3 1x1 This architecture is used on the coarsest (8 × 8) grids to promote
high dimensional representations, as suggested by principle 2 of
1x1 1x1 Pool 1x1 Section 2. We are using this solution only on the coarsest grid,
since that is the place where producing high dimensional sparse
representation is the most critical as the ratio of local processing
(by 1 × 1 convolutions) is increased compared to the spatial ag-
Base
gregation.
Figure 5. Inception modules where each 5 × 5 convolution is re-
placed by two 3 × 3 convolution, as suggested by principle 3 of
Section 2.
can replace any n × n convolution by a 1 × n convolu-
tion followed by a n × 1 convolution and the computational
receptive field as in a 3 × 3 convolution (see figure 3). Still cost saving increases dramatically as n grows (see figure 6).
the two-layer solution is 33% cheaper for the same number In practice, we have found that employing this factorization
of output filters, if the number of input and output filters is does not work well on early layers, but it gives very good re-
equal. By comparison, factorizing a 3 × 3 convolution into sults on medium grid-sizes (On m × m feature maps, where
a two 2 × 2 convolution represents only a 11% saving of m ranges between 12 and 20). On that level, very good re-
computation. sults can be achieved by using 1 × 7 convolutions followed
In theory, we could go even further and argue that one by 7 × 1 convolutions.

2821
... 17x17x640 17x17x640
1x1x1024

Fully connected Inception Pooling


8x8x1280
5x5x128
17x17x320 35x35x640
1x1 Convolution
Inception
5x5x768 Pooling Inception

5x5 Average pooling with stride 3 35x35x320 35x35x320


17x17x768

Figure 8. Auxiliary classifier on top of the last 17×17 layer. Batch


normalization[7] of the layers in the side head results in a 0.4% Figure 9. Two alternative ways of reducing the grid size. The so-
absolute gain in top-1 accuracy. The lower axis shows the number lution on the left violates the principle 1 of not introducing an rep-
of itertions performed, each with batch size 32. resentational bottleneck from Section 2. The version on the right
is 3 times more expensive computationally.

4. Utility of Auxiliary Classifiers


Filter Concat
[20] has introduced the notion of auxiliary classifiers to
improve the convergence of very deep networks. The origi- 3x3
17x17x640
stride 2
nal motivation was to push useful gradients to the lower lay-
concat
ers to make them immediately useful and improve the con- 3x3 3x3 17x17x320 17x17x320
vergence during training by combating the vanishing gra- stride 1 stride 2
conv pool
dient problem in very deep networks. Also Lee et al[11]
1x1 1x1 Pool 35x35x320
argues that auxiliary classifiers promote more stable learn- stride 2
ing and better convergence. Interestingly, we found that
auxiliary classifiers did not result in improved convergence Base
early in the training: the training progression of network
with and without side head looks virtually identical before Figure 10. Inception module that reduces the grid-size while ex-
pands the filter banks. It is both cheap and avoids the representa-
both models reach high accuracy. Near the end of training,
tional bottleneck as is suggested by principle 1. The diagram on
the network with the auxiliary branches starts to overtake
the right represents the same solution but from the perspective of
the accuracy of the network without any auxiliary branch grid sizes rather than the operations.
and reaches a slightly higher plateau.
Also [20] used two side-heads at different stages in the
network. The removal of the lower auxiliary branch did not volution with 2k filters and then apply an additional pooling
have any adverse effect on the final quality of the network. step. This means that the overall computational cost is dom-
Together with the earlier observation in the previous para- inated by the expensive convolution on the larger grid using
graph, this means that original the hypothesis of [20] that 2d2 k 2 operations. One possibility would be to switch to
these branches help evolving the low-level features is most pooling with convolution and therefore resulting in 2( d2 )2 k 2
likely misplaced. Instead, we argue that the auxiliary clas- reducing the computational cost by a quarter. However, this
sifiers act as regularizer. This is supported by the fact that creates a representational bottlenecks as the overall dimen-
the main classifier of the network performs better if the side sionality of the representation drops to ( d2 )2 k resulting in
branch is batch-normalized [7] or has a dropout layer. This less expressive networks (see Figure 9). Instead of doing so,
also gives a weak supporting evidence for the conjecture we suggest another variant the reduces the computational
that batch normalization acts as a regularizer. cost even further while removing the representational bot-
tleneck. (see Figure 10). We can use two parallel stride 2
5. Efficient Grid Size Reduction blocks: P and C. P is a pooling layer (either average or
maximum pooling) the activation, both of them are stride 2
Traditionally, convolutional networks used some pooling the filter banks of which are concatenated as in figure 10.
operation to decrease the grid size of the feature maps. In
order to avoid a representational bottleneck, before apply- 6. Inception-v3
ing maximum or average pooling the activation dimension
of the network filters is expanded. For example, starting a Here we are connecting the dots from above and pro-
d × d grid with k filters, if we would like to arrive at a d2 × d2 pose a new architecture with improved performance on the
grid with 2k filters, we first need to compute a stride-1 con- ILSVRC 2012 classification benchmark. The layout of our

2822
patch size/stride probability of each label k ∈ {1 . . . K}: p(k|x) =
type input size
or remarks exp(zk )
PK . Here, zi are the logits or unnormalized log-
conv 3×3/2 299×299×3 i=1 exp(zi )
conv 3×3/1 149×149×32 probabilities. Consider the ground-truth distribution over
conv padded 3×3/1 147×147×32 labels q(k|x) for this training example, normalized so that
P
pool 3×3/2 147×147×64 k q(k|x) = 1. For brevity, let us omit the dependence
conv 3×3/1 73×73×64 of p and q on example x. We definePthe loss for the ex-
K
conv 3×3/2 71×71×80 ample as the cross entropy: ℓ = − k=1 log(p(k))q(k).
conv 3×3/1 35×35×192 Minimizing this is equivalent to maximizing the expected
3×Inception As in figure 5 35×35×288 log-likelihood of a label, where the label is selected accord-
5×Inception As in figure 6 17×17×768 ing to its ground-truth distribution q(k). Cross-entropy loss
2×Inception As in figure 7 8×8×1280 is differentiable with respect to the logits zk and thus can be
pool 8×8 8 × 8 × 2048 used for gradient training of deep models. The gradient has
linear logits 1 × 1 × 2048 ∂ℓ
a rather simple form: ∂z = p(k) − q(k), which is bounded
softmax classifier 1 × 1 × 1000 k
between −1 and 1.
Table 1. The outline of the proposed network architecture. The Consider the case of a single ground-truth label y, so
output size of each module is the input size of the next one. We that q(y) = 1 and q(k) = 0 for all k 6= y. In this case,
are using variations of reduction technique depicted Figure 10 to minimizing the cross entropy is equivalent to maximizing
reduce the grid sizes between the Inception blocks whenever ap- the log-likelihood of the correct label. For a particular ex-
plicable. We have marked the convolution with 0-padding, which ample x with label y, the log-likelihood is maximized for
is used to maintain the grid size. 0-padding is also used inside q(k) = δk,y , where δk,y is Dirac delta, which equals 1 for
those Inception modules that do not reduce the grid size. All other k = y and 0 otherwise. This maximum is not achievable
layers do not use padding. The various filter bank sizes are chosen for finite zk but is approached if zy ≫ zk for all k 6= y
to observe principle 4 from Section 2.
– that is, if the logit corresponding to the ground-truth la-
bel is much great than all other logits. This, however, can
network is given in table 1. Note that we have factorized cause two problems. First, it may result in over-fitting: if
the traditional 7 × 7 convolution into three 3 × 3 convolu- the model learns to assign full probability to the ground-
tions based on the same ideas as described in section 3.1. truth label for each training example, it is not guaranteed to
For the Inception part of the network, we have 3 traditional generalize. Second, it encourages the differences between
inception modules at the 35 × 35 with 288 filters each. This the largest logit and all others to become large, and this,
∂ℓ
is reduced to a 17 × 17 grid with 768 filters using the grid combined with the bounded gradient ∂z k
, reduces the abil-
reduction technique described in section 5. This is is fol- ity of the model to adapt. Intuitively, this happens because
lowed by 5 instances of the factorized inception modules as the model becomes too confident about its predictions.
depicted in figure 5. This is reduced to a 8 × 8 × 1280 grid We propose a mechanism for encouraging the model to
with the grid reduction technique depicted in figure 10. At be less confident. While this may not be desired if the goal
the coarsest 8 × 8 level, we have two Inception modules as is to maximize the log-likelihood of training labels, it does
depicted in figure 6, with a concatenated output filter bank regularize the model and makes it more adaptable. The
size of 2048 for each tile. The detailed structure of the net- method is very simple. Consider a distribution over labels
work, including the sizes of filter banks inside the Inception u(k), independent of the training example x, and a smooth-
modules, is given in the supplementary material, given in ing parameter ǫ. For a training example with ground-truth
the model.txt that is in the tar-file of this submission. label y, we replace the label distribution q(k|x) = δk,y with
However, we have observed that the quality of the network
is relatively stable to variations as long as the principles q ′ (k|x) = (1 − ǫ)δk,y + ǫu(k)
from Section 2 are observed. Although our network is 42
which is a mixture of the original ground-truth distribution
layers deep, our computation cost is only about 2.5 higher
q(k|x) and the fixed distribution u(k), with weights 1 − ǫ
than that of GoogLeNet and it is still much more efficient
and ǫ, respectively. This can be seen as the distribution of
than VGGNet.
the label k obtained as follows: first, set it to the ground-
truth label k = y; then, with probability ǫ, replace k with
7. Model Regularization via Label Smoothing a sample drawn from the distribution u(k). We propose to
Here we propose a mechanism to regularize the classifier use the prior distribution over labels as u(k). In our exper-
layer by estimating the marginalized effect of label-dropout iments, we used the uniform distribution u(k) = 1/K, so
during training. that
ǫ
For each training example x, our model computes the q ′ (k) = (1 − ǫ)δk,y + .
K

2823
We refer to this change in ground-truth label distribution as relatively small and low-resolution. This raises the question
label-smoothing regularization, or LSR. of how to properly deal with lower resolution input.
Note that LSR achieves the desired goal of preventing The common wisdom is that models employing higher
the largest logit from becoming much larger than all others. resolution receptive fields tend to result in significantly im-
Indeed, if this were to happen, then a single q(k) would proved recognition performance. However it is important to
approach 1 while all others would approach 0. This would distinguish between the effect of the increased resolution of
result in a large cross-entropy with q ′ (k) because, unlike the first layer receptive field and the effects of larger model
q(k) = δk,y , all q ′ (k) have a positive lower bound. capacitance and computation. If we just change the reso-
Another interpretation of LSR can be obtained by con- lution of the input without further adjustment to the model,
sidering the cross entropy: then we end up using computationally much cheaper mod-
els to solve more difficult tasks. Of course, it is natural,
X
K
that these solutions loose out already because of the reduced
H(q ′ , p) = − log p(k)q ′ (k) = (1−ǫ)H(q, p)+ǫH(u, p) computational effort. In order to make an accurate assess-
k=1
ment, the model needs to analyze vague hints in order to
Thus, LSR is equivalent to replacing a single cross-entropy be able to “hallucinate” the fine details. This is computa-
loss H(q, p) with a pair of such losses H(q, p) and H(u, p). tionally costly. The question remains therefore: how much
The second loss penalizes the deviation of predicted label does higher input resolution helps if the computational ef-
ǫ
distribution p from the prior u, with the relative weight 1−ǫ . fort is kept constant. One simple way to ensure constant
Note that this deviation could be equivalently captured by effort is to reduce the strides of the first two layer in the
the KL divergence, since H(u, p) = DKL (ukp) + H(u) case of lower resolution input, or by simply removing the
and H(u) is fixed. When u is the uniform distribution, first pooling layer of the network.
H(u, p) is a measure of how dissimilar the predicted dis- For this purpose we have performed the following three
tribution p is to uniform, which could also be measured (but experiments:
not equivalently) by negative entropy −H(p); we have not 1. 299 × 299 receptive field with stride 2 and maximum
experimented with this approach. pooling after the first layer.
In our ImageNet experiments with K = 1000 classes,
we used u(k) = 1/1000 and ǫ = 0.1. For ILSVRC 2012, 2. 151 × 151 receptive field with stride 1 and maximum
we have found a consistent improvement of about 0.2% ab- pooling after the first layer.
solute both for top-1 error and the top-5 error (cf. Table 3). 3. 79 × 79 receptive field with stride 1 and without pool-
ing after the first layer.
8. Training Methodology
All three networks have almost identical computational
We have trained our networks with stochastic gradient cost. Although the third network is slightly cheaper, the
utilizing the TensorFlow [1] distributed machine learning cost of the pooling layer is marginal and (within 1% of the
system using 50 replicas running each on a NVidia Kepler total cost of the)network. In each case, the networks were
GPU with batch size 32 for 100 epochs. Our earlier experi- trained until convergence and their quality was measured on
ments used momentum [19] with a decay of 0.9, while our the validation set of the ImageNet ILSVRC 2012 classifica-
best models were achieved using RMSProp [21] with de- tion benchmark. The results can be seen in table 2. Al-
cay of 0.9 and ǫ = 1.0. We used a learning rate of 0.045, though the lower-resolution networks take longer to train,
decayed every two epoch using an exponential rate of 0.94. the quality of the final result is quite close to that of their
In addition, gradient clipping [14] with threshold 2.0 was higher resolution counterparts.
found to be useful to stabilize the training. Model evalua- However, if one would just naively reduce the network
tions are performed using a running average of the parame- size according to the input resolution, then network would
ters computed over time. perform much more poorly. However this would an unfair
comparison as we would are comparing a 16 times cheaper
9. Performance on Lower Resolution Input model on a more difficult task.
Also these results of table 2 suggest, one might con-
A typical use-case of vision networks is for the the post-
sider using dedicated high-cost low resolution networks for
classification of detection, for example in the Multibox [4]
smaller objects in the R-CNN [5] context.
context. This includes the analysis of a relative small patch
of the image containing a single object with some context. 10. Experimental Results and Comparisons
The tasks is to decide whether the center part of the patch
corresponds to some object and determine the class of the Table 3 shows the experimental results about the recog-
object if it does. The challenge is that objects tend to be nition performance of our proposed architecture (Inception-

2824
Receptive Field Size Top-1 Accuracy (single frame) Crops Top-1 Top-5
Network
Evaluated Error Error
79 × 79 75.2% GoogLeNet [20] 10 - 9.15%
151 × 151 76.4% GoogLeNet [20] 144 - 7.89%
299 × 299 76.6% VGG [18] - 24.4% 6.8%
BN-Inception [7] 144 22% 5.82%
Table 2. Comparison of recognition performance when the size of PReLU [6] 10 24.27% 7.38%
the receptive field varies, but the computational cost is constant. PReLU [6] - 21.59% 5.71%
Inception-v3 12 19.47% 4.48%
Top-1 Top-5 Cost Inception-v3 144 18.77% 4.2%
Network
Error Error Bn Ops
GoogLeNet [20] 29% 9.2% 1.5 Table 4. Single-model, multi-crop experimental results compar-
BN-GoogLeNet 26.8% - 1.5 ing the cumulative effects on the various contributing factors. We
compare our numbers with the best published single-model infer-
BN-Inception [7] 25.2% 7.8 2.0
ence results on the ILSVRC 2012 classification benchmark.
Inception-v3-basic 23.4% - 3.8
Inception-v3-rmsprop
RMSProp 23.1% 6.3 3.8
Inception-v3-smooth Models Crops Top-1 Top-5
Network
Label Smoothing 22.8% 6.1 3.8 Evaluated Evaluated Error Error
Inception-v3-fact VGGNet [18] 2 - 23.7% 6.8%
Factorized 7 × 7 21.6% 5.8 4.8 GoogLeNet [20] 7 144 - 6.67%
Inception-v3 PReLU [6] - - - 4.94%
21.2% 5.6% 4.8
BN-auxiliary BN-Inception [7] 6 144 20.1% 4.9%
Inception-v3 4 144 17.2% 3.58%∗
Table 3. Single crop experimental results comparing the cumula-
tive effects on the various contributing factors. We compare our Table 5. Ensemble evaluation results comparing multi-model,
numbers with the best published single-crop inference for Ioffe at multi-crop reported results. Our numbers are compared with the
al [7]. For the “Inception-v3-” lines, the changes are cumulative best published ensemble inference results on the ILSVRC 2012
and each subsequent line includes the new change in addition to classification benchmark. ∗ All results, but the top-5 ensemble
the previous ones. The last line is referring to all the changes is result reported are on the validation set. The ensemble yielded
what we refer to as “Inception-v3” below. Unfortunately, He et 3.46% top-5 error on the validation set.
al [6] reports the only 10-crop evaluation results, but not single
crop results, which is reported in the Table 4 below.

11. Conclusions
v2) as described in Section 6. Each Inception-v2 line shows
the result of the cumulative changes including the high- We have provided several design principles to scale up
lighted new modification plus all the earlier ones. Label convolutional networks and studied them in the context of
Smoothing refers to method described in Section 7. Fac- the Inception architecture. This guidance can lead to high
torized 7 × 7 includes a change that factorizes the first performance vision networks that have a relatively mod-
7 × 7 convolutional layer into a sequence of 3 × 3 convo- est computation cost compared to simpler, more monolithic
lutional layers. BN-auxiliary refers to the version in which architectures. Our highest quality version of Inception-v2
the fully connected layer of the auxiliary classifier is also reaches 21.2%, top-1 and 5.6% top-5 error for single crop
batch-normalized, not just the convolutions. We are refer- evaluation on the ILSVR 2012 classification, setting a new
ring to the model in last row of Table 3 as Inception-v3 and state of the art. This is achieved with relatively modest
evaluate its performance in the multi-crop and ensemble set- (2.5×) increase in computational cost compared to the net-
tings. work described in Ioffe et al [7]. Still our solution uses
All our evaluations are done on the 48238 non- much less computation than the best published results based
blacklisted examples on the ILSVRC-2012 validation set, on denser networks: our model outperforms the results of
as suggested by [16]. We have evaluated all the 50000 ex- He et al [6] – cutting the top-5 (top-1) error by 25% (14%)
amples as well and the results were roughly 0.1% worse in relative, respectively – while being six times cheaper com-
top-5 error and around 0.2% in top-1 error. In the upcom- putationally and using at least five times less parameters
ing version of this paper, we will verify our ensemble result (estimated). The combination of lower parameter count
on the test set, but at the time of our last evaluation of BN- and additional regularization with batch-normalized auxil-
Inception in spring [7] indicates that the test and validation iary classifiers and label-smoothing allows for training high
set error tends to correlate very well. quality networks on relatively modest sized training sets.

2825
References [14] R. Pascanu, T. Mikolov, and Y. Bengio. On the diffi-
culty of training recurrent neural networks. arXiv preprint
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, arXiv:1211.5063, 2012. 7
C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghe-
[15] D. C. Psichogios and L. H. Ungar. Svd-net: an algorithm
mawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia,
that automatically selects network structure. IEEE transac-
R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané,
tions on neural networks/a publication of the IEEE Neural
R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster,
Networks Council, 5(3):513–515, 1993. 1
J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker,
[16] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. War-
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
den, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Tensor-
et al. Imagenet large scale visual recognition challenge.
Flow: Large-scale machine learning on heterogeneous sys-
2014. 1, 8
tems, 2015. Software available from tensorflow.org. 7
[17] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni-
[2] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and
fied embedding for face recognition and clustering. arXiv
Y. Chen. Compressing neural networks with the hashing
preprint arXiv:1503.03832, 2015. 1
trick. In Proceedings of The 32nd International Conference
on Machine Learning, 2015. 1 [18] K. Simonyan and A. Zisserman. Very deep convolutional
[3] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep networks for large-scale image recognition. arXiv preprint
convolutional network for image super-resolution. In Com- arXiv:1409.1556, 2014. 1, 8
puter Vision–ECCV 2014, pages 184–199. Springer, 2014. [19] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the
1 importance of initialization and momentum in deep learning.
[4] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable In Proceedings of the 30th International Conference on Ma-
object detection using deep neural networks. In Computer chine Learning (ICML-13), volume 28, pages 1139–1147.
Vision and Pattern Recognition (CVPR), 2014 IEEE Confer- JMLR Workshop and Conference Proceedings, May 2013. 7
ence on, pages 2155–2162. IEEE, 2014. 1, 7 [20] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
[5] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
ture hierarchies for accurate object detection and semantic Going deeper with convolutions. In Proceedings of the IEEE
segmentation. In Proceedings of the IEEE Conference on Conference on Computer Vision and Pattern Recognition,
Computer Vision and Pattern Recognition (CVPR), 2014. 1, pages 1–9, 2015. 1, 2, 4, 5, 8
7 [21] T. Tieleman and G. Hinton. Divide the gradient by a run-
[6] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into ning average of its recent magnitude. COURSERA: Neural
rectifiers: Surpassing human-level performance on imagenet Networks for Machine Learning, 4, 2012. Accessed: 2015-
classification. arXiv preprint arXiv:1502.01852, 2015. 1, 8 11-05. 7
[7] S. Ioffe and C. Szegedy. Batch normalization: Accelerating [22] A. Toshev and C. Szegedy. Deeppose: Human pose estima-
deep network training by reducing internal covariate shift. In tion via deep neural networks. In Computer Vision and Pat-
Proceedings of The 32nd International Conference on Ma- tern Recognition (CVPR), 2014 IEEE Conference on, pages
chine Learning, pages 448–456, 2015. 3, 5, 8 1653–1660. IEEE, 2014. 1
[8] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, [23] N. Wang and D.-Y. Yeung. Learning a deep compact image
and L. Fei-Fei. Large-scale video classification with con- representation for visual tracking. In Advances in Neural
volutional neural networks. In Computer Vision and Pat- Information Processing Systems, pages 809–817, 2013. 1
tern Recognition (CVPR), 2014 IEEE Conference on, pages
1725–1732. IEEE, 2014. 1
[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
Advances in neural information processing systems, pages
1097–1105, 2012. 1
[10] A. Lavin. Fast algorithms for convolutional neural networks.
arXiv preprint arXiv:1509.09308, 2015. 1
[11] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-
supervised nets. arXiv preprint arXiv:1409.5185, 2014. 5
[12] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
networks for semantic segmentation. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 3431–3440, 2015. 1
[13] Y. Movshovitz-Attias, Q. Yu, M. C. Stumpe, V. Shet,
S. Arnoud, and L. Yatziv. Ontological supervision for fine
grained classification of street view storefronts. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 1693–1702, 2015. 1

2826

You might also like