细节放大
细节放大
细节放大
{recasens,pkellnho,wojciech,torralba}@csail.mit.edu
Toyota Research Institute, Cambridge, MA, 02139, USA
[email protected]
1 Introduction
Many modern neural network models used in computer vision have input size
constraints [1,2,3,4]. These constraints exist for various reasons. By restricting
the input resolution, one can control the time and computation required during
both training and testing, and benefit from efficient batch training on GPU.
On certain datasets, limiting the input feature dimensionality can also empiri-
cally increase performance by improving training sample coverage over the input
space.
2 A. Recasens∗ , P. Kellnhofer∗ , S.Stent, W. Matusik and A.Torralba
I. Typical approach
High resolution Low resolution
original image input image Task output,
Task network e.g.
- gaze vector
- object class
- attribute class
Task output,
e.g.
- gaze vector
- object class
- attribute class
Sampler
Task network
When the target input size is smaller than the images in the original dataset,
the standard approach is to uniformly downsample the input images. Perhaps the
best-known example is the commonly used 224 × 224 pixel input when training
classifiers on the ImageNet Large Scale Visual Recognition Challenge [5], despite
the presence of a range of image sizes – up to several megapixels – within the
original dataset.
While uniform downsampling is simple and effective in many situations, it
can be lossy for tasks which require information from different spatial resolutions
and locations. In such cases, sampling the salient regions at the necessary (and
possibly diverse) scales and locations is essential. Humans perform such tasks
by saccading their gaze in order to gather the necessary information with a mix-
Learning to Zoom: a Saliency-Based Sampling Layer for Neural Networks 3
ture of high-acuity foveal vision and coarser peripheral vision. Attempts have also
been made to endow machines with similar forms of sampling behavior. One pop-
ular example from traditional computer vision is SIFT [6], in which keypoints are
localised within space and image scale before feature extraction. More recently,
region proposal networks have been used widely in object detection [7]. Mim-
icking the human vision system more closely, mechanisms for task-dependent
sequential attention are being developed to allow numerous scene regions to be
processed in high resolution (see e.g. [8,9,10]). However, these approaches sur-
render some of the processing speed that makes machine vision attractive, and
add complexity for proposal generation and evaluating task completion.
In this work we introduce a saliency-based sampling layer: a simple plug-in
module that can be appended to the start of any input-constrained network and
used to improve downsampling in a task-specific manner. As shown in Fig. 1,
given a target image input size, our saliency sampler learns to allocate pixels
in that target to regions in the underlying image which are found to be partic-
ularly important for the task at hand. In doing so, the layer warps the input
image, creating a deformed version in which task-relevant parts of the image are
emphasized and irrelevant parts are suppressed, similar to how a caricature of a
face tries to magnify those parts of a person’s identity which make them stand
out from the average.
Our layer consists of a saliency map estimator connected to a sampler which
varies sampling density for image regions depending on their relative saliency
values. Since the layer is designed to be fully differentiable, it can be inserted
before any conventional network and trained end-to-end. Unlike sequential at-
tention models [9,10,11,12], the computation is performed in a single pass of the
saliency sampler at constant computational cost.
We apply our approach to tasks where the discovery of small objects or fine-
grained details is important (see Fig. 2), and consistently find that adding our
layer results in performance improvements over baseline networks.
2 Related Work
We divide the related work into three main categories: attention mechanisms,
saliency-based methods, and adaptive image sampling methods.
Attention mechanisms: Attention has been extensively used to improve the per-
formance of CNNs. Jaderberg et al. [13] introduced the Spatial Transformer
Network (STN), a layer that estimates a parametrized transformation from an
input image in an effort to undo nuisance image variation (such as from object
pose in the task of rigid object classification) and thereby improve model gen-
eralization. In their work, the authors proposed three types of transformation
that could be learned: affine, projective and thin plate spline (TPS). Although
our method also applies a transformation to the input image, our application is
quite different: we do not attempt to undo variation such as local translation
or rotation; rather we try to vary the resolution dynamically to favor regions
4 A. Recasens∗ , P. Kellnhofer∗ , S.Stent, W. Matusik and A.Torralba
(b) Fine-grained
(c) Fine-grained
Fig. 2. Examples of resampled input images for various tasks using our pro-
posed saliency sampler. Our module is able to discover saliency according to the
task: for gaze estimation in (a), the sampler learns to zoom in on the subject’s eyes
to allow for higher precision gaze estimation; for fine-grained classification in (b), the
sampler zooms in important parts of the bird’s anatomy while cropping out much of
the empty image; in (c), when no clear salient area is detected, the sampler defaults to
a near-uniform sampling.
of the input image which are more task salient. While our method could be
encapsulated within the TPS approach of [13], we implicitly prevent extreme
transformations and fold-overs, which can easily occur for a TPS-based spatial
transformer (and which also makes direct estimation of a non-parametrized sam-
pling map intractable). We believe that this helps to prevent dramatic failures
and therefore helps to make the module easier to learn.
Deformable convolutional networks (DCNs), introduced by Dai et al. [14], fol-
low a similar motivation to STNs. They show that convolutional layers can learn
to dynamically adjust their receptive fields to adapt to the input features and im-
prove invariance to nuisance factors. Their proposal involves the replacement of
any standard convolutional layer in a CNN with a deformable layer which learns
to estimate offsets to the standard kernel sampling locations, conditioned on the
input. We note four main differences with our work. First, while their method
Learning to Zoom: a Saliency-Based Sampling Layer for Neural Networks 5
samples from the same low-resolution input as the original CNN architecture,
our saliency sampler is designed to sample from any available resolution, allowing
it to take advantage of higher resolution data when available. Second, our ap-
proach estimates the sample field through saliency maps which have been shown
to emerge naturally when training fully convolutional neural networks [15]. We
found that estimating local spatial offsets directly, as in a DCN, is much harder.
Third, our method can be applied to existing trained networks without mod-
ification, while DCNs require changing network configurations by swapping in
the deformable convolutions. Finally, our approach produces human readable
outputs in the form of the saliency map and the deformed image which allow
for easy visual inspection and debugging. We note that our proposed saliency
sampler and DCNs are not mutually exclusive: our saliency sampler is designed
to sample efficiently across scale space and could potentially make use of de-
formable convolutional layers to help model local geometric variations. In the
same spirit as deformable networks, Li et al. [16] propose an encoder-decoder
structure to use non-squared convolutions. As in [13], they predict directly a
parametrization of these transformations instead of using a saliency map.
Attending to multiple objects recursively has also been previously explored.
Eslami et al. [11] proposed a method to iteratively attend to multiple objects in
an image. In the same direction, [12] introduced a method for fine-grained classi-
fication which recursively locates an object in a low-resolution image followed by
cropping from a high-resolution image. More recently, [17] expanded this idea
to multiple attention locations in the image, instead of a single one. Finally,
[18] describe a method where multiple crops are proposed and then filtered by
a CNN. We note that these methods are designed specifically for classification
and are not as general as our proposed sampling layer.
is usually taken when solving a particular problem where the features to use are
very clear for humans. For instance, to solve the problem of gaze-tracking on a
mobile device display, Khosla et al. [21] proposed the iTracker method, a gaze
estimation system based on RGB images. Their system uses the image from the
device’s front-facing camera, and extracts high resolution crops of both eyes and
face using separate detectors. Another example along this line was presented by
Wang et al. [22], who generate the features of the input image at different scales
to then select the best features and produce the final output.
Adaptive image sampling is also used for image retargeting in computer
graphics [23]. Unlike in our case where the sampled image only serves as an
intermediate representation for solving another problem, the goal of retargeting
is to deform an image to fit a new shape and preserve content important for
human observer as well as avoid visible deformations. Similarly to our concept,
this can be driven by saliency [24] and formulated as an energy minimization
[25] or Finite Element Method [26] problem.
3 Saliency Sampler
Let I be a high-resolution image of an arbitrary size and let Il be a low-resolution
image bounded by size M × N pixels suitable for a task network ft (Fig. 1).
Typically, CNNs rescale the input image I to Il without exploiting the relative
importance of I’s pixels. However, if our task requires information from a certain
image region more than others, it may be advantageous to sample this region
more densely. The saliency sampler executes this by first analyzing Il before
sampling areas of I proportionally to their perceived importance. In doing so, the
model can capture some of the benefit of increased resolution without significant
additional computational burden or risk of overfitting.
The sampling process can be divided into two stages. In the first stage, a
CNN is used to produce a saliency map. This map is task specific, since different
tasks may require focus on different image regions. In the second stage, the most
important image regions are sampled according to the saliency map.
Fig. 3. Saliency Sampler. The saliency map S (center, top) describes saliency as a
mass attracting neighboring pixels (arrows). Each pixel (red square) of the output low-
resolution image J samples from a location (cyan square) in the input high-resolution
image I which is offset by this attraction (yellow arrow) as defined by the Saliency Sam-
pler g(I, S). This distorts the coordinate system of the image and magnifies important
regions which get sampled more often than others.
the possible forms that g can take, and which one is more suitable for CNNs. In
all cases, we compute a mapping between the sampled image and the original
image and then use the grid sampler introduced in [13]. This mapping can be
written in the standard form as two functions u(x, y) and v(x, y) such that
J(x, y) = I(u(x, y), v(x, y)).
The main goal for the design of u and v is to map pixels proportionally to the
normalized weight assigned to them by the saliency map. Assuming that u(x, y),
v(x, y), x and y range from 0 to 1, an exact approximation to this problem would
be to find u and v such that:
Z u(x,y) Z v(x,y)
S(x0 , y 0 )dx0 dy 0 = xy (1)
0 0
This formulation holds certain desirable properties for our functions u and
v, notably:
Sampled areas: Areas of higher saliency are sampled more densely, since
those pixels with higher saliency mass will attract other pixels to them. Note
that kernel k can act as a regularizer to avoid corner cases where all the pixels
converge to the same value. In all our experiments, we use a Gaussian kernel
with σ set to one third of the width of the saliency map, which we found to work
well in various settings.
Convolutional form: This formulation allows us to compute u and v with
simple convolutions, which is key for the efficiency of the full system. This layer
can be easily added in a standard CNN and preserve differentiability needed for
training by backpropagation.
Note that the formulation in Eq. 2 and Eq. 3 has an undesirable bias to
sample towards the image center. We avoid this effect by padding the saliency
map with its border values.
The saliency sampler can be plugged into any convolutional neural network ft
where more informative subsampling of a higher resolution input is desired.
Since the module is end-to-end differentiable, we can train the full pipeline with
standard optimization techniques. Our complete pipeline consists of four steps
(see Fig. 1):
Both fs and ft have learnable parameters and so can be trained jointly for a
particular task. We found helpful to blur the resampled input image of the task
network for some epochs at the beginning of the training procedure. It forces
the saliency sampler to zoom deeper into the image in order to further magnify
small details otherwise destroyed by the consequent blur. This is beneficial even
for the final performance of the model with the blur removed.
Learning to Zoom: a Saliency-Based Sampling Layer for Neural Networks 9
Low Resolution Saliency Map Grid Sampled Image Low Resolution Saliency Map Grid Sampled Image
4 Experiments
In this section we apply the saliency sampler to two important problems in
computer vision: gaze-tracking and fine-grained object recognition. In each case,
we examine the benefit of augmenting standard methods on commonly used
datasets with our sampling module. We also compare against the closest com-
parable methods. As an architecture for the saliency network fs , in all the tasks
we use ablations of ResNet-18 [4] pretrained on the ImageNet Dataset [28] and
one final 1 × 1 convolutional layer to reduce the dimensionality of the saliency
map S. We found this network to work particularly well for classification and
regression problems.
Input Image Saliency Map Transformation Sampled Image Input Image Saliency Map Transformation Sampled Image
original input image and rescaled it to 227 × 227 resolution. These crops were
used as inputs for the ResNet-101 227 × 227 network for the final classification.
Table 2 shows the classification accuracy for the various models compared.
Our model is able to significantly outperform the ResNet-101 baseline by 5% and
3% for top-1 and top-5 accuracies respectively. The performance of the CAM-
based method is closer to our method which is expected as it benefits from the
same idea of emphasizing image details. However, our method still performs
several points better, perhaps because of its greater flexibility to focus on local
image regions non-uniformly and selectively zoom-in on certain features more
than others. It also has a major benefit of being able to zoom in on an arbitrary
number of non-collocated image locations, whereas doing so with crops involves
determining the number of crops beforehand or having a proposal mechanism.
The performance of the spatial transformers, the grid estimator and the de-
formable convolutions are similar or slightly better than the ResNet-101 baseline.
Like our method, those methods benefit from the ability to focus attention on a
particular region of the image. However, the affine version of the spatial trans-
formers applies a uniform deformation across the whole image, which may not
be particularly well suited to the task, while the more flexible TPS version and
the grid estimator, which in theory could more closely mimic the sampling intro-
duced by our method, were found to be harder to optimize and were consistently
found to perform worse. Finally, the deformable convolutions method does not
have access to the full resolution image and uses a complex parametrization
which makes its training very unstable. In contrast, our method benefits from
the fact that neural networks have a natural ability to predict salient image
elements [30] and thus the optimization may be significantly easier.
To justify our claim that the saliency sampler can benefit different task net-
work architectures, we repeat our experiment using a Inception V3 architecture
[31]. The original performance is already very high (64% and 86% for top-1 and
top-5) as it uses higher resolution (299) and a deeper network, but our sampler
still results in a performance of 66% in top-1 and 87% in top-5.
Saliency network importance: In Tbl. 3, we retrained ResNet-101 with
different depths of saliency network fs . We used different ablations of ResNet-18
with 6, 10 or 14 layers (which corresponds to adding one block at a time to build
Learning to Zoom: a Saliency-Based Sampling Layer for Neural Networks 13
Input Image Saliency Map Transformation Sampled Image Input Image Saliency Map Transformation Sampled Image
ResNet-18) for the experiment. The performance of the overall network increases
with the complexity of the saliency model but with diminishing returns.
4.3 CUB-200
To further prove that our model is useful across different datasets, we evaluated
it in the CUB-200 dataset [32] (Tbl. 4). Although the CUB-200 is also a fine-
grained recognition dataset, it is significantly smaller and the images are better
framed around subjects than in the iNaturalist dataset (see Fig. 6).
We used ResNet-50 as our task network and the initial 14 layers of ResNet-
18 as our saliency network. By adding our sampling layer we achieve a 2.9%
accuracy boost, which is less than the boost in iNaturalist, perhaps because
objects of interest are more tightly cropped in CUB-200. Compared to DT-RAM
[33], one of the top performing models in CUB-200, our approach outperforms
the comparable 224 × 224 version of RN-50 DT-RAM by 1.7%, using a simpler
model. Our method is not as accurate as the 448 × 448 resolution version of DT-
RAM, but the latter uses approximately 2 passes through a RN-50 on average
and a larger input size leading to a higher computational cost.
5 Discussion
Adding our saliency sampler is most beneficial for image tasks where the impor-
tant features are small and sparse, or appear across multiple image scales. The
deformation introduced in the vicinity of the magnified regions could potentially
discourage the network from strong deformations if another point of interest
14 A. Recasens∗ , P. Kellnhofer∗ , S.Stent, W. Matusik and A.Torralba
would be affected. This could be harmful for tasks such as text recognition. In
practice, we observed that the learning process is able to deal with such situations
well as it was capable of magnifying both collocated eyes without hindering the
gaze prediction performance. That is particularly interesting as this task requires
preservation of geometric information in the image. The method proved to be eas-
ier to train than other approaches which modify spatial sampling, such as Spatial
Transformer Networks [13] or Deformable Convolutional Networks [14]. These
methods ofter performed closer to the baseline as they failed to find suitable
parameters for their sampling strategy. The non-uniform approach to the mag-
nification introduced by our saliency map also enables variability of zoom over
the spatial domain. This together with the end-to-end optimization results in a
performance benefit over uniformly magnified area-of-interest crops as observed
in our fine-grained classification task. Unlike in the case of the iTracker [21], we
do not require prior knowledge about the relevant image features in the task.
6 Conclusion
We have presented the saliency sampler – a novel layer for CNNs that can adapt
the image sampling strategy to improve task performance while preserving mem-
ory allocation and computational efficiency for a given image processing task.
We have shown our technique’s effectiveness in locating and focusing on im-
age features important for the tasks of gaze tracking and fine-grained object
recognition. The method is simple to integrate into existing models and can
be efficiently trained in an end-to-end fashion. Unlike some of the other image
transformation techniques, our method is not restricted to a predefined number
or size of important regions and it can redistribute sampling density across the
entire image domain. At the same time, the parametrization of our technique by
a single scalar attention map makes it robust against irrecoverable image degra-
dation due to fold-overs or singularities. This leads to a superior performance in
problems that require the recovery of small image features such as eyes or subtle
differences between related animal species.
References
1. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep
convolutional neural networks. In: Conference on Neural Information Processing
Systems. (2012)
2. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.:
Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model
size. arXiv preprint arXiv:1602.07360 (2016)
3. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556 (2014)
4. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: IEEE Conference on Computer Vision and Pattern Recognition. (2016) 770–778
5. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large
Scale Visual Recognition Challenge. International Journal of Computer Vision
(IJCV) 115(3) (2015) 211–252
6. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna-
tional Journal of Computer Vision (IJCV) 60(2) (2004) 91–110
7. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detec-
tion with region proposal networks. In: Advances in Neural Information Processing
System. (2015) 91–99
8. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid
scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence
20(11) (1998) 1254–1259
9. Mnih, V., Heess, N., Graves, A.: Recurrent models of visual attention. In: Advances
in Neural Information Processing System. (2014) 2204–2212
10. Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recognition with visual atten-
tion. arXiv preprint arXiv:1412.7755 (2014)
11. Eslami, S.A., Heess, N., Weber, T., Tassa, Y., Szepesvari, D., Hinton, G.E., et al.:
Attend, infer, repeat: Fast scene understanding with generative models. In: Ad-
vances in Neural Information Processing Systems. (2016) 3225–3233
12. Fu, J., Zheng, H., Mei, T.: Look closer to see better: Recurrent attention convo-
lutional neural network for fine-grained image recognition. In: Conf. on Computer
Vision and Pattern Recognition. (2017)
13. Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks.
In: Advances in Neural Information Processing System. (2015) 2017–2025
14. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convo-
lutional networks. In: IEEE Conference on Computer Vision and Pattern Recog-
nition. (2017) 764–773
15. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features
for discriminative localization. In: IEEE Conference on Computer Vision and
Pattern Recognition, IEEE (2016) 2921–2929
16. Li, J., Chen, Y., Cai, L., Davidson, I., Ji, S.: Dense Transformer Networks.
arXiv:1705.08881 [cs, stat] (May 2017) arXiv: 1705.08881.
17. Zheng, H., Fu, J., Mei, T., Luo, J.: Learning multi-attention convolutional neural
network for fine-grained image recognition. In: IEEE International Conference on
Computer Vision (ICCV). (2017)
18. Xiao, T., Xu, Y., Yang, K., Zhang, J., Peng, Y., Zhang, Z.: The application of two-
level attention models in deep convolutional neural network for fine-grained image
classification. In: IEEE Conference on Computer Vision and Pattern Recognition,
IEEE (2015) 842–850
16 A. Recasens∗ , P. Kellnhofer∗ , S.Stent, W. Matusik and A.Torralba
19. Rosenfeld, A., Ullman, S.: Visual concept recognition and localization via iterative
introspection. In: Asian Conference on Computer Vision (ACCV), Springer (2016)
264–279
20. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-
cam: Visual explanations from deep networks via gradient-based localization. In:
IEEE Conference on Computer Vision and Pattern Recognition. (2017) 618–626
21. Khosla∗ , A., Krafka∗ , K., Kellnhofer, P., Kannan, H., Bhandarkar, S., Matusik,
W., Torralba, A.: Eye tracking for everyone. In: IEEE Conference on Computer
Vision and Pattern Recognition, Las Vegas, USA (June 2016) ∗ indicates equal
contribution.
22. Wang, S., Luo, L., Zhang, N., Li, J.: AutoScaler: Scale-Attention Networks for
Visual Correspondence. In: British Machine Vision Conference (BMVC). (2017)
23. Rubinstein, M., Gutierrez, D., Sorkine, O., Shamir, A.: A comparative study of
image retargeting. In: ACM Transactions on Graphics (TOG). Volume 29., ACM
(2010) 160
24. Wolf, L., Guttmann, M., Cohen-Or, D.: Non-homogeneous content-driven video-
retargeting. In: IEEE International Conference on Computer Vision (ICCV), IEEE
(2007) 1–6
25. Karni, Z., Freedman, D., Gotsman, C.: Energy-based image deformation. In:
Computer Graphics Forum. Volume 28., Wiley Online Library (2009) 1257–1268
26. Kaufmann, P., Wang, O., Sorkine-Hornung, A., Sorkine-Hornung, O., Smolic, A.,
Gross, M.: Finite element image warping. In: Computer Graphics Forum. Vol-
ume 32., Wiley Online Library (2013) 31–39
27. Chen, R., Freedman, D., Karni, Z., Gotsman, C., Liu, L.: Content-aware image
resizing by quadratic programming. In: Computer Vision and Pattern Recognition
Workshops (CVPRW), IEEE (2010) 1–8
28. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-
scale hierarchical image database. In: IEEE Conference on Computer Vision and
Pattern Recognition, IEEE (2009) 248–255
29. Van Horn, G., Mac Aodha, O., Song, Y., Shepard, A., Adam, H., Perona, P., Be-
longie, S.: The inaturalist challenge 2017 dataset. arXiv preprint arXiv:1707.06642
(2017)
30. Zhou, B., Khosla, A., A., L., Oliva, A., Torralba, A.: Learning Deep Features for
Discriminative Localization. IEEE Conference on Computer Vision and Pattern
Recognition (2016)
31. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the incep-
tion architecture for computer vision. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. (2016) 2818–2826
32. Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-ucsd
birds-200-2011 dataset. (2011)
33. Li, Z., Yang, Y., Liu, X., Zhou, F., Wen, S., Xu, W.: Dynamic computational time
for visual attention. arXiv preprint arXiv:1703.10332 (2017)