Learning To Compare Image Patches Via Convolutional Neural Networks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Learning to Compare Image Patches via Convolutional Neural Networks

Sergey Zagoruyko Nikos Komodakis


Universite Paris Est, Ecole des Ponts ParisTech Universite Paris Est, Ecole des Ponts ParisTech
[email protected] [email protected]
arXiv:1504.03641v1 [cs.CV] 14 Apr 2015

Abstract similarity

In this paper we show how to learn directly from image


data (i.e., without resorting to manually-designed features) decision network
a general similarity function for comparing image patches,
which is a task of fundamental importance for many com- ConvNet
puter vision problems. To encode such a function, we opt
for a CNN-based model that is trained to account for a
wide variety of changes in image appearance. To that end,
we explore and study multiple neural network architectures,
which are specifically adapted to this task. We show that
patch 1 patch 2
such an approach can significantly outperform the state-of-
the-art on several problems and benchmark datasets. Figure 1. Our goal is to learn a general similarity function for im-
age patches. To encode such a function, here we make use of and
1. Introduction explore convolutional neural network architectures.

Comparing patches across images is probably one of the software) large datasets that contain patch correspondences
most fundamental tasks in computer vision and image anal- between images [22]. This begs the following question: can
ysis. It is often used as a subroutine that plays an important we make proper use of such datasets to automatically learn
role in a wide variety of vision tasks. These can range from a similarity function for image patches ?
low-level tasks such as structure from motion, wide baseline The goal of this paper is to affirmatively address the
matching, building panoramas, and image super-resolution, above question. Our aim is thus to be able to generate a
up to higher-level tasks such as object recognition, image patch similarity function from scratch, i.e., without attempt-
retrieval, and classification of object categories, to mention ing to use any manually designed features but instead di-
a few characteristic examples. rectly learn this function from annotated pairs of raw image
Of course, the problem of deciding if two patches corre- patches. To that end, inspired also by the recent advances in
spond to each other or not is quite challenging as there exist neural architectures and deep learning, we choose to repre-
far too many factors that affect the final appearance of an sent such a function in terms of a deep convolutional neural
image [17]. These can include changes in viewpoint, varia- network [14, 13] (Fig. 1). In doing so, we are also interested
tions in the overall illumination of a scene, occlusions, shad- in addressing the issue of what network architecture should
ing, differences in camera settings, etc. In fact, this need be best used in a task like this. We thus explore and propose
of comparing patches has given rise to the development of various types of networks, having architectures that exhibit
many hand-designed feature descriptors over the past years, different trade-offs and advantages. In all cases, to train
including SIFT [15], that had a huge impact in the com- these networks, we are using as sole input a large database
puter vision community. Yet, such manually designed de- that contains pairs of raw image patches (both matching
scriptors may be unable to take into account in an optimal and non-matching). This allows to further improve the per-
manner all of the aforementioned factors that determine the formance of our method simply by enriching this database
appearance of a patch. On the other hand, nowadays one with more samples (as software for automatically generat-
can easily gain access to (or even generate using available ing such samples is readily available [21]).
Source code and trained models are available online at http:
To conclude this section, the paper’s main contributions
//imagine.enpc.fr/˜zagoruys/deepcompare.html (work are as follows: (i) We learn directly from image data (i.e.,
supported by EC project FP7-ICT-611145 ROBOSPECT). without any manually-designed features) a general similar-

1
ity function for patches that can implicitly take into ac- colour patches the networks could be trained to further in-
count various types of transformations and effects (due to crease performance. However, to be able to compare our
e.g., a wide baseline, illumination, etc.). (ii) We explore approach with state-of-the-art methods on existing datasets,
and propose a variety of different neural network models we chose to use only grayscale patches during training. Fur-
adapted for representing such a function, highlighting at the thermore, with the exception of the SPP model described in
same time network architectures that offer improved per- section 3.2, in all other cases the patches given as input to
formance. as in [19]. (iii) We apply our approach on sev- the network are assumed to have a fixed size of 64 × 64
eral problems and benchmark datasets, showing that it sig- (this means that original patches may need to be resized to
nificantly outperforms the state-of-the-art and that it leads the above spatial dimensions).
to feature descriptors with much better performance than There are several ways in which patch pairs can be pro-
manually designed descriptors (e.g., SIFT, DAISY) or other cessed by the network and how the information sharing can
learnt descriptors as in [19]. Importantly, due to their con- take place in this case. For this reason, we explored and
volutional nature, the resulting descriptors are very efficient tested a variety of models. We start in section 3.1 by de-
to compute even in a dense manner. scribing the three basic neural network architectures that
we studied, i.e., 2-channel, Siamese, Pseudo-siamese (see
2. Related work Fig. 2), which offer different trade-offs in terms of speed
and accuracy (note that, as usually, applied patch-matching
The conventional approach to compare patches is to use techniques imply testing a patch against a big number of
descriptors and a squared euclidean distance. Most feature other patches, and so re-using computed information is al-
descriptors are hand-crafted as SIFT [15] or DAISY [26]. ways useful). Essentially these architectures stem from the
Recently, methods for learning a descriptor have been pro- different way that each of them attempts to address the fol-
posed [27] (e.g., DAISY-like descriptors learn pooling re- lowing question: when composing a similarity function for
gions and dimensionality reduction [3]). Simonyan et al. comparing image patches, do we first choose to compute
[19] proposed a convex procedure for training on both tasks. a descriptor for each patch and then create a similarity on
Our approach, however, is inspired by the recent success top of these descriptors or do we perhaps choose to skip
of convolutional neural networks [18, 25, 24, 9]. Although the part related to the descriptor computation and directly
these models involve a highly non-convex objective func- proceed with the similarity estimation?
tion during training, they have shown outstanding results in In addition to the above basic models, we also describe
various tasks [18]. Fischer et al. [10] analysed the perfor- in section 3.2 some extra variations concerning the network
mance of convolutional descriptors from AlexNet network architecture. These variations, which are not mutually ex-
(that was trained on Imagenet dataset [13]) on the well- clusive to each other, can be used in conjunction with any
known Mikolajczyk dataset [16] and showed that these con- of the basic models described in section 3.1. Overall, this
volutional descriptors outperform SIFT in most cases ex- leads to a variety of models that is possible to be used for
cept blur. They also proposed an unsupervised training ap- the task of comparing image patches.
proach for deriving descriptors that outperform both SIFT
and Imagenet trained network. 3.1. Basic models
Zbontar and LeCun in [28] have recently proposed a
CNN-based approach to compare patches for computing Siamese: This type of network resembles the idea of
cost in small baseline stereo problem and shown the best having a descriptor [2, 6]. There are two branches in the net-
performance in KITTI dataset. However, the focus of that work that share exactly the same architecture and the same
work was only on comparing pairs that consist of very small set of weights. Each branch takes as input one of the two
patches like the ones in narrow baseline stereo. In contrast, patches and then applies a series of convolutional, ReLU
here we aim for a similarity function that can account for and max-pooling layers. Branch outputs are concatenated
a broader set of appearance changes and can be used in a and given to a top network that consists of linear fully con-
much wider and more challenging set of applications, in- nected and ReLU layers. In our tests we used a top network
cluding, e.g., wide baseline stereo, feature matching and consisting of 2 linear fully connected layers (each with 512
image retrieval. hidden units) that are separated by a ReLU activation layer.
Branches of the siamese network can be viewed as de-
scriptor computation modules and the top network - as a
3. Architectures
similarity function. For the task of matching two sets of
As already mentioned, the input to the neural network patches at test time, descriptors can first be computed inde-
is considered to be a pair of image patches. Our models pendently using the branches and then matched with the top
do not impose any limitations with respect to the number network (or even with a distance function like l2 ).
of channels in the input patches, i.e., given a dataset with Pseudo-siamese: In terms of complexity, this architec-
decision network decision network

decision layer
2-channel network

(2 shared branches)

(2 shared branches)
branch network 1

branch network 2
shared

surround stream

central stream
(siamese)

unshared
(pseudo-
siamese)

patch 1 patch 2 patch 1 patch 2

Figure 2. Three basic network architectures: 2-channel on the left,


siamese and pseudo-siamese on the right (the difference between patch 1 patch 2
siamese and pseudo-siamese is that the latter does not have shared
branches). Color code used: cyan = Conv+ReLU, purple = max Figure 3. A central-surround two-stream network that uses a
pooling, yellow = fully connected layer (ReLU exists between siamese-type architecture to process each stream. This results in
fully connected layers as well). 4 branches in total that are given as input to the top decision layer
(the two branches in each stream are shared in this case).

ture can be considered as being in-between the siamese model, the convolutional part of the final architecture turns
and the 2-channel networks. More specifically, it has the out to consist of one convolutional 4x4 layer and 6 convo-
structure of the siamese net described above except that lutional layers with 3x3 layers, separated by ReLU activa-
the weights of the two branches are uncoupled, i.e., not tions. As we shall also see later in the experimental results,
shared. This increases the number of parameters that can such a change in the network architecture can contribute in
be adjusted during training and provides more flexibility further improving performance, which is in accordance with
than a restricted siamese network, but not as much as the analogous observations made in [20].
2-channel network described next. On the other hand, it Central-surround two-stream network. As its name
maintains the efficiency of siamese network at test time. suggests, the proposed architecture consists of two separate
2-channel: unlike the previous models, here there is no streams, central and surround, which enable a processing in
direct notion of descriptor in the architecture. We simply the spatial domain that takes place over two different resolu-
consider the two patches of an input pair as a 2-channel tions. More specifically, the central high-resolution stream
image, which is directly fed to the first convolutional layer receives as input two 32 × 32 patches that are generetad
of the network. In this case, the bottom part of the net- by cropping (at the original resolution) the central 32 × 32
work consists of a series of convolutional, ReLU and max- part of each input 64 × 64 patch. Furthermore, the surround
pooling layers. The output of this part is then given as input low-resolution stream receives as input two 32×32 patches,
to a top module that consists simply of a fully connected which are generated by downsampling at half the original
linear decision layer with 1 output. This network provides pair of input patches. The resulting two streams can then be
greater flexibility compared to the above models as it starts processed by using any of the basic architectures described
by processing the two patches jointly. Furthermore, it is in section 3.1 (see Fig. 3 for an example that uses a siamese
fast to train, but in general at test time it is more expensive architecture for each stream).
as it requires all combinations of patches to be tested against One reason to make use of such a two-stream architec-
each other in a brute-force manner. ture is because multi-resolution information is known to be
important in improving the performance of image match-
3.2. Additional models
ing. Furthermore, by considering the central part of a patch
Deep network. We apply the technique proposed by Si- twice (i.e., in both the high-resolution and low-resolution
monyan and Zisserman in [20] advising to break up bigger streams) we implicitly put more focus on the pixels closer
convolutional layers into smaller 3x3 kernels, separated by to the center of a patch and less focus on the pixels in the pe-
ReLU activations, which is supposed to increase the non- riphery, which can also help for improving the precision of
linearities inside the network and make the decision func- matching (essentially, since pooling is applied to the down-
tion more discriminative. They also report that it might be sampled image, pixels in the periphery are allowed to have
difficult to initialise such a network, we, however, do not more variance during matching). Note that the total input
observe this behavior and train the network from scratch dimenionality is reduced by a factor of two in this case. As
as usual. In our case, when applying this technique to our a result, training proceeds faster, which is also one other
decision network objective function
N
λ X
min kwk2 + max(0, 1 − yi onet
i ) , (1)
w 2
branch network 1 SPP SPP i=1

branch network 2
where w are the weights of the neural network, onet i is
the network output for the i-th training sample, and yi ∈
{−1, 1} the corresponding label (with −1 and 1 denoting a
non-matching and a matching pair, respectively).
ASGD with constant learning rate 1.0, momentum 0.9
and weight decay λ = 0.0005 is used to train the models.
Training is done in mini-batches of size 128. Weights are
patch 1 patch 2 initialised randomly and all models are trained from scratch.
Data Augmentation and preprocessing. To com-
Figure 4. SPP network for a siamese architecture: SPP layers (or- bat overfitting we augment training data by flipping both
ange) are inserted immediately after the 2 branches of the network patches in pairs horizontally and vertically and rotating to
so that the top decision layer has an input of fixed dimensionality
90, 180, 270 degrees. As we don’t notice overfitting while
for any size of the input patches.
training in such manner we train models for a certain num-
ber of iterations, usually for 2 days, and then test perfor-
practical advantage. mance on test set.
Spatial pyramid pooling (SPP) network for compar- Training dataset size allows us to store all the images di-
ing patches. Up to this point we have been assuming that rectly in GPU memory and very efficiently retrieve patch
the network requires the input patches to have a fixed size pairs during training. Images are augmented ”on-the fly”.
of 64 × 64. This requirement comes from the fact that the We use Titan GPU in Torch [7] and convolution routines
output of the last convolutional layer of the network needs are taken from Nvidia cuDNN library [5]. Our siamese de-
to have a predefined dimensionality. Therefore, when we scriptors on GPU are just 2 times slower than computing
need to compare patches of arbitrary sizes, this means that SIFT descriptors on CPU and 2 times faster than Imagenet
we first have to resize them to the above spatial dimensions. descriptors on GPU according to [10].
However, if we look at the example of descriptors like SIFT,
for instance, we can see that another possible way to deal 5. Experiments
with patches of arbitrary sizes is via adjusting the size of
We applied our models to a variety of problems and
the spatial pooling regions to be proportional to the size of
datasets. In the following we report results, and also pro-
the input patch so that we can still maintain the required
vide comparisons with the state-of-the-art.
fixed output dimensionality for the last convolutional layer
without deteriorating the resolution of the input patches. 5.1. Local image patches benchmark
This is also the idea behind the recently proposed SPP- For a first evaluation of our models, we used the standard
net architecture [11], which essentially amounts to inserting benchmark dataset from [3] that consists of three subsets,
a spatial pyramid pooling layer between the convolutional Yosemite, Notre Dame, and Liberty, each of which contains
layers and the fully-connected layers of the network. Such a more than 450,000 image patches (64 x 64 pixels) sampled
layer aggregates the features of the last convolutional layer around Difference of Gaussians feature points. The patches
through spatial pooling, where the size of the pooling re- are scale and orientation normalized. Each of the subsets
gions is dependent on the size of the input. Inspired by this, was generated using actual 3D correspondences obtained
we propose to also consider adapting the network models of via multi-view stereo depth maps. These maps were used to
section 3.1 according to the above SPP-architecture. This produce 500,000 ground-truth feature pairs for each dataset,
can be easily achieved for all the considered models (e.g., with equal number of positive (correct) and negative (incor-
see Fig. 4 for an example with a siamese model). rect) matches.
For evaluating our models, we use the evaluation pro-
tocol of [4] and generate ROC curves by thresholding the
4. Learning distance between feature pairs in the descriptor space. We
report the false positive rate at 95% recall (FPR95) on each
Optimization. We train all models in strongly super- of the six combinations of training and test sets, as well as
vised manner. We use a hinge-based loss term and squared the mean across all combinations. We also report the mean,
l2 -norm regularization that leads to the following learning denoted as mean(1, 4), for only those 4 combinations that
were used in [1], [3] (in which case training takes place on conv3(3456) conv4(3456) conv5(2304)
Yosemite or Notre Dame, but not Liberty). Notredame 12.22 9.64 19.384
Liberty 16.25 14.26 21.592
Table 1 reports the performance of several models, and
Yosemite 33.25 30.22 43.262
also details their architecture (we have also experimented
mean 20.57 17.98 28.08
with smaller kernels, less max-pooling layers, as well as
adding normalisations, without noticing any significant im- Table 2. FPR95 for imagenet-trained features (dimensionality of
provement in performance). We briefly summarize some of each feature appears as subscript).
the conclusions that can be drawn from this table. A first
important conclusion is that 2-channel-based architectures we provide the corresponding ROC curves in Fig. 5. Fur-
(e.g., 2ch, 2ch-deep, 2ch-2stream) exhibit clearly thermore, we show in Table 2 the performance of imagenet-
the best performance among all models. This is something trained CNN features (these were l2 -normalized to improve
that indicates that it is important to jointly use information results). Among these, conv4 gives the best FPR95 score,
from both patches right from the first layer of the network. which is equal to 17.98. This makes it better than SIFT but
2ch-2stram network was the top-performing network still much worse than our models.
on this dataset, with 2ch-deep following closely (this ver-
ifies the importance of multi-resolution information during
matching and that also increasing the network depth helps).
In fact, 2ch-2stream managed to outperform the previ-
ous state-of-the-art by a large margin, achieving 2.45 times
better score than [19]! The difference with SIFT was even
larger, with our model giving 6.65 times better score in this
case (SIFT score on mean(1,4) was 31.2 according to
[3]). (a) (b)
Regarding siamese-based architectures, these too man- Figure 6. (a) Filters of the first convolutional layer of siam net-
age to achieve better performance than existing state-of- work. (b) Rows correspond to first layer filters from 2ch network
the-art systems. This is quite interesting because, e.g., (only a subset shown), depicting left and right part of each filter.
none of these siamese networks tries to learn the shape,
size or placement of the pooling regions (like, e.g., [19, 3]
do), but instead utilizes just standard max-pooling lay-
ers. Among the siamese models, the two-stream network
(siam-2stream) had the best performance, verifying (a) true positives (b) false negatives
once more the importance of multi-resolution information
when it comes to comparing image patches. Furthermore,
the pseudo-siamese network (pseudo-siam) was better
than the corresponding siamese one (siam). (c) true negatives (d) false positives

We also conducted additional experiments, in which Figure 7. Top-ranking false and true matches by 2ch-deep.
we tested the performance of siamese models when their
top decision layer is replaced with the l2 Euclidean dis- Fig. 6(a) displays the filters of the first convolutional
tance of the two convolutional descriptors produced by layer learnt by the siamese network. Furthermore, Fig. 6(b)
the two branches of the network (denoted with the suf- shows the left and right parts for a subset of the first layer
fix l2 in the name). In this case, prior to applying the filters of the 2-channel network 2ch. It is worth mention-
Euclidean distance, the descriptors are l2 -normalized (we ing that corresponding left and right parts look like being
also tested l1 normalization). For pseudo-siamese only one negative to each other, which basically means that the net-
branch was used to extract descriptors. As expected, in work has learned to compute differences of features be-
this case the two-stream network (siam-2stream-l2 ) tween the two patches (note, though, that not all first layer
computes better distances than the siamese network filters of 2ch look like this). Last, we show in Fig. 7 some
(siam-l2 ), which, in turn, computes better distances than top ranking false and correct matches as computed by the
the pseudo-siamese model (pseudo-siam-l2 ). In fact, 2ch-deep network. We observe that false matches could
the siam-2stream-l2 network manages to outperform be easily mistaken even by a human (notice, for instance,
even the previous state-of-the-art descriptor [19], which is how similar the two patches in false positive examples look
quite surprising given that these siamese models have never like).
been trained using l2 distances. For the rest of the experiments, we note that we use mod-
For a more detailed comparison of the various models, els trained on the Liberty dataset.
Train Test 2ch-2stream 2ch-deep 2ch siam siam-l2 pseudo-siam pseudo-siam-l2 siam-2stream siam-2stream-l2 [19]
Yos ND 2.11 2.52 3.05 5.75 8.38 5.44 8.95 5.29 5.58 6.82
Yos Lib 7.2 7.4 8.59 13.48 17.25 10.35 18.37 11.51 12.84 14.58
ND Yos 4.1 4.38 6.04 13.23 15.89 12.64 15.62 10.44 13.02 10.08
ND Lib 4.85 4.55 6.05 8.77 13.24 12.87 16.58 6.45 8.79 12.42
Lib Yos 5 4.75 7 14.89 19.91 12.5 17.83 9.02 13.24 11.18
Lib ND 1.9 2.01 3.03 4.33 6.01 3.93 6.58 3.05 4.54 7.22
mean 4.19 4.27 5.63 10.07 13.45 9.62 13.99 7.63 9.67 10.38
mean(1,4) 4.56 4.71 5.93 10.31 13.69 10.33 14.88 8.42 10.06 10.98

Table 1. Performance of several models on the “local image patches” benchmark. The models architecture is as follows: (i) 2ch-2stream
consists of two branches C(95, 5, 1)-ReLU-P(2, 2)-C(96, 3, 1)-ReLU-P(2, 2)-C(192, 3, 1)-ReLU-C(192, 3, 1)-ReLU, one for cen-
tral and one for surround parts, followed by F(768)-ReLU-F(1) (ii) 2ch-deep = C(96, 4, 3)-Stack(96)-P(2, 2)-Stack(192)-F(1),
where Stack(n) = C(n, 3, 1)-ReLU-C(n, 3, 1)-ReLU-C(n, 3, 1)-ReLU. (iii) 2ch = C(96, 7, 3)-ReLU-P(2, 2)-C(192, 5, 1)-ReLU-
P(2, 2)-C(256, 3, 1)-ReLU-F(256)-ReLU-F(1) (iv) siam has two branches C(96, 7, 3)-ReLU-P(2, 2)-C(192, 5, 1)-ReLU-P(2, 2)-
C(256, 3, 1)-ReLU and decision layer F(512)-ReLU-F(1) (v) siam-l2 reduces to a single branch of siam (vi) pseudo-siam is uncoupled
version of siam (vii) pseudo-siam-l2 reduces to a single branch of pseudo-siam (viii) siam-2stream has 4 branches C(96, 4, 2)-ReLU-
P(2, 2)-C(192, 3, 1)-ReLU-C(256, 3, 1)-ReLU-C(256, 3, 1)-ReLU (coupled in pairs for central and surround streams), and decision
layer F(512)-ReLU-F(1) (ix) siam-2stream-l2 consists of one central and one surround branch of siam-2stream. The shorthand no-
tation used was the following: C(n, k, s) is a convolutional layer with n filters of spatial size k × k applied with stride s, P(k, s) is a
max-pooling layer of size k × k applied with stride s, and F(n) denotes a fully connected linear layer with n output units.
yosemite −> notredame yosemite −> liberty notredame −> yosemite

1 1 1

0.95 0.95 0.95

0.9 0.9 0.9


True positive rate

True positive rate

True positive rate


0.85 0.85 0.85

0.8 0.8 0.8


Simonyan etal 6.82% Simonyan etal 14.58% Simonyan etal 10.08%
2ch−2stream 2.11% 2ch−2stream 7.20% 2ch−2stream 4.09%
2ch−deep 2.52% 2ch−deep 7.40% 2ch−deep 4.38%
0.75 siam 5.75% 0.75 siam 13.48% 0.75 siam 13.23%
2ch 3.05% 2ch 8.59% 2ch 6.04%
siam−2stream 5.29% siam−2stream 11.51% siam−2stream 10.44%
0.7 0.7 0.7
0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3
False positive rate False positive rate False positive rate

notredame −> liberty liberty −> yosemite liberty −> notredame

1 1 1

0.95 0.95 0.95

0.9 0.9 0.9


True positive rate

True positive rate

True positive rate

0.85 0.85 0.85

0.8 0.8 0.8


Simonyan etal 12.42% Simonyan etal 11.18% Simonyan etal 7.22%
2ch−2stream 4.85% 2ch−2stream 5.00% 2ch−2stream 1.90%
2ch−deep 4.56% 2ch−deep 4.75% 2ch−deep 2.01%
0.75 siam 8.77% 0.75 siam 14.89% 0.75 siam 4.33%
2ch 6.04% 2ch 7.00% 2ch 3.03%
siam−2stream 6.45% siam−2stream 9.02% siam−2stream 3.05%
0.7 0.7 0.7
0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3
False positive rate False positive rate False positive rate

Figure 5. ROC curves for various models (including the state-of-the-art descriptor [19]) on the local image patches benchmark. Numbers
in the legends are corresponding FPR95 values

5.2. Wide baseline stereo evaluation we chose are increasing with each image making matching
more difficult. Our goal was to show that a photometric cost
For this evaluation we chose the dataset by Strecha et al. computed with neural network competes favorably against
[23], which contains several image sequences with ground costs produced by a state-ot-the-art hand-crafted feature de-
truth homographies and laser-scanned depthmaps. We used scriptor, so we chose to compare with DAISY [26].
“fountain” and “herzjesu” sequences to produce 6 and 5 rec-
tified stereo pairs respectively. Baselines in both sequences Since our focus was not on efficiency, we used an un-
optimized pipeline for computing the photometric costs. depth maps produced by 2-channel architectures. Results
More specifically, for 2-channel networks we used a brute- without global optimization also show that the estimated
force approach, where we extract patches on corresponding depth maps contain much more fine details than DAISY.
epipolar lines with subpixel estimation, construct batches They may exhibit a very sparse set of errors for the case of
(containing a patch from the left image I1 and all patches siamese-based networks, but these errors can be very easily
on the corresponding epipolar line from the right image I2 ) eliminated during global optimization.
and compute network outputs, resulting in the cost: Fig. 8 also shows a quantitative comparison, focusing in
this case on siamese-based models as they are more effi-
C(p, d) = −onet (I1 (p), I2 (p + d)) (2) cient. The first plot of that figure shows (for a single stereo
pair) the distribution of deviations from the ground truth
Here, I(p) denotes a neighbourhood intensity matrix
across all range of error thresholds (expressed here as a
around a pixel p, onet (P1 , P2 ) is the output of the neural
fraction of the scene’s depth range). Furthermore, the other
network given a pair of patches P1 and P2 , and d is the dis-
plots of the same figure summarize the corresponding distri-
tance between points on epipolar line.
butions of errors for the six stereo pairs of increasing base-
For siamese-type networks, we compute descriptors for
line (in this case we also show separately the error distribu-
each pixel in both images once and then match them with
tions when only unoccluded pixels are taken into account).
decision top layer or l2 distance. In the first case the formula
The error thresholds were set to 3 and 5 pixels in these plots
for photometric cost is the following:
(note that the maximum disparity is around 500 pixels in the
C(p, d) = −otop (D1 (I1 (p)), D2 (I2 (p + d))) (3) largest baseline). As can be seen, all siamese models per-
form much better than DAISY across all error thresholds
where otop is output of the top decision layer, and D1 , D2 and all baseline distances (e.g., notice the difference in the
are outputs of branches of the siamese or pseudo-siamese curves of the corresponding plots).
network, i.e. descriptors (in case of siamese network D1 =
D2 ). For l2 matching, it holds: 5.3. Local descriptors performance evaluation
We also test our networks on Mikolajczyk dataset for lo-
C(p, d) = kD1 (I1 (p)) − D2 (I2 (p + d))k2 (4)
cal descriptors evaluation [16]. The dataset consists of 48
It is worth noting that all costs above can be computed a images in 6 sequences with camera viewpoint changes, blur,
lot more efficiently using speed optimizations similar with compression, lighting changes and zoom with gradually in-
[28]. This essentially means treating all fully connected lay- creasing amount of transfomation. There are known ground
ers as 1 × 1 convolutions, computing branches of siamese truth homographies between the first and each other image
network only once, and furthermore computing the outputs in sequence.
of these branches as well as the final outputs of the network Testing technique is the same as in [16]. Briefly, to test
at all locations using a number of forward passes on full im- a pair of images, detectors are applied to both images to
ages (e.g., for a 2-channel architecture such an approach of extract keypoints. Following [10], we use MSER detec-
computing the photometric costs would only require feed- tor. The ellipses provided by detector are used to exctract
ing the network with s2 · dmax full 2-channel images of size patches from input images. Ellipse size is magnified by a
equal to the input image pair, where s is the stride at the first factor of 3 to include more context. Then, depending on
layer of the network and dmax is the maximum disparity). the type of network, either descriptors, meaning outputs of
Once computed, the photometric costs are subsequently siamese or pseudo-siamese branches, are extracted, or all
used as unary terms in the following pairwise MRF energy patch pairs are given to 2-channel network to assign a score.
A quantitative comparison on this dataset is shown for
k∇I1 (p)k2
several models in Fig. 10. Here we also test the CNN
X X
E({dp }) = C(p, dp )+ (λ1 +λ2 e− σ2 )·|dp −dq | ,
p (p,q)∈E
network siam-SPP-l2 , which is an SPP-based siamese
architecture (note that siam-SPP is same as siam but
minimized using algorithm [8] based on FastPD [12] (we with the addition of two SPP layers - see also Fig. 4). We
set λ1 = 0.01, λ2 = 0.2, σ = 7 and E is a 4-connected grid). used an inserted SPP layer that had a spatial dimension of
We show in Fig. 9 some qualitative results in terms of 4 × 4. As can be seen, this provides a big boost in match-
computed depth maps (with and without global optimiza- ing performance, suggesting the great utility of such an ar-
tion) for the “fountain” image set (results for “herzjesu” ap- chitecture when comparing image patches. Regarding the
pear in supp. material due to lack of space). Global MRF rest of the models, the observed results in Fig. 10 recon-
optimization results visually verify that photometric cost firm the conclusions already drawn from previous experi-
computed with neural network is much more robust than ments. We simply note again the very good performance
with hand-crafted features, as well as the high quality of the of siam-2stream-l2 , which (although not trained with
MRF 3−pixel error MRF 3−pixel error (non occl. pixels) MRF 1−pixel error MRF 1−pixel error (non occl. pixels)
100 100 100 100 100

90 80 80 80 80
Correct depth %

80 60 60 60 60
2ch
70 40 40 40 40
si am- 2st ream- l 2
60 si am 20 20 20 20
DA I SY
50 0 0 0 0
0 20 40 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Error % Transformation magnitude Transformation magnitude Transformation magnitude Transformation magnitude

Figure 8. Quantitative comparison for wide-baseline stereo on “fountain” dataset. (Leftmost plot) Distribution of deviations from ground
truth, expressed as a fraction of scene’s depth range. (Other plots) Distribution of errors for stereo pairs of increasing baseline (horizontal
axis) both with and without taking into account occluded pixels (error thresholds were set equal to 1 and 3 pixels in these plots - maximum
disparity is around 500 pixels).

Figure 9. Wide baseline stereo evaluation. From left to right: DAISY, siam-2stream-l2 , siam, 2ch. First row - “winner takes all”
depthmaps, second row - depthmaps after MRF optimization.

Average of all sequences


l2 distances) is able to significantly outperform SIFT and 80
MSER SIFT
to also match the performance of imagenet-trained features MSER siam-2stream-l2
70 MSER Imagenet
(using, though, a much lower dimensionality of 512). MSER siam-SPP-l2
MSER 2ch-deep
60 MSER 2ch-2stream
6. Conclusions
50
Matching mAP

In this paper we showed how to learn directly from


40
raw image pixels a general similarity function for patches,
which is encoded in the form of a CNN model. To that 30
end, we studied several neural network architecures that
are specifically adapted to this task, and showed that they 20

exhibit extremely good performance, significantly outper-


10
forming the state-of-the-art on several problems and bench-
mark datasets. 0
1 2 3 4 5
Among these architectures, we note that 2-channel-based Transformation Magnitude
ones were clearly the superior in terms of results. It is, Figure 10. Evaluation on the Mikolajczyk dataset [16] showing the
therefore, worth investigating how to further accelerate mean average precision (mAP) averaged over all types of transfor-
the evaluation of these networks in the future. Regard- mations in the dataset (as usual, the mAP score measures the area
ing siamese-based architectures, 2-stream multi-resolution under the precision-recall curve). More detailed plots are provided
models turned out to be extremely strong, providing always in the supplemental material due to lack of space.
a significant boost in performance and verifying the im-
portance of multi-resolution information when comparing
patches. The same conclusion applies to SPP-based siamese
networks, which also consistently improved the quality of Last, we should note that simply the use of a larger train-
results1 . ing set can potentially benefit and improve the overall per-
1 In fact, SPP performance can improve even further, as no multiple formance of our approach even further (as the training set
aspect ratio patches were used during the training of SPP models (such that was used in the present experiments can actually be
patches appear only at test time). considered rather small by today’s standards).
References [18] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carls-
son. CNN features off-the-shelf: An astounding baseline
[1] X. Boix, M. Gygli, G. Roig, and L. Van Gool. Sparse quan- for recognition. In IEEE Conference on Computer Vision
tization for patch description. In CVPR, 2013. 5 and Pattern Recognition, CVPR Workshops 2014, Columbus,
[2] J. Bromley, I. Guyon, Y. Lecun, E. Sckinger, and R. Shah. OH, USA, June 23-28, 2014, pages 512–519, 2014. 2
Signature verification using a ”siamese” time delay neural [19] K. Simonyan, A. Vedaldi, and A. Zisserman. Learning local
network. In NIPS, 1994. 2 feature descriptors using convex optimisation. IEEE Trans-
[3] M. Brown, G. Hua, and S. Winder. Discriminative learning actions on Pattern Analysis and Machine Intelligence, 2014.
of local image descriptors. IEEE Transactions on Pattern 2, 5, 6
Analysis and Machine Intelligence, 2010. 2, 4, 5
[20] K. Simonyan and A. Zisserman. Very deep convolu-
[4] M. Brown, G. Hua, and S. Winder. Discriminative learning tional networks for large-scale image recognition. CoRR,
of local image descriptors. Pattern Analysis and Machine abs/1409.1556, 2014. 3
Intelligence, IEEE Transactions on, 33(1):43–57, Jan 2011.
[21] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism:
4
Exploring photo collections in 3d. ACM Trans. Graph.,
[5] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran,
25(3):835–846, July 2006. 1
B. Catanzaro, and E. Shelhamer. cudnn: Efficient primitives
[22] N. Snavely, S. M. Seitz, and R. Szeliski. Modeling the
for deep learning. CoRR, abs/1410.0759, 2014. 4
world from internet photo collections. Int. J. Comput. Vision,
[6] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity
80(2):189–210, Nov. 2008. 1
metric discriminatively, with application to face verification.
[23] C. Strecha, W. von Hansen, L. J. V. Gool, P. Fua, and
In CVPR, 2005. 2
U. Thoennessen. On benchmarking camera calibration and
[7] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A
multi-view stereo for high resolution imagery. In CVPR.
matlab-like environment for machine learning. In BigLearn,
IEEE Computer Society, 2008. 6
NIPS Workshop, 2011. 4
[24] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan,
[8] B. Conejo, N. Komodakis, S. Leprince, and J.-P. Avouac.
I. J. Goodfellow, and R. Fergus. Intriguing properties of neu-
Inference by learning: Speeding-up graphical model opti-
ral networks. CoRR, abs/1312.6199, 2013. 2
mization via a coarse-to-fine cascade of pruning classifier.
In NIPS, 2014. 7 [25] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface:
[9] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction Closing the gap to human-level performance in face verifica-
from a single image using a multi-scale deep network. In tion. In Conference on Computer Vision and Pattern Recog-
NIPS, 2014. 2 nition (CVPR), 2014. 2
[10] P. Fischer, A. Dosovitskiy, and T. Brox. Descriptor matching [26] E. Tola, V.Lepetit, and P. Fua. A Fast Local Descriptor for
with convolutional neural networks: a comparison to SIFT. Dense Matching. In Proceedings of Computer Vision and
CoRR, abs/1405.5769, 2014. 2, 4, 7 Pattern Recognition, Alaska, USA, 2008. 2, 6
[11] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling [27] T. Trzcinski, M. Christoudias, V. Lepetit, and P. Fua. Learn-
in deep convolutional networks for visual recognition. In ing Image Descriptors with the Boosting-Trick. In NIPS,
ECCV14, pages III: 346–361, 2014. 4 2012. 2
[12] N. Komodakis, G. Tziritas, and N. Paragios. Fast, approxi- [28] J. Zbontar and Y. LeCun. Computing the stereo match-
mately optimal solutions for single and dynamic MRFs. In ing cost with a convolutional neural network. CoRR,
CVPR, 2007. 7 abs/1409.4326, 2014. 2, 7
[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
F. Pereira, C. Burges, L. Bottou, and K. Weinberger, edi-
tors, Advances in Neural Information Processing Systems 25,
pages 1097–1105. Curran Associates, Inc., 2012. 1, 2
[14] Y. LeCun. A theoretical framework for back-propagation.
In Proceedings of the 1988 Connectionist Models Summer
School, pages 21–28, 1988. 1
[15] D. G. Lowe. Distinctive image features from scale-invariant
keypoints. International Journal of Computer Vision, 60:91–
110, 2004. 1, 2
[16] K. Mikolajczyk and C. Schmid. A performance evaluation
of local descriptors. IEEE Transactions on Pattern Analysis
& Machine Intelligence, 27(10):1615–1630, 2005. 2, 7, 8
[17] E. Nowak and F. Jurie. Learning Visual Similarity Mea-
sures for Comparing Never Seen Objects. In CPVR 2007
- IEEE Conference on Computer Vision and Pattern Recog-
nition, pages 1–8, Minneapolis, United States, June 2007.
IEEE Computer society. 1

You might also like