Feature Propagation On Image Webs For Enhanced Image Retrieval
Feature Propagation On Image Webs For Enhanced Image Retrieval
Feature Propagation On Image Webs For Enhanced Image Retrieval
Retrieval
Eric Brachmann
Marcel Spehr
Stefan Gumhold
TU Dresden
01062 Dresden, Germany
TU Dresden
01062 Dresden, Germany
TU Dresden
01062 Dresden, Germany
ABSTRACT
Keywords
content-based image retrieval; bag-of-features; image similarity; image webs; co-segmentation
1.
INTRODUCTION
2.
RELATED WORK
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ICMR13, April 1620, 2013, Dallas, Texas, USA.
Copyright 2013 ACM 978-1-4503-2033-7/13/04 ...$15.00.
25
Figure 1: Illustration of BOF failure. Three images of the same object are depicted along with their visual
word histograms. The viewpoint changes increasingly from left to right. The first and second histogram still
overlap to large extents, marked in gray. However, the appearance of the object has changed too much in
the third image. The overlap is minimal, and the similarity measure fails. Transitive feature exchange can
prevent the failure.
systems have been published. With the resulting performance boost, BOF search systems still produce state of the
art results on many image retrieval data sets[5]. Query expansion [3] takes the top retrieval results for an image, and
re-runs the search by treating them as new queries. The approach is motivated by the observation that the top results
are often relevant to the search query. The expanded query
set is regarded as an enriched query representation. The
retrieval results of all queries are combined, and ranked by
similarity. By performing multiple searches for each query,
query expansion multiplies retrieval times.
Spatial re-ranking[12] adds a verification step to BOF retrieval. It checks whether locations of matching features
between the query and each top retrieval result are consistent by searching affine transformations between feature
sets. Due to its computational cost, spatial re-ranking can
only be applied to a small set of top retrieval results. Even
so, it impairs the online response time of a retrieval system.
We use the same approach of spatial verification to establish
reliable image connections during image web construction.
But in our case, it is done in a pre-processing step, that does
not influence online query times.
Jegou et al.[5] present a complete state of the art BOF retrieval configuration. They augment BOF image signatures
with binary strings that prevent wrong visual word matches
even with coarse visual dictionaries. Instead of expensive
spatial re-ranking after retrieval, they exploit weak geometric consistency (WGC). Simplified geometric information is
embedded directly into the inverted file. It penalizes images
during retrieval where matching features are inconsistent in
terms of characteristic scale and dominant orientation compared to the query. A multiple assignment strategy prevents
missing valid matches of similar features due to assignment
to different visual words. The modifications of Jegou et
al. require an adapted version of the inverted file structure
with increased memory demand. Although cheaper than
full spatial re-ranking, WGC constrains slow down retrieval.
We adopt WGC constraints but instead of penalizing inconsistent images during retrieval, we use WGC checks during
image web construction to further increase correctness probability when establishing image relations. The result of our
approach is a set of enriched image signatures that may be
used in any classical BOF retrieval system. The structure
3.
IMAGE WEBS
26
4.
FEATURE PROPAGATION
27
5.
Wij ,
5.1
(1)
(D + I) + I,
1
(2)
W Y (t) + Y (0) ).
1
Datasets
We tested our approach on two data sets: INRIA holidays[5] and Oxford buildings[12].
Oxford buildings contains 5063 photos of several prominent buildings of Oxford along with some clutter images.
Since certain objects are covered by many photos and some
photos depict multiple objects, this data set is especially
suited for image web construction. Because of its size, it
represents a realistic application scenario. For each image,
the authors provide pre-calculated features and pre-assigned
visual words. Groundtruth consists of 55 queries with associated relevant images. The relevant images are divided into
two groups, good and ok, depending on how much of the
query object is visible. We do not differentiate between these
two groups. An additional group junk consists of images
that we ignore during evaluation as suggested in [12]. The
data set refines queries with regions of interest which we do
not use.
INRIA holidays contains 1491 personal holiday photos
covering diverse natural scenes and man made environments.
The structure of this data set differs considerably from Oxford buildings. It includes much more diverse, discontiguous
scenes. Since only very few images belong together, it is
much less suited for image web construction. Groundtruth
is given in the form of 500 disjoint groups of related images. Each group contains only a small number of images, 3 on average. The first image of each group serves
as query, and the remaining images are relevant retrieval results. Similar to the Oxford data set, the authors provide
pre-calculated features for each image, but no pre-assigned
visual words. Instead, they provide generic visual dictionaries ranging from 100 to 200k visual words. We assign visual
words using FLANN[11] and the 200k dictionary. Furthermore, pre-calculated features of 1M random Flickr images
are available on the INRIA holidays website. We use them
to assemble distractor image signatures to test the robustness of our BOF implementation.
A=
EVALUATION
(3)
5.2
1. In the default variant, we substitute the original feature set of an image with the feature set after propagation.
BOF Baseline
5.3
Web Construction
We used affine co-segmentation with the following parameters: We deploy the software of Mikolajczyk[10] to extract Hessian-Affine, Harris-Affine and MSER features. The
ASIFT demo code[14] adds ASIFT features. In the case
of ASIFT, we rescaled images by a factor of 0.4 to decrease the computational load. We perform feature matching with FLANN[11] and use Lowes ratio criterion[7] with
r = 0.7 for match filtering. The RANSAC implementation
of OpenCV[2] determines feature sets related by affine trans-
28
Figure 2: Part of the dense Oxford buildings image web. The largest cluster is clearly visible at the center.
Smaller clusters are located towards the left and right margins.
formations with a reprojection error of 5 pixels. We accept
feature sets if they consist of at least 20 features in both
images.
For WGC checks after affine co-segmentation we allowed a
2
maximal variance
of 1.0 for orientation differences, and
2
a maximal variance s
of 0.1 for scale differences. We defined these thresholds after manually examining cases where
affine co-segmentation yielded wrong results. We found that
the variance of orientation differences is much more expressive than the variance of scale differences.
We tested our configuration of affine co-segmentation by
manually validating its outcome on approximately 1300 image pairs of Oxford buildings. Only 13 of them were flawed.
Based on affine co-segmentation we constructed dense image webs of Oxford buildings and INRIA holidays. We
stopped the initial sparse web construction when less than
20 co-segmentations were successful per 1000 image pairs
processed. We stopped densification when the algebraic connectivity improved less than 5% of its initial rise. We found
that a reasonable stopping criterion for densification is imperative. If all possible image connections are established,
local image neighbourhoods become too big and generic for
feature propagation. This results in decreased retrieval performance.
The Oxford buildings web consists of 363 distinct image
clusters dominated by one large cluster with 547 images.
The second largest cluster counts 100 images, and most of
the cluster consist of 5 images or less. Altogether, ca 40%
of the images appear in the image web. For all other images
co-segmentation found no reliable partner. Reasons include
the depiction of singular object, large changes in acquisition
conditions, or image clutter. Figure 2 shows a part of the
Oxford buildings web. The INRIA holidays web consists
of 328 clusters with ca 50% of all images. All clusters are
small with 2 to 10 images. Figure 3 shows one cluster of the
INRIA holidays web in detail.
5.4
Propagation
Based on the image webs, we propagate features according to Section 4. Propagation depends on two parameters:
29
5.5
Results
Table 1 subsumes our evaluation results on INRIA holidays used in conjunction with a dictionary of 200k visual
words. With a basic BOF implementation we achieve a baseline mAP of 0.554 without distractor images. This is comparable to the baseline value reported in [5]. The results
clearly show the benefit of feature propagation. Default feature propagation with = 0.5 results in a mAP of 0.594, i.e.
an improvement of 7.1% over the baseline value. No propagation variant harms retrieval. We observe the benefit of
a large although there is no further improvement beyond
= 0.5.
The impact of distractor images is straight forward. With
more distractors added to the image collection, chances increase that they are confused with relevant images. MAPs
are dropping for the baseline BOF search as well as for all
propagation variants. However, the performance decrease is
much smaller after feature propagation, see Figure 4. With
100,000 distractors the relative improvement over the basic
BOF search rises to 30%. Image signatures clearly became
more robust. Note that we used unaltered query signatures.
Hence, mutual adaptions of queries and database images
through feature propagation are ruled out.
For the most part, we can reproduce our observations for
Oxford buildings. The baseline mAP is much smaller with
0.320. The data set contains more images of homogeneous
objects, so there is more room for confusion. Furthermore,
the homogeneous images exploit the expressiveness of the
generic INRIA 200k dictionary only to some extent. Although performance is lower for Oxford buildings on absolute terms, the relative improvement through feature propagation is higher than for INRIA holidays. The best results
are again achieved with default propagation and = 0.5.
Without distactors, we boost the mAP to 0.409, an improvement of 27.7%. The improvment is stable in regard
to distractors, see Figure 5. With 100,000 distractor images
our best result is a mAP of 0.360 compared to the baseline
of 0.223, a significant improvement of 60%.
We also performed feature propagation on the pre-assigned
visual words of Oxford buildings. They are based on a
much larger dictionary of 1M words that was furthermore
learned on Oxford buildings itself. Naturally, it is much
more expressive for this data set. We observe a high baseline mAP of 0.545. Here, we noticed dropping retrieval performance through feature propagation, see Table 3. With
default propagation mAP drops by 9.3% for = 0.5, and
by 3.3% for = 0.9. We attribute this to the sparseness of
visual words with the 1M dictionary. Sparse visual words
are more likely to vanish through default propagation. This
can happen to an extent where the expressiveness of image
signatures suffers. Augmented propagation prevents such effects. Indeed, with = 0.5 we achieve a mAP of 0.571, an
improvement of 4.8%. We were not able to test the robustness with distractor signatures here, because the 1M word
dictionary was not published.
5.6
6.
CONCLUSION
Performance
30
Table 1: Evaluation of feature propagation on INRIA holidays in conjunction with a generic 200k word
dictionary. The best performance per row is marked in bold face.
mAP
mAP
mAP
distractors baseline
propagation default
propagation augmented
= 0.1 = 0.5 = 0.9 = 0.1 = 0.5 = 0.9
0 0.554
0.566
0.594
0.592
0.570
0.576
0.575
1,000 0.530
0.546
0.576
0.574
0.548
0.562
0.563
10,000 0.463
0.487
0.533
0.532
0.486
0.526
0.529
100,000 0.382
0.423
0.498
0.498
0.422
0.489
0.495
propagation augmented
0.7
0.6
0.6
0.5
0.5
0.4
0.4
mAP
mAP
propagation default
0.7
0.3
0.2
0.3
0.2
baseline
alpha = 0.1
alpha = 0.5
alpha = 0.9
0.1
0
1
10
100
1000
distractors
10000
baseline
alpha = 0.1
alpha = 0.5
alpha = 0.9
0.1
0
100000
10
100
1000
distractors
10000
100000
Figure 4: Impact of the number of distractor images on retrieval performance for INRIA holidays.
Table 2: Evaluation of feature propagation on Oxford buildings in conjunction with a generic 200k word
dictionary. The best performance per row is marked in bold face.
mAP
mAP
mAP
distractors baseline
propagation default
propagation augmented
= 0.1 = 0.5 = 0.9 = 0.1 = 0.5 = 0.9
0 0.320
0.370
0.409
0.338
0.334
0.365
0.368
1,000 0.315
0.366
0.407
0.338
0.327
0.359
0.363
10,000 0.294
0.345
0.400
0.333
0.306
0.340
0.344
100,000 0.223
0.284
0.360
0.309
0.235
0.271
0.284
propagation default
0.7
0.6
baseline
alpha = 0.1
alpha = 0.5
alpha = 0.9
0.6
0.5
0.5
0.4
0.4
mAP
mAP
propagation augmented
0.7
baseline
alpha = 0.1
alpha = 0.5
alpha = 0.9
0.3
0.3
0.2
0.2
0.1
0.1
0
1
10
100
1000
distractors
10000
100000
10
100
1000
distractors
10000
100000
Figure 5: Impact of the number of distractor images on retrieval performance for Oxford buildings.
31
7.
ACKNOWLEDGMENTS
8.
REFERENCES
32