Harvesting Image Databases From The Web
Harvesting Image Databases From The Web
Harvesting Image Databases From The Web
1 INTRODUCTION
T
HE availability of image databases has proven invalu-
able for training and testing object class models during
the recent surge of interest in object recognition. However,
producing such databases containing a large number of
images and with high precision is still an arduous manual
task. Image search engines apparently provide an effortless
route, but currently are limited by poor precision of the
returned images and restrictions on the total number of
images provided. For example, with Google Image Search,
the precision is as low as 32 percent on one of the classes
tested here (shark) and averages 39 percent, and downloads
are restricted to 1,000 images.
Fergus et al. [11], [12], Lin et al. [20], Li et al. [19], Collins
et al. [8], Vijayanarasimhan and Grauman [33], and Fritz
and Schiele [14] dealt with the precision problem by
reranking the images downloaded from an image search.
The method in [12] involved visual clustering of the images
by using probabilistic Latent Semantic Analysis (pLSA) [16]
over a visual vocabulary, [19] used a Hierarchical Dirichlet
Process instead of pLSA, and [33] uses multiple instance
learning to learn the visual models. Lin et al. [20] reranked
using the text on the original page from which the image
was obtained. However, for all of these methods, the yield
is limited by the restriction on the total number of images
provided by the image search.
Berg and Forsyth [5] overcome the download restriction
by starting from a Web search instead of an image search.
This search can generate thousands of images. Their method
then proceeds in two stages: First, topics are discovered
based on words occurring on the Webpages using Latent
Dirichlet Allocation (LDA) [6] on text only. Image clusters
for each topic are formed by selecting images where nearby
text is top ranked by the topic. A user then partitions the
clusters into positive and negative for the class. Second,
images and the associated text from these clusters are used
as exemplars to train a classifier based on voting on visual
(shape, color, and texture) and text features. The classifier is
then used to rerank the downloaded data set. Note that the
user labeling of clusters avoids the problem of polysemy, as
well as providing good training data for the classifier. The
method succeeds in achieving a greater yield, but at the cost
of manual intervention.
Our objective in this work is to harvest a large number of
images of a particular class automatically, and to achieve this
with high precision. Our motivation is to provide training
databases so that a new object model can be learned
effortlessly. Following [5], we also use Web search to obtain
a large pool of images and the Webpages that contain them.
The low precision does not allow us to learn a class model
from such images using vision alone. The challenge then is
how best to combine text, metadata, and visual information
in order to achieve the best image reranking.
The twomaincontributions are: First, we showinSection3
that metadata and text attributes on the Webpage containing
the image provide a useful estimate of the probability that the
image is in class, and thence, can be used to successfully rank
images inthe downloadedpool. Second, we showinSection4
that this probability is sufficient to provide (noisy) training
data for a visual classifier, and that this classifier delivers a
superior reranking to that produced by text alone. Fig. 1
visualizes this two-stage improvement over the initially
downloaded images. The class-independent text ranker
significantly improves this unranked baseline and is itself
improved by quite a margin when the vision-based ranker
754 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 4, APRIL 2011
. F. Schroff is with the Department of Computer Science and Engineering,
University of California, San Diego, CA 92093.
E-mail: [email protected].
. A. Criminisi is with the Microsoft ResearchCambridge, 7 J.J. Thomson
Avenue, Cambridge CB3 0FB, UK. E-mail: [email protected].
. A. Zisserman is with the Robotics Research Group, Department of
Engineering Science, University of Oxford, Parks Road, Oxford, OX1 3PJ,
UK. E-mail: [email protected].
Manuscript received 26 Nov. 2008; revised 13 May 2009; accepted 11 July
2009; published online 19 July 2010.
Recommended for acceptance by B. Schiele.
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number
TPAMI-2008-11-0816.
Digital Object Identifier no. 10.1109/TPAMI.2010.133.
0162-8828/11/$26.00 2011 IEEE Published by the IEEE Computer Society
(trained on the text ranker results) is employed. We
compared our proposed discriminative framework (SVM)
to unsupervised methods (topic models), concluding that the
discriminative approach is better suited for this task, and
thus, the focus of this work.
Others have used text and images together, however, in a
slightly different setting. For example, Barnard et al. [2] use
ground-truth annotated images as opposed to noisy annota-
tion stemming fromWebpages, as in our case. Other work by
Berg et al. [4] uses text from the Internet, but focuses on
identifying a specific class rather than general object classes.
We show in Section 5.1 that our automatic method
achieves superior ranking results to those produced by the
method of Berg and Forsyth [5] and also to that of Google
Image Search.
This paper is an extended version of [28]. The extensions
include: a comparison of different text ranking methods,
additional visual features (HOG), an investigation of the
cross validation to noise in the training data, and a
comparison of different topic models (for the visual features).
2 THE DATABASES
This section describes the methods for downloading the
initial pool of images (together with associated metadata)
from the Internet, and the initial filtering that is applied. For
the purpose of training classifiers and for assessing
precision and recall, the downloaded images are annotated
manually for 18 classes: airplane (ap), beaver (bv), bikes
(bk), boat (bt), camel (cm), car (cr), dolphin (dp), elephant
(ep), giraffe (gf), guitar (gr), horse (hs), kangaroo (kg),
motorbikes (mb), penguin (pg), shark (sk), tiger (tr),
wristwatch (ww), and zebra (zb).
Data collection. We compare three different approaches
to downloading images from the Web. The first approach,
named WebSearch, submits the query word to Google Web
search and all images that are linked within the returned
Webpages are downloaded. Google limits the number of
returned Webpages to 1,000, but many of the Webpages
contain multiple images, so in this manner, thousands of
images are obtained. The second approach, ImageSearch,
starts from Google image search (rather than Web search).
Google image search limits the number of returned images
to 1,000, but here, each of the returned images is treated as a
seedfurther images are downloaded from the Webpage
where the seed image originated. The third approach,
GoogleImages, includes only the images directly returned
by Google image search (a subset of those returned by
ImageSearch). The query can consist of a single word or
more specific descriptions such as penguin animal or
penguin OR penguins. Images smaller than 120 120 are
discarded. In addition to the images, text surrounding the
image HTML tag is downloaded, together with other
metadata such as the image filename.
Ground-truth annotation. In a similar manner to Fergus
et al. [12], images are divided into three categories:
in-class-good. Images that contain one or many class
instances in a clearly visible way (without major occlusion,
lighting deterioration, or background clutter, and of
sufficient size).
in-class-ok. Images that show parts of a class instance, or
obfuscated views of the object due to lighting, clutter,
occlusion, and the like.
nonclass. Images not belonging to in-class.
The good and ok sets are further divided into two
subclasses:
abstract. Images that do not resemble realistic natural
objects (e.g., drawings, nonrealistic paintings, comics, casts,
or statues).
nonabstract. Images not belonging to the previous class.
Example annotations for the class penguin are shown
in Fig. 2. The full data set is published in [29]. As is usual
in annotation, there are ambiguous cases, e.g., deciding
when occlusion is sufficiently severe to classify as ok
rather than good, or when the objects are too small. The
annotations were made as consistent as possible by a final
check from one person. Note that the abstract versus
nonabstract categorization is not general but is suitable for
the object classes we consider in this paper. For example,
it would not be useful if the class of interest was graph
or statue or a similar more abstract category.
Table 1 details the statistics for each of the three
retrieval techniques (WebSearch, ImageSearch, and
SCHROFF ET AL.: HARVESTING IMAGE DATABASES FROM THE WEB 755
Fig. 1. Text and visual ranking versus unranked baseline: precision
recall plot for the text reranking, the visual ranking trained on the text
ranking, and the (unranked) original downloaded images, for the shark
query.
Fig. 2. Image annotations: example images corresponding to annota-
tion categories for the class penguin.
GoogleImages). Note that some images are common
between the methods. ImageSearch gives a very low
precision (only about 4 percent) and is not used for the
harvesting experiments. This low precision is probably due
to the fact that Google selects many images from Web-
gallery pages which contain images of all sorts. Google is
able to select the in-class images from those pages, e.g., the
ones with the object-class in the filename; however, if we
use those Webpages as seeds, the overall precision greatly
decreases. Therefore, we only use WebSearch and
GoogleImages, which are merged into one data set per
object class. Table 2 lists the 18 categories downloaded and
the corresponding statistics for in-class and nonclass images.
The overall precision of the images downloaded for all
18 classes is about 29 percent.
Due to the great diversity of images available on the
Internet and because of how we retrieve the images, it is
difficult to make general observations on how these
databases look. However, it is clear that polysemy affects
the returned images. Interestingly, this is not a problem that
could be predicted directly from the English word since
most of the classes we search for dont have direct
polysemous meanings, i.e., they are not polysemous in the
sense of bank (as in place to get money or river bank) for
example. It is rather that the words correspond to brands or
product names (leopard tank) or team names (the NHL
ice hockey team San Jose Sharks) or are used as attributes
(tiger shark). Apart from that, the in-class images occur in
almost all variations imaginable, as sharks crashed into
houses or other oddities. Even though context [31] can
clearly be important in reranking the images (e.g., camel
and kangaroo in desert-like images), it will have its
limitations due to the variety of occurrences of the object.
2.1 Removing Drawings and Symbolic Images
Since we are mostly interested in building databases for
natural image recognition, we ideally would like to remove
all abstract images from the downloaded images. However,
separating abstract images from all others automatically is
very challenging for classifiers based on visual features.
Instead, we tackle the easier visual task of removing
drawings and symbolic images. These include: comics,
graphs, plots, maps, charts, drawings, and sketches, where
the images can be fairly simply characterized by their visual
features (see below). Example images are shown in Fig. 3.
Their removal significantly reduces the number of nonclass
images, improving the resulting precision of the object class
data sets as shown in Table 2 (overall precision goes from 29
to 35 percent). Filtering out such images also has the aim of
removing this type of abstract image fromthe in-class images.
Learning the filter. We train a radial basis function
Support Vector Machine (SVM) on a hand-labeled data set
(examples in Fig. 3). After the initial training, no further user
interaction is required. In order to obtain this data set, images
were downloaded using ImageSearch with one level of
recursion (i.e., Webpages linked from seed Webpages are
also used) with queries such as sketch or drawing or
draft. The goal was to retrieve many images andthen select
suitable training images manually. The resulting data set
consists of approximately 1,400 drawings and symbolic
images and 2,000 nondrawings and symbolic images.
Three simple visual only features are used: 1) a color
histogram, 2) a histogramof the L2-normof the gradient, and
3) a histogramof the angles (0 . . . ) weighted by the L2-norm
of the corresponding gradient. In all cases, 1,000 equally
spaced bins are used. The motivation behind this choice of
features is that drawings and symbolic images are character-
izedbysharpedges incertainorientations andor a distinctive
color distribution (e.g., only few colors in large areas). The
methodachieves around90 percent classificationaccuracyon
756 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 4, APRIL 2011
TABLE 1
Statistics by Source: The Statistics of Downloaded Images
for Different Retrieval Techniques for 18 Classes
TABLE 2
Image Class Statistics of the Original Downloaded Images
Using WebSearch and GoogleImages Only, and After
Applying the Drawing and Symbolic Images Removal Filter
Fig. 3. Drawings and symbolic images: examples of positive and
negative training images.
the drawings and symbolic images database (using two-fold
cross-validation).
This classifier is applied to the entire downloaded
image data set to filter out drawing and symbolic images,
before further processing. The total number of images that
are removed for each class is shown in Table 2. In total,
39 percent of nonclass images are removed over all classes.
The remaining images are those used in our experiments.
As well as successfully removing nonclass images, the
filter also succeeds in removing an average of 60 percent
(123 images) in-class abstract images, with a range between
45 percent (for motorbikes, 40 images) and 85 percent (for
wristwatch, 11 images). There is some loss of the desired
in-class nonabstract images, with, on average, 13 percent
(90 images) removed, though particular classes lose a
relatively high percentage (28 percent for shark and
wristwatch). Even though this seems to be a high loss,
the precision of the resulting data sets is improved in all
cases except for the class shark.
3 RANKING ON TEXTUAL FEATURES
We now describe the reranking of the returned images
based on text and metadata alone. Here, we follow and
extend the method proposed by Frankel et al. [13] in using a
set of textual attributes whose presence is a strong
indication of the image content.
Textual features. We use seven features from the text
and HTML tags on the Webpage: contextR, context10, filedir,
filename, imagealt, imagetitle, and Websitetitle.
Filedir, filename, and Websitetitle are self-explanatory.
Context10 includes the 10 words on either side of the image
link. ContextR describes the words on the Webpage between
11 and 50 words away from the image link. Imagealt and
imagetitle refer to the alt and title attribute of the image
tag. The features are intended to be conditionally indepen-
dent given the image content (we address this indepen-
dence below). It is difficult to compare directly with the
features in [13] since no precise definition of the features
actually used is given.
Context here is defined by the HTML source, not by the
rendered page, since the latter depends on screen resolution
and browser type and is an expensive operation. In the text
processing, a standard stop list [24] and the Porter stemmer
[25] are used. In addition, HTML-tags and domain-specific
stop words (such as html or ) are ignored.
We also experimented with a number of other features,
such as the image MIME type (gif, jpeg, etc.), but found
that they did not help discrimination.
3.1 Image Ranking
Using these seven textual features, the goal is to rerank the
retrieved images. Each feature is treated as binary: True if
it contains the query word (e.g., penguin) and False
otherwise. The seven features define a binary feature vector
for each image a o
1
. . . . . o
7
, and the ranking is then
based on the posterior probability, 1y ii-c|o::ja, of the
image being in-class, where y 2 fii-c|o::. ioic|o::g is the
class label of an image.
We learn a class independent ranker in order to rerank the
images based on the posterior 1yja. To rerank images for
one particular class (e.g., penguin), we do not employ the
ground-truth data for that class. Instead, we train the Bayes
classifier (specifically we learn 1ajy, 1y, and 1a)
using all available annotations except the class we want to
rerank. This way, we evaluate performance as a completely
automatic class independent image ranker, i.e., for any new
and unknown class, the images can be reranked without
ever using labeled ground-truth knowledge of that class.
3.1.1 Ranking Models
We compare different Bayesian posterior models for
1y ii-c|o::ja. Specifically, we looked at the suitability
of the following decompositions:
Chow-Liu dependence tree decomposition [7]:
1ajy /
Y
8
1
1r
i
jr
ii
1
with r o
1
. . . . . o
7
. y and i being a permutation of
1. . . . . 8. The Chow-Liu model approximates the full joint
dependency graph as a tree by retaining the edges between
variables with the highest mutual information.
Naive Bayes model:
1ajy /
Y
7
1
1o
i
jy. 2
Given the class label for an image, the text features are
assumed to be independent. For our application, this is
obviously not the case, e.g., filename and image alter-
native tag are highly correlated.
Pairwise dependencies:
1ajy /
Y
7
i.,1
1o
i
. o
,
jy. 3
Only pairwise dependencies are modeled. This is similar to
the Chow-Liu model, but less sparse.
Full joint:
1ajy / 1o
1
. . . . . o
7
jy. 4
The full joint probability distribution is learned. If the
amount of available training data is too small, the learned
model can be inaccurate.
Mixed naive Bayes:
1ajy / 1o
1
. . . . . o
4
jy
Y
7
5
1o
i
jy. 5
where 1o
1
. . . . . o
4
jy is the joint probability of the first four
textual features (contextR, context10, filedir, and filename).
Logistic regression. Additionally, we evaluate the
performance using the discriminative logistic regression
model, where
1yja
1
1 c
w
T
a
. 6
3.1.2 Text Reranking Results
We assess the performance by reporting precision at
various points of recall as well as average precision in
Table 3. Figs. 4 and 5 give an overview of the different
methods and their performance. Fig. 6 shows the precision-
recall curves for selected classes using the mixed naive
SCHROFF ET AL.: HARVESTING IMAGE DATABASES FROM THE WEB 757
model. It can clearly be seen that precision is highly
increased at the lower recall levels compared to the average
precision of Table 2.
The mixed model (5) gave slightly better performance
than other factorizations on 100 images recall, and reflects the
fact that the first four features are less independent of each
other than the remaining three. Overall, all models perform
comparably and the differences are negligible, except for
the Chow-Lui dependence tree, which performs slightly
worse. We chose the mixed naive model for our experi-
ments as the 100 images recall is more related to our
selection of training data than average precision or
precision at 15 percent recall.
A separate set of experiments was carried out to measure
how the performance of the text ranker varies with the
number and choice of classes used for training. Ideally, we
would like to compare 1ajy, 1y, and 1a learned using
different numbers of training classes. However, given our
goal of ranking images, we instead compare these prob-
abilities indirectly by assessing precision at 15 percent
recall. We find that the performance is almost unaltered by
the choice of training classes provided more than five
classes (chosen randomly) are used for training.
Discussion. As can be seen in Fig. 6, the text reranker
performs well, on average, and significantly improves the
precision up to quite a high recall level (see Figs. 8 and 9 for
top-ranked images). In Section 4, we will show that this is
sufficient to train a visual classifier. For some classes, the text
ranker performs very well (e.g., wristwatch, giraffe); for
others, it performs rather poorly (e.g., airplane, beaver,
758 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 4, APRIL 2011
Fig. 4. Average precision-recall of text rankers: The precision-recall
curve averaged over all 18 classes for each of the described ranking
methods.
Fig. 5. Comparison of text rankers: The precision at 15 percent recall
for all 18 classes and four different models. See Table 3 for details and
the average precision over all classes.
TABLE 3
Precision of Textual Reranking: The Performance of the Textual Reranking for All 18 Classes over Different Models:
Precision at 15 Percent Recall, Precision at 100 Images Recall, and Average Precision
The precision is given as a percentage. The last column gives the average over all classes. Mixed naive Bayes performs best and was picked for all
subsequent experiments.
camel, and tiger). Visual inspection of the highly ranked
outlier (nonclass) images in the text-ranked lists gives some
explanation for these performances. Classes that perform
well (wristwatch, giraffe) generally have outliers that are
unrelated to each other. In contrast, for the classes that
performpoorly, the outlier images are relatedandresult from
lack of discriminativity of the query wordfor example, for
airplanes, there are images of airplane food, airports, toy
airplanes, paper airplanes, airplane interiors, and advertise-
ments with comic airplanes. Other classes suffer from the
type of polysemy described in Section 2: For camels, there are
brand and cigarette-related outliers, and for tiger, there is the
attribute problem with images of helicopters, tanks, fish
(sharks), boxing, golf, stones, and butterflies.
These examples cover very different classes, fromanimals
to various man-made objects. Generalization to other well-
defined object classes is demonstrated in the experiments of
Section 5.1 on the object classes of [5] and [12].
We investigated two alternative text-based classifiers,
pLSA and SVM, in addition to the Bayes estimator finally
adopted, but found that they had inferior performance. For
the SVM, the same binary text features a were used. It is
possible that the binary features led to the poor perfor-
mance of the SVM. For the pLSA, we used context10 and
contextR (similar to [5]). Due to the problems in the pLSA
clustering, the problem of how to select the right topic
without user interaction as in [5], and the question of how
to use the additional binary features (e.g., filename) in a
principled manner, we adopted the Bayes estimator instead.
The text reranking is not meant to compete with the
performance of Internet image search engines, which could
also be used in our algorithm and, given their recent
performance improvements, would be a reasonable choice.
Instead, we decided to use this simple setup to gather the
training data for our visual models, with the focus of
independence and a more controlled setup, considering the
fast changing quality of image search engines.
4 RANKING ON VISUAL FEATURES
The text reranking of Section 3 associates a posterior
probability with each image as to whether it contains the
query class or not. The problem we are now faced with is
how to use this information to train a visual classifier that
would improve the ranking further. The problem is one of
training from noisy data: We need to decide which images
to use for positive and negative training data and how to
select a validation set in order to optimize the parameters of
the classifier.
We first describe the visual features used and then how
the classifier is trained.
Visual features. We follow the approach of [12] and
use a variety of region detectors with a common visual
vocabulary in the bag of visual words model framework
(BOW). All images are first resized to 300 pixels in width.
Regions are detected using difference of Gaussians,
Multiscale-Harris [22], Kadirs saliency operator [18], and
points sampled from Canny edge points. Each image
region is represented as a 72-dimensional SIFT [21]
descriptor. A separate vocabulary consisting of 100 visual
words is learned for each detector using k-means, and
these vocabularies are then combined into a single one of
400 words. Finally, the descriptor of each region is
assigned to the vocabulary. The software for the detectors
is obtained from [32]. Fuller implementation details are
given in [12] and are reproduced in our implementation.
The 72-dimensional SIFT is based on the work in [12] and
driven by the motivation that a coarser spatial binning can
improve generalization, as opposed to a finer binning that
is more suitable for particular object matching.
In addition to the descriptors used in [12], we add the
widely used HOG descriptor [9], computed over the whole
image to incorporate some spatial layout into the model. It
was also used in a similar setting in [14]. We use a cell size
of 8 pixels, a block size of one cell, and 9 contrast invariant
gradient bins. This results in a 900-dimensional feature
vector, as the images were resized to 80 80 pixels. The two
descriptors are concatenated resulting in a 1,300-dimen-
sional feature vector per image.
4.1 Training the Visual Classifier
At this point, we can select i
100.
However, we can assume that the nonclass images are not
visually consistentan assumption verified to some extent
by the results in Section 4.2. The case of negative images is
more favorable: We select i
X
i:y
i
1
i
C
X
,:y
,
1
,
7
subject to y
|
w
T
x
|
/
! 1
|
. 8
|
! 0. | 1. . . . . i
. 9
where x
|
are the training vectors and y
|
2 f1. 1g the class
labels. C
and C
, and C
background
images. We require a performance measure for the cross
validation and use precision at 15 percent recall, computed
on the validation subset of the i
images (treated as
positive) and the i
and C
and i
and i
and i
since these
numbers are very class dependent. Table 4 indicates that
using more images in the background class i
tends to
improve the performance but there is no real difference
between using 150,1.000 and 250,1.000 (i
/i
), which
perform at 68.4%1.9 and 68.0%2.0, and thus are not
significantly different. All numbers in this section report
precision at 15 percent recall.
It can be seen (Table 4) that HOG alone performs
significantly worse than the bag of visual words
57.9%1.8, but the combination of BOW and HOG
improves the overall performance to 69.8%2.1, compared
to BOW alone 68.0%2.0.
In order to select the appropriate parameter values, we
use cross validation, where the validation set is part of the
i
and i
,i
1.000 background images as before. The second to last column (avg.) gives the
average over all classes. The last column states the mean of the classwise standard deviations over five runs of cross validation, as well as the
standard deviation of the means over all classes, in parentheses.
Fig. 7. Overview of the text+vision (t+v) image harvesting algorithm.
validation set due to a low precision at 15 percent recall.
2) All images (training and validation sets) are classified as
background. This leads to bad, but detectable, performance
as well.
Here, we describe a slight adjustment to this method,
which ignores difficult images. Parameter settings that
classify (almost) all images as fore or background are not
useful; neither are those that overfit to the training data. We
reject those parameter settings. We then use the good
parameter settings to train and classify all images. By
looking at the distribution of SVM responses (over all
parameter settings), we are able to eliminate intermediate
images, i.e., images that are not classified as positive or
negative images in the majority of cases. We assume that
SCHROFF ET AL.: HARVESTING IMAGE DATABASES FROM THE WEB 761
Fig. 8. Comparing the top-ranked 36 images using the text ranker and the full system: Red boxes indicate false positives.
those images are difficult to classify and we dont use those
in our model selection step because we cannot rely on their
class labels being correct due to the noise. Note that training
is still performed on all images, and the difficult images
are only discarded during the computation of the precision
at 15 percent recall during the cross validation.
This method does not give a significant improvement
over all classes, but improves the performance dramatically
for some classes, e. g. , penguin 90.2%5.0 from
78.9%13.2. This modified version of the cross validation
is denoted by C in Table 4.
We next determine howmuch the performance is affected
by the noise in the training data by training the SVM on
ground-truthpositive data, i.e., insteadof selectingi
images
from the text-ranked images, we select i
ii-c|o:: images
using the ground-truth labeling. We find that the text+vision
762 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 4, APRIL 2011
Fig. 9. Comparing the top-ranked 36 images using the text ranker and the full system: Red boxes indicate false positives.
system performs well for the classes where the text ranking
performs sufficiently well (see Figs. 8 and9 for the topimages
returned by the text ranker). HOG+BOW versus gt, e.g., car
93.5%7.1 versus 98.6%1.1, or giraffe 88.3%2.7 versus
91.4%4.8. If the text ranking fails, the ground-truth per-
forms, as is to be expected, much better than the text ranked-
based training, e.g., for airplane, camel, and kangaroo. These
experiments show that the SVM-based classifier is relatively
insensitive to noise inthe training data as longas a reasonable
noise level is given.
As a baseline comparison, we investigate the performance
if no text reranking is used, but the i
150. i