Information Sciences: Changqin Huang, Haijiao Xu, Liang Xie, Jia Zhu, Chunyan Xu, Yong Tang

Information Sciences 430–431 (2018) 331–348
Contents lists available at ScienceDirect
Information Sciences
journal homepage: www.elsevier.com/locate/ins
Large-scale semantic web image retrieval using bimodal deep

learning techniques
Changqin Huang a,b,∗, Haijiao Xu a,∗, Liang Xie c, Jia Zhu b, Chunyan Xu d,
Yong Tang b
a
School of Information Technology in Education, South China Normal University, Guangzhou, China
b
Guangdong Engineering Research Center for Smart Learning, South China Normal University, Guangzhou, China
c
School of Science, Wuhan University of Technology, Wuhan, China
d
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China
a r t i c l e i n f o a b s t r a c t
Article history: Semantic web image retrieval is useful to end-users for semantic image searches over the
Received 24 July 2017 Internet. This paper aims to develop image retrieval techniques for large-scale web image
Revised 18 November 2017
databases. An advanced retrieval system, termed Multi-concept Retrieval using Bimodal
Accepted 20 November 2017
Deep Learning (MRBDL), is proposed and implemented using Convolutional Neural Net-
Available online 21 November 2017
works (CNNs) which can effectively capture semantic correlations between a visual image
Keywords: and its free contextual tags. Different from existing approaches using multiple and inde-
Convolutional neural networks pendent concepts in a query, MRBDL considers multiple concepts as a holistic scene for
Multi-concept scene classifiers retrieval model learning. In particular, we first use a bimodal CNN to train a holistic scene
Concept based image retrieval classifier in two modalities, and then semantic correlations of the sub-concepts included in
Bimodal learning the images are leveraged to boost holistic scene recognition. The predicted semantic scores
obtained from holistic scene classifier are combined with complementary information on
web images to improve the retrieval performance. Experiments have been carried out over
two publicly available web image databases. The results show that our proposed approach
performs favorably compared with several other state-of-the-art methods.
© 2017 Elsevier Inc. All rights reserved.
1. Introduction
It is well-known that social media data is an indispensable part of our life. Large-scale web information retrieval becomes
a very important topic in information technology. A common characteristic of social media is that users commonly add extra
information directly or indirectly to the media. This metadata associates multi-concept semantics to the media that can be
used for retrieval. For example, web users often assign a variety of social tags [8] or comments to web images for sharing,
then retrieve and utilize them via social websites. Therefore, for large-scale retrieval over semantic image databases, an
efficient approach is vital to process users’ semantic queries.
The tags or textual descriptions of web images are informative but mostly noisy. Fig. 1 shows some query samples
from NUS-WIDE dataset [5]. Obviously, the social tags associated with web images, conveying semantic information, can be
taken as semantic description. Hence, a straightforward idea to implement such a retrieval system is to use the web image
tags in the retrieval system design. However, these tags are often ambiguous and noisy, leading to undesirable retrieval
∗
Corresponding author.
E-mail addresses: [email protected] (C. Huang), [email protected] (H. Xu).
https://doi.org/10.1016/j.ins.2017.11.043
0020-0255/© 2017 Elsevier Inc. All rights reserved.
332 C. Huang et al. / Information Sciences 430–431 (2018) 331–348
Fig. 1. Query samples from the NUS-WIDE dataset. Given a multi-concept query Q = “clouds, person, snow”, example images from the training set and the
test set are shown in the top and bottom row respectively.
performance. For example, as shown in Fig. 1, the underlined tags such as “interestingness”, “bravo” and “national” in the
first row do not directly reflect the visual content, and the tags such as “Oregon” and “speed” in the second row only
partially depict the visual content. Although the noisy social tags with weak semantic information cannot be directly used
as retrieval concepts, they can be considered as auxiliary low-level features (i.e. text modality) for image retrieval.
For web image retrieval, conventional unimodal approaches employ either visual modality [32] or text modality [8]. To
boost web image retrieval performance, multi-modal and cross-modal retrieval approaches, exploring the correlations of
these two modalities, have been proposed in [30]. Most existing work concentrates on single-concept-based image retrieval,
where each query is assumed to have only one concept. This is inconsistent with real-world scenarios where a user always
conducts retrieval with multiple concepts, namely the Multi-Concept-based Image Retrieval (MCIR). As shown in Fig. 1, given
a multi-concept query Q = “clouds, person, snow”, web images simultaneously describing the three concepts are returned
from the database. To tackle the MCIR problem, traditional approaches are ineffective as a multi-concept scene may contain
unique visual characteristics that are difficult to identify solely by single-concept classifiers [11]. Thus, further researches on
the multi-concept-based image retrieval are significant and useful.
Recently, CNNs have been applied for single-concept-based image retrieval [23]. The achieved performance is promising,
and indicates that deep descriptors learned by CNN can well capture the underlying semantic structures of images. Inspired
by this work, we propose a multi-concept retrieval approach using deep learning techniques to resolve the MCIR problem.
In our proposed framework, MRBDL effectively combines the single-concept classifier and the holistic scene classifier in
the visual and text modalities, respectively. From our observations on experimental results, such a design schema can sub-
stantially improve the discriminative power of the classifier for multi-concept scene recognition. In our proposed MRBDL,
we firstly devise a bimodal CNN where the training images and associated texts are separately fed into the corresponding
convolution block layers and the Fully-Connected (FC) classifier layers. The FC classifier layer is composed of two types of
classifiers, that is, the single-concept FC classifier that best suits single-concept recognition, and the multi-concept scene
FC classifier contributes to holistic scene recognition. Next, a two-phase training strategy is proposed to train the bimodal
CNN. Finally, the semantic correlations among concepts are utilized to estimate the semantic scores of the concepts in or-
der to enhance the discriminative capability of FC classifiers. If a concept Cj and its related semantic concepts Cr have a
high co-occurrence frequency in the image set, we boost the semantic score of predicting this concept Cj . To combine the
complementary information from the visual and text modalities, we make an ensemble of these predicted semantic scores
by a fusion operator. To compensate the varying frequencies of concepts derived from imbalanced image datasets [29], the
gradient descent algorithm is applied for maximizing the log-likelihood of the semantic scores over the training images.
The rest of the paper is organized as follows. Section 2 briefly reviews some related work. Section 3 details the proposed
MRBDL framework. Section 4 describes our experimental setup. Section 5 reports the experiments with results and analysis.
Finally, Section 6 concludes this paper.
2. Related work
This section provides some background knowledge, including unimodal, multi-modal and cross-modal image retrieval
techniques, and deep learning concept associated with CNNs.
2.1. Unimodal learning
Unimodal image retrieval systems can be roughly grouped into two categories: Content-Based Image Retrieval (CBIR)
and COncept-based Image Retrieval (COIR). Design of CBIR systems is usually based on local and global visual descrip-
C. Huang et al. / Information Sciences 430–431 (2018) 331–348 333
tors, with an image as a query, and some certain similarity metrics to measure the closeness of feature vectors [1]. COIR
systems can be clustered in terms of generative, discriminative and nearest neighbor approaches. Generative approaches
learn a joint probability distribution over images and concepts. Discriminative approaches learn a map from images to
concepts, such as a Support Vector Machine (SVM) [2] and Stochastic Configuration Networks (SCN) [27], whilst, the non-
parametric nearest-neighbor-based methods are simpler compared to generative and discriminative algorithms, where the
nearest-neighbor classifier assigns a class to an image by a majority vote of its nearest neighbors. In [15], a greedy label
transfer approach was proposed to annotate the images from the nearest neighbor. The Tag Propagation approach (Tag-
Prop) [11] propagates the label from the annotated images to the unannotated images via a weighted nearest neighbor
graph.
To reduce the complexity in system design, many of the proposed approaches ignore the valuable semantics among
concepts. Alternatively, some others consider a unified learning framework to integrate the semantics for image retrieval.
Cilibrasi et al. [6] proposed a Google semantic distance to extract rich semantics of words and phrases. However, neither
the common WordNet ontology and the Google distance based on the common Google corpus fully reflect the character-
istics of the specific image sets. This may impact the performance of concept-based image retrieval. In [16], an ontology-
based image retrieval approach was proposed, which utilizes domain-specific ontology to retrieve semantically relevant im-
ages. Most previous shallow learning approaches focus on single-concept-based image retrieval. To perform multi-concept-
based image retrieval, they consider a combination of single-concept techniques. One typical example is TagProp [11],
which utilizes the product over the single-concept predictions to predict the relevance of images for a scene multi-concept
query.
2.2. Multi-modal and cross-modal learning
Web images usually contain two modalities, i.e. visual modality and text modality. Many multi-modal and cross-modal
image retrieval approaches have been proposed, which explore the correlations between visual image and contextual
text.
Rasiwasia et al. [22] utilized Canonical Correlation Analysis (CCA) to learn the shared subspaces , maximizing the corre-
lation between visual and text modality. In [35], a unified multi-modal feature generation approach was proposed, where a
coding step of the general bag-of-word framework is used to obtain one representative vector of two types of tags. In [21],
a multi-label CCA approach was proposed to learn the shared subspaces, considering the high level semantics in the form of
multi-label annotation. Zhu et al. [34] proposed a unified multi-modal framework for image retrieval, which simultaneously
preserves visual similarity and semantic assistance from the text modality. Although these approaches are effective for CBIR,
their power has not been completely demonstrated for concept-based image retrieval.
Guillaumin et al. [26] employed Flickr social tags as textual descriptors and significantly improved the accuracy of con-
cept learning. To effectively leverage the semantics latently embedded in the social tags, a Multiple Kernel Learning (MKL)
approach was proposed in [12] to boost image classification accuracy. It employs the MKL framework to combine a visual
kernel with a textual kernel encoding the related social tags. Based on visual and textual modalities, Wu et al. [31] and
Wang et al. [28] proposed to learn a distance metric to better capture image similarity.
2.3. Deep learning
Deep learning based on CNN has achieved great success for single-concept based image tasks and attracted increasing
interest recently [3]. Krizhevsky et al. [14] trained a deep CNN, that achieved promising results in the large-scale dataset
ImageNet. This learner model was later improved by the VGGNet network [23] and the GoogLeNet network [25]. He et al.
[13] proposed a residual learning framework to ease the training of neural networks. Its layers are explicitly reformulated
as learning residual functions with reference to the layer inputs. A deep learning approach for the active classification of
electrocardiogram signals [20] was proposed, which learns a feature representation from the raw data in an unsupervised
way using a stacked autoencoder with sparsity constraint.
Some researches considered multi-modal and cross-modal approaches based on deep learning. A multi-modal deep Boltz-
mann machine was proposed in [24], which produces a fused descriptor of visual and textual data and then utilizes the
logistic regression classifier to recognize the semantic concepts in images. Wang et al. [30] proposed a mapping function
learning approach based on a stacked auto-encoder and deep CNN, which projects data from different modalities into a com-
mon metric space. It can effectively capture both intra-modal and inter-modal semantic correlations of heterogeneous data,
thus achieving promising retrieval performance. Although these deep learning approaches show effectiveness in concept-
based image retrieval, they are limited to handle single concept-based image retrieval, in which each user query is assumed
to have only one concept. This is inconsistent with real world scenarios where a user always conducts image retrieval with
multiple concepts. This motivates us to design MRBDL to effectively exploit the multi-concept scene FC classifier to recognize
a holistic scene.
Fig. 2. Illustration of the framework of MRBDL.
3. Multi-concept retrieval using bimodal deep learning
3.1. Problem statement
To fully exploit the visual cues, text cues and semantic correlations of concepts from large-scale image data sets, our
image retrieval approach adopts multi-concept semantic retrieval using a bimodal CNN model, i.e. MRBDL. Before detailing
this model of concept-based image retrieval, we firstly introduce some notations used in this paper.

Let D = DI DT be the web image dataset. An image and its associated contextual tags from a tag vocabulary L are
denoted by the descriptor vector xIi ∈ DI and xTi ∈ DT respectively, and then a bimodal image is denoted as an image-text
pair (xIi , xTi ).
Given a vocabulary V = {c1 , . . . , c j , . . . , cM } consisting of M unique semantic concepts, each concept cj ∈ V is a single-
concept (e.g. “snow” or “person”). D is split into two sets: a training set A and a test set B. Each image-text pair (xIi , xTi ) ∈
A is annotated with a few semantic single-concepts cj , while image-text pairs in B have no semantic annotations. Each
semantic multi-concept C j = {c1 , · · · , cm } is an element of the power set of V, i.e. Cj ∈ 2V or Cj ⊆V, where m is the length |Cj |
of Cj .
Given a retrieval multi-concept Q = {c1 , · · · , ct } ∈ 2V and a target domain B. The goal is to find a set O ⊂ B with K most
relevant images describing all t = |Q | target single-concepts, i.e., ∀o ∈ O and o ∈ B/O, r(Q, o) ≥ r(Q, o ) where r(Q, .) denotes
the relevance score for the concept Q.
3.2. The framework of MRBDL
Fig. 2 depicts the basic framework of MRBDL with working mechanisms. The system contains three main components:
offline convolution block layers, offline FC classifier layers and online retrieval. The following gives some details about the
functionalities of these components.
Offline convolution block layers. This component aims to learn two types of deep descriptors of web images: the visual and
textual descriptors, which are respectively fed into the visual and textual FC classifier layers. Through the visual convo-
lutional block layer, visual descriptors of images are learned to transform image pixels to a feature vector. Textual tags of
images are transformed to word embedding [19] and textual descriptors are then learned via the textual convolutional block
layer.
Offline FC classifier layers. This component corresponds to two types of classifiers: single-concept FC classifier and multi-
concept scene FC classifier. They can learn a set of mapping functions between the concepts and the deep descriptors in the
visual and textual modalities. First, a multi-concept vocabulary V + is constructed with the concept co-occurrence approach.
Then, all concepts in V + are memorized through training the visual and textual FC classifiers. Since each concept in V +
has two choices: single-concept and multi-concept, four types of mapping functions can be built, namely single-concept
visual mapping function, multi-concept visual mapping function, single-concept textual mapping function, and multi-concept
textual mapping function.
Online retrieval. Given a multi-concept scene query Q, the multi-concept neighbor set Q + is first generated via the seman-
tically nearest neighbor approach. Then, each concept in Q + is mapped into relevance scores with mapping functions from
two modalities of images. Finally, the relevance scores are fused according to the semantic correlation between the two
modalities, the semantic correlation among concepts, and the concept frequencies, and images are returned in the descend-
ing order of relevance scores.
3.3. Multi-concept vocabulary generation
Each multi-concept Cj ∈ 2V such as “clouds, person, snow” can be used for two different classifiers. As the whole concepts
are concerned with, Cj can be taken as a holistic scene concept, and it is fed into the multi-concept scene FC classifier in
two modalities. On the other hand, when we focus on a single concept, Cj is split into multiple single-concepts ck ∈ Cj which
are then fed into the single-concept FC classifier in two modalities.
To avoid meaningless concept permutation, MRBDL selects the meaningful Cj to generate a multi-concept vocabulary V +
according to the following co-occurrence rule over the training set A:
|C j | ≤ α (1)
φ (C j ) ≥ β (2)
where |Cj | and φ (Cj ) denote the cardinality of the concept Cj appearing in A and the total number of all m concepts ck ∈ Cj
co-occurring in A, namely the multi-concept frequency, respectively; α and β are two thresholds. If the size of the set V + is
very large, we can control the co-occurrence count β in (2) to reduce the computational cost. In this way, a multi-concept
vocabulary V + can be produced. The multi-concept generation algorithm is given in Algorithm 1.
Algorithm 1 Summary of multi-concept vocabulary generation.

Input: Training set A with annotation set V .
Output: Multi-concept vocabulary V + .
1: Initialize the vocabulary, i.e., V + (1 ) = V ;
2: for i = 2 to α do
3: Calculate the multi-concept set W = {C j ||C j | = i} via (2);

4: V + (i ) = V + (i − 1 ) W ;
5: end for
6: return V + (α ).
3.4. Network structure
3.4.1. Visual network structure

Deep CNN has been successfully applied in image tasks [23], with a specialized connectivity structure consisting of the
convolution block layer followed by the FC classifier layer. The convolutional block layer forms multiple-staged descriptor
extractors with higher sub-layers generating more abstract descriptors from lower ones, on top of which there is the FC
classifier layer. The convolutional block layer can be instantiated by any existing CNN architecture. Some typical CNN models
include AlexNet [14], VGGNet [23], etc. Without loss of generality, we choose VGGNet to instantiate our convolutional block
layer in this paper. To effectively recognize the holistic scene, we then add a FC classifier layer on top of the convolution
block layer of MRBDL. The FC classifier layer includes two types of FC classifiers: a single-concept FC classifier and a multi-
concept FC classifier, as illustrated in Fig. 2. For the two types of FC classifiers, we employ one shared convolutional block
layer in the structure of MRBDL, since the convolutional block layer forms the general hierarchical representations for visual
images, which should be shared by all FC classifiers.
The objective of the original CNN such as VGGNet is to predict a annotation of each unseen image, whereas in our case,
one image is annotated with multiple annotations. We thus follow Gong et al. [10] to extend the softmax loss function J to
learn the multi-concept Cj as follows:
exp( f j (xIi ))
p(C j |xIi ) = , (3)
k exp ( f k (xi ))
I
|A| |V | +
1
J=− p(C j |xIi ) log( p(C j |xIi )), (4)
|A| i=1 j=1
where f j (xIi ) and p(C j |xIi ) denote the activation value and the ground-truth probability for the image xIi ∈ A and the multi-
concept C j ∈ V + , respectively. p(C j |xIi ) = 1 if Cj occurs in xIi , and p(C j |xIi ) = 0 otherwise.
3.4.2. Text network structure

The text modality of an image contains rich semantic information compared with its visual modality. It is informative
but mostly noisy. Hence it cannot be directly used in retrieval exercises. Semantic annotations associated with images are
carefully annotated by humans and thus are more accurate compared to the textual tags. Hence, training the tags with
annotation information can learn robust textual descriptors against noisy text (tags). Moreover, textual tags can provide
complementary information compared to visual images. Their combination can mitigate the noisy text and improve the
performance of concept-based image retrieval.
A neural language model can learn a dense feature vector for each word (tag) or phrase, called a word embedding. The
learned dense vectors can be used to construct a dense vector for a sentence, e.g., by average pooling. We integrate pre-
trained word embedding GloVe [19] over the contextual tags of images to our textual convolutional block layer so as to learn
robust textual descriptors. This consists of three main steps: (i) construct an embedding layer, which takes all contextual
tags associated with each image as an input sentence; (ii) calculate the word embedding for each tag of images based on
the pre-trained GloVe vectors,1 which is loaded into the embedding layer; and (iii) feed the output of the embedding layer
into the convolutional block to extract the textual descriptors.
Finally, we employ learner models such as multi-layer perceptron (MLP), SVM or a newly developed randomized learner
model SCN to build the single-concept FC classifier and the multi-concept scene FC classifier on top of the textual convolu-
tional layer. In this work, we employ a MLP model with one hidden layer and 1024 hidden nodes to label the concepts.
3.5. FC classifier layer
The FC classifier layer includes to two types of classifiers, i.e., single-concept FC classifier and multi-concept scene FC
classifier, which can learn a map from images to concepts. The single-concept FC classifier is similar to the traditional clas-
sifier. Different from existing approaches using multiple concepts as independent ones, our multi-concept scene FC classifier
considers multiple concepts as a holistic scene for retrieval model learning. First, MRBDL learns a set of mapping functions
to project data from different modalities to the multi-concepts C j ∈ V + . After the learning of the two modalities, a modal-
ity ensemble is made by using fusion techniques to obtain the relevance scores. To compensate the varying frequencies of
semantic concepts, we maximize a log-likelihood function of the relevance scores.
3.5.1. Semantic nearest neighbor generation

For a multi-concept scene query Q ∈ 2V , multiple neighbor multi-concepts Cj are involved in concept mapping learning to
improve holistic scene recognition. The generation of the multi-concept neighbor set Q + can be completed in three steps.
Firstly, we generate a semantic candidate pool P ⊂ V + by selecting correlative multi-concepts Cj with probabilities p(Q|Cj ) > 0.
The semantic probability p(Q|Cj ) represents the correlation between two concepts Q and Cj , and it is defined over the training
set A as follows:
φ (Q ∪ C j )
p(Q |C j ) = , (5)
φ (Q )
where φ (Q) and φ (Cj ) denote the occurrence frequency of multi-concept Q and Cj respectively and φ (Q ∪ Cj ) denotes the
number of images simultaneously containing two multi-concepts Q and Cj . Each multi-concept Cj is regarded as its own
semantic neighbor and consequently p(C j |C j ) = 1.
Secondly, we choose all included sub-concepts Cj ⊆Q in the scene concept Q into the nearest neighbor set Q + . Finally,
we select the most correlative scene concepts Cr ∈ P in the set Q + from the rest. Thus, we generate a multi-concept nearest
neighbor set Q + which consists of KQ + elements.
To keep the probabilistic attribute of the semantic correlation, the semantic link probabilities p(Q|Cj ) are normalized
according to (6) below:
⎧
⎨ p(Q |C j ) if Cr , C j ∈ Q + ,
KQ +
p(Q |C j ) = p(Q |Cr ) (6)
⎩ r=1
0 elsewhere.
The corresponding generation algorithm is summarized in Algorithm 2.
1
https://nlp.stanford.edu/projects/glove/
Algorithm 2 Summary of semantic nearest neighbor generation.

Input: Multi-concept vocabulary V + and multi-concept query Q.
Output: Semantic nearest neighbor set Q + = {C1 , · · · , CKQ + }.
1: Calculate semantic candidate pool P = {C j |C j ∈ V + , p(Q |C j ) > 0} via Eqn. (5);
2: Calculate sub-concept set S = {C j |C j ⊆ Q, φ (C j ) > 0};
3: Calculate correlative scene candidate pool R = {Cr |Cr ∈ P, Cr ∈ / S };
4: Perform heap sort over all Cr ∈ R and calculate correlative scene concept set T ⊆ R by selecting top KQ + − |S| concept
Cr ∈ R;

5: Calculate semantic neighbor set Q + = S T with the semantic link probabilities p(Q |C j ) via Eqn. (6);
6: Return semantic nearest neighbor set Q + = {C1 , · · · , CKQ + }.
3.5.2. Multi-concept bimodal mapping

When a query concept Q and its interrelated concepts Cj have a high co-occurrence frequency in the image set, we
consider boosting the relevance score rm (Q, X) of predicting this concept Q (X ∈ {xIi , xTi }), if there is strong evidence p(Cj |X)
for the interrelated concepts Cj in the image set. Then, the relevance scores for the visual and text modalities can be written
as
KQ +

rm (Q, xIi ) = p(Q |C j ) p(C j |xIi ), C j ∈ Q + , (7)
j=1
KQ +

rm (Q, xTi ) = p(Q |C j ) p(C j |xTi ), C j ∈ Q + , (8)
j=1
where rm (Q, X) denotes the relevance scores predicted by the multi-concept scene FC classifiers for two modalities. The
posterior probability p(Cj |X) predicted by the multi-concept scene FC classifier can be taken as an evidence of Cj in images,
while the semantic correlation p(Q|Cj ) can be seen as the weight of the probability p(Cj |X).
Based on the performance in single-concept learning reported in [15,23], we integrate the single-concept FC classifiers
into the FC classifier layer of each modality. Specifically, the relevance score between Q and image xsi (s ∈ {I, T } ) is calculated
as follows:

t
rs (Q, xIi ) = p(c j |xIi ), c j ∈ Q, (9)
j=1

t
rs (Q, xTi ) = p(c j |xTi ), c j ∈ Q, (10)
j=1
where rs (Q, xsi ) denotes relevance scores predicted by the single-concept FC classifiers for two modalities. The posterior
probability p(c j |xsi )(s ∈ {I, T } ) predicted by the single-concept FC classifier can be regarded as an evidence of Cj in images.
To combine the complementary information from the visual and text modalities, we fuse these predicted semantic scores
as follows:
r (Q, xIi , xTi ) = r (Q, xIi ) + r (Q, xTi ), (11)
r (Q, xIi ) = wQ1 · rs (Q, xIi ) + wQ2 · rm (Q, xIi ), (12)
r (Q, xTi ) = wQ3 · rs (Q, xTi ) + wQ4 · rm (Q, xTi ), (13)

where r (Q, xIi )
and r (Q, xTi ) denote the unimodal relevance scores of query concept Q for the visual and text modality
respectively, and r (Q, xIi , xTi ) denotes the final relevance score. wQ = [wQ
1
, wQ
2
, wQ
3
, wQ
4
] are the MRBDL parameters to be esti-

mated and subject to l wQ l
= 1, wQ
> 0, l = 1, 2, 3, 4, which are taken as the weights for the four types of relevance scores
l
in (12) and (13). In this way, the relevance scores r (Q, xIi , xTi ) can be evaluated, and all testing images can be ranked and
listed in descending order.
3.6. Two-phase training schema
The training of MRBDL consists of two phases, as illustrated in Fig. 2. For each modality, the first phase trains the shared
convolutional block layer to extract the corresponding deep descriptors. After this, the deep descriptors are fed into the FC
classifier layer for mapping function training in the second phase. The training is conducted by back-propagation using the
mini-batch stochastic gradient descent learning algorithm to minimize the objective loss function (4).
Visual modality training. Training a deep CNN is a time-consuming process. In this work, we exploit the pre-trained network
to reduce the number of trainable network parameters. We get started by training the convolutional layer with all the pre-
trained parameters of the VGGNet convolutional block layer, except for the VGGNet FC layer parameters. This process is
executed once, and records the output: the deep visual descriptors X(1) . Then, we turn to train the network parameters E(1)
of the FC classifier layer on top of the stored deep descriptors X(1) . The FC layer comprises two types of classifiers: a multi-
concept scene FC classifier and a single-concept FC classifier, which are simultaneously trained with the back-propagation
algorithm according to (4). Followed by a process to enhance the generalization capability of deep descriptors based on
trained X(1) and E(1) : It freezes all convolutional blocks up to the last convolutional block. The specialized descriptors X(2)
are learned after the retraining is done. The descriptors trained by low-level convolutional blocks are less abstract than
those found higher up, so we keep the first few blocks fixed for more general descriptors and only fine-tune the last one to
obtain more specialized descriptors X(2) . Finally, we train the multi-concept FC layer for the fine-tuned FC parameters E(2)
with the back-propagation algorithm, based on trained X(2) and E(1) .
From our offline experiments, it is observed that the optimization algorithm moves very slowly when large gradient up-
dates triggered by the randomly initialized parameters. In each iteration, the neurons in the dropout layer will be stochas-
tically selected with a probability of 0.5 to forward their activation to the output units. Only the selected neurons will
participate in the back-propagation. Similarly, we adopt all the neurons for prediction with their activation value multiplied
by 0.5 for normalization.
Text modality training. The associated contextual tags contain more semantics compared with visual images. A neural lan-
guage model such as GloVe can learn a word embedding vector for each textual tag. The learned word embedding vectors
can be used to construct a dense vector for a sentence, where all tags associated with an image are regarded as an input
sentence. After this, we obtain one word embedding for each tag. The word embeddings of all the tags of one image con-
stitute a 2D tensor, which is fed into a textual convolutional block layer to extract the textual descriptor. Thus we create
one text descriptor vector for these tags. Finally, the textual descriptor is fed into the textual FC classifier layer for mapping
function training.
In our experiments, we consider the popular GloVe word embedding technique which factorizes a matrix of word co-
occurrence statistics in the text modality. We first convert all the text tags in the image set into sequences of word indices.
A word index denotes an integer ID for the word (tag). Then, based on the pre-trained GloVe vectors we construct an
embedding matrix X for textual tags associated with each image. At index i, X(i) includes the embedding vector for the
word with index i. Third, we load X into the embedding layer. An embedding layer is fed with sequences of integers, i.e. a
2D input. The input sequences are padded to the same length in a batch of input data. The integer inputs are mapped to
the word embedding vectors at the corresponding index in X. In other words, the output of the embedding layer is a 3D
tensor which is fed into the textual convolutional block layer to extract the textual descriptors.
Multi-concept training. Total |V + | concepts Cj from a multi-concept vocabulary V + are involved in the training. The image-
text pairs (xIi , xTi ) with V + are used for training input. The concept mapping functions are training output, which produce the
relevance score between image and concept. For each scene multi-concept C j ∈ V + of two modalities, the training images
annotated with all single-concepts ci ∈ Cj are taken as positive samples and the rest as negative samples. Considering the
C C C
imbalance between the size N+j of the positive example set and the size N−j of the negative example set, the weights 1/N+j
C
and 1/N−jare respectively given to positive samples and negative samples for Cj training. Such an approach also fits for
single-concept training when |C j | = 1.
3.7. Parameter optimization
Real-world datasets are often imbalanced [9], posing a significant challenge to developing retrieval models. In general,
classifiers commonly employ error metrics to find optimal parameters of a model. During the learning phase, it is very
common to over-classify the frequent concepts with high occurrence frequencies, making it hard to derive suitable models
for rare concepts with low occurrence frequencies. Therefore, the frequent concepts are predicted with a very low error
rate due to the infrequent occurrence of rare concepts, while the rare concepts are predicted with a very high error rate.
The retrieved Q and its associated concepts cj ∈ Q and C j ∈ Q + contain various occurrence frequencies, which affect the four
types of relevance scores predicted by classifier in (12) and (13). To compensate for the varying frequencies of concepts, we
maximize a log-likelihood function of the semantic scores over the training set A.
Let yQ
i
∈ {0, 1} denote the absence/presence of a multi-concept scene query Q for the image pair (xIi , xTi ). The prediction
value of p(yQ
i
) can be evaluated by
p(yQi = 1 ) = r (Q, xIi , xTi ), (14)
p(yQi = 0 ) = 1 − r (Q, xIi , xTi ), (15)

p(yQi ) = yQi p(yQi = 1 ) + (1 − yQi ) p(yQi = 0 )., (16)

Hence, the log-likelihood function of the multi-concept scene query Q can be represented as
|A|

JQ = nQi log p(yQi ), (17)
i=1
where nQ
i
Q
is a cost which considers the imbalance between the number of positive samples, N+ , and that of the negative
Q
samples, N− , for the query concept Q. We set nQ
i
Q
= 1/N+ , if yQ
i
= 1 and nQ
i
Q
= 1/N− , otherwise.
By substituting p(yQ
i
) in (17) with (14)–(16), the log-likelihood function can be rewritten as
|A|

JQ = nQi log{yQi r (Q, xIi , xTi ) + (1 − yQi )(1 − r (Q, xIi , xTi ))}. (18)
i=1
Denoted by vQ = [rs (Q, xIi ), rm (Q, xIi ), rs (Q, xTi ), rm (Q, xTi )] . According to (11)–(13), the log-likelihood function for optimizing
wQ becomes
A

JQ = nQi log{yQi wQ vQ + (1 − yQi )(1 − wQ vQ )}. (19)
i=1
The log-likelihood function JQ is maximized by using the gradient descent approach. The gradient of the log-likelihood with
respect to wQl
can be expressed as follows:
(2yQi − 1 ) ∂∂ w
Q
A
∂ JQ wQl
= n Q
{ }
∂ wlQ
i=1
i
1 − yi + ( 2 yi − 1 ) wQ vQ
Q Q
A
vQl vQl
= nQi {yQi + (yQi − 1 ) }, (20)
wQ vQ 1 − wQ vQ
i=1
where wQ
l
and vQ
l
denote the lth element of wQ and vQ , respectively.
3.8. Testing with MRBDL
MRBDL has four types of mapping functions: a multi-concept visual mapping function, a single-concept visual mapping
function, a multi-concept textual mapping function and a single-concept textual mapping function. Given a test set of web
images, image raw descriptors are extracted from each modality and mapped into the deep visual and textual descriptors
by our shared convolutional block layers, respectively. Then they are fed into the visual and textual FC classifier layers
respectively and mapped into the different relevance scores by the learned mapping functions.
MRBDL makes an ensemble of these multi-concept and single-concept relevance scores from the different modalities
after the concept mapping. To compensate for the varying frequencies of concepts, MRBDL maximizes the log-likelihood
function of the relevance scores to produce the ultimate scores. Ultimately, the relevant images are returned in descending
order of the relevance scores.
The main procedures of the MRBDL-based MCIR are summarized in Algorithm 3. From Algorithm 3, we present time
and space complexity analysis. A multi-concept vocabulary V + is constructed in an offline mode, with the one-time cost
of O(1). Furthermore, it enjoys a high degree of parallelism and can be efficiently implemented with advanced parallel
computing techniques. In the offline training, learning deep descriptors and concept mapping functions for each modality
are convolution operations, and the time complexity is O(E × N), where E denotes the number of trainable neural network
parameters. We warm start the visual convolutional layer by initializing it with the pre-trained network and employ a very
small FC classifier layer, decreasing the number of network parameters to be trained (i.e. E) and boosting training efficiency.
In online retrieval, the calculation of the multi-concept nearest neighbor set Q + can be completed in O(1) time. For each
test image, it takes O(1) time to calculate the relevance score. It is obvious that the computational complexity in time and
space are all O(N) for the loop operation of the relevance estimation. At the last step, the ranking list of images is returned
after the heap sort is performed over all relevance scores r (Q, xIi , xTi ), with time and space complexity being O(Nlog N) and
O(1) respectively. Consequently the time and space complexity of Algorithm 3 are O(Nlog N) and O(N), respectively.
4. Experimental setup
In this section, we first introduce the experimental datasets, and then the evaluation metrics. After this, the compared
approaches are introduced. Finally, we describe the experimental implementation in details. All the experiments are con-
ducted on an AMAX computer with 128G memory, 2.1Ghz E5-2620 CPU and NVIDIA GPUs.
Algorithm 3 Summary of semantic web image retrieval using bimodal deep CNN.
Input: Training set A with annotation set V , test set B and multi-concept query Q.
Output: Top-K relevant images for multi-concept query Q.
// Offline Learning
1: Construct multi-concept vocabulary V + via Algorithm 1;
2: Learn deep visual descriptors in visual convolutional layer;
3: Learn multi-concept and single-concept mapping functions in visual FC layer;
4: Learn deep textual descriptors in textual convolutional layer;
5: Learn multi-concept and single-concept mapping functions in textual FC layer;
// Online Learning
6: Calculate semantic candidate pool P ⊂ V + and construct multi-concept nearest neighbor set Q + via Algorithm 2;
7: for each (xIi , xTi ) ∈ B do
8: Map concept Cr ∈ Q + and c j ∈ Q and calculate relevance scores rm (Q, xIi ) and rs (Q, xIi ) for visual modality via Eq. (7)
and (9);
9: Map concept Cr ∈ Q + and c j ∈ Q and calculate relevance scores rm (Q, xTi ) and rs (Q, xTi ) for textual modality via Eq. (8)
and (10);
10: Fuse relevance scores for two modalities and obtain ultimate relevance score r (Q, xIi , xTi ) via Eq. (11);
11: end for
12: Perform heap sort over all r (Q, xIi , xTi ) and return top K images.
4.1. Datasets
The experiments were carried out on two publicly available image datasets: MIR Flickr 2011 [18] and NUS-WIDE [5]. We
choose these since they include large vocabularies that are discriminative enough to evaluate the performance of multi-
concept-based image retrieval methods. The two datasets include the visual images, the associated tags and the ground
truth for single-concept (class label) task evaluation, which are publicly available.
MIR Flickr 2011 2 is comprised of 18,0 0 0 images from Flickr along with social tags, 80 0 0 images of which are annotated
with 99 semantic concepts from the vocabulary V. Following [12], we employ GIST, HOG, SIFT and RGB color histograms as
the visual descriptors, using L2 for GIST, HI for HOG, χ 2 for the SIFT histogram and L1 distance for the RGB color histogram
to calculate the visual distance between two samples. The text modality is represented by textual tags from a tag vocabulary
L containing 53,296 tags. They are encoded as a 53,296 dimensional binary vector, and each dimension describes the pres-
ence of 53,296 tags. The inner product is employed to calculate the text similarity between two samples [12]. The dataset is
randomly split into two parts. All 80 0 0 image-text pairs with annotations are employed for training, while the rest are used
for testing.
NUS-WIDE 3 consists of 269,648 web images annotated with 1–13 semantic concepts from a vocabulary containing 81
concepts. Each image is associated with contextual tags from Flickr, with a total of 5018 unique tags. Due to broken links,
a total of 230,708 web images are downloadable for the experiments. The visual descriptors provided in [5] are employed,
including LAB, HSV, edge direction, wavelet texture and SIFT histograms. Similarly, we employ L1 distance for the LAB and
HSV color histograms, HI for edge direction histogram, L2 for wavelet texture histogram, and χ 2 for the SIFT histogram to
calculate the visual distance. For the textual descriptor, following [5], 10 0 0 dimensional binary textual vectors are employed,
and each dimension describes the presence of the 10 0 0 most frequent tags which constitute the tag vocabulary L. Similarly,
the inner product is employed to calculate the text distance between two images. This dataset is randomly split into two
parts. The first part consists of 138,375 images to be employed for training and the second part contains 92,333 images to
be used for testing.
On MIR Flickr 2011, the minimum, mean and maximum image number of each concept is about 12, 940.1 and 7,484,
respectively. The minimum, mean and maximum concept number of each image is about 3, 11.4 and 25, respectively. The
annotation vocabulary contains several dozen semantic concepts and about 2/3 of the semantic concepts have frequencies
less than the average concept frequency. Table 1 summarizes the key statistics of the image datasets. Image and concept
counts are given in the format minimum / mean / maximum.
4.2. Evaluation metrics
Non-interpolated mean Average Precision (mAP) over retrieved concepts is taken as a metric to evaluate the retrieval
performance of web images. Given a multi-concept scene query Q, Average Precision (AP) can be computed by using AP =
1
NR r p(r )δ (r ), where NR is the total number of relevant images, r is the rank in the sequence of retrieved images, p(r) is
the precision at cut-off r in the list, which is defined as the ratio between the number of relevant images and the number of
2
http://imageclef.org/2011/Photo/
3
http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm
Table 1
Statistics of the datasets.
Datasets MIR Flickr 2011 NUS-WIDE
Testing Size 10,0 0 0 92,333

Training Size 80 0 0 138,375
Dataset Size 18,0 0 0 230,708
Vocabulary Size 99 81
Number of Tags 53,296 5,018
Mean Text Length 8.6 5.8
Concepts per Image 3 / 11.4 / 25 1 / 2.5 / 13
Images per Concept 12 / 940.1/ 7484 51 / 5,380.7 / 64,509
retrieved images, and δ (r) is the indicator function which equals 1 if the rth image is relevant to Q, and zero otherwise. mAP

for a query set C can be computed as mAP = |C1 | Q∈C AP, where |C| is the total number of queries. mAP takes into account
the rank of retrieved images when calculating AP, and hence heavily penalizes the retrieval results when the relevant images
are returned at low rank. The higher the mAP score, the better the retrieval performance. Moreover, Precision-scope curve is
reported to reflect retrieval performance variations with respect to the number of returned images as well. Given a retrieval
multi-concept Q = {c1 , · · · , ct } ∈ 2V , the ground truth of multi-concept retrieval is defined as: if a web image describes all
t = |Q | target single-concepts cj ∈ Q, it is considered a relevant image; otherwise, it is irrelevant.
4.3. Evaluation approaches
MRBDL is specially designed for concept-based image retrieval. We hence compare MRBDL with several state-of-the-
art concept-based unimodal and multi-modal approaches. Given a multi-concept scene query Q, the relevance score rp of

images is calculated by the compared approaches according to the product rule r p = c ∈Q p(ci |. ) [11], where p(ci |.) denotes
i
the relevance score predicted by a traditional single-concept classifier.
Unimodal approaches used for evaluation include ML-KNN [33], LIBSVM [2], LinearSVM [7], FastTag [4] and VGGNet [23].
ML-KNN and LIBSVM are classical nonlinear and adaptive approaches that efficiently handle rare concepts in imbalanced
datasets. LinearSVM is a linear SVM classifier for large-scale datasets, which is employed as the textual classifier in our ex-
periments. FastTag learns two linear mapping functions co-regularized in a joint convex loss function that can be efficiently
optimized in closed form updates on large-scale datasets. VGGNet is a typical deep learning approach in concept-based im-
age retrieval due to its promising discriminability for concept detection. Following previous work [26], each unimodal score
is combined with an equal weight for the final relevance score.
The multi-modal approaches used for comparison include MKL [12], v+p+t+TagProp [26], deep autoencoder [17], multi-
modal Deep Belief Network (DBN) [24] and a Deep Boltzmann Machine (DBM) [24]. MKL utilizes the multiple kernel learning
framework to combine the visual kernel with the textual kernel. Multi-modal v+p+t+TagProp learns a discriminative met-
ric for the nearest neighbors and employs a linear kernel of the visual and text modality. The deep autoencoder and DBN
employ a deep autoencoder and a deep belief network to fuse two modalities, respectively. DBM learns a generative model
of the joint space of visual and text modality through deep Boltzmann machines. The implementation codes of the com-
pared approaches are publicly available.4 The parameters in the compared approaches are adjusted according to the relevant
literature and the best performance is reported.
4.4. Configurations
MRBDL has two parameters, α and β in (1) and (2), to adjust the maximum length and the minimum frequency of
the multi-concepts on multi-concept vocabulary generation. Since the highest frequency appears in images with about 11
and 3 concepts, the maximum length α is set to 11 and 3 on MIR Flickr 2011 and NUS-WIDE respectively. To reduce the
training time overhead, we confine V + to an acceptable size by filtering concepts according to their occurrence frequencies
in the image set. More specifically, if the occurrence frequency of a concept is equal to or greater than the threshold β , we
include it in the V + ; otherwise we discard it. In the experiments, we empirically set β to 200 and 50 for MIR Flickr 2011
and NUS-WIDE, respectively. Hence, the vocabulary V + consists of 15,970 and 2084 multi-concepts respectively.
In the experiments, 5-fold cross-validation is adopted to determine the size KQ + of the multi-concept nearest neighbor
set for a query Q. Fig. 3 shows the performance comparisons by varying KQ + from 2 to 20. It can be observed that better
performance can be achieved when KQ + = 8 on MIR Flickr 2011, and a very small value of KQ + such as KQ + = 2 may lead
to limited performance. A similar variation trend in performance can be observed on NUS-WIDE when KQ + changes, and the
optimal value of KQ + is about 7. Therefore, we set KQ + = 8 for MIR Flickr 2011 and KQ + = 7 for NUS-WIDE in the experiments,
respectively.
4
We use the implementation codes from the authors’ websites: http://lamda.nju.edu.cn/code_MLkNN.ashx , http://www.csie.ntu.edu.tw/∼cjlin/libsvm/ ,
http://www.csie.ntu.edu.tw/∼cjlin/liblinear/ , http://www.cse.wustl.edu/∼mchen/code/FastTag/ , http://asi.insa-rouen.fr/enseignants/∼arakoto/code/mklindex.
html , http://lear.inrialpes.fr/people/guillaumin/code.php , http://www.cs.toronto.edu/∼nitish/multimodal/ ,˜etc.
Fig. 3. Performance on the (a) MIR Flickr 2011 and (b) NUS-WIDE in terms of mAP (%) with respect to the multi-concept neighborhood size K.
For the visual convolution block layer, we follow the same network definition used in [23]. The convolution filter size
is set to 3 × 3. The FC layer has an output size of 1024, followed by dropout layers with a dropout ratio of 0.5. For all the
layers, Rectified Linear Units (ReLU) are employed as the nonlinear activation function. For the textual convolution block
layer, the convolution filter size is set to 5 × 5. The size of the FC layer is 1024, followed by the ReLU nonlinearity and
dropout (0.5) layer.
The optimization of the whole network is achieved by the stochastic gradient descent method with the mini-batch size
of 128 at a fixed momentum value of 0.9. The global learning rate for the whole network is set to 0.01 at the beginning,
and a staircase weight decay is applied after 20 epochs.
In the experiments, we construct the test query set Q for evaluation. All single-concept queries cj ∈ Q are first selected
into Q and then 1500 multi-concept scene queries are randomly picked out. There are 500 2-concepts, 500 3-concepts
and 500 4-concepts, where n-concept means a multi-concept with a length n. Furthermore, to validate the discriminative
capability of FC classifiers for a semantic multi-concept scene, Q is split into two query groups for evaluation. One includes
a multi-concept query set F = {Q |∀Q ∈ Q, |Q | > 1} and a single-concept query set G = Q − F. The other consists of a difficult
query set U with no more than 99 relevant images in B and an easy query set V = Q − U with 100 or more relevant images
in B.
5. Experiments and analysis
5.1. Multi-concept semantic scene retrieval experiments
Each dataset is randomly separated into two collections. One collection consisting of 80 0 0 images for MIR Flickr 2011
and 138,375 images for NUS-WIDE is used as training data, and the other collection consisting of 10,0 0 0 images for MIR
Flickr 2011 and 92,333 images for NUS-WIDE is used as testing data. We repeat the experiment 10 times. Each run adopts a
new separation of the collections. We report the result based on the average over the 10 trials. In total, 1599 test queries
are evaluated on MIR Flickr 2011 which contains 1500 multi-concepts and 99 single-concepts, while on NUS-WIDE a total of
1581 test queries are evaluated which is comprised of 1500 multi-concepts and 81 single-concepts. The mAP scores and their
standard deviations of MRBDL for the compared approaches on the two datasets are presented in Tables 2 and 3 respectively.
VGGNet+KNN means using the visual VGGNet classifier and textual KNN classifier, while VGGNet+SVM means using the
visual VGGNet classifier and textual linear SVM classifier. We illustrate the results on two evaluation groups F&G: a single-
concept query set G and a multi-concept query set F, and U&V: an easy query set V and a difficult query set U, shown in
Figs. 4 and 5, respectively. The precision-scope curves on the two datasets are reported in Fig. 6, demonstrating the precision
variation with the number of returned relevant images. We experimented with ML-KNN, LIBSVM, MKL, v+p+t+TagProp on
NUS-WIDE, but found that they cannot easily scale to a large-scale image dataset because of the O(n2 ) time complexity.
Hence, we do not compare them in Table 3.
From the reported results, it can be clearly observed that the proposed MRBDL surpasses the compared approaches. For
instance, on MIR Flickr 2011, the highest mAP of MRBDL is 21.7%, which is more than 10.7% better than the second best
mAP 19.6% achieved by DBM. On NUS-WIDE MRBDL obtains a remarkable improvement of about 18.6%. Compared to the
state-of-the-art approach (i.e. DBM), MRBDL shows better performance on the precision-scope curves as well. Furthermore,
we can obtain several insightful observations as follows.
Table 2
Multi-concept retrieval performance (mAP %) over all 1599 test queries on
MIR Flickr 2011.
Approaches All Concepts 2-Concept 3-Concept 4-Concept
Random 3.0 ± 0.5 4.1 ± 0.4 1.8 ± 0.3 1.5 ± 0.4

ML-KNN 14.0 ± 0.3 16.7 ± 0.2 10.8 ± 0.3 10.6 ± 0.3
LIBSVM 16.6 ± 0.4 19.8 ± 0.5 12.9 ± 0.3 12.4 ± 0.4
FastTag 17.1 ± 0.3 20.2 ± 0.5 13.6 ± 0.3 12.9 ± 0.4
VGGNet+KNN 18.0 ± 0.4 21.1 ± 0.4 14.6 ± 0.4 14.4 ± 0.5
VGGNet+SVM 17.7 ± 0.5 21.0 ± 0.3 13.9 ± 0.4 13.6 ± 0.4
MKL 15.9 ± 0.4 18.7 ± 0.4 12.5 ± 0.4 11.9 ± 0.5
v+p+t+TagProp 17.6 ± 0.4 20.8 ± 0.4 14.1 ± 0.4 13.7 ± 0.2
Autoencoder 19.3 ± 0.4 22.6 ± 0.6 15.6 ± 0.4 15.0 ± 0.5
DBN 19.1 ± 0.6 22.4 ± 0.5 15.9 ± 0.4 15.2 ± 0.5
DBM 19.6 ± 0.3 22.9 ± 0.4 16.0 ± 0.3 15.6 ± 0.4
MRBDL 21.7 ± 0.4 24.7 ± 0.4 18.0 ± 0.3 17.5 ± 0.3
Table 3
Multi-concept retrieval performance (mAP %) over all 1581 test queries on
NUS-WIDE.
Approaches All Concepts 2-Concept 3-Concept 4-Concept
Random 0.3 ± 0.7 0.2 ± 0.6 0.4 ± 0.4 0.3 ± 0.5

FastTag 20.3 ± 0.5 24.9 ± 0.6 17.4 ± 0.7 13.7 ± 0.5
VGGNet+SVM 21.1 ± 0.6 24.1 ± 0.5 18.2 ± 0.6 14.8 ± 0.5
Autoencoder 22.0 ± 0.4 26.0 ± 0.6 19.2 ± 0.6 15.7 ± 0.5
DBN 21.9 ± 0.5 25.7 ± 0.4 18.9 ± 0.4 15.4 ± 0.5
DBM 24.2 ± 0.6 28.7 ± 0.5 21.3 ± 0.3 17.3 ± 0.6
MRBDL 28.7 ± 0.5 32.4 ± 0.5 25.4 ± 0.4 22.5 ± 0.5
Fig. 4. Multi-concept retrieval performance (mAP %) over (a) the query group F&G: a single-concept query set G and a multi-concept query set F, and (b)
the query group U&V: an easy query set V and a difficult query set U on MIR Flickr 2011.
On the two datasets, it is interesting to find that multi-concept-based retrieval performs worse than single-concept-based
retrieval, as shown in Figs. 4 and 5. Especially for the conventional approaches, the mAP for multi-concept retrieval sharply
declines compared to single-concept retrieval. For instance, the SVM achieves a mAP of 40.3% on a single-concept, whereas it
only obtains a mAP of about 19.8%, 12.9% and 12.4%, respectively on 2-concept, 3-concept and 4-concept on MIR Flickr 2011.
This experimental phenomenon validates our analysis on multi-concept scene recognition introduced in Section 1. Actually,
a multi-concept scene may have unique visual characteristics, while the traditional single-concept approaches aim to achieve
accurate results for single-concept recognition. This is possibly because minimizing the loss function of single-concept scene
recognition is the main objective which may not be consistent with the objective of multi-concept scene recognition. To
retrieve a holistic scene, existing approaches employ a combination of single-concept technologies. But in some cases, this
may lose the valuable semantics latently embedded in the holistic scene, which makes it difficult to identify a multi-concept
scene solely by a single-concept classifier. Single-concept approaches may not be suitable for multi-concept-based image
Fig. 5. Multi-concept retrieval performance (mAP %) over (a) the query group F&G: a single-concept query set G and a multi-concept query set F, and (b)
the query group U&V: an easy query set V and a difficult query set U on NUS-WIDE.
Fig. 6. Precision-scope curves on (a) MIR Flickr 2011 and (b) NUS-WIDE varying the number of returned relevant images.
retrieval. This observation also motivates us to design MRBDL to effectively exploit the multi-concept scene FC classifier to
recognize the holistic scene.
In MRBDL, multi-concept retrieval shows relatively steady performance with query concept length, obtaining a mAP of
about 21.7% and 28.7% on MIR Flickr 2011 and NUS-WIDE, respectively. However, for many of the compared methods, their
performance was unstable with the given query concept length. This is because MRBDL comprehensively considers visual
modality, text modality and semantics in multi-concept scene learning. For visual and text modalities, MRBDL utilizes four
types of FC classifier: visual single-concept, visual multi-concept, textual single-concept and textual multi-concept, to en-
hance the performance of FC classifiers for multi-concept scenes. The multi-concept scene FC classifier employs the loss
function of a multi-concept scene to recognize a multi-concept scene C j ∈ V + , as shown in (4). The design ensures our
MRBDL can enjoy discriminability for single-concept and multi-concept scene recognition.
From Figs. 4(b) and 5(b), MRBDL is shown to be the best approach over the easy query set V and difficult query set
U. Obviously, difficult queries are the harder ones since the average concept frequency 66 and 60 of U is far less than the
average concept frequency 472 and 604 of V respectively on MIR Flickr 2011 and NUS-WIDE. The advantage of MRBDL is
reported to be greater over the difficult query set. On this set, the mAP improvement compared to DBM, the second best
approach, is 21.6% and 30.9%, as compared to 5.3% and 12.9% over the easy query set on MIR Flickr 2011 and NUS-WIDE,
respectively. This improvement may be due to the combination of a single-concept FC classifier and multi-concept scene FC
Fig. 7. Effects of contextual tags on (a) MIR Flickr 2011 and (b) NUS-WIDE in terms of mAP (%).
classifier in our multi-concept bimodal CNN model. Furthermore, to compensate for the varying frequencies of the concept,
we maximize the log-likelihood of the semantic scores over the dataset.
Deep learning approaches such as DBM and deep autoencoder achieve higher mAP than shallow learning approaches
such as KNN and SVM in many cases. This demonstrates that deep learning approaches perform better than shallow learning
methods in capturing the underlying semantic structures of images in deep descriptor learning. Furthermore, it shows that
the combination of deep learning approaches and shallow learning approaches such as VGGNet+SVM also performs well.
5.2. Effects of textual tags
We conduct experiments to validate the effectiveness of the contextual tags on multi-concept based image retrieval. More
specifically, we compare the performance of MRBDL over all 1599 test queries, using the visual modality and the visual and
text modality.
Fig. 7 reports the detailed experimental results on the two datasets. We can see the performance of concept-based im-
age retrieval can be improved by the auxiliary text. This is because the relationship between visual images and semantic
concepts can be modeled and correlated better with contextual data.
5.3. Effects of training data ratio
The image retrieval approaches, especially for deep learning, perform well with a large numbers of training examples,
while with limited training samples, in general they will overfit markedly to the training images, leading to limited retrieval
performance. In fact, real-world datasets are always imbalanced. They are comprised of a large number of rare concepts
with few training samples. Therefore, in this section we investigate the performance variations with different training sizes
on MIR Flickr 2011 and NUS-WIDE respectively. Performance variations are recorded when the training size is changed.
We employ a data ratio r ∈ {10%, 20%, 25%, 30%, 40%, 50%} of training examples in the experiments, respectively. For
instance, employing r = 10% of the training images means that we employ a total of 800 and 13,837 training images on MIR
Flickr 2011 and NUS-WIDE, respectively. The data ratio of 30%, 40%, 50% contain many more training images than those in
the 10%, 20%, 25% setting. The first three groups of experiments show the retrieval performance on the small number of
training examples, while the last three groups of experiments show the retrieval performance on the medium number of
training samples.
Fig. 8 shows the main results on the two datasets. We can observe that mAP of MRBDL increases when more training
data is used. In particular, when there are limited training images available, MRBDL shows relatively stable performance
compared with the other approaches. This phenomenon illustrates the stability of the proposed MRBDL with a reasonably
small training set. Furthermore, it should be noted that MRBDL outperforms the other compared approaches, achieving mAP
scores of 18.0, 18.7, 19.0, 19.7, 19.9 and 20.1 (%) on MIR Flickr 2011 and 22.6, 23.9, 24.4, 24.6, 25.4 and 26.0 (%) on NUS-WIDE.
This observation further validates that, our approach can effectively improve the performance of multi-concept retrieval
when training images are limited.
Fig. 8. Performance variations with training data ratio on (a) MIR Flickr 2011 and (b) NUS-WIDE in terms of mAP (%).
Table 4
Multi-concept retrieval performance (mAP %) for rare concepts and frequent concepts on MIR Flickr
2011.
Approaches R1 R2 R3 F1 F2 F3
ML-KNN 23.3 ± 0.8 4.9 ± 0.9 3.3 ± 0.7 43.8 ± 0.7 48.6 ± 0.8 26.4 ± 0.6
LIBSVM 28.4 ± 0.8 6.5 ± 0.9 2.8 ± 0.8 50.0 ± 0.7 54.0 ± 0.5 30.8 ± 0.7
FastTag 29.1 ± 0.8 7.2 ± 0.5 5.0 ± 0.7 49.9 ± 0.7 54.0 ± 0.7 31.3 ± 0.6
VGGNet+KNN 28.7 ± 0.8 7.7 ± 0.9 4.3 ± 0.5 50.6 ± 0.9 54.3 ± 0.7 32.1 ± 0.8
VGGNet+SVM 30.2 ± 0.6 7.1 ± 0.8 3.8 ± 0.9 51.6 ± 0.7 55.4 ± 0.7 31.5 ± 0.7
MKL 27.8 ± 0.8 7.0 ± 0.7 5.7 ± 0.6 46.2 ± 0.8 49.4 ± 0.8 27.1 ± 0.7
v+p+t+TagProp 28.6 ± 1.0 7.6 ± 0.7 5.8 ± 0.7 49.9 ± 0.7 54.0 ± 0.7 31.0 ± 0.6
Deep Autoencoder 32.1 ± 0.8 8.5 ± 0.7 6.0 ± 0.6 53.0 ± 0.9 56.8 ± 0.6 34.5 ± 0.5
DBN 32.0 ± 0.8 8.6 ± 0.5 5.9 ± 0.4 52.8 ± 0.8 56.6 ± 0.8 34.4 ± 0.8
DBM 32.8 ± 0.7 8.9 ± 0.4 5.7 ± 0.7 53.4 ± 0.7 57.2 ± 0.7 34.4 ± 0.8
MRBDL 34.5 ± 0.8 10.9 ± 0.7 9.2 ± 0.7 54.0 ± 0.7 57.6 ± 0.6 35.2 ± 0.7
5.4. Rare concept experiments
This section investigates the retrieval performance on imbalanced datasets containing a high percentage of rare concepts.
Real-world data sets are commonly imbalanced, posing a significant challenge to developing retrieval models. On MIR Flickr
2011 and NUS-WIDE, about 2/3 and 4/5 of the semantic concepts have frequencies less than the average concept frequency,
respectively. Concept classifiers always over-classify the frequent concepts with high occurrence frequencies in the learning
stage. This makes it hard to derive suitable models for rare concepts with low occurrence frequencies. In such cases, classi-
fiers usually have good performance on frequent concepts, but very poor performance on rare concepts. Consequently, the
advantages of many previous retrieval approaches may not hold on imbalanced datasets for concept-based image retrieval.
It also validates the importance of especially considering varying frequencies of concepts when developing classifiers.
We consider two groups of experiments. In the first group of experiments, the 50 most rare single-concepts, 2-concepts
and 3-concepts from Q are selected as the single-concept rare query set R1 , the 2-concept rare query set R2 and the 3-
concept rare query set R3 respectively. Similarly, in the second group, the top-50 frequent single-concepts, 2-concepts and
3-concepts from Q are respectively chosen as the single-concept frequent query set F1 , the 2-concept frequent query set F2
and 3-concept frequent query set F3 .
The mAP scores on the different query sets and datasets are reported in Tables 4 and 5, respectively. It can be observed
from the presented results that concept classifiers obtain much lower mAP scores on the rare concept sets R1 , R2 and R3
but higher mAP scores on the frequent concept sets F1 , F2 and F3 , which proves the classifiers bias for frequent concepts.
In practice, this means that concept-based image retrieval may perform worse when a user submits a rare multi-concept
scene query, which impacts user experience.
From Table 5, on frequent concept sets, our approach achieves a mAP of about 66.2%, 52.3% and 40.9%, which outperforms
the best compared approach DBM by about 5.8%, 5.9% and 8.8%. On rare concept sets, greater gains can be achieved by the
proposed approach: about 8.4%, 44.4% and 43.2%. We analyze the cause of the improvements on the rare concepts. During
a rare concept Q detection, a group of weighted concept classifiers of its semantic neighbors Q + is involved in detection.
Among the semantic neighbors, some concepts C j ∈ Q + may be frequent concepts, which boosts the relevance score of Q.
Table 5
Multi-concept retrieval performance (mAP %) for rare concepts and frequent concepts on NUS-WIDE.
Approaches R1 R2 R3 F1 F2 F3
FastTag 49.0 ± 0.8 12.9 ± 0.8 15.0 ± 0.8 55.5 ± 0.9 40.3 ± 0.9 29.6 ± 0.6
VGGNet+SVM 53.4 ± 1.0 11.8 ± 0.8 12.9 ± 0.9 62.3 ± 1.2 45.9 ± 1.0 32.8 ± 1.1
Deep Autoencoder 52.9 ± 0.6 12.0 ± 0.7 12.3 ± 0.9 61.1 ± 0.8 47.3 ± 0.9 35.5 ± 1.0
DBN 52.2 ± 1.0 12.0 ± 0.8 12.5 ± 1.0 61.1 ± 1.0 47.7 ± 0.8 35.0 ± 0.8
DBM 55.0 ± 1.0 13.5 ± 1.0 14.6 ± 0.8 62.6 ± 0.8 49.4 ± 0.5 37.6 ± 0.8
MRBDL 59.6 ± 0.8 19.5 ± 0.8 20.9 ± 0.8 66.2 ± 1.1 52.3 ± 0.9 40.9 ± 1.0
Fig. 9. Variations of objective function value in (18) with the number of iterations on NUS-WIDE.
Thus, the rare concept Q is easier to detect. Furthermore, we maximize the log-likelihood function of the relevance scores
over the training samples to compensate for the varying frequencies of concepts. Therefore, MRBDL can mitigate the concept
imbalance to improve retrieval performance.
5.5. Convergence analysis
This section describes the experiments that were conducted to analyze the convergence of MRBDL. Fig. 9 plots the ob-
jective function value in (18) as a function of the number of iterations on NUS-WIDE. From the figure, we can observe that
the objective function value first increases with the number of iterations and becomes steady after certain iterations. This
shows that the convergence of MRBDL can be assured with the gradient descent approach.
6. Conclusion
This paper contributes to a development of concept-based image retrieval techniques. We propose a novel approach
MRBDL to tackle the problem of MCIR, which is designed in a unified deep learning framework. To boost the discriminability
of the multi-concept scene, MRBDL utilizes multi-concept scene FC classifiers in the visual and text CNN to detect holistic
scenes with unique visual characteristics. Moreover, semantic correlations of concepts are used to improve multi-concept
scene detection. To compensate for the varying frequencies of concepts, the log-likelihood function of the relevance scores
is maximized over the training images, which can mitigate concept imbalance and thus enhance retrieval performance.
Comprehensive experiments on two public web image datasets show that the use of multi-concept-based image retrieval can
be improved by three kinds of data, i.e. visual, text and semantics, and MRBDL can achieve superior performance compared
with several state-of-the-art approaches.
This research opens up several promising directions for further explorations. We plan to further enhance the performance
of MRBDL with associated data from additional information, such as user comments, social relationships of images and geo-
graphical locations of images. Furthermore, we will look into semantic approaches based on deep recurrent neural networks
for holistic scene detection. The denoising of social tags is also an interesting issue for our future work.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (No.61370229), the S&T Projects of
Guangdong Province (No. 2014B010117007, 2015B010110 0 02, 2015A030401087 and 2016B010109008), the S&T Projects
of Guangzhou Municipality (No. 201604010 0 03), the GDUPS(2015), and China Postdoctoral Science Foundation (No.
2016M600657 and 2017T100637).
References
[1] J. Cao, B. Wang, D. Brown, Similarity based leaf image retrieval using multiscale r-angle description, Inf. Sci. 374 (2016) 51–64.
[2] C.C. Chang, C.J. Lin, Libsvm: a library for support vector machines, ACM Trans. Intell. Syst. Technol. 2 (3) (2011) 1–27.
[3] C.L.P. Chen, C.Y. Zhang, Data-intensive applications, challenges, techniques and technologies: a survey on big data, Inf. Sci. 275 (11) (2014) 314–347.
[4] M. Chen, A. Zheng, K. Weinberger, Fast image tagging, in: Proceedings of International Conference on Machine Learning, 2013, pp. 1274–1282.
[5] T.S. Chua, J. Tang, R. Hong, NUS-WIDE: a real-world web image database from National University of Singapore, in: Proceedings of International ACM
Conference on Image and Video Retrieval, 2009, pp. 48–56.
[6] R.L. Cilibrasi, P.M. Vitanyi, The Google similarity distance, IEEE Trans. Knowl. Data Eng. 19 (3) (2007) 370–383.
[7] R.E. Fan, K.W. Chang, C.J. Hsieh, Liblinear: a library for large linear classification, J. Mach. Learn. Res. 9 (2008) 1871–1874.
[8] Q. Fang, C. Xu, J. Sang, Folksonomy-based visual ontology construction and its applications, IEEE Trans. Multimedia 18 (4) (2016) 702–713.
[9] M. Galar, A. Fernandez, E. Barrenechea, A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches,
IEEE Trans. Syst. Man Cybern. Part C 42 (4) (2012) 463–484.
[10] Y. Gong, Y. Jia, T. Leung, Deep convolutional ranking for multilabel image annotation, in: Proceedings of International Conference on Learning Repre-
sentations, 2014, pp. 1312–1320.
[11] M. Guillaumin, T. Mensink, J. Verbeek, Tagprop: discriminative metric learning in nearest neighbor models for image auto-annotation, in: Proceedings
of IEEE International Conference on Computer Vision, 2009, pp. 309–316.
[12] M. Guillaumin, J. Verbeek, C. Schmid, Multimodal semi-supervised learning for image classification, in: Proceedings of IEEE Conference on Computer
Vision and Pattern Recognition, 2010, pp. 902–909.
[13] K. He, X. Zhang, S. Ren, Deep residual learning for image recognition, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,
2015, pp. 770–778.
[14] A. Krizhevsky, I. Sutskever, G. Hinton, Imagenet classification with deep convolutional neural networks, in: Proceedings of IEEE Conference on Neural
Information Processing Systems, 2012, pp. 1097–1105.
[15] A. Makadia, V. Pavlovic, S. Kumar, Baselines for image annotation, Int. J. Comput. Vision 90 (1) (2010) 88–105.
[16] U. Manzoor, A. Mohammed, J. Vreeken, B. Zafar, Semantic image retrieval: an ontology based approach, Int. J. Adv. Res. Artif. Intell. 4 (4) (2015) 4–13.
[17] J. Ngiam, A. Khosla, M. Kim, Multimodal deep learning, in: Proceedings of International Conference on Machine Learning, 2011, pp. 689–696.
[18] S. Nowak, K. Nagel, J. Liebetrau, The CLEF 2011 photo annotation and concept-based retrieval tasks, in: Proceedings of Conference and Labs of the
Evaluation Forum, 2011, pp. 1–25.
[19] J. Pennington, R. Socher, C.D. Manning, Glove: global vectors for word representation, in: Proceedings of Conference on Empirical Methods in Natural
Language Processing, 2014, pp. 1532–1543.
[20] M.M.A. Rahhal, Y. Bazi, H. Alhichri, Deep learning approach for active classification of electrocardiogram signals, Inf. Sci. 345 (C) (2016) 340–354.
[21] V. Ranjan, N. Rasiwasia, C.V. Jawahar, Multi-label cross-modal retrieval, in: Proceedings of IEEE International Conference on Computer Vision, 2015,
pp. 4094–4102.
[22] N. Rasiwasia, J.C. Pereira, E. Coviello, A new approach to cross-modal multimedia retrieval, in: Proceedings of International Conference on Multimedia,
2010, pp. 251–260.
[23] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2014. Arxiv: 1509.1556.
[24] N. Srivastava, R. Salakhutdinov, Multimodal learning with deep Boltzmann machines, J. Mach. Learn. Res. 15 (2014) 2949–2980.
[25] C. Szegedy, W. Liu, Y. Jia, Going deeper with convolutions, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015,
pp. 1–9.
[26] J. Verbeek, M. Guillaumin, T. Mensink, Image annotation with tagprop on the mirflickr set, in: Proceedings of International Conference on Multimedia
Information Retrieval, 2010, pp. 537–546.
[27] D. Wang, M. Li, Stochastic configuration networks: fundamentals and algorithms, IEEE Trans. Cybern. 47 (10) (2017) 3466–3479.
[28] D. Wang, J.S. Lim, M.M. Han, Learning similarity for semantic images classification, Neurocomputing 67 (2005) 363–368.
[29] S. Wang, X. Yao, Multiclass imbalance problems: analysis and potential solutions, IEEE Trans. Syst. Man Cybern. Part B 42 (4) (2012) 1119–1130.
[30] W. Wang, X. Yang, B.C. Ooi, Effective deep learning based multi-modal retrieval, Int. J. Very Large Data Bases 25 (1) (2016) 79–101.
[31] W. Wu, C.H. Hoi, P. Zhao, Mining social images with distance metric learning for automated image tagging, in: Proceedings of International ACM
Conference on Web Search and Data Mining, 2011, pp. 197–206.
[32] H.J. Xu, C.Q. Huang, P. Pan, Image retrieval based on multi-concept detector and semantic correlation, Sci. China Inf. Sci. 58 (12) (2015) 1–15.
[33] M.L. Zhang, Z.H. Zhou, ML-KNN: a lazy learning approach to multi-label learning, Pattern Recognit. 40 (7) (2007) 2038–2048.
[34] L. Zhu, J. Shen, L. Xie, Unsupervised visual hashing with semantic assistant for content-based image retrieval, IEEE Trans. Knowl. Data Eng. 29 (2)
(2017) 472–486.
[35] A. Znaidia, A. Shabou, A. Popescu, Multimodal feature generation framework for semantic image classification, in: Proceedings of International ACM
Conference on Multimedia Retrieval, 2012, pp. 38–47.

Information Sciences: Changqin Huang, Haijiao Xu, Liang Xie, Jia Zhu, Chunyan Xu, Yong Tang

Uploaded by

Copyright:

Available Formats

Information Sciences: Changqin Huang, Haijiao Xu, Liang Xie, Jia Zhu, Chunyan Xu, Yong Tang

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information Sciences: Changqin Huang, Haijiao Xu, Liang Xie, Jia Zhu, Chunyan Xu, Yong Tang

Uploaded by

Copyright:

Available Formats

Information Sciences 430–431 (2018) 331–348

Contents lists available at ScienceDirect

Large-scale semantic web image retrieval using bimodal deep

2.1. Unimodal learning

2.2. Multi-modal and cross-modal learning

2.3. Deep learning

Fig. 2. Illustration of the framework of MRBDL.

3. Multi-concept retrieval using bimodal deep learning

3.1. Problem statement

3.2. The framework of MRBDL

3.3. Multi-concept vocabulary generation

Algorithm 1 Summary of multi-concept vocabulary generation.

3.4. Network structure

3.4.1. Visual network structure

3.4.2. Text network structure

3.5. FC classiﬁer layer

3.5.1. Semantic nearest neighbor generation

Algorithm 2 Summary of semantic nearest neighbor generation.

3.5.2. Multi-concept bimodal mapping

r (Q, xIi ) = wQ1 · rs (Q, xIi ) + wQ2 · rm (Q, xIi ), (12)

r (Q, xTi ) = wQ3 · rs (Q, xTi ) + wQ4 · rm (Q, xTi ), (13)

3.6. Two-phase training schema

3.7. Parameter optimization

p(yQi = 1 ) = r (Q, xIi , xTi ), (14)

p(yQi = 0 ) = 1 − r (Q, xIi , xTi ), (15)

p(yQi ) = yQi p(yQi = 1 ) + (1 − yQi ) p(yQi = 0 )., (16)

3.8. Testing with MRBDL

4.2. Evaluation metrics

Datasets MIR Flickr 2011 NUS-WIDE

Testing Size 10,0 0 0 92,333

4.3. Evaluation approaches

5. Experiments and analysis

5.1. Multi-concept semantic scene retrieval experiments

Approaches All Concepts 2-Concept 3-Concept 4-Concept

Random 3.0 ± 0.5 4.1 ± 0.4 1.8 ± 0.3 1.5 ± 0.4

Approaches All Concepts 2-Concept 3-Concept 4-Concept

Random 0.3 ± 0.7 0.2 ± 0.6 0.4 ± 0.4 0.3 ± 0.5

5.2. Effects of textual tags

5.3. Effects of training data ratio

5.4. Rare concept experiments

5.5. Convergence analysis

You might also like