arXiv:1905.12203v1 [cs.LG] 29 May 2019
Flexible Cross-Modal Hashing
Xuanwu Liu, Jun Wang
Guoxian Yu
Southwest University, China
[email protected]
Southwest University, China and KAUST, SA
[email protected]
Carlotta Domeniconi
Xiangliang Zhang
George Mason University, USA
[email protected]
KAUST, SA
[email protected]
ABSTRACT
1
Hashing has been widely adopted for large-scale data retrieval
in many domains, due to its low storage cost and high retrieval
speed. Existing cross-modal hashing methods optimistically assume
that the correspondence between training samples across modalities
are readily available. This assumption is unrealistic in practical
applications. In addition, these methods generally require the same
number of samples across different modalities, which restricts their
flexibility.
We propose a flexible cross-modal hashing approach (FlexCMH)
to learn effective hashing codes from weakly-paired data, whose
correspondence across modalities are partially (or even totally)
unknown. FlexCMH first introduces a clustering-based matching
strategy to explore the local structure of each cluster, and thus to
find the potential correspondence between clusters (and samples
therein) across modalities. To reduce the impact of an incomplete
correspondence, it jointly optimizes in a unified objective function
the potential correspondence, the cross-modal hashing functions
derived from the correspondence, and a hashing quantitative loss.
An alternative optimization technique is also proposed to coordinate the correspondence and hash functions, and to reinforce the
reciprocal effects of the two objectives. Experiments on publicly
multi-modal datasets show that FlexCMH achieves significantly
better results than state-of-the-art methods, and it indeed offers a
high degree of flexibility for practical cross-modal hashing tasks.
Hashing has attracted an increasing interest from both research
and industry, due to its low storage cost and high retrieval speed
with big data [4, 8, 24, 26]. Hashing aims at compressing highdimensional vectorial data into short binary codes by preserving
the structure of them, and to facilitate efficient retrieval with a
significantly reduced storage. Based on the index constructed from
hashing codes, big data retrieval can be made in a constant or
sub-linear time [12, 16, 19, 24–26, 29].
With the wide range of applications of the Internet of Things,
rapid influxes of multi-modal data asks for efficient cross-modal
hashing solutions. For example, given an image/video about a historic event, one may want to cross-modally retrieve some texts
describing the event in detail. How to perform cross-modal hashing
on these widely-witnessed multi-modal data becomes then a topic
of interest in hashing [10, 24, 26, 28]. Based on using the labels of
training samples or not, existing cross-modal hashing solution can
be roughly divided into unsupervised ones and supervised ones. Unsupervised ones seek hash coding functions by taking into account
underlying data structure, distributions, or topological information
[2, 21]. And supervised (semi-supervised) approaches try to leverage supervised information (i.e., semantic labels) to improve the
performance [3, 7, 13, 23, 28].
Existing cross-modal hashing methods optimistically assume
that the correspondence between samples of different modalities
is known [9]. However, in real applications, some objects are only
available in one modality, or their corresponding (or paired) objects
in another modality are only partially (or even totally) unknown.
This can happen, for example, when one wants to search images
from text, and there are 100 images and 200 documents, and the correspondence between 50 images and 80 documents is only partially
known. In other words, the image-text collection is weakly-paired,
and only the semantic labels are shared across modalities. To the
best of our knowledge, how to flexibly learn hashing codes from
the weakly-paired data is still an untouched and challenging topic
in cross-modal hashing.
Some attempts have been made to tackle the weakly-paired multiview data [11, 15, 30]. To name a few, Weakly-paired Maximum
Correlation Analysis(WMCA) extends the maximum covariance
analysis to the weakly-paired case by jointly learning the latent
pairs and subspace for dimensionality reduction and transfer learning [11]. Multi-modal Projection Dictionary Learning (MMPDL)
jointly learns the projective dictionary and pairing matrix for the fusion classification [15]. Zong et al. [30] assume the cluster indicator
vectors of two samples from two different views should be similar if
KEYWORDS
Cross modal hashing, weakly-paired, flexibility, optimization
ACM Reference Format:
Xuanwu Liu, Jun Wang, Guoxian Yu, Carlotta Domeniconi, and Xiangliang
Zhang. 2019. Flexible Cross-Modal Hashing. In Proceedings of ACM Conference (Conference’19). ACM, New York, NY, USA, 9 pages. https://doi.org/10.
1145/nnnnnnn.nnnnnnn
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from
[email protected].
Conference’19, August 2019, Anchorage, Alaska, USA
© 2019 Association for Computing Machinery.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00
https://doi.org/10.1145/nnnnnnn.nnnnnnn
INTRODUCTION
Conference’19, August 2019, Anchorage, Alaska, USA
they belong to the same cluster and dissimilar otherwise, and then
tackle the multi-view clustering on unpaired data by nonnegative
matrix factorization. Mandal et al. [17] learn coupled dictionaries
from the respective data views and sparse representation coefficients with respect to their own dictionaries. They then maximize
the correlation between sample coefficients of the same class, and
simultaneously minimize the correlation of different classes to seek
the matching between samples and to fuse weakly-paired multiview data. However, these approaches still handle the weakly-paired
problem in a non-flexible setting. For example, WMCA requires
the number of samples in different modalities to be the same, and
MMPDL needs the same number of samples for each class among
different modalities. These requirements are violated in many cases,
where samples across different modalities are partially-paired and
the numbers of member samples of matched clusters (or classes)
across modalities are not the same.
In this paper, we propose a Flexible Cross-Modal Hashing (FlexCMH) solution (as illustrated in Fig. 1) to handle partially-paired
(and even completely unpaired) multi-modal data. Our main contributions are summarized as follows:
(1) We design a novel matching strategy that uses centroids of
clusters, the neighborhood structure of centroids, and an incomplete correspondence between samples to seek a matching between samples in different modalities. The matching
strategy neither requires the same number of samples within
the matched clusters, nor across different modalities. Therefore, FlexCMH can be applied with flexibility in general
cross-modal hashing settings.
(2) We propose a unified objective function to simultaneously
consider the cross-modal matching loss, the intra-modal representation loss, and the quantitative loss to learn adaptive
hashing codes. We also introduce an alternative optimization
technique to jointly optimize the correspondence and hash
functions, and to reinforce the reciprocal effects of these two
objectives.
(3) Experiments on benchmark multi-modal datasets show that
FlexCMH significantly outperforms related and representative cross-modal hashing approaches [2, 11, 13, 15, 28] in
weakly paired cases, and it holds a competitive performance
in different open settings.
The rest of this paper is organized as follows. Section 2 introduces
the objective function of FlexCMH, and its optimization. Section
3 presents the experimental setup, results, and analysis. Section 4
draws the conclusions and provides directions for future work.
2
PROPOSED METHOD
Suppose we have M modalities, and the number of training samples
for the m-th modality is Nm . Xm ∈ R Nm ×dm represents the data
matrix for the m-th modality, where both Nm and dm are modaldependent. Y ∈ RNm ×l stores the label information of Nm samples,
where l is the number of labels. Yik ∈ {0, 1}, Yik = 1 indicates that
M
{xm
i }m=1 is annotated with the k-th label; Yik = 0 otherwise. For
example, in a two-modality Wiki-image search application, xi1 is
the image feature vector of sample i, and xi2 is the tag vector of
Xuanwu Liu et al.
this sample. To enable cross-modal hashing, we need to learn two
hashing functions, F 1 : Rd1 → {0, 1}b and F 2 : Rd2 → {0, 1}b , where
b is the length of binary hash codes. These two hashing functions
are expected to map xi1 and xi2 from the respective modality onto
a common Hamming space and to preserve the proximity of the
original data.
This canonical cross-modal hashing assumes that training samples in different modalities have a complete correspondence. However, the samples may be weakly-paired only. For example, consider
the scenario in which, due to a temporary sensor failure, xi1 and
xi2 do not describe the same object from different feature views.
Instead, xi1 and x2j (i , j) depict the same object. An intuitive solution is to only use the paired samples. However, the structure
information jointly reflected by paired and unpaired samples may
be distorted, thus the performance may be heavily compromised.
Morevoer, if the pair information between two modalities is totally
unknown, the canonical solutions cannot be applied.
To achieve an effective cross-modal hashing on such weaklypaired (or totally unpaired) multi-modal data, we introduce a flexible solution (FlexCMH), and provide its overall workflow in Fig.
1. FlexCMH first introduces a clustering-based matching strategy
to leverage the cluster centorids and the local structure around the
centroids to explore the potential correspondence between clusters
(and samples within) across different modalities. Next, it defines a
permutation matrix based on the explored correspondence to unify
the index of same samples across modalities. Based on the unified
index, it introduces an unified objective function to simultaneously
account for cross-modal similarity preserving loss, the intra-modal
representation loss and the quantitative hashing loss. An alternative optimization technique is also proposed to jointly optimize
the correspondence and the hash functions, and to reinforce the
reciprocal effects of these two objectives. The following subsections
elaborate on the above process.
2.1
Clustering-based cross-modal matching
strategy
Unlike single-modal hashing, the correspondence between samples
is crucial for the multi-modal data fusion and retrieval. For completely matched samples, the correspondence is completely known
and can be used, along with the inter(intra)-modality similarity
between samples across modalities, to learn cross-modal hashing
functions. But for weakly-paired data, since the correspondence is
only partially known, it’s a non-trivial job to quantify the similarity
between samples from different modalities. A remedy is to divide
the samples into different groups based on their labels and impose
some constraints (i.e., concerning the similarity between different
classes) on the coding vectors [14, 27]. In the representation space,
the within-class data would cluster together although they are from
different modalities, and the between-class data would be placed
far apart from each other. In other words, all the data vectors of
the same class (different classes) from different modalities should
be similar (dissimilar)[22]. We can approximate the similarity between different classes using the centroids of respective groups[15].
However, only considering centroids may not be sufficient, and
the neighborhood objects around a centroid may also be helpful.
Flexible Cross-Modal Hashing
Conference’19, August 2019, Anchorage, Alaska, USA
Binary codes
cat
soft
cute
puppy
0101 0101 1100 1101
puppy
dogs
dogs
black
0101 0101 1100 1101
CrossModal
Hashing
Learning
cat
tiger
tiger,wild
Weakly-paired data
Clustering-based Matching
Intra-modal
Representation Loss
Inter-modal
Representation Loss
0101 0101 1100 1101
Quantitative Loss
Figure 1: Workflow of the proposed FlexCMH (Flexible Cross-Modal Hashing). FlexCMH includes two parts: (1) A clusteringbased matching strategy to explore the matched clusters and samples therein across modalities; (2) A unified objective function to jointly account for the inter-modal representation loss, the intra-modal representation loss, and the quantitative loss to
learn adaptive hashing functions. The intra-modality presentation loss aims at exploring the clusters and centroids of respective modalities. The inter-modal representation loss aims at preserving the proximity between samples of different modalities
using matched samples. The quantitative loss aims at quantifying the hashing loss from the high-dimensional vectors to the
binary codes.
Furthermore, incomplete labels of training data restrict the quality
of groups.
Given these observations, we introduce a novel clustering-based
matching strategy to leverage the centroids of clusters and the local structure around the centroids. This strategy can explore the
correspondence between clusters (and samples therein) between
different modalities. We illustrate the clustering-based matching
strategy in the center of Fig. 1, where the stars represent centroids
of clusters in different modalities, and the red points indicate the
objects with known correspondence in another modality. The likelihood that two clusters will match increases with the similarity of
their centroids, with the similarity of the local structure around the
centroids. To achieve that, we define a quantitative match function
as follows:
ns
Õ
mm ′
m 2
m′
m′ 2 2
scc
=
(||xm
(1)
′
cд − zc ||F − α ||xc ′д − zc ′ ||F )
д=1
zm
c
′
zm
c′
where
and
are the centriods the c-th cluster in the m-th
modality and the c ′ -th cluster in the m ′ -th modality, ns is the user
specified number of nearest samples of the centroids, xm
cд is the д-th
m 2
m′ 2
nearest sample of zm
c , α = ||zc ||F /||zc ′ ||F is a scalar coefficient
to balance the scale difference between two modalities. To seek
the correspondence between clusters of different modalities, Eq.
(1) not only accounts for the centroids, but also for the neighborhood samples around the centroids. As such, it can explore the
correspondence between neighborhood samples of respective centroids to facilitate the follow-up cross-modal hashing. In contrast,
existing solutions only match centroids using labeled samples and
ignore the important local patterns [11, 15]. Our match function
neither requires for two matched clusters to have the same number
of samples, nor the same number of samples across modalities. It
can also be applied to multi-modality data whose label information
and correspondence are completely unknown. These advantages
contribute to the flexibility of FlexCMH.
Two clusters (c and c ′ ) and their respective centroids zcm and
′
mm ′ is the smallest among all pairwise clusters
zcm′ are matched, if scc
′
from two modalities. We can align the objects in the respective
modalities by reordering their indexes, and then use the ‘matched’
(aligned) objects in different modalities for cross-modality hashing.
′
To this end, we define a permutation matrix Γmm ∈ RNm ×Nm′ to
align samples as follows:
(
mm ′ is the smallest or Pmm ′ = 1
1, scc
′
mm ′
ij
Γi j =
(2)
0,
otherwise
where Pmm
= 1 means the i-th sample in the m-th modality is
ij
paired with the j-th sample in the m ′ -th modality. In this way,
our cluster-based matching strategy also incorporates the known
′
matched samples from different modalities. Γmm
= 1 if x im belongs
ij
′
′
mm
to the c-th cluster and x m
j belongs to the c -th cluster, and scc ′
is the smallest among all pairwise clusters from two modalities.
′
These conditions indicate that the indexes of x im and x m
j should be
reordered for alignment. We observe that our matching strategy is
different from the typical network alignment, which aims at finding
′
′
Conference’19, August 2019, Anchorage, Alaska, USA
Xuanwu Liu et al.
identical sub-networks [18, 20]. In contrast, we aim at matching
samples within the explored clusters, which describe the same
object from different feature views. In addition, a sample in one
modality can be paired with more than one sample in another
modality. The follow-up cross-modal hashing functions can be
learned using the found correspondence.
2.2
Cross-modal hashing
To compute the matching loss, we should first identify the centroids
of the respective clusters. WMCA [15] and MMPDL [14] both aim
at addressing cross-model learning with weakly-paired samples,
but they obtain clusters using only labeled samples. In practice, the
labels of samples may not be sufficient, and even unavailable. As
such, they have a restricted flexibility. To find centroids, we adopt
Semi-Nonnegative Matrix Factorization (SemiNMF) [5] as follows:
Ls =
M
Õ
m
||X
m=1
d
×k
R
can
− Zm Hm ||F2 ,
m
s.t .H
≥0
k Õ
M
Õ
c=1 m=1
m ′ ,m
m ′ mm ′ 2
||Hm
||F
c − Hc Γc
(4)
Nm reorders the samples in Xm in descending order
where Hm
c ∈R
based on their association probabilities with respect to the c-th
′
class. Γmm
∈ RN ×N is the permutation matrix, which shuffles the
c
′
sample indexes in Hm
c to align the samples according to the same
m
indexes in Hc , which can be obtained using Eq. (2). As such, the
m′
samples of Hm
c can be matched with Hc . In practice, we choose the
m′
top N samples which belong to the c (c ′ ) class to setup Hm
c and Hc ′ ,
and to achieve cross-modal matching. As a result, our matching
strategy can accommodate the case in which the number of samples
belonging to the same class in different modalities can be different.
In this way, we can achieve cross-modal retrieval on multi-modal
data, whose matched samples are partially or completely unknown,
even with different numbers of samples in the matched clusters.
Hm can be viewed as a soft cluster assignments of samples in
the m-th modality with respect to k clusters in a latent space. The
assignments are also coordinated by the assignments in other data
modalities (see Eq. (4)). For cross-modal hashing, we transform
M
Õ
Lq =
m=1
||B − H̃m ||F2
(5)
B can be viewed as the common Hamming space across all data
modalities. It can be used for cross-modal retrieval, along with the
Hm of the respective modalities. Eq. (6) is also called the hashing
quantitative loss.
2.3
Unified objective function
Based on the above analysis, we can assemble the three losses into
a unified objection function, and formulate it as:
min
m
m
(3)
where Zm ∈
be viewed as the latent representation of
k cluster centroids of the m-th modality, and Hm ∈ Rk ×Nm is
the soft cluster assignments of samples in the latent space. The
above equation calculates the intra-modality representation loss
and clustering loss simultaneously. Therefore, Zm can be used for
the clustering-based matching. Hm is the indicator matrix, which
represents the probability that Nm samples belong to different
classes, and can be used for hashing codes learning.
To achieve sample-to-sample cross-modal retrieval, based on the
matched clusters and samples from Eq. (2), we further minimize
the difference between the matched pairs to encourage them to
be as similar as possible. Specifically, the indicator vectors (Hm )
of two samples from two different modalities should be similar if
they have the same cluster label, and dissimilar otherwise. To this
end, we quantify the relationship between two different modalities
by minimizing the deviation of the indicator vectors of pairwise
objects from different modalities as follows:
Lc =
the soft assignments into hard clusters H̃m ∈ {0|1}b×N using kmeans clustering, and then we seek the binary hash coding matrix
B ∈ {0| + 1}b×N as follows:
k Õ
M
Õ
Z ,H ,B
+
M
Õ
m=1
c=1 m=1
m ′ ,m
m ′ mm ′ 2
||Hm
||F
c − Hc Γc
||Xm − Zm Hm ||F2 + λ
M
Õ
m=1
(6)
||B − H̃m ||F2
where the first term quantifies the cross-modal matching loss and
the inter-modal representation loss, the second term measures the
intra-modal representation loss, and the third term measures the
hashing code quantitative loss. λ is a scalar parameter that achieves
a balance between the cross-modal hashing loss and the quantitative loss. By simultaneously optimizing the above three losses, we
jointly account for the correspondence and the hash functions, and
thus reinforce the reciprocal effects of these two objectives. This
joint optimization can avoid the misleading impact of initially not
well-matched clusters and samples on the subsequent cross-modal
hashing. Our experimental results confirm this advantage.
2.4
Optimization
We observe that the loss function in Eq. (6) is actually a sum of the
cross-modal matching and retrieval loss, the intra-modal representation loss, and the hashing quantitative loss. Once Zm is fixed, we
′
can directly obtain Γmm
using Eq. (2). We can solve Eq. (6) via the
c
Alternating Direction Method of Multipliers (ADMM) [1], which
alternatively optimizes one of Zm , Hm , and B, while keeping the
other two fixed.
Optimize Hm with Zm and B fixed: We utilize stochastic gradient descent (SGD) to learn Hm using the back-propagation (BP)
algorithm. Here, Eq. (6) is transformed into k independent optimization problems, where the c − th sub-problem minimizes:
min
M
Õ
m ′ mm ′ 2
||Hm
||F + λ
c − Hc Γc
m=1
m ′ ,m
Xm
c has
M
Õ
m=1
m m 2
||Xm
c − Z Hc ||F
(7)
where
the same size and samples order as Hm
c . For any
class, the derivatives of Eq. (7) with respect to the indicator matrix
Hm
c in the m is:
M
Õ
∂L
mT m
m ′ mm ′
= 2ZmT Zm Hm
Xc + λ
2(Hm
)
c −Z
c − Hc Γc
m
∂Hc
m ′ ,m
(8)
Flexible Cross-Modal Hashing
Conference’19, August 2019, Anchorage, Alaska, USA
∂L to update the indicator matrix Hm using
We can then take ∂H
m
c
2.5
SGD. Similarly, we can also update
∂L .
m′
To facilitate the time complexity analysis, we assume a simple extreme case, with M modalities and k classes and the number of iterations is t. For any modality, we have n samples and the extreme pairing case is considered. The time complexity of the proposed method
is composed of three parts. First, the time cost of updating Hcm in Eq.
(8) is O(kM(k 2d +k 2n+kdn+(k 2d)(M −1)/2)) . Second, the time cost
of updating Zcm in Eq. (9) is O(M(4dkn+nk 2 )). Third, the time cost of
′
updating Γmm in Eq. (2) is O(k 2n 2d 2 (M(M − 1))/2). Since the complexity of third part is larger than other two parts in each iteration,
the overall complexity of FlexCMH is O(tk 2n 2d 2 (M(M − 1))/2). The
empirically study (Configuration: Ubuntu 16.04.1, Intel(R) Xeon(R)
CPU E5-2650, 256RAM) on three adopted multi-modal datasets
shows that FlexCMH costs 8.532 seconds on Wiki, 43.244 seconds
on Mirflickr, and 1768.196 seconds on Nus-wide.
c
′
Hm
c
based on the derivative
∂Hc
depends on
Optimize Zm with Hm and B fixed: Since Γmm
c
m ′ , we compute the derivative of Eq. (6) with respect to
zm
and
z
c ′
c
Γmm
and Zm as follows:
c
′
∂L
∂L
∂L ∂Γmm
=
+ mm ′ c m
m
m
∂Z
∂Z
∂Z
∂Γc
′
(9)
mm ′ T
mT
mT
= 2Zm Hm
− 4λXm
+ 2Xc Γc
c Hc
c Hc
m′
m′ T
Hc
We can then use these derivatives to update the centroid matrix
Zm . In each iteration, after the centroids in Zm are updated, we
′
consequently update Γmm
based on Eqs. (1) and (2).
c
Optimize B with Hm and Zm fixed: Once Zm and Hm are
fixed, H̃m is also determined, then the minimization in Eq. (6) is
equal to a maximization as follows:
max tr (BT (λ
B
M
Õ
H̃m ) =tr (λBT U) =
Õ
Bi j Ui j
(10)
i, j
m=1
ÍM
where B ∈ {−1, +1} N ×b , U = λ m=1
H̃m . It is easy to observe that
the binary code Bi j should keep the same sign as Ui j . Therefore,
we have:
B = siдn(U) = siдn(λ
M
Õ
H̃m )
(11)
m=1
Where siдn(x)=1 if x > 0, siдn(x)=0 otherwise.
By iteratively applying Eqs. (8-11), we can jointly optimize the
correspondence and the hash functions, thus reinforcing the reciprocal effects of these two objectives. The whole procedure of
FlexCMH and the alternative optimization for solving Eq. (6) are
summarized in Algorithm 1.
Algorithm 1 FlexCMH: Flexible Cross-Modal Hashing
Input: M modality data matrices Xm , m ∈ {1, 2, · · · , M }; the
′
matched samples indicator matrix Pmm (optional).
m
Output: Clustering centroid matrices Z and indicator matrices
Hm , binary code matrix B.
1: Initialize centroid matrices Zm , indicator matrices Hm , the
number of classes k and the number of iterations iter , t = 1.
2: while t < iter or Eq. (6) has not converged do
3:
for c = 1 → k do
4:
Update Hm
c using Eqs. (8)
5:
end for
6:
Update Zm using Eq. (9);
′
7:
Update the permutation matrix Γmm using Eqs. (1-2).
8:
Update B using Eq. (11);
9:
t = t + 1.
10: end while
Complexity analysis
3 EXPERIMENTS
3.1 Experimental setup
Datasets: Three widely used benchmark datasets (Nus-wide, Wiki,
and Mirflicker) are collected to evaluate the performance of RDCMH. Each dataset includes two modalities, image and text, although FlexCMH can also be directly applied to cases with more
than two data modalities. Nus-wide1 contains 260,648 web-text
pairs. Each image is annotated with one or more labels taken from
81 concept labels. Each text is represented as a 1,000-dimensional
bag-of-words vector. The hand-crafted feature of each image is a
500-dimensional bag-of-visual words (BOVW) vector. Wiki2 is generated from a group of 2,866 Wikipedia documents. Each document
is an image-text pair, can be annotated with 10 semantic labels,
and is represented by a 128-dimensional SIFT feature vector. The
text articles are represented as probability distributions over 10
topics, which are derived from a Latent Dirichlet Allocation (LDA)
model. Mirflickr3 originally contained 25,000 instances collected
from Flicker. Each instance consists of an image and its associated
textual tags, and is manually annotated with one or more labels,
from a total of 24 semantic labels. Each text is represented as a 1,386dimensional bag-of-words vector, and each image is represented
by a 512-dimensional GIST feature vector.
Comparing methods: Six related and representative methods
are adopted for comparison. (i) CMSSH (Cross-modal Similarity
Sensitive Hashing) [2] treats hash code learning as a binary classification problem, and efficiently learns the hash functions using
a boosting method. (ii) SCM (Semantic Correlation Maximization)
[28] optimizes the hashing functions by maximizing the correlation
between two modalities with respect to semantic labels; it includes
two versions, SCM-orth and SCM-seq. SCM-orth learns hash functions by direct eigen-decomposition with orthogonal constraints
for balancing coding functions, and SCM-seq can more efficiently
learn hash functions in a sequential manner without the orthogonal
constraints. (iii) CMFH (Collective matrix factorization hashing)
[6] learns unified binary codes using collective matrix factorization
with a latent factor model on multi-modal data. (iv) SePH (Semantics Preserving Hashing) [13] is a probability-based hashing method,
1 http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm
2 https://www.wikidata.org/wiki/Wikidata
3 http://press.liacs.nl/mirflickr/mirdownload.html
Conference’19, August 2019, Anchorage, Alaska, USA
which generates one unified hash code for all observed views by
considering the semantic consistency between views. (v) WMCA
(Weakly-paired Maximum Correlation Analysis) [11] adopts the
maximum covariance analysis to perform the joint learning of the
latent matching and subspace. (vi)MMPDL (Muti-modal projection
dictionary learning) [15] is a unified projective dictionary learning method, which jointly learns the projective dictionaries and
matching matrix for the classification fusion. The source code of
the baselines is provided by the authors, and the input parameter
values are set according to the guidelines given by the authors in
their respective papers. For WMCA and MMPDL, since they are not
hashing methods, we obtain the hashing codes by exchanging the
classification with the ordinary hashing function sдn(·). For FlexCMH, we fix λ in Eq. (6) to 1, k = 10 on Wiki, k = 25 on Mirflickr,
and k = 80 on Nus-wide; the number of nearest neighbors ns in Eq.
(1) is fixed to 5 and N in Eq. (4) is fixed to 50% of min{Nm , Nm ′ }.
Our preliminary study shows FlexCMH is robust to the input values
of ns and N . The number of iterations for optimizing Eq. (6) is set
to 500. We empirically found that FlexCMH generally converges in
fewer iterations on all the datasets. The parameter sensitivity of λ
and k is studied in appendix. The datasets and the code of FlexCMH
will be made publicly available.
3.2
Results in different practical settings
To study thoroughly the performance of FlexCMH and of the
comparing methods, we conduct three types of experiments: (1)
completely-paired, (2) weakly-paired, and (3) completely-unpaired.
In each type of experiment, all methods are run ten times, and we
report the average MAP (mean average precision) results. Since the
MAP standard deviations of all methods are quite small (less than
2%) on all datasets, to save space the standard deviations are not
reported. The best results are boldfaced.
For the completely-paired experiments, the clustering-based
matching process of FlexCMH is excluded, and each comparing
method uses all the paired samples for training (70%) and the rest
for validation (30%). Table 1 reports the MAP results on MIRFLICKR,
NUS-WIDE, and Wiki datasets. In the Table, ‘Image vs. Text’ denotes the setting where the query is an image and the database is
text, and vice versa for ‘Text vs. Image’.
For the weakly-paired experiments, we investigate three different settings: (2a) 50% of the image-text pairs of the training set (70%
of the whole dataset) are kept, and the other pairs are randomly
shuffled; in this setting the first four comparing methods (CMSSH,
SCM-Seq, SCM-Orth, and SePH) can only use paired image-text
instances for training and disregard the unpaired ones, whereas the
last two (WMCA and MMPDL) and FlexCMH can use all the training data. (2b) As in (2a), but both the paired (50%) and the unpaired
training samples are used to train all the comparing methods. (2c)
As in (2a), all the images in the training set are used for training,
but 10% of the text samples in the training set is randomly removed.
As such, the number of images is different from the number of
text samples across modalities and clusters. For the setting (2c), all
the comparing methods cannot be applied, so we only report the
MAP results of our FlexCMH and its variants (FlexCMH(nJ) and
FlexCMH(nC)). FlexCMH(nJ) first seeks the matched image-text
Xuanwu Liu et al.
pairs, and then executes the follow-up cross-modal hashing, without jointly optimizing the matched clusters (samples) and hashing
functions. FlexCMH(nC) uses the label information to obtain the
correspondence between samples (as done by MMPDL), instead of
our proposed clustering-based matching strategy. Table 2 reports
the MAP values of the compared methods in these settings.
For the completely-unpaired experiments, besides randomly partitioning the data into training (70%) and testing (30%) sets, we
randomly shuffle the index of images and the index of text samples
in the training set. As a result, the images and the text samples
are almost completely unpaired. For this type of experiments, only
WMCA and MMPDL can be used for comparison. Table 3 reports
the MAP values of the three methods.
From Table 1, we can see that our FlexCMH achieves the best
performance in most cases. This is because FlexCMH not only
jointly models the cross-modal similarity preserving loss and the
intra-modal similarity preserving loss, to build a more faithfully
semantic projection, but also models the quantitative loss to learn
adaptive hashing codes. We observe that SePH obtains better results
for ‘Text vs. Image’ retrieval on Wiki. This is possible because of
the adaptability of its probability-based strategy on small datasets.
An unexpected observation is that the performance of CMSSH
and SCM-Orth decreases as the length of hash codes increases.
This might be caused by the imbalance between bits in the hash
codes learned by singular value or eigenvalue decomposition. These
experimental results show the effectiveness of FlexCMH for the
canonical cross-modal hashing, where training samples from different modalities are completely paired.
From Table 2, we can see that the MAP results are similar to
those of Table 1, while only 50% of the pairs of the training set
are used for training by CMSSH, SCM, and SePH. This observation
suggests that this pair fraction is sufficient to train the cross-modal
hashing functions. In practice, we observed a significant reduction
in the MAP values when less than 10% of the training data is paired.
We also observe that the MAP results of CMSSH, SCM, and SePH
sharply reduce when all the paired and unpaired samples are used
for the experiment. This is because CMSSH, SCM, and SePH are
misled by ‘incorrectly paired’ (in fact, not-paired) samples. WMCA,
MMPDL and FlexCMH do not manifest such a sharp reduction
in performance. That is because they adopt different techniques
to augment matched samples, which boost the performance of
cross-modal hashing. In addition, FlexCMH still achieves the best
performance, thanks to (1) its novel clustering-based matching
approach for exploring the matched clusters and samples therein,
and (2) a unified objective function to optimize, in a coordinate
manner, the matching between clusters and samples, and the crossmodal hashing functions with the matched clusters and samples.
FlexCMH holds slightly reduced results when the numbers of
samples (images and texts) in different modalities are not the same,
and only 50% of the image-text is paired. In ‘Image vs. Text’ retrieval, the MAP results of FlexCMH are generally lower than those
in Table 1. That is because 10% of the text samples in the text modality is removed. As a result, the retrieval results may be incorrect
when using the corresponding images to query the removed text.
We also observe that the results of FlexCMH(nJ) are inferior to
those of FlexCMH. This observation proves that jointly optimizing the hashing functions, and the matched clusters and samples,
Flexible Cross-Modal Hashing
Conference’19, August 2019, Anchorage, Alaska, USA
Table 1: Results (MAP) on three datasets with completely-paired data.
Image
vs.
Text
Text
vs.
Image
Methods
CMSSH
SCM-seq
SCM-orth
CMFH
SePH
WMCA
MMPDL
FlexCMH
CMSSH
SCM-seq
SCM-orth
CMFH
SePH
WMCA
MMPDL
FlexCMH
Mirflickr
64bits
16bits
32bits
0.5616
0.5721
0.6041
0.6232
0.6573
0.5834
0.6126
0.6639
0.5616
0.5694
0.6055
0.6205
0.6481
0.5847
0.6124
0.6601
0.5555
0.5607
0.6112
0.6256
0.6603
0.5847
0.6135
0.6674
0.5551
0.5611
0.6154
0.6237
0.6521
0.5861
0.6142
0.6632
0.5513
0.5535
0.6176
0.6268
0.6616
0.5856
0.6141
0.6691
0.5506
0.5544
0.6238
0.6259
0.6545
0.5886
0.6156
0.6648
Nus-wide
64bits
128bits
16bits
32bits
0.5484
0.5482
0.6232
0.6293
0.6637
0.5873
0.6128
0.6724
0.5475
0.5497
0.6299
0.6286
0.6534
0.5903
0.6172
0.6676
0.3414
0.3623
0.4651
0.4752
0.4787
0.4396
0.4635
0.4901
0.3392
0.3412
0.4370
0.4349
0.4489
0.4179
0.4225
0.4639
0.3336
0.3646
0.4714
0.4793
0.4869
0.4415
0.4658
0.4935
0.3321
0.3459
0.4428
0.4387
0.4539
0.4192
0.4232
0.4653
0.3282
0.3703
0.4822
0.4812
0.4888
0.4433
0.4661
0.4987
0.3272
0.3472
0.4504
0.4412
0.4587
0.4221
0.4237
0.4688
128bits
16bits
32bits
0.3261
0.3721
0.4851
0.4866
0.4932
0.4436
0.4672
0.5012
0.3256
0.3539
0.2235
0.4425
0.4621
0.4235
0.4256
0.4712
0.1694
0.1577
0.2341
0.2578
0.2836
0.2243
0.2731
0.2846
0.1578
0.1521
0.2257
0.2872
0.5345
0.2089
0.2821
0.2812
0.1523
0.1434
0.2411
0.2591
0.2859
0.2271
0.2745
0.2889
0.1384
0.1561
0.2459
0.2891
0.5351
0.2104
0.2824
0.2836
Wiki
64bits
0.1447
0.1376
0.2443
0.2603
0.2879
0.2283
0.2768
0.2912
0.1331
0.1371
0.2482
0.2907
0.5471
0.2131
0.2836
0.2857
128bits
0.1434
0.1358
0.2564
0.2612
0.2863
0.2312
0.2801
0.2935
0.1256
0.1261
0.2518
0.2923
0.5506
0.2156
0.2861
0.2869
Table 2: Results (MAP) on three datasets with weakly-paired data.
Image
vs.
Text
Text
vs.
Image
Mirflickr
Nus-wide
64bits
128bits
16bits
32bits
64bits
128bits
16bits
50% image-text pairs are paired, CMSSH, SCM and SePH only use paired data for training
Methods
16bits
32bits
CMSSH
SCM-seq
SCM-orth
CMFH
SePH
WMCA
MMPDL
FlexCMH
CMSSH
SCM-seq
SCM-orth
CMFH
SePH
WMCA
MMPDL
FlexCMH
0.5614
0.5720
0.6037
0.6225
0.6571
0.5833
0.6123
0.6635
0.5612
0.5696
0.6053
0.6201
0.6479
0.5843
0.6121
0.6578
0.5551
0.5606
0.6111
0.6249
0.6609
0.5848
0.6131
0.6673
0.5541
0.5612
0.6151
0.6233
0.6516
0.5857
0.6138
0.6623
0.5512
0.5532
0.6166
0.6261
0.6618
0.5852
0.6145
0.6689
0.5502
0.5541
0.6235
0.6248
0.6541
0.5848
0.6154
0.6641
CMSSH
SCM-seq
SCM-orth
CMFH
SePH
WMCA
MMPDL
FlexCMH
CMSSH
SCM-seq
SCM-orth
CMFH
SePH
WMCA
MMPDL
FlexCMH
0.5216
0.5398
0.5404
0.5405
0.5411
0.5456
0.5778
0.5867
0.5121
0.5211
0.5235
0.5314
0.5431
0.5456
0.5631
0.5801
0.5238
0.5401
0.5413
0.5422
0.5436
0.5463
0.5792
0.5891
0.5135
0.5226
0.5238
0.5335
0.5441
0.5461
0.5647
0.5825
0.5244
0.5406
0.5430
0.5438
0.5467
0.5471
0.5814
0.5925
0.5142
0.5237
0.5241
0.5356
0.5453
0.5458
0.5648
0.5836
FlexCMH(nJ)
FlexCMH(nC)
FlexCMH
FlexCMH(nJ)
FlexCMH(nC)
FlexCMH
0.6121
0.5859
0.6435
0.6224
0.5983
0.6589
0.6135
0.5886
0.6441
0.6237
0.6004
0.6624
0.5482
0.5479
0.6235
0.6290
0.6636
0.5836
0.6121
0.6721
0.5474
0.5486
0.6294
0.6291
0.6533
0.5889
0.6167
0.6672
0.3411
0.3620
0.4648
0.4748
0.4785
0.4378
0.4632
0.4903
0.3396
0.3408
0.4364
0.4342
0.4487
0.4171
0.4221
0.4636
0.3337
0.3644
0.4716
0.4789
0.4862
0.4416
0.4654
0.4936
0.3318
0.3455
0.4427
0.4382
0.4536
0.4186
0.4228
0.4657
0.3278
0.3704
0.4820
0.4805
0.4881
0.4429
0.4659
0.4982
0.3269
0.3471
0.4501
0.4403
0.4585
0.4215
0.4231
0.4683
0.3256
0.3722
0.4853
0.4835
0.4928
0.4437
0.4677
0.5010
0.3253
0.3536
0.2233
0.4426
0.4618
0.4232
0.4259
0.4708
32bits
Wiki
64bits
128bits
0.1689
0.1574
0.2344
0.2576
0.2825
0.2246
0.2729
0.2844
0.1573
0.1517
0.2256
0.2869
0.5343
0.2086
0.2813
0.2810
0.1520
0.1436
0.2410
0.2588
0.2853
0.2278
0.2743
0.2885
0.1382
0.1556
0.2456
0.2883
0.5346
0.2089
0.2821
0.2831
0.1436
0.1374
0.2445
0.2596
0.2881
0.2281
0.2765
0.2907
0.1326
0.1361
0.2487
0.2902
0.5468
0.2128
0.2832
0.2855
0.1432
0.1361
0.2567
0.2608
0.2862
0.2316
0.2793
0.2931
0.1253
0.1264
0.2521
0.2915
0.5494
0.2153
0.2863
0.2864
0.1011
0.1107
0.1126
0.1157
0.1235
0.1575
0.2342
0.2629
0.0989
0.1118
0.1206
0.1231
0.1238
0.1437
0.2132
0.2538
0.1023
0.1112
0.1138
0.1165
0.1267
0.1593
0.2361
0.2647
0.1002
0.1124
0.1209
0.1255
0.1242
0.1445
0.2141
0.2541
0.1035
0.1125
0.1149
0.1179
0.1284
0.1611
0.2375
0.2655
0.1011
0.1121
0.1214
0.1269
0.1247
0.1458
0.2155
0.2557
0.1031
0.1128
0.1168
0.1182
0.1302
0.1635
0.2341
0.2687
0.1020
0.1128
0.1221
0.1293
0.1264
0.1473
0.2135
0.2563
0.2245
0.2057
0.2593
0.2456
0.2274
0.2815
0.2256
0.2076
0.2612
0.2471
0.2293
0.2834
0.2273
0.2088
0.2635
0.2493
0.2308
0.2842
Wiki
64bits
128bits
50% image-text pairs are paired, all methods use all the training data
Image
vs.
Text
Text
vs.
Image
0.5249
0.5412
0.5442
0.5447
0.5501
0.5489
0.5846
0.5973
0.5136
0.5242
0.5250
0.5372
0.5459
0.5472
0.5655
0.5859
0.2715
0.2953
0.3343
0.3409
0.3561
0.3721
0.4117
0.4273
0.2563
0.2855
0.3211
0.3382
0.3531
0.3612
0.3872
0.4031
0.2731
0.2968
0.3358
0.3428
0.3582
0.3746
0.4136
0.4296
0.2607
0.2879
0.3234
0.3397
0.3554
0.3648
0.3891
0.4056
0.2757
0.2991
0.3372
0.3442
0.3610
0.3758
0.4137
0.4315
0.2622
0.2893
0.3269
0.3421
0.3560
0.3679
0.3911
0.4079
0.2766
0.3012
0.3395
0.3462
0.3612
0.3761
0.4136
0.4331
0.2741
0.2921
0.3274
0.3442
0.3579
0.3712
0.3924
0.4112
50% image-text pairs are paired, the number of image samples and that of text samples are different
Image
vs. Text
Text vs.
Image
0.6167
0.5904
0.6453
0.6244
0.6031
0.6643
0.6185
0.5927
0.6468
0.6256
0.6042
0.6658
0.3878
0.3618
0.4233
0.4115
0.3832
0.4627
0.3892
0.3643
0.4251
0.4123
0.3845
0.4635
0.3905
0.3666
0.4267
0.4143
0.3873
0.4654
0.3936
0.3647
0.4283
0.4150
0.3907
0.4688
0.2231
0.2015
0.2577
0.2435
0.2256
0.2803
Table 3: Results (MAP) on three datasets with complete-unpaired data.
Image
vs.
Text
Text
vs.
Image
Methods
WMCA
MMPDL
FlexCMH
WMCA
MMPDL
FlexCMH
Mirflickr
64bits
16bits
32bits
0.5214
0.5535
0.5693
0.5256
0.5489
0.5631
0.5231
0.5542
0.5704
0.5263
0.5503
0.5652
0.5245
0.5567
0.5723
0.5278
0.5531
0.5681
Nus-wide
64bits
128bits
16bits
32bits
0.5263
0.5588
0.5749
0.5293
0.5547
0.5694
0.3559
0.3963
0.4115
0.3414
0.3635
0.4031
0.3574
0.3984
0.4135
0.3438
0.3678
0.4058
0.3591
0.4004
0.4159
0.3467
0.3691
0.4083
128bits
16bits
32bits
0.3604
0.4015
0.4173
0.3481
0.3713
0.4112
0.1276
0.2210
0.2511
0.1335
0.2015
0.2437
0.1295
0.2231
0.2534
0.1344
0.2038
0.2459
0.1310
0.2254
0.2548
0.1358
0.2074
0.2483
0.1336
0.2268
0.2563
0.1381
0.2098
0.2501
0.280
0.655
0.490
0.275
0.7
0.480
Nuswideet al.
Xuanwu Liu
Wiki
0.3
0.6
0.5
0.25
0.6
MAP
0.660
Mirfilcker
0.8
0.500
MAP
0.285
MAP
0.290
0.665
MAP
0.670
MAP
MAP
Mirfilcker August 2019, Anchorage,
Wiki
Nuswide
Conference’19,
Alaska, USA
0.2
0.4
0.470
0.650
0.270
0.645
0.265
0.640
0.001
0.01
0.1
1
10
100
0.260
0.001
0.5
0.460
0.01
0.1
1
10
100
0.450
0.001
0.4
0.01
0.1
1
10
100
0.3
0.15
4
7
10
13
16
0.1
14
19
k
24
39
34
0.2
70
75
80
85
k
k
Figure 2: MAP vs. λ on Mirfilcker and Wiki datasets.
Figure 3: MAP vs. k on different datasets.
enables a mutual boosting of the two objectives. Furthermore, FlexCMH(nC) is also outperformed by FlexCMH, which proves that
our proposed clustering-based matching strategy can more reliably
find the matching between samples across modalities.
In Table 3, the MAP results of all methods are inferior to those
of Tables 1 and 2. Still, FlexCMH achieves the best results, which
proves the effectiveness of FlexCMH on completely-unpaired data.
From these results, we can state that the matching information
of samples across modalities is crucial for cross-modal hashing.
Our clustering-based matching strategy can reliably explore paired
samples, and it boosts the performance of cross-modal hashing on
weakly-paired (or completely unpaired) samples.
Besides, we present some examplar cross-modal retrieved images
(texts) in the supplementary file to visually support the advantages
of FlexCMH.
In summary, our experimental results prove that FlexCMH can
learn the cross-modal hashing more effectively than representative
comparing methods. FlexCMH is flexible in a variety of practical
settings, where the paired samples across modalities are partially
available or even completely unknown, and the numbers of samples
in different modalities (and matched clusters) are also different. To
the best of our knowledge, no existing cross-modal hashing methods
can work in these scenarios.
Table 4: Results on three modalities on Wiki.
comparing methods cannot directly handle more than two modalities, we adapt them by learning hash functions between each pair
of modalities, and then merge the retrieved results from the respective pairs. For example, if Image1 serves as the query modality,
then the comparing methods separately optimize two cross-modal
hashing mappings (i.e., Image1 → Text and Image1 → Image2). The
experimental setting is the same as in (2b).
Table 4 shows the MAP values on the Wiki dataset in the threemodality case. FlexCMH again outperforms the compared methods,
providing evidence of the broad applicability of our proposed approach.
3.3
4
Parameter sensitivity analysis
CMSSH
SCM-seq
SCM-orth
SePH
WMCA
MMPDL
FlexCMH
Text
0.1015
0.1114
0.1215
0.1241
0.1428
0.2023
0.2331
Image1
0.0896
0.0975
0.1025
0.1056
0.1156
0.1676
0.1876
Image2
0.0937
0.1034
0.1151
0.1104
0.1241
0.1896
0.2031
CONCLUSIONS
We further explore the sensitivity of the scalar parameter λ in
Eq. (6), and report the results on three datasets in Fig. 2, where
the code length is fixed to 16 bits. We can see that FlexCMH is
slightly sensitive to λ when λ ∈ [10−3 , 102 ], and achieves the best
performance when λ = 1. Over-weighting or under-weighting the
quantitative loss has a negative impact on the performance, but not
significant. In summary, an effective λ can be easily selected for
FlexCMH.
In addition, we investigate the sensitivity of the number of clusters k, and report the results in Fig. 3 with the code length fixed to
16 bits. We can see that FlexCMH is sensitive to k and can achieve
the best results with k = 10 on Wiki, k = 25 on Mirflickr, and
k = 80 on Nus-wide. These preferred values of k are close to the
number of distinct labels of the corresponding datasets. Given this,
we suggest to fix k around the number of labels l.
In this paper, we proposed a Flexible cross-modal hashing (FlexCMH) solution to learn effective hashing functions from weaklypaired (or completely-unpaired) data across modalities. FlexCMH
introduces a clustering-based matching strategy to explore the
potential correspondence between clusters and their member samples. In addition, it jointly optimizes the potential correspondence,
cross-modal hashing functions derived from the correspondence
and the hashing quantitative loss in a unified objective function to
coordinately learn compact hashing codes. Extensive experiments
demonstrate that FlexCMH outperforms the state-of-the-art hashing methods on completely-paired, weakly-paired, and completelyunpaired multi-modality data. In the future, we will incorporate
deep feature learning into cross-modal hashing on weakly-paired
data. The code and data (those that are not available yet) will be
publicly available.
3.4
REFERENCES
Results on three modalities
In this section, we evaluate the effectiveness of FlexCMH on the
Wiki dataset fixing code length to 16 with three modalities. Currently, to the best of our knowledge, there are no publicly available
datasets with three or more modalities. To simulate a three-modality
setting, we divide the 128-dimensional image modality into two
sub-modalities: the 64-dim i1 and the 64-dim i2 modalities. Since the
[1] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers.
Foundations and Trends in Machine Learning, 3(1):1–122, 2011.
[2] M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios. Data fusion through
cross-modality metric learning using similarity-sensitive hashing. In CVPR, pages
3594–3601, 2010.
[3] G. Chen, Y. Song, F. Wang, and C. Zhang. Semi-supervised multi-label learning
by solving a sylvester equation. In ICDM, pages 410–419, 2008.
90
Flexible Cross-Modal Hashing
[4] L. Chen, D. Xu, W. H. Tsang, and X. Li. Spectral embedded hashing for scalable
image retrieval. IEEE Transactions on Cybernetics, 44(7):1180–1190, 2014.
[5] C. Ding, T. Li, and M. I. Jordan. Convex and semi-nonnegative matrix factorizations. TPAMI, 32(1):45–55, 2010.
[6] G. Ding, Y. Guo, and J. Zhou. Collective matrix factorization hashing for multimodal data. In CVPR, pages 2083–2090, 2014.
[7] Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. A multi-view embedding space for
modeling internet images, tags, and their semantics. IJCV, 106(2):210–233, 2014.
[8] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: a
procrustean approach to learning binary codes for large-scale image retrieval.
TPAMI, 35(12):2916–2929, 2013.
[9] C. Kang, S. Xiang, S. Liao, C. Xu, and C. Pan. Learning consistent feature representation for cross-modal multimedia retrieval. TMM, 17(3):370–381, 2015.
[10] S. Kumar and R. Udupa. Learning hash functions for cross-view similarity search.
In IJCAI, pages 1360–1365, 2011.
[11] C. H. Lampert and O. Kr?mer. Weakly-paired maximum covariance analysis for
multimodal dimensionality reduction and transfer learning. In ECCV, 2010.
[12] L. Li, Y. Mengyang, and S. Ling. Multiview alignment hashing for efficient image
search. IEEE Transactions on Image Processing, 24(3):956–966, 2015.
[13] Z. Lin, G. Ding, J. Han, and J. Wang. Cross-view retrieval via probability-based
semantics-preserving hashing. IEEE Transactions on Cybernetics, 47(12):4342–
4355, 2017.
[14] H. Liu, W. Feng, X. Zhang, and F. Sun. Weakly-paired deep dictionary learning
for cross-modal retrieval. Pattern Recognition Letters, pages S016786551830268X–,
2018.
[15] H. Liu, Y. Wu, F. Sun, B. Fang, and G. Di. Weakly paired multimodal fusion for
object recognition. IEEE Transactions on Automation Science and Engineering,
PP(99):1–12, 2017.
[16] X. Liu, Y. Mu, D. Zhang, B. Lang, and X. Li. Large-scale unsupervised hashing
with shared structure learning. IEEE Transactions on Cybernetics, 45(9):1811–1822,
2017.
[17] D. Mandal and S. Biswas. Generalized coupled dictionary learning approach with
applications to cross-modal matching. TIP, 25(8):3826–3837, 2016.
[18] L. Meng and A. Striegel. Local versus global biological network alignment.
Bioinformatics, 32(20):btw348, 2015.
[19] F. Shen, C. Shen, L. Wei, and H. T. Shen. Supervised discrete hashing. In CVPR,
2015.
[20] Z. Si and H. Tong. Final: Fast attributed network alignment. In Acm Sigkdd
International Conference, 2016.
[21] J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen. Inter-media hashing for
large-scale retrieval from heterogeneous data sources. In SIGMOD, pages 785–796,
2013.
[22] C. Wang, S. Yan, L. Zhang, and H.-J. Zhang. Multi-label sparse coding for automatic image annotation. In CVPR, pages 1643–1650, 2009.
[23] J. Wang, S. Kumar, and S. F. Chang. Semi-supervised hashing for scalable image
retrieval. In CVPR, pages 3424–3431, 2010.
[24] J. Wang, W. Liu, S. Kumar, and S.-F. Chang. Learning to hash for indexing big
data – a survey. Proc. of the IEEE, 104(1):34–57, 2016.
[25] J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity search: A survey.
Computer Science, 2014.
[26] J. Wang, T. Zhang, N. Sebe, and H. T. Shen. A survey on learning to hash. TPAMI,
40(4):769–790, 2018.
[27] L. Xiao, M. Song, D. Tao, X. Zhou, C. Chen, and J. Bu. Semi-supervised coupled
dictionary learning for person re-identification. In CVPR, 2014.
[28] D. Zhang and W. J. Li. Large-scale supervised multimodal hashing with semantic
correlation maximization. In AAAI, pages 2177–2183, 2014.
[29] X. Zhu, Z. Huang, H. T. Shen, and X. Zhao. Linear cross-modal hashing for
efficient multimedia search. In ACM MM, pages 143–152, 2013.
[30] L. Zong, X. Zhang, and X. Liu. Multi-view clustering on unmapped data via
constrained non-negative matrix factorization. Neural Networks, 108:155–171,
2018.
Conference’19, August 2019, Anchorage, Alaska, USA