Academia.eduAcademia.edu

Weakly-paired Cross-Modal Hashing

2019, arXiv (Cornell University)

Hashing has been widely adopted for large-scale data retrieval in many domains, due to its low storage cost and high retrieval speed. Existing cross-modal hashing methods optimistically assume that the correspondence between training samples across modalities are readily available. This assumption is unrealistic in practical applications. In addition, these methods generally require the same number of samples across different modalities, which restricts their flexibility. We propose a flexible cross-modal hashing approach (FlexCMH) to learn effective hashing codes from weakly-paired data, whose correspondence across modalities are partially (or even totally) unknown. FlexCMH first introduces a clustering-based matching strategy to explore the local structure of each cluster, and thus to find the potential correspondence between clusters (and samples therein) across modalities. To reduce the impact of an incomplete correspondence, it jointly optimizes in a unified objective function the potential correspondence, the cross-modal hashing functions derived from the correspondence, and a hashing quantitative loss. An alternative optimization technique is also proposed to coordinate the correspondence and hash functions, and to reinforce the reciprocal effects of the two objectives. Experiments on publicly multi-modal datasets show that FlexCMH achieves significantly better results than state-of-the-art methods, and it indeed offers a high degree of flexibility for practical cross-modal hashing tasks.

arXiv:1905.12203v1 [cs.LG] 29 May 2019 Flexible Cross-Modal Hashing Xuanwu Liu, Jun Wang Guoxian Yu Southwest University, China [email protected] Southwest University, China and KAUST, SA [email protected] Carlotta Domeniconi Xiangliang Zhang George Mason University, USA [email protected] KAUST, SA [email protected] ABSTRACT 1 Hashing has been widely adopted for large-scale data retrieval in many domains, due to its low storage cost and high retrieval speed. Existing cross-modal hashing methods optimistically assume that the correspondence between training samples across modalities are readily available. This assumption is unrealistic in practical applications. In addition, these methods generally require the same number of samples across different modalities, which restricts their flexibility. We propose a flexible cross-modal hashing approach (FlexCMH) to learn effective hashing codes from weakly-paired data, whose correspondence across modalities are partially (or even totally) unknown. FlexCMH first introduces a clustering-based matching strategy to explore the local structure of each cluster, and thus to find the potential correspondence between clusters (and samples therein) across modalities. To reduce the impact of an incomplete correspondence, it jointly optimizes in a unified objective function the potential correspondence, the cross-modal hashing functions derived from the correspondence, and a hashing quantitative loss. An alternative optimization technique is also proposed to coordinate the correspondence and hash functions, and to reinforce the reciprocal effects of the two objectives. Experiments on publicly multi-modal datasets show that FlexCMH achieves significantly better results than state-of-the-art methods, and it indeed offers a high degree of flexibility for practical cross-modal hashing tasks. Hashing has attracted an increasing interest from both research and industry, due to its low storage cost and high retrieval speed with big data [4, 8, 24, 26]. Hashing aims at compressing highdimensional vectorial data into short binary codes by preserving the structure of them, and to facilitate efficient retrieval with a significantly reduced storage. Based on the index constructed from hashing codes, big data retrieval can be made in a constant or sub-linear time [12, 16, 19, 24–26, 29]. With the wide range of applications of the Internet of Things, rapid influxes of multi-modal data asks for efficient cross-modal hashing solutions. For example, given an image/video about a historic event, one may want to cross-modally retrieve some texts describing the event in detail. How to perform cross-modal hashing on these widely-witnessed multi-modal data becomes then a topic of interest in hashing [10, 24, 26, 28]. Based on using the labels of training samples or not, existing cross-modal hashing solution can be roughly divided into unsupervised ones and supervised ones. Unsupervised ones seek hash coding functions by taking into account underlying data structure, distributions, or topological information [2, 21]. And supervised (semi-supervised) approaches try to leverage supervised information (i.e., semantic labels) to improve the performance [3, 7, 13, 23, 28]. Existing cross-modal hashing methods optimistically assume that the correspondence between samples of different modalities is known [9]. However, in real applications, some objects are only available in one modality, or their corresponding (or paired) objects in another modality are only partially (or even totally) unknown. This can happen, for example, when one wants to search images from text, and there are 100 images and 200 documents, and the correspondence between 50 images and 80 documents is only partially known. In other words, the image-text collection is weakly-paired, and only the semantic labels are shared across modalities. To the best of our knowledge, how to flexibly learn hashing codes from the weakly-paired data is still an untouched and challenging topic in cross-modal hashing. Some attempts have been made to tackle the weakly-paired multiview data [11, 15, 30]. To name a few, Weakly-paired Maximum Correlation Analysis(WMCA) extends the maximum covariance analysis to the weakly-paired case by jointly learning the latent pairs and subspace for dimensionality reduction and transfer learning [11]. Multi-modal Projection Dictionary Learning (MMPDL) jointly learns the projective dictionary and pairing matrix for the fusion classification [15]. Zong et al. [30] assume the cluster indicator vectors of two samples from two different views should be similar if KEYWORDS Cross modal hashing, weakly-paired, flexibility, optimization ACM Reference Format: Xuanwu Liu, Jun Wang, Guoxian Yu, Carlotta Domeniconi, and Xiangliang Zhang. 2019. Flexible Cross-Modal Hashing. In Proceedings of ACM Conference (Conference’19). ACM, New York, NY, USA, 9 pages. https://doi.org/10. 1145/nnnnnnn.nnnnnnn Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. Conference’19, August 2019, Anchorage, Alaska, USA © 2019 Association for Computing Machinery. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 https://doi.org/10.1145/nnnnnnn.nnnnnnn INTRODUCTION Conference’19, August 2019, Anchorage, Alaska, USA they belong to the same cluster and dissimilar otherwise, and then tackle the multi-view clustering on unpaired data by nonnegative matrix factorization. Mandal et al. [17] learn coupled dictionaries from the respective data views and sparse representation coefficients with respect to their own dictionaries. They then maximize the correlation between sample coefficients of the same class, and simultaneously minimize the correlation of different classes to seek the matching between samples and to fuse weakly-paired multiview data. However, these approaches still handle the weakly-paired problem in a non-flexible setting. For example, WMCA requires the number of samples in different modalities to be the same, and MMPDL needs the same number of samples for each class among different modalities. These requirements are violated in many cases, where samples across different modalities are partially-paired and the numbers of member samples of matched clusters (or classes) across modalities are not the same. In this paper, we propose a Flexible Cross-Modal Hashing (FlexCMH) solution (as illustrated in Fig. 1) to handle partially-paired (and even completely unpaired) multi-modal data. Our main contributions are summarized as follows: (1) We design a novel matching strategy that uses centroids of clusters, the neighborhood structure of centroids, and an incomplete correspondence between samples to seek a matching between samples in different modalities. The matching strategy neither requires the same number of samples within the matched clusters, nor across different modalities. Therefore, FlexCMH can be applied with flexibility in general cross-modal hashing settings. (2) We propose a unified objective function to simultaneously consider the cross-modal matching loss, the intra-modal representation loss, and the quantitative loss to learn adaptive hashing codes. We also introduce an alternative optimization technique to jointly optimize the correspondence and hash functions, and to reinforce the reciprocal effects of these two objectives. (3) Experiments on benchmark multi-modal datasets show that FlexCMH significantly outperforms related and representative cross-modal hashing approaches [2, 11, 13, 15, 28] in weakly paired cases, and it holds a competitive performance in different open settings. The rest of this paper is organized as follows. Section 2 introduces the objective function of FlexCMH, and its optimization. Section 3 presents the experimental setup, results, and analysis. Section 4 draws the conclusions and provides directions for future work. 2 PROPOSED METHOD Suppose we have M modalities, and the number of training samples for the m-th modality is Nm . Xm ∈ R Nm ×dm represents the data matrix for the m-th modality, where both Nm and dm are modaldependent. Y ∈ RNm ×l stores the label information of Nm samples, where l is the number of labels. Yik ∈ {0, 1}, Yik = 1 indicates that M {xm i }m=1 is annotated with the k-th label; Yik = 0 otherwise. For example, in a two-modality Wiki-image search application, xi1 is the image feature vector of sample i, and xi2 is the tag vector of Xuanwu Liu et al. this sample. To enable cross-modal hashing, we need to learn two hashing functions, F 1 : Rd1 → {0, 1}b and F 2 : Rd2 → {0, 1}b , where b is the length of binary hash codes. These two hashing functions are expected to map xi1 and xi2 from the respective modality onto a common Hamming space and to preserve the proximity of the original data. This canonical cross-modal hashing assumes that training samples in different modalities have a complete correspondence. However, the samples may be weakly-paired only. For example, consider the scenario in which, due to a temporary sensor failure, xi1 and xi2 do not describe the same object from different feature views. Instead, xi1 and x2j (i , j) depict the same object. An intuitive solution is to only use the paired samples. However, the structure information jointly reflected by paired and unpaired samples may be distorted, thus the performance may be heavily compromised. Morevoer, if the pair information between two modalities is totally unknown, the canonical solutions cannot be applied. To achieve an effective cross-modal hashing on such weaklypaired (or totally unpaired) multi-modal data, we introduce a flexible solution (FlexCMH), and provide its overall workflow in Fig. 1. FlexCMH first introduces a clustering-based matching strategy to leverage the cluster centorids and the local structure around the centroids to explore the potential correspondence between clusters (and samples within) across different modalities. Next, it defines a permutation matrix based on the explored correspondence to unify the index of same samples across modalities. Based on the unified index, it introduces an unified objective function to simultaneously account for cross-modal similarity preserving loss, the intra-modal representation loss and the quantitative hashing loss. An alternative optimization technique is also proposed to jointly optimize the correspondence and the hash functions, and to reinforce the reciprocal effects of these two objectives. The following subsections elaborate on the above process. 2.1 Clustering-based cross-modal matching strategy Unlike single-modal hashing, the correspondence between samples is crucial for the multi-modal data fusion and retrieval. For completely matched samples, the correspondence is completely known and can be used, along with the inter(intra)-modality similarity between samples across modalities, to learn cross-modal hashing functions. But for weakly-paired data, since the correspondence is only partially known, it’s a non-trivial job to quantify the similarity between samples from different modalities. A remedy is to divide the samples into different groups based on their labels and impose some constraints (i.e., concerning the similarity between different classes) on the coding vectors [14, 27]. In the representation space, the within-class data would cluster together although they are from different modalities, and the between-class data would be placed far apart from each other. In other words, all the data vectors of the same class (different classes) from different modalities should be similar (dissimilar)[22]. We can approximate the similarity between different classes using the centroids of respective groups[15]. However, only considering centroids may not be sufficient, and the neighborhood objects around a centroid may also be helpful. Flexible Cross-Modal Hashing Conference’19, August 2019, Anchorage, Alaska, USA Binary codes cat soft cute puppy 0101 0101 1100 1101 puppy dogs dogs black 0101 0101 1100 1101 CrossModal Hashing Learning cat tiger tiger,wild Weakly-paired data Clustering-based Matching Intra-modal Representation Loss Inter-modal Representation Loss 0101 0101 1100 1101 Quantitative Loss Figure 1: Workflow of the proposed FlexCMH (Flexible Cross-Modal Hashing). FlexCMH includes two parts: (1) A clusteringbased matching strategy to explore the matched clusters and samples therein across modalities; (2) A unified objective function to jointly account for the inter-modal representation loss, the intra-modal representation loss, and the quantitative loss to learn adaptive hashing functions. The intra-modality presentation loss aims at exploring the clusters and centroids of respective modalities. The inter-modal representation loss aims at preserving the proximity between samples of different modalities using matched samples. The quantitative loss aims at quantifying the hashing loss from the high-dimensional vectors to the binary codes. Furthermore, incomplete labels of training data restrict the quality of groups. Given these observations, we introduce a novel clustering-based matching strategy to leverage the centroids of clusters and the local structure around the centroids. This strategy can explore the correspondence between clusters (and samples therein) between different modalities. We illustrate the clustering-based matching strategy in the center of Fig. 1, where the stars represent centroids of clusters in different modalities, and the red points indicate the objects with known correspondence in another modality. The likelihood that two clusters will match increases with the similarity of their centroids, with the similarity of the local structure around the centroids. To achieve that, we define a quantitative match function as follows: ns Õ mm ′ m 2 m′ m′ 2 2 scc = (||xm (1) ′ cд − zc ||F − α ||xc ′д − zc ′ ||F ) д=1 zm c ′ zm c′ where and are the centriods the c-th cluster in the m-th modality and the c ′ -th cluster in the m ′ -th modality, ns is the user specified number of nearest samples of the centroids, xm cд is the д-th m 2 m′ 2 nearest sample of zm c , α = ||zc ||F /||zc ′ ||F is a scalar coefficient to balance the scale difference between two modalities. To seek the correspondence between clusters of different modalities, Eq. (1) not only accounts for the centroids, but also for the neighborhood samples around the centroids. As such, it can explore the correspondence between neighborhood samples of respective centroids to facilitate the follow-up cross-modal hashing. In contrast, existing solutions only match centroids using labeled samples and ignore the important local patterns [11, 15]. Our match function neither requires for two matched clusters to have the same number of samples, nor the same number of samples across modalities. It can also be applied to multi-modality data whose label information and correspondence are completely unknown. These advantages contribute to the flexibility of FlexCMH. Two clusters (c and c ′ ) and their respective centroids zcm and ′ mm ′ is the smallest among all pairwise clusters zcm′ are matched, if scc ′ from two modalities. We can align the objects in the respective modalities by reordering their indexes, and then use the ‘matched’ (aligned) objects in different modalities for cross-modality hashing. ′ To this end, we define a permutation matrix Γmm ∈ RNm ×Nm′ to align samples as follows: ( mm ′ is the smallest or Pmm ′ = 1 1, scc ′ mm ′ ij Γi j = (2) 0, otherwise where Pmm = 1 means the i-th sample in the m-th modality is ij paired with the j-th sample in the m ′ -th modality. In this way, our cluster-based matching strategy also incorporates the known ′ matched samples from different modalities. Γmm = 1 if x im belongs ij ′ ′ mm to the c-th cluster and x m j belongs to the c -th cluster, and scc ′ is the smallest among all pairwise clusters from two modalities. ′ These conditions indicate that the indexes of x im and x m j should be reordered for alignment. We observe that our matching strategy is different from the typical network alignment, which aims at finding ′ ′ Conference’19, August 2019, Anchorage, Alaska, USA Xuanwu Liu et al. identical sub-networks [18, 20]. In contrast, we aim at matching samples within the explored clusters, which describe the same object from different feature views. In addition, a sample in one modality can be paired with more than one sample in another modality. The follow-up cross-modal hashing functions can be learned using the found correspondence. 2.2 Cross-modal hashing To compute the matching loss, we should first identify the centroids of the respective clusters. WMCA [15] and MMPDL [14] both aim at addressing cross-model learning with weakly-paired samples, but they obtain clusters using only labeled samples. In practice, the labels of samples may not be sufficient, and even unavailable. As such, they have a restricted flexibility. To find centroids, we adopt Semi-Nonnegative Matrix Factorization (SemiNMF) [5] as follows: Ls = M Õ m ||X m=1 d ×k R can − Zm Hm ||F2 , m s.t .H ≥0 k Õ M Õ c=1 m=1 m ′ ,m m ′ mm ′ 2 ||Hm ||F c − Hc Γc (4) Nm reorders the samples in Xm in descending order where Hm c ∈R based on their association probabilities with respect to the c-th ′ class. Γmm ∈ RN ×N is the permutation matrix, which shuffles the c ′ sample indexes in Hm c to align the samples according to the same m indexes in Hc , which can be obtained using Eq. (2). As such, the m′ samples of Hm c can be matched with Hc . In practice, we choose the m′ top N samples which belong to the c (c ′ ) class to setup Hm c and Hc ′ , and to achieve cross-modal matching. As a result, our matching strategy can accommodate the case in which the number of samples belonging to the same class in different modalities can be different. In this way, we can achieve cross-modal retrieval on multi-modal data, whose matched samples are partially or completely unknown, even with different numbers of samples in the matched clusters. Hm can be viewed as a soft cluster assignments of samples in the m-th modality with respect to k clusters in a latent space. The assignments are also coordinated by the assignments in other data modalities (see Eq. (4)). For cross-modal hashing, we transform M Õ Lq = m=1 ||B − H̃m ||F2 (5) B can be viewed as the common Hamming space across all data modalities. It can be used for cross-modal retrieval, along with the Hm of the respective modalities. Eq. (6) is also called the hashing quantitative loss. 2.3 Unified objective function Based on the above analysis, we can assemble the three losses into a unified objection function, and formulate it as: min m m (3) where Zm ∈ be viewed as the latent representation of k cluster centroids of the m-th modality, and Hm ∈ Rk ×Nm is the soft cluster assignments of samples in the latent space. The above equation calculates the intra-modality representation loss and clustering loss simultaneously. Therefore, Zm can be used for the clustering-based matching. Hm is the indicator matrix, which represents the probability that Nm samples belong to different classes, and can be used for hashing codes learning. To achieve sample-to-sample cross-modal retrieval, based on the matched clusters and samples from Eq. (2), we further minimize the difference between the matched pairs to encourage them to be as similar as possible. Specifically, the indicator vectors (Hm ) of two samples from two different modalities should be similar if they have the same cluster label, and dissimilar otherwise. To this end, we quantify the relationship between two different modalities by minimizing the deviation of the indicator vectors of pairwise objects from different modalities as follows: Lc = the soft assignments into hard clusters H̃m ∈ {0|1}b×N using kmeans clustering, and then we seek the binary hash coding matrix B ∈ {0| + 1}b×N as follows: k Õ M Õ Z ,H ,B + M Õ m=1 c=1 m=1 m ′ ,m m ′ mm ′ 2 ||Hm ||F c − Hc Γc ||Xm − Zm Hm ||F2 + λ M Õ m=1 (6) ||B − H̃m ||F2 where the first term quantifies the cross-modal matching loss and the inter-modal representation loss, the second term measures the intra-modal representation loss, and the third term measures the hashing code quantitative loss. λ is a scalar parameter that achieves a balance between the cross-modal hashing loss and the quantitative loss. By simultaneously optimizing the above three losses, we jointly account for the correspondence and the hash functions, and thus reinforce the reciprocal effects of these two objectives. This joint optimization can avoid the misleading impact of initially not well-matched clusters and samples on the subsequent cross-modal hashing. Our experimental results confirm this advantage. 2.4 Optimization We observe that the loss function in Eq. (6) is actually a sum of the cross-modal matching and retrieval loss, the intra-modal representation loss, and the hashing quantitative loss. Once Zm is fixed, we ′ can directly obtain Γmm using Eq. (2). We can solve Eq. (6) via the c Alternating Direction Method of Multipliers (ADMM) [1], which alternatively optimizes one of Zm , Hm , and B, while keeping the other two fixed. Optimize Hm with Zm and B fixed: We utilize stochastic gradient descent (SGD) to learn Hm using the back-propagation (BP) algorithm. Here, Eq. (6) is transformed into k independent optimization problems, where the c − th sub-problem minimizes: min M Õ m ′ mm ′ 2 ||Hm ||F + λ c − Hc Γc m=1 m ′ ,m Xm c has M Õ m=1 m m 2 ||Xm c − Z Hc ||F (7) where the same size and samples order as Hm c . For any class, the derivatives of Eq. (7) with respect to the indicator matrix Hm c in the m is: M Õ ∂L mT m m ′ mm ′ = 2ZmT Zm Hm Xc + λ 2(Hm ) c −Z c − Hc Γc m ∂Hc m ′ ,m (8) Flexible Cross-Modal Hashing Conference’19, August 2019, Anchorage, Alaska, USA ∂L to update the indicator matrix Hm using We can then take ∂H m c 2.5 SGD. Similarly, we can also update ∂L . m′ To facilitate the time complexity analysis, we assume a simple extreme case, with M modalities and k classes and the number of iterations is t. For any modality, we have n samples and the extreme pairing case is considered. The time complexity of the proposed method is composed of three parts. First, the time cost of updating Hcm in Eq. (8) is O(kM(k 2d +k 2n+kdn+(k 2d)(M −1)/2)) . Second, the time cost of updating Zcm in Eq. (9) is O(M(4dkn+nk 2 )). Third, the time cost of ′ updating Γmm in Eq. (2) is O(k 2n 2d 2 (M(M − 1))/2). Since the complexity of third part is larger than other two parts in each iteration, the overall complexity of FlexCMH is O(tk 2n 2d 2 (M(M − 1))/2). The empirically study (Configuration: Ubuntu 16.04.1, Intel(R) Xeon(R) CPU E5-2650, 256RAM) on three adopted multi-modal datasets shows that FlexCMH costs 8.532 seconds on Wiki, 43.244 seconds on Mirflickr, and 1768.196 seconds on Nus-wide. c ′ Hm c based on the derivative ∂Hc depends on Optimize Zm with Hm and B fixed: Since Γmm c m ′ , we compute the derivative of Eq. (6) with respect to zm and z c ′ c Γmm and Zm as follows: c ′ ∂L ∂L ∂L ∂Γmm = + mm ′ c m m m ∂Z ∂Z ∂Z ∂Γc ′ (9) mm ′ T mT mT = 2Zm Hm − 4λXm + 2Xc Γc c Hc c Hc m′ m′ T Hc We can then use these derivatives to update the centroid matrix Zm . In each iteration, after the centroids in Zm are updated, we ′ consequently update Γmm based on Eqs. (1) and (2). c Optimize B with Hm and Zm fixed: Once Zm and Hm are fixed, H̃m is also determined, then the minimization in Eq. (6) is equal to a maximization as follows: max tr (BT (λ B M Õ H̃m ) =tr (λBT U) = Õ Bi j Ui j (10) i, j m=1 ÍM where B ∈ {−1, +1} N ×b , U = λ m=1 H̃m . It is easy to observe that the binary code Bi j should keep the same sign as Ui j . Therefore, we have: B = siдn(U) = siдn(λ M Õ H̃m ) (11) m=1 Where siдn(x)=1 if x > 0, siдn(x)=0 otherwise. By iteratively applying Eqs. (8-11), we can jointly optimize the correspondence and the hash functions, thus reinforcing the reciprocal effects of these two objectives. The whole procedure of FlexCMH and the alternative optimization for solving Eq. (6) are summarized in Algorithm 1. Algorithm 1 FlexCMH: Flexible Cross-Modal Hashing Input: M modality data matrices Xm , m ∈ {1, 2, · · · , M }; the ′ matched samples indicator matrix Pmm (optional). m Output: Clustering centroid matrices Z and indicator matrices Hm , binary code matrix B. 1: Initialize centroid matrices Zm , indicator matrices Hm , the number of classes k and the number of iterations iter , t = 1. 2: while t < iter or Eq. (6) has not converged do 3: for c = 1 → k do 4: Update Hm c using Eqs. (8) 5: end for 6: Update Zm using Eq. (9); ′ 7: Update the permutation matrix Γmm using Eqs. (1-2). 8: Update B using Eq. (11); 9: t = t + 1. 10: end while Complexity analysis 3 EXPERIMENTS 3.1 Experimental setup Datasets: Three widely used benchmark datasets (Nus-wide, Wiki, and Mirflicker) are collected to evaluate the performance of RDCMH. Each dataset includes two modalities, image and text, although FlexCMH can also be directly applied to cases with more than two data modalities. Nus-wide1 contains 260,648 web-text pairs. Each image is annotated with one or more labels taken from 81 concept labels. Each text is represented as a 1,000-dimensional bag-of-words vector. The hand-crafted feature of each image is a 500-dimensional bag-of-visual words (BOVW) vector. Wiki2 is generated from a group of 2,866 Wikipedia documents. Each document is an image-text pair, can be annotated with 10 semantic labels, and is represented by a 128-dimensional SIFT feature vector. The text articles are represented as probability distributions over 10 topics, which are derived from a Latent Dirichlet Allocation (LDA) model. Mirflickr3 originally contained 25,000 instances collected from Flicker. Each instance consists of an image and its associated textual tags, and is manually annotated with one or more labels, from a total of 24 semantic labels. Each text is represented as a 1,386dimensional bag-of-words vector, and each image is represented by a 512-dimensional GIST feature vector. Comparing methods: Six related and representative methods are adopted for comparison. (i) CMSSH (Cross-modal Similarity Sensitive Hashing) [2] treats hash code learning as a binary classification problem, and efficiently learns the hash functions using a boosting method. (ii) SCM (Semantic Correlation Maximization) [28] optimizes the hashing functions by maximizing the correlation between two modalities with respect to semantic labels; it includes two versions, SCM-orth and SCM-seq. SCM-orth learns hash functions by direct eigen-decomposition with orthogonal constraints for balancing coding functions, and SCM-seq can more efficiently learn hash functions in a sequential manner without the orthogonal constraints. (iii) CMFH (Collective matrix factorization hashing) [6] learns unified binary codes using collective matrix factorization with a latent factor model on multi-modal data. (iv) SePH (Semantics Preserving Hashing) [13] is a probability-based hashing method, 1 http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm 2 https://www.wikidata.org/wiki/Wikidata 3 http://press.liacs.nl/mirflickr/mirdownload.html Conference’19, August 2019, Anchorage, Alaska, USA which generates one unified hash code for all observed views by considering the semantic consistency between views. (v) WMCA (Weakly-paired Maximum Correlation Analysis) [11] adopts the maximum covariance analysis to perform the joint learning of the latent matching and subspace. (vi)MMPDL (Muti-modal projection dictionary learning) [15] is a unified projective dictionary learning method, which jointly learns the projective dictionaries and matching matrix for the classification fusion. The source code of the baselines is provided by the authors, and the input parameter values are set according to the guidelines given by the authors in their respective papers. For WMCA and MMPDL, since they are not hashing methods, we obtain the hashing codes by exchanging the classification with the ordinary hashing function sдn(·). For FlexCMH, we fix λ in Eq. (6) to 1, k = 10 on Wiki, k = 25 on Mirflickr, and k = 80 on Nus-wide; the number of nearest neighbors ns in Eq. (1) is fixed to 5 and N in Eq. (4) is fixed to 50% of min{Nm , Nm ′ }. Our preliminary study shows FlexCMH is robust to the input values of ns and N . The number of iterations for optimizing Eq. (6) is set to 500. We empirically found that FlexCMH generally converges in fewer iterations on all the datasets. The parameter sensitivity of λ and k is studied in appendix. The datasets and the code of FlexCMH will be made publicly available. 3.2 Results in different practical settings To study thoroughly the performance of FlexCMH and of the comparing methods, we conduct three types of experiments: (1) completely-paired, (2) weakly-paired, and (3) completely-unpaired. In each type of experiment, all methods are run ten times, and we report the average MAP (mean average precision) results. Since the MAP standard deviations of all methods are quite small (less than 2%) on all datasets, to save space the standard deviations are not reported. The best results are boldfaced. For the completely-paired experiments, the clustering-based matching process of FlexCMH is excluded, and each comparing method uses all the paired samples for training (70%) and the rest for validation (30%). Table 1 reports the MAP results on MIRFLICKR, NUS-WIDE, and Wiki datasets. In the Table, ‘Image vs. Text’ denotes the setting where the query is an image and the database is text, and vice versa for ‘Text vs. Image’. For the weakly-paired experiments, we investigate three different settings: (2a) 50% of the image-text pairs of the training set (70% of the whole dataset) are kept, and the other pairs are randomly shuffled; in this setting the first four comparing methods (CMSSH, SCM-Seq, SCM-Orth, and SePH) can only use paired image-text instances for training and disregard the unpaired ones, whereas the last two (WMCA and MMPDL) and FlexCMH can use all the training data. (2b) As in (2a), but both the paired (50%) and the unpaired training samples are used to train all the comparing methods. (2c) As in (2a), all the images in the training set are used for training, but 10% of the text samples in the training set is randomly removed. As such, the number of images is different from the number of text samples across modalities and clusters. For the setting (2c), all the comparing methods cannot be applied, so we only report the MAP results of our FlexCMH and its variants (FlexCMH(nJ) and FlexCMH(nC)). FlexCMH(nJ) first seeks the matched image-text Xuanwu Liu et al. pairs, and then executes the follow-up cross-modal hashing, without jointly optimizing the matched clusters (samples) and hashing functions. FlexCMH(nC) uses the label information to obtain the correspondence between samples (as done by MMPDL), instead of our proposed clustering-based matching strategy. Table 2 reports the MAP values of the compared methods in these settings. For the completely-unpaired experiments, besides randomly partitioning the data into training (70%) and testing (30%) sets, we randomly shuffle the index of images and the index of text samples in the training set. As a result, the images and the text samples are almost completely unpaired. For this type of experiments, only WMCA and MMPDL can be used for comparison. Table 3 reports the MAP values of the three methods. From Table 1, we can see that our FlexCMH achieves the best performance in most cases. This is because FlexCMH not only jointly models the cross-modal similarity preserving loss and the intra-modal similarity preserving loss, to build a more faithfully semantic projection, but also models the quantitative loss to learn adaptive hashing codes. We observe that SePH obtains better results for ‘Text vs. Image’ retrieval on Wiki. This is possible because of the adaptability of its probability-based strategy on small datasets. An unexpected observation is that the performance of CMSSH and SCM-Orth decreases as the length of hash codes increases. This might be caused by the imbalance between bits in the hash codes learned by singular value or eigenvalue decomposition. These experimental results show the effectiveness of FlexCMH for the canonical cross-modal hashing, where training samples from different modalities are completely paired. From Table 2, we can see that the MAP results are similar to those of Table 1, while only 50% of the pairs of the training set are used for training by CMSSH, SCM, and SePH. This observation suggests that this pair fraction is sufficient to train the cross-modal hashing functions. In practice, we observed a significant reduction in the MAP values when less than 10% of the training data is paired. We also observe that the MAP results of CMSSH, SCM, and SePH sharply reduce when all the paired and unpaired samples are used for the experiment. This is because CMSSH, SCM, and SePH are misled by ‘incorrectly paired’ (in fact, not-paired) samples. WMCA, MMPDL and FlexCMH do not manifest such a sharp reduction in performance. That is because they adopt different techniques to augment matched samples, which boost the performance of cross-modal hashing. In addition, FlexCMH still achieves the best performance, thanks to (1) its novel clustering-based matching approach for exploring the matched clusters and samples therein, and (2) a unified objective function to optimize, in a coordinate manner, the matching between clusters and samples, and the crossmodal hashing functions with the matched clusters and samples. FlexCMH holds slightly reduced results when the numbers of samples (images and texts) in different modalities are not the same, and only 50% of the image-text is paired. In ‘Image vs. Text’ retrieval, the MAP results of FlexCMH are generally lower than those in Table 1. That is because 10% of the text samples in the text modality is removed. As a result, the retrieval results may be incorrect when using the corresponding images to query the removed text. We also observe that the results of FlexCMH(nJ) are inferior to those of FlexCMH. This observation proves that jointly optimizing the hashing functions, and the matched clusters and samples, Flexible Cross-Modal Hashing Conference’19, August 2019, Anchorage, Alaska, USA Table 1: Results (MAP) on three datasets with completely-paired data. Image vs. Text Text vs. Image Methods CMSSH SCM-seq SCM-orth CMFH SePH WMCA MMPDL FlexCMH CMSSH SCM-seq SCM-orth CMFH SePH WMCA MMPDL FlexCMH Mirflickr 64bits 16bits 32bits 0.5616 0.5721 0.6041 0.6232 0.6573 0.5834 0.6126 0.6639 0.5616 0.5694 0.6055 0.6205 0.6481 0.5847 0.6124 0.6601 0.5555 0.5607 0.6112 0.6256 0.6603 0.5847 0.6135 0.6674 0.5551 0.5611 0.6154 0.6237 0.6521 0.5861 0.6142 0.6632 0.5513 0.5535 0.6176 0.6268 0.6616 0.5856 0.6141 0.6691 0.5506 0.5544 0.6238 0.6259 0.6545 0.5886 0.6156 0.6648 Nus-wide 64bits 128bits 16bits 32bits 0.5484 0.5482 0.6232 0.6293 0.6637 0.5873 0.6128 0.6724 0.5475 0.5497 0.6299 0.6286 0.6534 0.5903 0.6172 0.6676 0.3414 0.3623 0.4651 0.4752 0.4787 0.4396 0.4635 0.4901 0.3392 0.3412 0.4370 0.4349 0.4489 0.4179 0.4225 0.4639 0.3336 0.3646 0.4714 0.4793 0.4869 0.4415 0.4658 0.4935 0.3321 0.3459 0.4428 0.4387 0.4539 0.4192 0.4232 0.4653 0.3282 0.3703 0.4822 0.4812 0.4888 0.4433 0.4661 0.4987 0.3272 0.3472 0.4504 0.4412 0.4587 0.4221 0.4237 0.4688 128bits 16bits 32bits 0.3261 0.3721 0.4851 0.4866 0.4932 0.4436 0.4672 0.5012 0.3256 0.3539 0.2235 0.4425 0.4621 0.4235 0.4256 0.4712 0.1694 0.1577 0.2341 0.2578 0.2836 0.2243 0.2731 0.2846 0.1578 0.1521 0.2257 0.2872 0.5345 0.2089 0.2821 0.2812 0.1523 0.1434 0.2411 0.2591 0.2859 0.2271 0.2745 0.2889 0.1384 0.1561 0.2459 0.2891 0.5351 0.2104 0.2824 0.2836 Wiki 64bits 0.1447 0.1376 0.2443 0.2603 0.2879 0.2283 0.2768 0.2912 0.1331 0.1371 0.2482 0.2907 0.5471 0.2131 0.2836 0.2857 128bits 0.1434 0.1358 0.2564 0.2612 0.2863 0.2312 0.2801 0.2935 0.1256 0.1261 0.2518 0.2923 0.5506 0.2156 0.2861 0.2869 Table 2: Results (MAP) on three datasets with weakly-paired data. Image vs. Text Text vs. Image Mirflickr Nus-wide 64bits 128bits 16bits 32bits 64bits 128bits 16bits 50% image-text pairs are paired, CMSSH, SCM and SePH only use paired data for training Methods 16bits 32bits CMSSH SCM-seq SCM-orth CMFH SePH WMCA MMPDL FlexCMH CMSSH SCM-seq SCM-orth CMFH SePH WMCA MMPDL FlexCMH 0.5614 0.5720 0.6037 0.6225 0.6571 0.5833 0.6123 0.6635 0.5612 0.5696 0.6053 0.6201 0.6479 0.5843 0.6121 0.6578 0.5551 0.5606 0.6111 0.6249 0.6609 0.5848 0.6131 0.6673 0.5541 0.5612 0.6151 0.6233 0.6516 0.5857 0.6138 0.6623 0.5512 0.5532 0.6166 0.6261 0.6618 0.5852 0.6145 0.6689 0.5502 0.5541 0.6235 0.6248 0.6541 0.5848 0.6154 0.6641 CMSSH SCM-seq SCM-orth CMFH SePH WMCA MMPDL FlexCMH CMSSH SCM-seq SCM-orth CMFH SePH WMCA MMPDL FlexCMH 0.5216 0.5398 0.5404 0.5405 0.5411 0.5456 0.5778 0.5867 0.5121 0.5211 0.5235 0.5314 0.5431 0.5456 0.5631 0.5801 0.5238 0.5401 0.5413 0.5422 0.5436 0.5463 0.5792 0.5891 0.5135 0.5226 0.5238 0.5335 0.5441 0.5461 0.5647 0.5825 0.5244 0.5406 0.5430 0.5438 0.5467 0.5471 0.5814 0.5925 0.5142 0.5237 0.5241 0.5356 0.5453 0.5458 0.5648 0.5836 FlexCMH(nJ) FlexCMH(nC) FlexCMH FlexCMH(nJ) FlexCMH(nC) FlexCMH 0.6121 0.5859 0.6435 0.6224 0.5983 0.6589 0.6135 0.5886 0.6441 0.6237 0.6004 0.6624 0.5482 0.5479 0.6235 0.6290 0.6636 0.5836 0.6121 0.6721 0.5474 0.5486 0.6294 0.6291 0.6533 0.5889 0.6167 0.6672 0.3411 0.3620 0.4648 0.4748 0.4785 0.4378 0.4632 0.4903 0.3396 0.3408 0.4364 0.4342 0.4487 0.4171 0.4221 0.4636 0.3337 0.3644 0.4716 0.4789 0.4862 0.4416 0.4654 0.4936 0.3318 0.3455 0.4427 0.4382 0.4536 0.4186 0.4228 0.4657 0.3278 0.3704 0.4820 0.4805 0.4881 0.4429 0.4659 0.4982 0.3269 0.3471 0.4501 0.4403 0.4585 0.4215 0.4231 0.4683 0.3256 0.3722 0.4853 0.4835 0.4928 0.4437 0.4677 0.5010 0.3253 0.3536 0.2233 0.4426 0.4618 0.4232 0.4259 0.4708 32bits Wiki 64bits 128bits 0.1689 0.1574 0.2344 0.2576 0.2825 0.2246 0.2729 0.2844 0.1573 0.1517 0.2256 0.2869 0.5343 0.2086 0.2813 0.2810 0.1520 0.1436 0.2410 0.2588 0.2853 0.2278 0.2743 0.2885 0.1382 0.1556 0.2456 0.2883 0.5346 0.2089 0.2821 0.2831 0.1436 0.1374 0.2445 0.2596 0.2881 0.2281 0.2765 0.2907 0.1326 0.1361 0.2487 0.2902 0.5468 0.2128 0.2832 0.2855 0.1432 0.1361 0.2567 0.2608 0.2862 0.2316 0.2793 0.2931 0.1253 0.1264 0.2521 0.2915 0.5494 0.2153 0.2863 0.2864 0.1011 0.1107 0.1126 0.1157 0.1235 0.1575 0.2342 0.2629 0.0989 0.1118 0.1206 0.1231 0.1238 0.1437 0.2132 0.2538 0.1023 0.1112 0.1138 0.1165 0.1267 0.1593 0.2361 0.2647 0.1002 0.1124 0.1209 0.1255 0.1242 0.1445 0.2141 0.2541 0.1035 0.1125 0.1149 0.1179 0.1284 0.1611 0.2375 0.2655 0.1011 0.1121 0.1214 0.1269 0.1247 0.1458 0.2155 0.2557 0.1031 0.1128 0.1168 0.1182 0.1302 0.1635 0.2341 0.2687 0.1020 0.1128 0.1221 0.1293 0.1264 0.1473 0.2135 0.2563 0.2245 0.2057 0.2593 0.2456 0.2274 0.2815 0.2256 0.2076 0.2612 0.2471 0.2293 0.2834 0.2273 0.2088 0.2635 0.2493 0.2308 0.2842 Wiki 64bits 128bits 50% image-text pairs are paired, all methods use all the training data Image vs. Text Text vs. Image 0.5249 0.5412 0.5442 0.5447 0.5501 0.5489 0.5846 0.5973 0.5136 0.5242 0.5250 0.5372 0.5459 0.5472 0.5655 0.5859 0.2715 0.2953 0.3343 0.3409 0.3561 0.3721 0.4117 0.4273 0.2563 0.2855 0.3211 0.3382 0.3531 0.3612 0.3872 0.4031 0.2731 0.2968 0.3358 0.3428 0.3582 0.3746 0.4136 0.4296 0.2607 0.2879 0.3234 0.3397 0.3554 0.3648 0.3891 0.4056 0.2757 0.2991 0.3372 0.3442 0.3610 0.3758 0.4137 0.4315 0.2622 0.2893 0.3269 0.3421 0.3560 0.3679 0.3911 0.4079 0.2766 0.3012 0.3395 0.3462 0.3612 0.3761 0.4136 0.4331 0.2741 0.2921 0.3274 0.3442 0.3579 0.3712 0.3924 0.4112 50% image-text pairs are paired, the number of image samples and that of text samples are different Image vs. Text Text vs. Image 0.6167 0.5904 0.6453 0.6244 0.6031 0.6643 0.6185 0.5927 0.6468 0.6256 0.6042 0.6658 0.3878 0.3618 0.4233 0.4115 0.3832 0.4627 0.3892 0.3643 0.4251 0.4123 0.3845 0.4635 0.3905 0.3666 0.4267 0.4143 0.3873 0.4654 0.3936 0.3647 0.4283 0.4150 0.3907 0.4688 0.2231 0.2015 0.2577 0.2435 0.2256 0.2803 Table 3: Results (MAP) on three datasets with complete-unpaired data. Image vs. Text Text vs. Image Methods WMCA MMPDL FlexCMH WMCA MMPDL FlexCMH Mirflickr 64bits 16bits 32bits 0.5214 0.5535 0.5693 0.5256 0.5489 0.5631 0.5231 0.5542 0.5704 0.5263 0.5503 0.5652 0.5245 0.5567 0.5723 0.5278 0.5531 0.5681 Nus-wide 64bits 128bits 16bits 32bits 0.5263 0.5588 0.5749 0.5293 0.5547 0.5694 0.3559 0.3963 0.4115 0.3414 0.3635 0.4031 0.3574 0.3984 0.4135 0.3438 0.3678 0.4058 0.3591 0.4004 0.4159 0.3467 0.3691 0.4083 128bits 16bits 32bits 0.3604 0.4015 0.4173 0.3481 0.3713 0.4112 0.1276 0.2210 0.2511 0.1335 0.2015 0.2437 0.1295 0.2231 0.2534 0.1344 0.2038 0.2459 0.1310 0.2254 0.2548 0.1358 0.2074 0.2483 0.1336 0.2268 0.2563 0.1381 0.2098 0.2501 0.280 0.655 0.490 0.275 0.7 0.480 Nuswideet al. Xuanwu Liu Wiki 0.3 0.6 0.5 0.25 0.6 MAP 0.660 Mirfilcker 0.8 0.500 MAP 0.285 MAP 0.290 0.665 MAP 0.670 MAP MAP Mirfilcker August 2019, Anchorage, Wiki Nuswide Conference’19, Alaska, USA 0.2 0.4 0.470 0.650 0.270 0.645 0.265 0.640 0.001 0.01 0.1 1 10 100 0.260 0.001 0.5 0.460 0.01 0.1 1 10 100 0.450 0.001 0.4 0.01 0.1 1 10 100 0.3 0.15 4 7 10 13 16 0.1 14 19 k 24 39 34 0.2 70 75 80 85 k k Figure 2: MAP vs. λ on Mirfilcker and Wiki datasets. Figure 3: MAP vs. k on different datasets. enables a mutual boosting of the two objectives. Furthermore, FlexCMH(nC) is also outperformed by FlexCMH, which proves that our proposed clustering-based matching strategy can more reliably find the matching between samples across modalities. In Table 3, the MAP results of all methods are inferior to those of Tables 1 and 2. Still, FlexCMH achieves the best results, which proves the effectiveness of FlexCMH on completely-unpaired data. From these results, we can state that the matching information of samples across modalities is crucial for cross-modal hashing. Our clustering-based matching strategy can reliably explore paired samples, and it boosts the performance of cross-modal hashing on weakly-paired (or completely unpaired) samples. Besides, we present some examplar cross-modal retrieved images (texts) in the supplementary file to visually support the advantages of FlexCMH. In summary, our experimental results prove that FlexCMH can learn the cross-modal hashing more effectively than representative comparing methods. FlexCMH is flexible in a variety of practical settings, where the paired samples across modalities are partially available or even completely unknown, and the numbers of samples in different modalities (and matched clusters) are also different. To the best of our knowledge, no existing cross-modal hashing methods can work in these scenarios. Table 4: Results on three modalities on Wiki. comparing methods cannot directly handle more than two modalities, we adapt them by learning hash functions between each pair of modalities, and then merge the retrieved results from the respective pairs. For example, if Image1 serves as the query modality, then the comparing methods separately optimize two cross-modal hashing mappings (i.e., Image1 → Text and Image1 → Image2). The experimental setting is the same as in (2b). Table 4 shows the MAP values on the Wiki dataset in the threemodality case. FlexCMH again outperforms the compared methods, providing evidence of the broad applicability of our proposed approach. 3.3 4 Parameter sensitivity analysis CMSSH SCM-seq SCM-orth SePH WMCA MMPDL FlexCMH Text 0.1015 0.1114 0.1215 0.1241 0.1428 0.2023 0.2331 Image1 0.0896 0.0975 0.1025 0.1056 0.1156 0.1676 0.1876 Image2 0.0937 0.1034 0.1151 0.1104 0.1241 0.1896 0.2031 CONCLUSIONS We further explore the sensitivity of the scalar parameter λ in Eq. (6), and report the results on three datasets in Fig. 2, where the code length is fixed to 16 bits. We can see that FlexCMH is slightly sensitive to λ when λ ∈ [10−3 , 102 ], and achieves the best performance when λ = 1. Over-weighting or under-weighting the quantitative loss has a negative impact on the performance, but not significant. In summary, an effective λ can be easily selected for FlexCMH. In addition, we investigate the sensitivity of the number of clusters k, and report the results in Fig. 3 with the code length fixed to 16 bits. We can see that FlexCMH is sensitive to k and can achieve the best results with k = 10 on Wiki, k = 25 on Mirflickr, and k = 80 on Nus-wide. These preferred values of k are close to the number of distinct labels of the corresponding datasets. Given this, we suggest to fix k around the number of labels l. In this paper, we proposed a Flexible cross-modal hashing (FlexCMH) solution to learn effective hashing functions from weaklypaired (or completely-unpaired) data across modalities. FlexCMH introduces a clustering-based matching strategy to explore the potential correspondence between clusters and their member samples. In addition, it jointly optimizes the potential correspondence, cross-modal hashing functions derived from the correspondence and the hashing quantitative loss in a unified objective function to coordinately learn compact hashing codes. Extensive experiments demonstrate that FlexCMH outperforms the state-of-the-art hashing methods on completely-paired, weakly-paired, and completelyunpaired multi-modality data. In the future, we will incorporate deep feature learning into cross-modal hashing on weakly-paired data. The code and data (those that are not available yet) will be publicly available. 3.4 REFERENCES Results on three modalities In this section, we evaluate the effectiveness of FlexCMH on the Wiki dataset fixing code length to 16 with three modalities. Currently, to the best of our knowledge, there are no publicly available datasets with three or more modalities. To simulate a three-modality setting, we divide the 128-dimensional image modality into two sub-modalities: the 64-dim i1 and the 64-dim i2 modalities. Since the [1] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2011. [2] M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios. Data fusion through cross-modality metric learning using similarity-sensitive hashing. In CVPR, pages 3594–3601, 2010. [3] G. Chen, Y. Song, F. Wang, and C. Zhang. Semi-supervised multi-label learning by solving a sylvester equation. In ICDM, pages 410–419, 2008. 90 Flexible Cross-Modal Hashing [4] L. Chen, D. Xu, W. H. Tsang, and X. Li. Spectral embedded hashing for scalable image retrieval. IEEE Transactions on Cybernetics, 44(7):1180–1190, 2014. [5] C. Ding, T. Li, and M. I. Jordan. Convex and semi-nonnegative matrix factorizations. TPAMI, 32(1):45–55, 2010. [6] G. Ding, Y. Guo, and J. Zhou. Collective matrix factorization hashing for multimodal data. In CVPR, pages 2083–2090, 2014. [7] Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV, 106(2):210–233, 2014. [8] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. TPAMI, 35(12):2916–2929, 2013. [9] C. Kang, S. Xiang, S. Liao, C. Xu, and C. Pan. Learning consistent feature representation for cross-modal multimedia retrieval. TMM, 17(3):370–381, 2015. [10] S. Kumar and R. Udupa. Learning hash functions for cross-view similarity search. In IJCAI, pages 1360–1365, 2011. [11] C. H. Lampert and O. Kr?mer. Weakly-paired maximum covariance analysis for multimodal dimensionality reduction and transfer learning. In ECCV, 2010. [12] L. Li, Y. Mengyang, and S. Ling. Multiview alignment hashing for efficient image search. IEEE Transactions on Image Processing, 24(3):956–966, 2015. [13] Z. Lin, G. Ding, J. Han, and J. Wang. Cross-view retrieval via probability-based semantics-preserving hashing. IEEE Transactions on Cybernetics, 47(12):4342– 4355, 2017. [14] H. Liu, W. Feng, X. Zhang, and F. Sun. Weakly-paired deep dictionary learning for cross-modal retrieval. Pattern Recognition Letters, pages S016786551830268X–, 2018. [15] H. Liu, Y. Wu, F. Sun, B. Fang, and G. Di. Weakly paired multimodal fusion for object recognition. IEEE Transactions on Automation Science and Engineering, PP(99):1–12, 2017. [16] X. Liu, Y. Mu, D. Zhang, B. Lang, and X. Li. Large-scale unsupervised hashing with shared structure learning. IEEE Transactions on Cybernetics, 45(9):1811–1822, 2017. [17] D. Mandal and S. Biswas. Generalized coupled dictionary learning approach with applications to cross-modal matching. TIP, 25(8):3826–3837, 2016. [18] L. Meng and A. Striegel. Local versus global biological network alignment. Bioinformatics, 32(20):btw348, 2015. [19] F. Shen, C. Shen, L. Wei, and H. T. Shen. Supervised discrete hashing. In CVPR, 2015. [20] Z. Si and H. Tong. Final: Fast attributed network alignment. In Acm Sigkdd International Conference, 2016. [21] J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen. Inter-media hashing for large-scale retrieval from heterogeneous data sources. In SIGMOD, pages 785–796, 2013. [22] C. Wang, S. Yan, L. Zhang, and H.-J. Zhang. Multi-label sparse coding for automatic image annotation. In CVPR, pages 1643–1650, 2009. [23] J. Wang, S. Kumar, and S. F. Chang. Semi-supervised hashing for scalable image retrieval. In CVPR, pages 3424–3431, 2010. [24] J. Wang, W. Liu, S. Kumar, and S.-F. Chang. Learning to hash for indexing big data – a survey. Proc. of the IEEE, 104(1):34–57, 2016. [25] J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity search: A survey. Computer Science, 2014. [26] J. Wang, T. Zhang, N. Sebe, and H. T. Shen. A survey on learning to hash. TPAMI, 40(4):769–790, 2018. [27] L. Xiao, M. Song, D. Tao, X. Zhou, C. Chen, and J. Bu. Semi-supervised coupled dictionary learning for person re-identification. In CVPR, 2014. [28] D. Zhang and W. J. Li. Large-scale supervised multimodal hashing with semantic correlation maximization. In AAAI, pages 2177–2183, 2014. [29] X. Zhu, Z. Huang, H. T. Shen, and X. Zhao. Linear cross-modal hashing for efficient multimedia search. In ACM MM, pages 143–152, 2013. [30] L. Zong, X. Zhang, and X. Liu. Multi-view clustering on unmapped data via constrained non-negative matrix factorization. Neural Networks, 108:155–171, 2018. Conference’19, August 2019, Anchorage, Alaska, USA