Academia.eduAcademia.edu

Targeted Attack for Deep Hashing Based Retrieval

2020, Computer Vision – ECCV 2020

The deep hashing based retrieval method is widely adopted in large-scale image and video retrieval. However, there is little investigation on its security. In this paper, we propose a novel method, dubbed deep hashing targeted attack (DHTA), to study the targeted attack on such retrieval. Specifically, we first formulate the targeted attack as a point-to-set optimization, which minimizes the average distance between the hash code of an adversarial example and those of a set of objects with the target label. Then we design a novel component-voting scheme to obtain an anchor code as the representative of the set of hash codes of objects with the target label, whose optimality guarantee is also theoretically derived. To balance the performance and perceptibility, we propose to minimize the Hamming distance between the hash code of the adversarial example and the anchor code under the ∞ restriction on the perturbation. Extensive experiments verify that DHTA is effective in attacking both deep hashing based image retrieval and video retrieval.

Targeted Attack for Deep Hashing based Retrieval Jiawang Bai1,2 ⋆ , Bin Chen1,2 ⋆ , Yiming Li1 ⋆ , Dongxian Wu1,2 , Weiwei Guo3 , Shu-tao Xia1,2 , and En-hui Yang4 arXiv:2004.07955v2 [cs.CR] 8 May 2020 1 2 4 Tsinghua University Peng Cheng Laboratory 3 Vivo University of Waterloo Abstract. The deep hashing based retrieval method is widely adopted in large-scale image and video retrieval. However, there is little investigation on its security. In this paper, we propose a novel method, dubbed deep hashing targeted attack (DHTA), to study the targeted attack on such retrieval. Specifically, we first formulate the targeted attack as a point-to-set optimization, which minimizes the average distance between the hash code of an adversarial example and those of a set of objects with the target label. Then we design a novel component-voting scheme to obtain an anchor code as the representative of the set of hash codes of objects with the target label, whose optimality guarantee is also theoretically derived. To balance the performance and perceptibility, we propose to minimize the Hamming distance between the hash code of the adversarial example and the anchor code under the ℓ∞ restriction on the perturbation. Extensive experiments verify that DHTA is effective in attacking both deep hashing based image retrieval and video retrieval. Keywords: targeted attack, deep hashing, adversarial attack, similarity retrieval 1 Introduction High-dimension and large-scale data approximate nearest neighbor (ANN) retrieval has been widely adopted in online search engines, e.g., Google or Bing, due to its efficiency and effectiveness. Within all ANN retrieval methods, hashingbased methods [38] have attracted a lot of attentions due to their compact binary representations and rapid similarity computation between hash codes with Hamming distance. In particular, deep learning based hashing methods [1,2,6,25,16,31] have shown their superiority in performance since they generally learn more meaningful semantic hash codes through learnable hashing functions with deep neural networks (DNNs). Recent studies [13,19,35] revealed that DNNs are vulnerable to adversarial examples, which are crafted by adding intentionally small perturbations to benign ⋆ Equal contribution 2 Adversarial query Objects of "Cat" 3 1 Objects of "Dog" 2 1 3 (a) Point-to-point (P2P) 2 (b) Point-to-set (P2S) Fig. 1. The comparison between the P2P attack paradigm and proposed P2S paradigm. In this retrieval with top-3 similarity example, there are two object classes (i.e. ‘Cat’ and ‘Dog’), where the target label of attack is ‘Cat’. In the P2P paradigm, a object with the target label is randomly selected as the reference to generate the adversarial query. In P2P paradigm, when the selected object is close to the category boundary (dotted lines in the figure) or is an outlier, the attack performance will be poor. In this example, the ‘targeted attack success rate’ of P2P and P2S is 33.3% and 100%, respectively. examples and fool DNNs to confidently make incorrect predictions. While deep retrieval systems take advantage of the power of DNNs, they also inherit the vulnerability to adversarial examples [11,21,36,46]. Previous research [46] only paid attention to design a non-targeted attack in deep hashing based retrieval, i.e., returning retrieval objects with incorrect labels. Compared with non-targeted attacks, targeted attacks are more malicious since they make the adversarial examples misidentified as a predefined label and can be used to achieve some malicious purposes [4,10,28]. For example, a hashing based retrieval system may return violent images when a child queries with an intentionally perturbed cartoon image by the adversary. Accordingly, it is desirable to study the targeted adversarial attacks on deep hashing models and address their security concerns. This paper focuses on the targeted attack in hashing based retrieval. Different from classification, retrieval aims at returning multiple relevant objects instead of one result, which indicates that the query has more important relationship with the set of relevant objects than with other objects. Motivated by this fact, we formulate the targeted attack as a point-to-set (P2S) optimization, which minimizes the average distance between the compressed representations (e.g., hash codes in Hamming space) of the adversarial example and those of a set of objects with the target label. Compared with the point-to-point (P2P) paradigm [36] which directs the adversarial example to generate a representation similar to that of a randomly chosen object with the target label, our proposed pointto-set attack paradigm is more stable and efficient. The detailed comparison between P2S and P2P attack paradigm is shown in Figure 1. In particular, when minimizing the average Hamming distances between a hash code and those of an object set, we prove that the globally optimal solution (dubbed anchor code) can be achieved through a simple component-voting scheme, which is a gift from the nature of hashing-based retrieval. Therefore, the anchor code can be Targeted Attack for Deep Hashing based Retrieval 3 naturally chosen as a targeted hash code to direct the generation of adversarial query. To further balance the attack performance and the imperceptibility, we propose a novel attack method, dubbed deep hashing targeted attack (DHTA), by minimizing the Hamming distance between the hash code of adversarial query and the anchor code under the ℓ∞ restriction on the adversarial perturbations. In summary, the main contribution of this work is four-fold: – We formulate the targeted attack on hashing retrieval as a point-to-set optimization instead of the common point-to-point paradigm considering the characteristics of retrieval tasks. – We propose a novel component-voting scheme to obtain an anchor code as the representative of the set of hash codes of objects with the target label, whose theoretical optimality of proposed attack paradigm with average-case point-to-set metric is discussed. – We develop a simple yet effective targeted attack, the DHTA, which efficiently balances the attack performance and the perceptibility. This is the first attempt to design a targeted attack on hashing based retrieval. – Extensive experiments verify that DHTA is effective in attacking both image and video hashing. 2 2.1 Related Work Deep Hashing based Similarity Retrieval Hashing methods can map semantically similar objects to similar compact binary codes in Hamming space, which are widely adopted to accelerate the ANN retrieval [38] for large scale database. The classical version of data-dependent hashing consists of two parts, including hash function learning and binary inference [22,12,30]. Recently, more and more deep learning techniques were introduced to the traditional hashing-based retrieval methods and reach state-of-the-art performance, thanks to the powerful feature extraction of deep neural networks. The first deep hashing method was proposed in [42] focusing on image retrieval. Recent works showed that learning hashing mapping in an end-to-end manner can greatly improve the quality of the binary codes [20,25,2,1]. The above-mentioned methods can be easily extended to multi-label image retrieval, e.g., [49,40]. Depending on the availability of unlabeled images, other researchers devoted to design novel hashing methods to cope with the lack of labeled images, e.g., unsupervised deep hashing method [45], and semi-supervised one [44]. Different from deep image hashing methods, deep video hashing usually first extract frame features by a convolutional neural network (CNN), then fuse them to learn global hashing function. Among various kinds of fusion methods, recurrent neural network (RNN) architecture is the most common choice, which can well model the temporal structure of videos [14]. Moreover, some of the unsupervised video hashing methods were also proposed [41,23], which organize the hash code learning in a self-taught manner to reduce the time and labor consuming labeling. 4 2.2 Adversarial Attack DNNs can be easily fooled to confidently make incorrect predictions by intentional and human-imperceptible perturbations. The process of generating adversarial examples is called adversarial attack, which was initially proposed by Szegedy et al. [35] in the image classification task. To achieve such adversarial examples, the fast gradient sign method (FGSM) [13] aims to maximize the loss along the gradient direction. After that, projected gradient descent (PGD) [19] was proposed to reach better performance. Deepfool finds the smallest perturbation by exploring the nearest decision boundary [26]. Except for the aforementioned attacks, many other methods [3,8,47] have also been developed to find the adversarial perturbation in the image classification problem. Besides, there are also other DNN based tasks that inherit the vulnerability to adversarial examples [43,9,39]. Especially for the deep learning based similarity retrieval, it raises wide concerns on its security issues. For feature-based retrieval, Li et al. [21] focused on non-targeted attack by adding universal adversarial perturbations (UAPs), while targeted mismatch adversarial attack was explored in [36]. In [11], adversarial queries for deep product quantization network are generated by perturbing the overall soft-quantized distributions. However, for hashing based retrieval, one of the most important retrieval methods, its robustness analysis is left far behind. There is only one previous work in attacking deep hashing based retrieval [46], which paid attention to the non-targeted attack, i.e., returning retrieval objects with the incorrect label. The targeted attack in such retrieval a system remains blank. 3 3.1 The Proposed Method Preliminaries In this section, we briefly review the process of deep hashing based retrieval. N Suppose X = {(xi , yi )}i=1 indicates a set of N sample collection labeled with C classes, where xi indicates the retrieval object, e.g., a image or a video, and yi ∈ {0, 1}C corresponds to a label vector. The c-th component of indicator vector yic = 1 means that the sample xi belongs to class c. Let X (t) = {(x, y) ∈ X | y = yt } be a subset of X consisting of those objects with label yt . Deep Hashing Model. The hash code of a query object x of deep hashing model is generated as follows: h = F (x) = sign (fθ (x)) , (1) where fθ (·) is a DNN. In general, fθ (·) consists of a feature extractor followed by the fully-connected layers. Specifically, the feature extractor is usually specified as CNN for image retrieval [2,1,5], while CNN stacked with RNN is widely adopted for video retrieval [14,33,23]. In particular, the sign(·) function is approximated by the tanh(·) function during the training process in deep hashing based retrieval methods to alleviate the gradient vanishing problem [2]. Targeted Attack for Deep Hashing based Retrieval 5 Similarity-based Retrieval. Given a deep hashing model F (·), a query object x and a object database {xi }M i=1 , the retrieval process is as follows. Firstly, the query x is fed into the deep hashing model and binary code F (x) can be obtained through Eq. (1). Secondly, the Hamming distance between the hash code of query x and that of each object xi in the database is calculated, denoted as dH (F (x), F (xi )). Finally, the retrieval system returns a list of objects, which is produced by sorting these Hamming distances. 3.2 Deep Hashing Targeted Attack Problem Formulation. In general, given a benign query x, the objective of targeted attack in retrieval is to generate an attacked version x′ of x, which would cause the targeted model to retrieve objects with the target label yt . This objective can be achieved through minimizing the distance between the hash code of the attacked sample x′ and those of the object subset X (t) with the target label yt , i.e., min d (F (x′ ) , F (X (t) )), ′ x (2) where F (X (t) ) = {F (x)|x ∈ X (t) )}, and d(·, ·) denotes a point-to-set metric. Once the problem is formulated as objective (2), the remaining problem is how to define the point-to-set metric. In this paper, we use the most widely used point-to-set metric, the average-case metric, as shown in Definition 1. Definition 1. Given a point h0 ∈ {−1, +1}K and a set of points A in {−1, +1}K and point-to-point metric dH , the average-case point-to-set metric is defined as follows: 1 X dH (h0 , h). (3) dAve (h0 , A) , |A| h∈A Remark 1. If average-case point-to-set metric is adopted, the objective function (2) is specified as X 1 min dH (h′ , h), (4) h′ |A| (t) h∈F (X ) ′ where h is the hash code corresponding to the adversarial example x′ . In particular, there exists an analytical optimal solution (dubbed anchor code) of the optimization problem (4) obtained through a component-voting scheme, which is a gift from the nature of Hamming distance of hashing-based retrieval. The component-voting scheme is shown in Algorithm 1, and the optimality of anchor code is verified in Theorem 1. The proof is shown in the Appendix A. Theorem 1. Anchor code ha calculated by Algorithm 1 is the binary code achieving the minimal sum of Hamming distances with respect to hi , i = 1, . . . , nt , i.e., ha = arg min h∈{+1,−1}K nt X i=1 dH (h, hi ). (5) 6 Algorithm 1 Component-voting Scheme t Input: K-bits hash codes {hi }n i=1 of objects with the target label t. Output: Anchor code ha . 1: for j = 1 : K do 2: Conduct voting process through counting up the number of +1 and −1, denoted j j t by N+1 and N−1 , respectively. For the j-th component among {hi }n i=1 , i.e., j = N+1 nt X I(hji = +1), i 3: j N−1 = nt X I(hji = −1), (6) i=1 where I(·) is an indicator function. Determine the j-th component of anchor code hja as ( j j +1, if N+1 > N−1 j ha = . −1, otherwise (7) 4: end for 5: return Anchor code ha . Overall Objective Function. Due to the optimal representative property of anchor code for the set of hash codes of objects with the target label (Theorem 1), we can naturally choose the anchor code as a targeted hash code to direct the generation of the adversarial query. However, the attacked object corresponding to the anchor code may be far different from the original one visually, which would cause the attacked object easily detectable. To solve this problem, we introduce the ℓ∞ restriction on the adversarial perturbations while minimizing the Hamming distance between the hash code of attacked object and that of the anchor code as follows: min dH (sign(fθ (x′ )), ha ) ′ x s.t. ||x′ − x||∞ ≤ ǫ, (8) where ǫ denotes the maximum perturbation strength, ha is the anchor code of object set with the target label. Besides, given a pair of binary codes hi and hj , since dH (hi , hj ) = 12 (K − ⊤ hi hj ), we can equivalently replace Hamming distance with inner product in the objective function. In particular, similar to deep hashing methods [2], we adopt the hyperbolic tangent (tanh) function to approximate sign function for the adversarial generation. Similar to [46], we also introduce the factor α to address the gradient vanishing problem. In summary, the overall optimization objective of proposed method is as follows: min − ′ x 1 ⊤ h tanh(αfθ (x′ )) K a s.t. ||x′ − x||∞ ≤ ǫ, where the hyper-parameter α ∈ [0, 1], ha is the anchor code. The overall process of proposed DHTA is shown in Figure 2. (9) Targeted Attack for Deep Hashing based Retrieval -0.8 -0.9 +0.9 7 +0.7 tanh Loss Benign Query 𝒉𝒂 +1 +1 -1 -1 +1 +1 +1 -1 -1 +1 +1 -1 -1 +1 -1 +1 Voting 𝒉𝟏 𝒉𝟐 𝒉𝟑 Adversarial Query Feature Extractor Fully-Connected Layers 𝒉𝟏 Anchor Code: 𝒉𝒂 𝒉𝟐 𝒉𝟑 Fig. 2. The pipeline of proposed DHTA method, where the gray and orange arrows indicate forward and backward propagation, respectively. The adversarial query is generated through minimizing the loss calculated by its hash code and the anchor code of the set of objects with the target label. The anchor code ha is calculated through the component-voting scheme (i.e. an entry-wise voting process). In this toy example, h1 , h2 and h3 are three 4 bits hash codes of objects with the target label “Cat”. 4 4.1 Experiments Benchmark Datasets and Evaluation Metrics Four retrieval benchmark datasets are validated in our experiments. The first two datasets are used for image retrieval, while the last two are used for video retrieval. The description of these datasets are described in detail below. – ImageNet [29] consists of 1.2M training samples and 50,000 testing samples with 1000 classes. We follow [2] to build a subset containing 130K images with 100 classes. We use images from training set as the database, and images from the testing set as the queries. We sample 100 images per class from the database for the training of deep hashing model. – NUS-WIDE [7] dataset contains 269,648 images from 81 classes. We only select the subset of images with the 20 most frequent labels. We randomly sample 5000 images as the query set and take the remaining images as the database, as suggested in [50]. Besides, we randomly sample 10,000 images from the database to train the hashing model. – JHMDB [17] consists of 928 videos in 21 categories. We randomly choose 10 videos per category as queries, 10 videos per category as training samples, and the rest as retrieval database. – UCF-101 [34] is an action recognition dataset, which contains 13,320 videos categorized into 101 classes. We use 30 videos per category for training, 30 videos per category for querying and the remaining 7,260 videos as the database. 8 Table 1. t-MAP (%) of targeted attack methods and MAP (%) of query with benign objects (Original) with various code lengths on two image datasets. Original Noise P2P DHTA Original ImageNet 16bits 32bits 48bits 64bits t-MAP 3.80 1.36 1.64 1.98 t-MAP 3.29 1.24 1.89 2.10 t-MAP 44.35 58.32 62.50 65.61 t-MAP 63.68 77.76 82.31 82.10 MAP 51.02 62.70 67.80 70.11 ImageNet NUS-WIDE Precision Precision 0.9 37.62 37.34 75.45 82.35 36.03 36.15 78.59 85.66 38.32 38.25 81.40 86.80 38.69 38.57 81.28 88.84 76.93 80.37 82.06 81.62 ImageNet 0.9 NUS-WIDE 0.8 0.7 Precision 0.9 0.8 0.6 0.7 0.6 0.4 0.5 0.2 0.4 0.0 0.3 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Recall Recall 0.8 NUS-WIDE 16bits 32bits 48bits 64bits 0.6 0.5 Precision Method Metric 0.8 Original P2P DHTA 0.4 200 400 600 800 1000 0.7 200 400 600 800 1000 Number of Top Ranked Samples Number of Top Ranked Samples Fig. 3. Precision-recall and precision curves under 48 bits code length in image retrieval. P2P attack and DHTA are evaluated based on the target label, while the result of ‘Original’ is calculated based on the label of query object. For the evaluation of targeted attacks, we define the targeted mean average precision (t-MAP) as the evaluation metric, which is similar to mean average precision (MAP) widely used in information retrieval [51]. Specifically, the referenced label of t-MAP is the targeted label instead of the original one of the query object in MAP. The higher the t-MAP, the better the targeted attack performance. In image hashing, we evaluate t-MAP on top 5,000 and 1,000 retrieved images on NUS-WIDE and ImageNet, respectively. We evaluate t-MAP on all retrieved videos in video hashing. Besides, we also present the precision-recall curves (PR curves) of different methods for more comprehensive comparison. 4.2 Overall Results on Image Retrieval Evaluation Setup. For image hashing, we adopt VGG-11 [32] as the backbone network pre-trained on ImageNet to extract features, then replace the last fully-connected layer of softmax classifier with the hashing layer. The detailed settings of training image hashing models are presented in the Appendix B. For each dataset, we randomly select 100 samples from the query set as benign queries to evaluate the performance of attack. For each generation, we randomly select a label as the target label different from the label of query. When generating an anchor code, we randomly sample images from objects in the database with the target label to form the hash code set. For all adversarial examples, the perturbation magnitude ǫ of normalized data and nt is set to 0.032 and 9, respectively. Stochastic Gradient Descent (SGD) [48] is adopted to optimize the Targeted Attack for Deep Hashing based Retrieval 9 Benign Query Label: ‘pencil box’ √ √ × √ √ √ × √ √ √ Adversarial Query Target: ‘otter’ √ √ √ √ √ × √ √ √ √ Fig. 4. An example of image retrieval with benign query or its correspondingly adversarial query on ImageNet. Retrieved objects with top-10 similarity are shown in the box. The tick and cross indicate whether the retrieved object is consistent with the desired label (the original label for benign query and the target label for adversarial query). proposed attack. We attack image hashing models with learning rate 1 and the number of iterations is set to 2,000. Following [46], the parameter α is set as 0.1 during the first 1,000 iterations, and is updated every 200 iterations according to [0.2, 0.3, 0.5, 0.7, 1.0] during the last 1,000 iterations. We compare DHTA with targeted attack with P2P paradigm [36], which is specified as DHTA with nt = 1. We also show the t-MAP results of images with additive noise sampled from the uniform distribution U (−ǫ, +ǫ). Results. The general attack performance of different methods is shown in Table 1. The t-MAP values of query with benign objects (dubbed Original ) or query with noisy objects (dubbed Noise) are relatively small on both ImageNet and NUS-WIDE datasets. Especially on ImageNet dataset, the t-MAP values of two aforementioned methods are closed to 0, which indicates that query with benign images or images with noise can not successfully retrieve objects with the target labels as expected. In contrast, designed targeted attack methods (i.e. P2P and DHTA) can significantly improve the t-MAP values. For example, compared with the t-MAP of benign query on ImageNet dataset, the improvement of P2P methods is over 40% in all cases. Especially under the relatively large code length (64 bits), the improvement even goes to 63%. Among two targeted attack methods, the proposed DHTA method achieves the best performance. Compared with P2P, the t-MAP improvement of DHTA is over 16% (usually over 19%) in all cases on the ImageNet dataset. Moreover, the t-MAP values of targeted attacks increase as the number of bits, which is probably caused by the extra information introduced in the longer code length. In particular, an interesting phenomenon is that the t-MAP value of DHTA is even significantly higher than the MAP value of ‘Original’, which suggests that the attack performance of DHTA is not hindered by the performance of the original hashing model (i.e. threat model) to some extent. An example of the results of query with a benign image and an adversarial image is displayed in Figure 4. Furthermore, we also provide the precision-recall and precision curves for a more comprehensive comparison. As shown in Figure 3, the curves of DHTA are always above all other curves, which demonstrates that the performance of DHTA does better than all other methods. 10 Table 2. t-MAP (%) of targeted attack methods and MAP (%) of query with benign objects (Original) with various code lengths on two video datasets. JHMDB 16bits 32bits 48bits 64bits UCF-101 16bits 32bits 48bits 64bits t-MAP 6.73 6.26 6.48 6.89 t-MAP 6.67 6.13 6.50 6.94 t-MAP 39.67 42.37 44.78 44.38 t-MAP 56.47 62.04 63.02 66.06 1.69 1.67 1.79 1.86 1.69 1.72 1.87 1.85 55.57 53.49 55.27 51.88 67.84 66.18 69.72 67.83 Method Metric Original Noise P2P DHTA Original MAP 35.18 42.46 45.80 45.50 JHMDB UCF-101 55.16 55.25 56.56 56.79 JHMDB 0.8 UCF-101 0.5 0.4 Precision Precision Precision Precision 0.7 0.8 0.6 0.6 0.6 0.5 0.4 0.4 0.4 0.3 0.2 0.2 0.2 0.0 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.1 20 40 60 80 100 Recall Recall Number of Top Ranked Samples Original P2P DHTA 0.3 0.2 0.1 200 400 600 800 1000 Number of Top Ranked Samples Fig. 5. Precision-recall and precision curves under 48 bits code length in video retrieval. P2P attack and DHTA are evaluated based on the target label, while the result of ‘Original’ method is calculated based on the label of query object. 4.3 Overall Results on Video Retrieval Evaluation Setup. According to model architectures of the state-of-the-art deep video retrieval methods [14,33,23], we adopt AlexNet [18] to extract spatial features and LSTM [15] to fuse the temporal information. The detailed settings of training video hashing model are presented in the Appendix B. For attacking video hashing, the number of iterations is 500, and the parameter α is fixed at 0.1. Other settings are the same as those used in Section 4.2. Results. The attack performance in video retrieval is shown in Table 2. Similar to the image scenario, query with benign videos or videos with noise can not successfully retrieve objects with the target label, thus fails to attack the deep hashing based retrieval. In contrast, deep hashing based video retrieval can be easily attacked by designed targeted attacks, especially the DHTA proposed in this paper. For example, the t-MAP value of DHTA is 59% over query with benign videos, and 21% over P2P attack paradigm on the JHMDB dataset with code length 64 bits. The precision-recall and the precision curves also verify the superiority of DHTA over other methods, as shown in Figure 5. Especially on JHMDB dataset, there exists a significantly large gap between the PR curve of DHTA and those of other methods. In addition, the t-MAP value of DHTA is again significantly larger than the MAP of the benign query (the ‘Original’). An example of the results of query with a benign video and an adversarial video is displayed in Figure 6. Targeted Attack for Deep Hashing based Retrieval 11 Benign Query Label: ‘golf’ √ √ √ × √ √ √ × × √ Adversarial Query Target: ‘wave’ √ √ √ √ √ √ √ × √ × Fig. 6. An example of image retrieval with benign query or its correspondingly adversarial query on JHMDB. Retrieved objects with top-10 similarity are shown in the box. The tick and cross indicate whether the retrieved object is consistent with the desired label (the original label for benign query and the target label for adversarial query). ImageNet 70 70 60 60 50 50 MAP (%) 3 5 7 nt 9 11 40 13 60 60 DHTA @ 32bits DHTA @ 48bits Original @ 32bits 50 Original @ 48bits MAP (%) 80 t-MAP (%) 80 40 1 JHMDB 90 t-MAP (%) 90 50 40 1 40 3 5 7 nt 9 11 13 Fig. 7. t-MAP (%) of DHTA and MAP (%) of query with benign objects (‘Original’) with different nt and code length on ImageNet and JHMDB. 4.4 Discussion Effect of nt . To analyze the effect of the size of object set for generating the anchor code (i.e., nt ), we discuss the t-MAP of DHTA under different values of nt ∈ {1, 3, 5, 7, 9, 11, 13}. Other settings are the same as those used in Section 4.2-4.3. We use ImageNet and JHMDB as the representative for analysis. As shown in Figure 7, the t-MAP value increase as the increase of nt under different code lengths. The MAP of corresponding query with benign objects (i.e. the ‘Original’) can be regarded as the reference of the retrieval performance. We observe that the t-MAP is higher than the MAP of its corresponding ‘Original’ method in all cases when nt ≥ 3. In other words, DHTA can still have satisfying performance with relatively small nt . This advantage is critical for attackers, since the bigger the nt , the higher the cost of data collection and adversarial generation for an attack. It is worth noting that the attack performance degrades significantly when nt = 1, which exactly corresponds to the P2P attack paradigm. 12 Table 3. t-MAP (%) of DHTA with different iterations on ImageNet. Table 4. t-MAP (%) of DHTA with different iterations on JHMDB. Iteration 16bits 32bits 48bits 64bits 100 500 1000 1500 2000 52.99 55.18 56.96 62.81 63.68 66.29 68.30 68.36 74.11 77.76 70.65 74.47 75.03 79.28 82.31 Iteration 16bits 32bits 48bits 64bits 72.43 76.15 76.25 78.71 82.10 10 50 100 500 28.51 48.69 53.21 56.47 23.88 48.18 54.91 62.04 22.84 47.01 55.94 63.02 23.21 48.97 58.28 66.06 Table 5. MAP (%) of different methods on ImageNet and JHMDB. The best results are marked with boldface, while the second best results are marked with underline. Method ImageNet 16bits 32bits 48bits 64bits Original 51.02 Noise 50.94 P2P 3.36 HAG 1.88 DHTA 0.54 62.70 62.52 2.48 4.96 5.64 67.80 66.69 2.45 3.89 2.30 70.11 69.85 3.93 2.34 1.70 JHMDB 16bits 32bits 48bits 64bits 35.18 35.04 7.71 3.52 6.76 42.46 42.15 8.20 3.58 7.23 45.80 45.67 8.14 3.42 6.56 45.50 45.63 10.19 3.34 7.55 Effect of the Number of Iterations. Table 3-4 present the t-MAP of DHTA with different iterations on ImageNet and JHMDB datasets. Except for the iterations, other settings are the same as those used in Section 4.2-4.3. As expected, the t-MAP values increase with the number of iterations. Even with relatively few iterations, the proposed DHTA can still achieve satisfying performance. For example, with 100 iterations, the t-MAP values are over 50% under all code lengths. Especially on the ImageNet dataset, the t-MAP is over 70% with relatively larger code length (≥48 bits). These results consistently verify the high-efficiency of our DHTA method. Evaluation from the Perspective of Non-targeted Attack. Targeted attack can be regarded as a special non-targeted attack, since the target label is usually different from the one of query object. In this part, we compare the targeted attacks (P2P and DHTA) with other methods, including additive noise and HAG [46] (which is the state-of-the-art non-targeted attack), in the non-targeted attack scenario. The MAP results of different methods are reported in Table 5. The lower the MAP, the better the non-targeted attack performance. As shown in the table, although targeted attacks are not designed for the non-targeted scenario, they still have competitive performance. For example, the MAP values of DHTA are 50% smaller than those of ‘Original’ under all code length on ImageNet. Especially for the proposed DHTA, it even has better non-targeted attack performance (i.e. smaller MAP) compared with HAG on ImageNet in most cases. Perceptibility. Except for the attack performance, the perceptibility of adversarial perturbations is also important. Following the setting suggested in [35,37], 13 Adversarial Examples Benign Examples Targeted Attack for Deep Hashing based Retrieval Perceptibility: 8.07 × 10−3 8.39 × 10−3 7.91 × 10−3 8.42 × 10−3 7.82 × 10−3 7.98 × 10−3 8.73 × 10−3 9.00 × 10−3 8.07 × 10−3 Adversarial Examples Benign Examples (a) ImageNet Perceptibility: 8.90 × 10−3 8.96 × 10−3 8.74 × 10−3 (b) NUS-WIDE Fig. 8. Visualization examples of generated adversarial examples in image hashing. given a benign query q x, the perceptibility of its corresponding adversarial query 2 x′ is defined as n1 kx′ − xk2 , where n is the size of the object and pixel values are scaled to be in the range [0, 1]. For each dataset, we calculate the average perceptibility over all generated adversarial objects. The perceptibility value of ImageNet and NUS-WIDE datasets is 8.35 × 10−3 and 9.07 × 10−3 , respectively. In video retrieval tasks, the value is 5.81 × 10−3 and 7.72 × 10−3 on JHMDB and UCF-101 datasets, respectively. These results indicate that the adversarial queries are very similar to their original versions. Some adversarial images are shown in Figure 8, while examples of video retrieval are shown in the Appendix C. 4.5 Open-set Targeted Attack Evaluation Setup. In the above experiments, the target label is selected from those of training set. In this section, we use ImageNet dataset as an example to further evaluate the proposed DHTA under a tougher open-set scenario, where the out-of-sample class will be assigned as the target label. This setting is more realistic since the attacker may probably not be able to access the training set 14 Table 6. t-MAP (%) of DHTA with out-of-sample target label on ImageNet. Method DHTA DHTA DHTA DHTA (nt (nt (nt (nt = 5) = 7) = 9) = 11) 16bits 32bits 48bits 64bits 33.67 34.77 37.34 38.00 46.34 50.92 54.13 54.05 48.91 51.68 55.12 56.93 48.27 49.18 52.17 54.12 of the attacked deep hashing model. For example, the deep hashing model may be downloaded from a third-party open-source platform where the training set is unavailable. Specifically, we randomly select 10 additional classes different from those used for training a deep hashing model in Section 4.1. These selected images from 10 additional classes will be treated as an open set for our evaluation. When generating the anchor code of objects with the target label (within the open set), we remain our deep hashing model trained on the previous 100 classes. Results. As shown in Table 6, DHTA still has a certain attack effect even if the target label is out-of-sample. Especially when the nt and the code length tend larger, the t-MAP values of DHTA are over 50%. This phenomenon may reveal that the learned feature extractor did learn some useful low-level features, which represents those objects with the same class in some similar locations in Hamming space, no matter the class is learned or not. In addition, the attack performance is also increasing with the nt and code length. Further discussions and better insights of this problem will be discussed in our future work. 5 Conclusions In this paper, we explore the landscape of the targeted attack for deep hashing based retrieval. Based on the characteristics of retrieval task, we formulate the attack as a point-to-set optimization, which minimizes the average distance between the hash code of the adversarial example and those of a set of objects with the target label. Theoretically, we propose a component-voting scheme to obtain the optimal representative, the anchor code, for the code set of point-to-set optimization. Based on the anchor code, we propose a novel targeted attack method, the DHTA, to balance the performance and perceptibility through minimizing the Hamming distance between the hash code of adversarial example and the anchor code under the ℓ∞ restriction on the adversarial perturbation. Extensive experiments are conducted, which verifies the effectiveness of DHTA in attacking both deep hashing based image retrieval and video retrieval. Targeted Attack for Deep Hashing based Retrieval 15 References 1. Cao, Y., Long, M., Liu, B., Wang, J.: Deep cauchy hashing for hamming space retrieval. In: CVPR (2018) 2. Cao, Z., Long, M., Wang, J., Yu, P.S.: Hashnet: Deep learning to hash by continuation. In: ICCV (2017) 3. Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In: IEEE S&P (2017) 4. Carlini, N., Wagner, D.: Audio adversarial examples: Targeted attacks on speechto-text. In: IEEE S&P Workshops (2018) 5. Chen, Y., Lai, Z., Ding, Y., Lin, K., Wong, W.K.: Deep supervised hashing with anchor graph. In: CVPR (2019) 6. Chen, Z., Yuan, X., Lu, J., Tian, Q., Zhou, J.: Deep hashing via discrepancy minimization. In: CVPR (2018) 7. Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: Nus-wide: a real-world web image database from national university of singapore. In: ICMR (2009) 8. Dong, Y., Liao, F., Pang, T., Su, H., Zhu, J., Hu, X., Li, J.: Boosting adversarial attacks with momentum. In: CVPR (2018) 9. Dong, Y., Su, H., Wu, B., Li, Z., Liu, W., Zhang, T., Zhu, J.: Efficient decisionbased black-box adversarial attacks on face recognition. In: CVPR (2019) 10. Eykholt, K., Evtimov, I., Fernandes, E., Li, B., Rahmati, A., Xiao, C., Prakash, A., Kohno, T., Song, D.: Robust physical-world attacks on deep learning visual classification. In: CVPR (2018) 11. Feng, Y., Chen, B., Dai, T., Xia, S.t.: Adversarial attack on deep product quantization network for image retrieval. In: AAAI (2020) 12. Ge, T., He, K., Sun, J.: Graph cuts for supervised binary coding. In: ECCV (2014) 13. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: ICLR (2015) 14. Gu, Y., Ma, C., Yang, J.: Supervised recurrent hashing for large scale video retrieval. In: ACM MM (2016) 15. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997) 16. Hu, D., Nie, F., Li, X.: Deep binary reconstruction for cross-modal hashing. IEEE Transactions on Multimedia 21(4), 973–985 (2018) 17. Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: ICCV (2013) 18. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NeurIPS. pp. 1097–1105 (2012) 19. Kurakin, A., Goodfellow, I., Bengio, S.: Adversarial examples in the physical world. In: ICLR (2017) 20. Lai, H., Pan, Y., Liu, Y., Yan, S.: Simultaneous feature learning and hash coding with deep neural networks. In: CVPR (2015) 21. Li, J., Ji, R., Liu, H., Hong, X., Gao, Y., Tian, Q.: Universal perturbation attack against image retrieval. In: ICCV (2019) 22. Li, P., Wang, M., Cheng, J., Xu, C., Lu, H.: Spectral hashing with semantically consistent graph for image indexing. IEEE Transactions on Multimedia 15(1), 141–152 (2012) 23. Li, S., Chen, Z., Lu, J., Li, X., Zhou, J.: Neighborhood preserving hashing for scalable video retrieval. In: ICCV. pp. 8212–8221 (2019) 16 24. Liong, V.E., Lu, J., Tan, Y.P., Zhou, J.: Deep video hashing. IEEE Transactions on Multimedia 19(6), 1209–1219 (2016) 25. Liu, H., Wang, R., Shan, S., Chen, X.: Deep supervised hashing for fast image retrieval. In: CVPR (2016) 26. Moosavi-Dezfooli, S.M., Fawzi, A., Frossard, P.: Deepfool: a simple and accurate method to fool deep neural networks. In: CVPR (2016) 27. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, highperformance deep learning library. In: NeurIPS (2019) 28. Qin, Y., Carlini, N., Cottrell, G., Goodfellow, I., Raffel, C.: Imperceptible, robust, and targeted adversarial examples for automatic speech recognition. In: ICML (2019) 29. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. International journal of computer vision 115(3), 211–252 (2015) 30. Shen, F., Shen, C., Liu, W., Tao Shen, H.: Supervised discrete hashing. In: CVPR (2015) 31. Shen, F., Xu, Y., Liu, L., Yang, Y., Huang, Z., Shen, H.T.: Unsupervised deep hashing with similarity-adaptive and discrete optimization. IEEE transactions on pattern analysis and machine intelligence 40(12), 3034–3044 (2018) 32. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015) 33. Song, J., Zhang, H., Li, X., Gao, L., Wang, M., Hong, R.: Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Transactions on Image Processing 27(7), 3210–3221 (2018) 34. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012) 35. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. In: ICLR (2014) 36. Tolias, G., Radenovic, F., Chum, O.: Targeted mismatch adversarial attack: Query with a flower to retrieve the tower. In: ICCV (2019) 37. Tramèr, F., Kurakin, A., Papernot, N., Goodfellow, I., Boneh, D., McDaniel, P.: Ensemble adversarial training: Attacks and defenses. In: ICLR (2018) 38. Wang, J., Zhang, T., Sebe, N., Shen, H.T., et al.: A survey on learning to hash. IEEE transactions on pattern analysis and machine intelligence 40(4), 769–790 (2017) 39. Wiyatno, R.R., Xu, A.: Physical adversarial textures that fool visual object tracking. In: ICCV (2019) 40. Wu, D., Lin, Z., Li, B., Ye, M., Wang, W.: Deep supervised hashing for multi-label and large-scale image retrieval. In: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval (2017) 41. Wu, G., Han, J., Guo, Y., Liu, L., Ding, G., Ni, Q., Shao, L.: Unsupervised deep video hashing via balanced code for large-scale video retrieval. IEEE Transactions on Image Processing 28(4), 1993–2007 (2018) 42. Xia, R., Pan, Y., Lai, H., Liu, C., Yan, S.: Supervised hashing for image retrieval via image representation learning. In: AAAI (2014) 43. Xu, Y., Wu, B., Shen, F., Fan, Y., Zhang, Y., Shen, H.T., Liu, W.: Exact adversarial attack to image captioning via structured output learning with latent variables. In: CVPR (2019) 44. Yan, X., Zhang, L., Li, W.J.: Semi-supervised deep hashing with a bipartite graph. In: IJCAI (2017) Targeted Attack for Deep Hashing based Retrieval 17 45. Yang, E., Liu, T., Deng, C., Liu, W., Tao, D.: Distillhash: Unsupervised deep hashing by distilling data pairs. In: CVPR (2019) 46. Yang, E., Liu, T., Deng, C., Tao, D.: Adversarial examples for hamming space search. IEEE transactions on cybernetics 50(4), 1473–1484 (2018) 47. Yao, Z., Gholami, A., Xu, P., Keutzer, K., Mahoney, M.W.: Trust region based adversarial attack on neural networks. In: CVPR (2019) 48. Zhang, T.: Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: ICML (2004) 49. Zhao, F., Huang, Y., Wang, L., Tan, T.: Deep semantic ranking based hashing for multi-label image retrieval. In: CVPR (2015) 50. Zhu, H., Long, M., Wang, J., Cao, Y.: Deep hashing network for efficient similarity retrieval. In: AAAI (2016) 51. Zuva, K., Zuva, T.: Evaluation of information retrieval systems. International journal of computer science & information technology 4(3), 35 (2012) 18 Appendix A Proof of Theorem 1 Theorem 1. Anchor code ha calculated by Algorithm 1 is the binary code achieving the minimal sum of Hamming distances with respect to hi , i = 1, . . . , nt , i.e., ha = arg min h∈{+1,−1}K nt X dH (h, hi ). (10) i=1 K Proof. We only need to prove that for any h ∈ {+1, −1} following inequality holds. nt X dH (ha , hi ) ≤ nt X and h 6= ha , the dH (h, hi ). (11) i i Denote D = {j1 , j2 , . . . , jK0 }, 1 ≤ K0 ≤ K, as the index set where h and ha differ. Then we have nt X dH (ha , hi ) X dH (hja , hji ) + i = j∈D = ≤ X nt − X nt − j∈D = nt X dH (hja , hji ) (12) j∈{1,2,...,K}\D nt X I(hja i j∈D (a) X nt X i =  hji ) +  I(hj = hji ) + X nt − X nt − nt X I(hja = hji ) nt X I(hj = hji ) i j∈{1,2,...,K}\D j∈{1,2,...,K}\D dH (h, hi ), i  (13)  (14) (15) i where (a) holds since anchor code ha is obtained through a voting process (as shown in Algorithm 1), i.e., ∀j ∈ D, nt X i=1 I(hja = hji ) ≥ nt X I(hj = hji ). (16) i=1  Targeted Attack for Deep Hashing based Retrieval B 19 Threat Models All experiments are implemented based on the PyTorch framework [27]. The detailed training settings are shown as follows. Image Hashing. We adopt VGG-11 [32] as the backbone network pre-trained on ImageNet to extract features, then replace the last fully-connected layer of softmax classifier with the hashing layer. We fine-tune the base model and train the hash layer from scratch using the pairwise loss function in [46]. We employ stochastic gradient descent (SGD) [48] with momentum 0.9 as the optimizer. The weight decay parameter is set as 0.0005. The learning rate is fixed at 0.01 and the batch size is 24. Video Hashing. We extract frame features using AlexNet [18] pretrained on the ImageNet dataset. Then we employ the objective function in [24] to train LSTM [15] with the hash layer from scratch. The parameter in the objective function to balance discriminative loss and quantization loss is set to 0.0001. SGD is used to optimize model parameters, with momentum 0.9 and fixed learning rate 0.05. The weight decay parameter is set as 0.0001. The batch size is set to 100 and the maximum length of input videos is 40. Due to different video sizes are for two video datasets, we adopt different strategies to sample video frames. For the JHMDB dataset, we select all frames of videos whose lengths are smaller than 40 and top-40 frames for other videos. For the UCF-101 dataset, we sample video frames with equal stride (set to 3) for each video. C Visualization In this section, we provide some visual examples of DHTA in video hashing and open-set scenario. Video Hashing. Some examples of generated adversarial videos and their correspondingly benign videos are shown in Figure 9. Specifically, due to the limitation of the space, for each video, we present frames ∈ {3, 6, 9, 12, 15, 18, 21}. Similar to the image scenario, these results indicate that the adversarial queries are very similar to their original versions. In other words, the generated adversarial objects of the proposed DHTA is human-imperceptible. Open-set Targeted Attack. We show generated adversarial examples and the corresponding retrieved images under an open-set scenario in Figure 10. Even if this setting is tougher, there still exist many images with targeted label in the top-10 retrieved images. This result indicates that our proposed DHTA can successfully fool deep hashing model to return objects from out-of-sample class. Adversarial Video Benign Video 20 Adversarial Video Benign Video Perceptibility: 5.76 × 10−3 Perceptibility: 5.48 × 10−3 Adversarial Video Benign Video (a) JHMDB Adversarial Video Benign Video Perceptibility: 6.84 × 10−3 Perceptibility: 8.77 × 10−3 (b) UCF-101 Fig. 9. Visualization examples of generated adversarial examples in video hashing. Targeted Attack for Deep Hashing based Retrieval Adversarial Queries Label: ‘fig’ Target: ‘crash helmet’ 21 Top-10 Retrieved Images √ √ × √ × √ √ √ √ × Label: ‘lighter’ Target: ‘valley’ × √ √ √ × √ × √ × × Label: ‘restaurant’ Target: ‘coho’ √ √ × √ √ × × √ √ × Fig. 10. Examples of image retrieval with adversarial query on ImageNet. All target labels are randomly selected from the out-of-sample class labels. Retrieved objects with top-10 similarity are shown on the right. The tick and cross indicate whether the retrieved object is consistent with the target label.