2018-15 FP
2018-15 FP
2018-15 FP
Adversarial Deformations
____________________________________________________________________________________________________
ADef: an Iterative Algorithm to Construct
Adversarial Deformations
Abstract
While deep neural networks have proven to be a powerful tool for many recognition
and classification tasks, their stability properties are still not well understood. In
the past, image classifiers have been shown to be vulnerable to so-called adversarial
attacks, which are created by additively perturbing the correctly classified image.
In this paper, we propose the ADef algorithm to construct a different kind of
adversarial attack created by iteratively applying small deformations to the image,
found through a gradient descent step. We demonstrate our results on MNIST with a
convolutional neural network and on ImageNet with Inception-v3 and ResNet-101.
1 Introduction
In the last years, deep neural networks have outperformed all previous techniques in many machine
learning tasks, including image classification. This is made possible by two main ingredients. First,
the depth of the network which is in general composed of many layers, and second, the high degree of
freedom in choosing the parameters in each affine transformation. However, the advantage brought by
these two aspects comes at a price: if the norms of the weight matrices are not small, the application
of several layers of operations results in networks that are in general unstable [25]. In other words,
two very close signals (e.g., two visually indistinguishable images) may have very different outputs,
resulting in one of them being misclassified even if the other one is correctly classified with high
confidence.
Since the first observation in [25], a lot of research has been done to investigate this issue through the
construction of adversarial examples: given a correctly classified image x, we look for an image y
which is visually indistinguishable from x but is misclassified by the network. Typically, the image y
is constructed as y = x + r, where r is an adversarial perturbation that is supposed to be small in a
suitable sense (normally, with respect to an ℓp norm). Several algorithms have been developed to
construct adversarial perturbations, see [9, 18, 13, 16, 6].
Even though such pathological cases are very unlikely to occur in practice, their existence is relevant
since malicious attackers may exploit this drawback to fool classifiers or other automatic systems.
Further, adversarial perturbations may be constructed in a black-box setting (i.e., without knowing
the architecture of the DNN but only its outputs) [19, 17] and also in the physical world [13, 3, 4, 23].
This has motivated the investigation of defenses, i.e., how to make the network invulnerable to such
attacks, see [14, 5, 16, 26, 12, 20, 2, 11]. In most cases, adversarial examples are artificially created
and then used to retrain the network, which becomes more stable under these types of perturbations.
Most of the work on the construction of adversarial examples and on the design of defense strategies
has been conducted in the context of small perturbations r measured in the ℓ∞ norm. However, this is
not necessarily a good measure of image similarity: e.g., for two translated images x and y, the norm
of x − y is not small in general, even though x and y will look indistinguishable if the translation is
2 Adversarial perturbations
Let K be a classifier of images consisting of P pixels into L ≥ 2 categories, i.e. a function from the
space of images X = RcP , where c = 1 (for grayscale images) or c = 3 (for color images), and into
the set of labels L = {1, . . . , L}. Suppose x ∈ X is an image that is correctly classified by K and
suppose y ∈ X is another image that is imperceptible from x and such that K(y) 6= K(x), then y is
said to be an adversarial example. The meaning of imperceptibility varies, but generally, proximity
in ℓp -norm (with 1 ≤ p ≤ ∞) is considered to be a sufficient substitute. Thus, an adversarial
perturbation for an image x ∈ X is a vector r ∈ X such that K(x + r) 6= K(x) and krkp is small,
where
1/p
cP
p
X
krkp = |rj | if 1 ≤ p < ∞, and krk∞ = max |rj | . (1)
j=1,...,cP
j=1
Given such a classifier K and an image x, an adversary may attempt to find an adversarial example y
by minimizing kx − ykp subject to K(y) 6= K(x), or even subject to K(y) = k for some target label
k 6= K(x).
Different methods for finding minimal adversarial perturbations have been proposed, most notably
FGSM [9] and PGD [16] for ℓ∞ , and the DeepFool algorithm [18] for general ℓp -norms. We shall
now briefly describe the DeepFool algorithm for p = 2, as it provides useful insight for the remainder
of the present work.
Let F = (F1 , . . . , FL ) : X → RL be the underlying model for the classifier K, such that
K(x) = arg max Fk (x).
k=1,...,L
Let x ∈ X be the image of interest with assigned label l = K(x). Let k 6= l in L be a target label.
For the moment, our goal will be to find an image y ∈ X such that Fk (y) ≥ Fl (y) and with minimal
kx − yk2 .
Define the function f = Fk − Fl : X → R. Then f (x) < 0 and our goal is accomplished if we find
an image y, close to x, with f (y) ≥ 0. Assuming that f is differentiable at x, we can consider the
linear approximation
f (x + r) ≈ f (x) + r · ∇f (x)
for small enough r ∈ X. If ∇f (x) is non-zero, then with
f (x)
r=− 2 ∇f (x),
k∇f (x)k2
2
we get
f (x)
f (x) + r · ∇f (x) = f (x) − 2 ∇f (x) · ∇f (x) = 0.
k∇f (x)k2
Thus, if krk2 = −f (x)/ k∇f (x)k2 is small enough, then
f (x + r) ≈ 0,
and the classifier K has approximately equal confidence for the image y = x + r to have either label
l or k, bringing us near to our goal.
The DeepFool algorithm works by iterating the above approximation, setting x(0) = x and x(n) =
x(n−1) + r(n) with
Fk (x(n) ) − Fl (x(n) )
r(n) = − 2 ∇F k (x (n)
) − ∇F l (x (n)
)
∇Fk (x(n) ) − ∇Fl (x(n) ) 2
and allowing the target label k to change in each iteration to minimize the norm of r(n) . The process
is terminated when K(x(n) ) 6= l and the algorithm outputs an adversarial example y = x(n) =
x + r(1) + . . . + r(n) . The process may also converge to a decision boundary of the classifier, in
which case the total perturbation r∗ = r(1) + . . . + r(n) can be multiplied by a factor 1 + η where η
is some small positive real number, to ensure that y = x + (1 + η)r∗ lies on the opposite side of the
boundary from x.
Digital images have pixel values in a bounded range such as from 0 to 255. Perturbing an image with
an algorithm such as DeepFool does not guarantee that the resulting vector is a valid image with pixel
values constrained to that range. Hence, an adversarial example from DeepFool may well lie outside
the region in X on which the classifier has been trained and can realistically receive inputs. In [1],
we discuss the impact on the success rate of the DeepFool algorithm when the perturbed image is
projected on the set of valid images.
3 Adversarial deformations
We now make clear what is meant by a deformation of an image. It simplifies the discussion to model
images as functions ξ : [0, 1]2 → Rc (with c = 1 or c = 3) instead of discrete vectors x in RcP .
In this setting, perturbing an image ξ : [0, 1]2 → Rc corresponds to adding to it another function
ρ : [0, 1]2 → Rc with a small Lp -norm, defined analogously to the discrete case (1) by
Z !1/p
p
kρkLp = kρ(u)kp du if 1 ≤ p < ∞, and kρkL∞ = ess sup kρ(u)k∞ . (2)
[0,1]2 u∈[0,1]2
3
Figure 1: First row: The original 28 × 28 pixel image from the MNIST database, and the same
image translated by (−2, 1), rotated by an angle of 10◦ , and deformed w.r.t. an arbitrary smooth
vector field τ . The ℓ∞ -norm of the corresponding perturbation is shown under each deformed image.
The pixel values range from 0 (white) to 1 (black), so the deformed images all lie far from the original
image in the l∞ -norm. Second row: The vector fields corresponding to the above deformations and
their T -norms.
the set T of vector fields that do not move points on the grid {1, . . . , W }2 outside of [1, W ]2 . More
precisely,
T := τ : {1, . . . , W }2 → R2 | τ (s, t) + (s, t) ∈ [1, W ]2 for all s, t ∈ {1, . . . , W } .
as a proxy. The ℓp -norms defined in (1), adapted to vector fields, can be used as well. (We remark,
however, that none of these norms define a distance between x and xτ , since two vector fields
τ, σ ∈ T with kτ kT 6= kσkT may produce the same deformed image xτ = xσ .)
As in section 2, we let K : X → L be a classifier and let F = (F1 , . . . , FL ) : X → RL be the
underlying model, such that
K(x) = arg max Fk (x).
k=1,...,L
Let x ∈ X be an image and fix ξ : [0, 1]2 → Rc obtained by interpolation from x, such that
xi,s,t = ξi (s/(W +1), t/(W +1)) for i = 1, . . . , c and s, t = 1, . . . , W . Let l = K(x) denote the true
label of x, let k ∈ L be a target label and set f = Fk − Fl . We assume that x does not lie on a
decision boundary, so that we have f (x) < 0.
4
We define the function g : T → R, τ 7→ f (xτ ) and note that g(0) = f (x0 ) = f (x) < 0. Our goal
is to find a small vector field τ ∈ T such that g(τ ) = f (xτ ) ≥ 0. Under suitable assumptions on
ξ and f , the derivative of g at 0, D0 g : X → R, is expressible in terms of the known quantities
∇f (x) and ∇ξ1 , . . . , ∇ξc . Here, ξ1 , . . . , ξc are the color components of ξ and their derivatives can
be approximated by finite differences. Similarly to the derivation in section 2, we can use a linear
approximation of g around the zero vector field as a guide:
g(τ ) ≈ g(0) + (D0 g) τ (5)
for small enough τ ∈ T . Hence, if τ is a vector field such that
(D0 g) τ = −g(0) (6)
and kτ kT is small, then the classifier K has approximately equal confidence for the deformed image
xτ to have either label l or k.
This is a scalar equation with unknown in T , and so has infinitely many solutions. In order to
select τ with small norm, we solve it in the least-squares sense. As derived in [1], we have that
PW
(D0 g) τ = s,t=1 α(s, t) · τ (s, t), where the discrete vector field α : {1, . . . , W }2 → R2 is given
by
c
1 X s t
α(s, t) = ∇f (x) i,s,t · ∇ξi , . (7)
W + 1 i=1 W +1 W +1
Let
W
2
X
A= kα(s, t)k2 = kαk2ℓ2
s,t=1
be the sum of the squares of the entries of α = (α1 , α2 ). Then, if A 6= 0, the discrete vector field
τ : {1, . . . , W }2 → R2 , chosen as
f (x)
τ =− α, (8)
A
corresponds to a solution of equation (6). Finally, we define the deformed image xτ ∈ X according
to (3).
One might like to impose some degree of smoothness on the deforming vector field. In fact, it suffices
to search in the range of a smoothing operator S : T → T . However, this essentially amounts to
applying S to the solution from the larger search space T . Let α̃ = Sα = (ϕ ∗ α1 , ϕ ∗ α2 ), where
S denotes the componentwise application of a two-dimensional Gaussian filter ϕ (of any standard
deviation), and let à = kα̃k2ℓ2 . Then the vector field
f (x) f (x) 2
τ̃ = − S α̃ = − S α
à Ã
also satisfies (6), since S is self-adjoint. We can hence replace τ by τ̃ to obtain a smooth deformation
of the image x.
We proceed as in section 2, and iterate the deformation process until the deformed image is misclassi-
fied. More explicitly, let x(0) = x and for n ≥ 1 let τ (n) solve (8) for x(n−1) . Then we can define
the iteration as x(n) = x(n−1) ◦ (id + τ (n) ). The algorithm terminates and outputs an adversarial
example y = x(n) if K(x(n) ) 6= l. The iteration also terminates if x(n) lies on a decision boundary
of K, in which case we propose to introduce an overshoot factor 1 + η on the total deforming
vector field. Provided that the number of iterations is moderate, the total vector field can be well
approximated by τ ∗ = τ (1) + · · · + τ (n) and the process can be altered to output the deformed image
y = x ◦ (id + (1 + η)τ ∗ ) instead.
The target label k may be chosen in each iteration to minimize the vector field to obtain a better
approximation in the linearization (5). More precisely, for a candidate set of labels k1 , . . . , km , we
compute the corresponding vectors fields τ1 , . . . , τm and select
k = arg min kτj kT .
j=1,...,m
The candidate set consists of the labels corresponding to the indices of the m smallest entries of
F − Fl , in absolute value.
5
Algorithm ADef
Input: Classification model F , image x, correct label l, candidate labels k1 , . . . , km
Output: Deformed image y
Initialize y ← x
while K(y) = l do
for j = 1, P
. . . , m do
c
αj ← i=1 (∇Fkj )i − (∇Fl )i · ∇yi
Fkj (y)−Fl (y) 2
τj ← − kSαk22
S αj
ℓ
end for
i ← arg minj=1,...,m kτj kT
y ← y ◦ (id + τi )
end while
return y
By equation (7), given that ∇f is moderate, the deforming vector field takes small values wherever
ξ has a small derivative. This means that the vector field will be concentrated on the edges in the
image x (see e.g. the first row of figure 3). Further, note that the result of a deformation is always a
valid image in the sense that it does not violate the pixel value bounds. This is not guaranteed for the
perturbations computed with DeepFool (cf. Section 2).
4 Experiments
We evaluate the performance of ADef by applying the algorithm to classifiers trained on the MNIST
[15] and ImageNet [22] datasets. Below, we briefly describe the setup of the experiments and in table
1 we summarize their results.
MNIST: We train a four-layer convolutional neural network. The first layer consists of 32 convo-
lutional filters of size 5 × 5, which are followed by 2 × 2 max-pooling and a rectifier activation
function. The second layer has 64 filters and is otherwise identical to the first. The third layer is a
fully connected layer into dimension 1024 with a rectifier activation function, and the final layer is
linear with output dimension 10. We shall call this model MNIST-CNN. We use ADef to produce
adversarial deformations of the images in the test set. The algorithm is configured to pursue any label
different from the correct label (all incorrect labels are candidate labels). It performs smoothing by a
Gaussian filter of standard deviation 1/2, and it overshoots by η = 2/10 whenever it converges to a
decision boundary.
ImageNet: We apply ADef to pretrained Inception-v3 [24] and ResNet-101 [10] models to generate
adversarial deformations for the images in the ILSVRC2012 validation set. The images are prepro-
cessed by first scaling so that the smaller axis has 299 pixels for the Inception model and 224 pixels
for ResNet, and then they are center-cropped to a square image. The algorithm is set to focus only on
the label of second highest probability. It employs a Gaussian filter of standard deviation 1 and an
overshoot factor η = 1/10.
We only consider inputs that are correctly classified by the model in question, and, since τ ∗ =
τ (1) + · · · + τ (n) approximates the total deforming vector field, we declare ADef to be successful if its
output is misclassified and kτ ∗ kT ≤ ε, where we choose ε = 3. Observe that, by (4), a deformation
with respect to a vector field τ does not displace any pixel further away from its original position
than kτ kT . Hence, for high resolution images, the choice ε = 3 indeed produces small deformations
if the vector fields are smooth. From figure 2, we can see how the success rate of ADef depends on
the choice of ε.
When searching for an adversarial example, one usually searches for a perturbation with ℓ∞ -norm
smaller than some small number ε > 0. Common choices of ε range from 1/10 to 3/10 for MNIST
classifiers [9, 16, 12, 26, 11] and 2/255 to 16/255 for ImageNet classifiers [9, 14, 26, 11]. Table 1 shows
that on average, the perturbations obtained by ADef are quite large compared to those constraints.
However, as can be seen in figure 3, the relatively high resolution images of the ImageNet dataset
can be deformed into adversarial examples that, while corresponding to large perturbations, are not
visibly different to the original images.
6
Model Accuracy ADef success Avg. kτ ∗ kT Avg krk∞ Avg. # iterations
MNIST-CNN 99.41% 99.90% 1.0864 0.6927 9.779
Inception-v3 77.56% 98.94% 0.5984 0.2039 4.050
ResNet-101 76.97% 99.78% 0.5561 0.1882 4.176
Table 1: The results of applying ADef to the images in the MNIST test set (first row) and the
ILSVRC2012 validation set (second and third rows). The accuracy of the Inception and ResNet
models is defined as the top-1 accuracy on the center-cropped and resized images. The success rate of
ADef is shown as a percentage of the correctly classified inputs. The pixel range is scaled to [0, 1], so
the perturbation r = y − x, where x is the input and y the output of ADef, has values in [−1, 1]. The
averages in the three last columns are computed over the set of images on which ADef is successful.
Figure 2: The (normalized) distribution of kτ ∗ kT from the experiment. Deformations that fall to the
left of the vertical line at ε = 3 are considered successful.
Figure 3: Sample deformations for the Inception-v3 model. The vector fields and perturbations have
been amplified for visualization. First row: An image from the ILSVRC2012 validation set, the
output of ADef with a Gaussian filter of standard deviation 1, the corresponding vector field and
perturbation. The rightmost image is a close-up of the vector field around the nose of the ape. Second
row: A larger deformation of the same image, obtained by using a wider Gaussian filter (standard
deviation 6) for smoothing.
7
Figure 4: First row: The original image and deformed images produced by restricting ADef to the
target labels 0 to 8. The ℓ∞ -norms of the corresponding perturbations are shown under the deformed
images. Second row: The vector fields corresponding to the deformations and their T -norms.
Figure 5: Untargeted vs. targeted attack on the ResNet-101 model. An image from the ILSVRC2012
validation set deformed to the labels of second highest (first row) and lowest (second row) probabilities
(out of 1,000) for the original image. The vector fields and perturbations have been amplified for
visualization.
ADef can also be used for targeted adversarial attacks, by restricting the deformed image to have a
particular target label instead of any label which yields the optimal deformation. Figure 4 demonstrates
the effect of choosing different target labels for a given MNIST image, and figure 5 shows the result
of targeting the label of lowest probability for an image from the ImageNet dataset.
5 Conclusion
In this work, we proposed a new efficient algorithm, ADef, to construct a new type of adversarial
attacks for DNN image classifiers. The procedure is iterative and in each iteration takes a gradient
descent step to deform the previous iterate in order to push to a decision boundary.
We demonstrated that with almost imperceptible deformations, state-of-the art classifiers can be
fooled to misclassify with a high success rate of ADef. This suggests that networks are vulnerable
to different types of attacks and that simply training the network on a specific class of adversarial
examples might not form a sufficient defense strategy.
Acknowledgments. The authors would like to thank Helmut Bölcskei and Thomas Wiatowski for
fruitful discussions.
8
References
[1] R. Alaifari, G. S. Alberti, and T. Gauksson. Adversarial deformations for deep neural networks.
in preparation, 2018.
[2] A. Athalye, N. Carlini, and D. Wagner. Obfuscated gradients give a false sense of security:
Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.
[3] A. Athalye and I. Sutskever. Synthesizing robust adversarial examples. arXiv preprint
arXiv:1707.07397, 2017.
[4] T. B. Brown, D. Mané, A. Roy, M. Abadi, and J. Gilmer. Adversarial patch. arXiv preprint
arXiv:1712.09665, 2017.
[5] N. Carlini and D. Wagner. Adversarial examples are not easily detected: Bypassing ten detection
methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security,
pages 3–14. ACM, 2017.
[6] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In IEEE
Symposium on Security and Privacy (SP), pages 39–57. IEEE, 2017.
[7] L. Engstrom, D. Tsipras, L. Schmidt, and A. Madry. A Rotation and a Translation Suffice:
Fooling CNNs with Simple Transformations. arXiv preprint arXiv:1712.02779, 2017.
[8] T. Gauksson. Adversarial perturbations and deformations for convolutional neu-
ral networks. https://www.research-collection.ethz.ch/handle/20.500.11850/
258550/, 2017.
[9] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples.
arXiv preprint arXiv:1412.6572, 2014.
[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR),
pages 770–778, 2016.
[11] H. Kannan, A. Kurakin, and I. Goodfellow. Adversarial logit pairing. arXiv preprint
arXiv:1803.06373, 2018.
[12] J. Z. Kolter and E. Wong. Provable defenses against adversarial examples via the convex outer
adversarial polytope. arXiv preprint arXiv:1711.00851, 2017.
[13] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial examples in the physical world. arXiv
preprint arXiv:1607.02533, 2016.
[14] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial machine learning at scale. arXiv preprint
arXiv:1611.01236, 2016.
[15] Y. LeCun. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/
mnist/.
[16] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models
resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
[17] S. M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Universal adversarial perturba-
tions. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages
86–94, July 2017.
[18] S. M. Moosavi Dezfooli, A. Fawzi, and P. Frossard. DeepFool: a simple and accurate method
to fool deep neural networks. In Proceedings of 2016 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), number EPFL-CONF-218057, 2016.
[19] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami. Practical black-box
attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on
Computer and Communications Security, pages 506–519. ACM, 2017.
[20] A. Raghunathan, J. Steinhardt, and P. Liang. Certified defenses against adversarial examples.
arXiv preprint arXiv:1801.09344, 2018.
[21] A. Rozsa, E. M. Rudd, and T. E. Boult. Adversarial diversity and hard positive generation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops,
pages 25–32, 2016.
9
[22] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition
Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
[23] M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter. Accessorize to a crime: Real and
stealthy attacks on state-of-the-art face recognition. In Proceedings of the 2016 ACM SIGSAC
Conference on Computer and Communications Security, pages 1528–1540. ACM, 2016.
[24] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception archi-
tecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 2818–2826, 2016.
[25] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus.
Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
[26] F. Tramèr, A. Kurakin, N. Papernot, D. Boneh, and P. McDaniel. Ensemble adversarial training:
Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017.
[27] C. Xiao, J.-Y. Zhu, B. Li, W. He, M. Liu, and D. Song. Spatially transformed adversarial
examples. arXiv preprint arXiv:1801.02612, 2018.
10