Lec1&2 Final
Lec1&2 Final
Lec1&2 Final
Adversarial attacks
Improving Robustness
2
Machine Learning: The Success Story
3
Deep nets are state-of-the-art solution in
many problems
5
An Old Imperceptible Adversarial Distortion
Perceptible Perceptible
● Here, the adversary made changes to the image that are perceptible to the
human eye, yet the category is unchanged.
● Modern models can be made robust to imperceptible distortions, but they are still
not robust to perceptible distortions.
7
Fooling a Binary Classifier
Input x 2 -1 3 -2 2 2 1 -4 5 1
Weight w -1 -1 1 -1 1 -1 1 1 -1 1
8
Fooling a Binary Classifier
Input x 2 -1 3 -2 2 2 1 -4 5 1
Adv Input x+𝜀 1.5 -1.5 3.5 -2.5 1.5 1.5 1.5 -3.5 4.5 1.5
Weight w -1 -1 1 -1 1 -1 1 1 -1 1
9
Lessons from Fooling a Binary Classifier
Input x 2 -1 3 -2 2 2 1 -4 5 1
Adv Input x+𝜀 1.5 -1.5 3.5 -2.5 1.5 1.5 1.5 -3.5 4.5 1.5
Weight w -1 -1 1 -1 1 -1 1 1 -1 1
The cumulative effect of many small changes made the adversary powerful
enough to change the classification decision.
Adversarial examples exist even for non-deep learning, simple models.
Something more fundemental!
10
Adversarial Examples from Overfitting
11
Adversarial Examples from Excessive
Linearity
12
Modern deep nets are very piecewise linear
13
Review : norm
14
Review : norm
15
An Adversary Threat Model
A simple threat model is to assume adversary has an attack
distortion budget (assuming a fixed and ).
Not all distortions have a small norm (e.g., rotations). This simplistic threat
model is common since it is a more tractable subproblem.
The adversary’s goal is usually to find a distortion that maximizes the loss
subject to its budget:
16
Fast Gradient Sign Method (FGSM)
How do we generate adversarial examples algorithmically?
19
Untargeted vs Targeted Attacks
An untargeted attack maximizes the loss, whereas a targeted attack
optimizes examples to be misclassified as a predetermined target
Untargeted attacks for AT are standard for CIFAR, but targeted attacks
for AT are standard for ImageNet since it has many similar classes
Original
: great_white_shark
21
Defenses against Adversarial Attacks
Defensive Adversarial
Distillation Training
22
Defensive Distillation
Define a new softmax associated with a temperature T > 0:
23
Defensive Distillation
Defensive distillation proceeds in four steps:
1) Train a network, the teacher network, by setting the temperature of the softmax to T
during the training phase.
2) Compute soft labels by applying the teacher network to each instance in the training set,
again evaluating the softmax at temperature T.
3) Train the distilled network (a network with the same shape as the teacher network) on the
soft labels, using softmax at temperature T.
4) Finally, when running the distilled network at test time (to classify new inputs), use
temperature 1 .
Can we break defensive distillation ?
Let us find adversarial examples for distilled networks
24
Why other attacks fail on defensive distillation
● “This clearly demonstrates the fragility of using the loss function as the
objective to minimize. “
25
Effect of temperature
● Does high distillation
temperature increase the
robustness of the network
26
Defenses against Adversarial Attacks
Defensive Adversarial
Distillation Training
27
Adversarial Training (AT)
The best-known way to make models more robust to adversarial
examples is adversarial training
As follows is a common adversarial procedure:
30
Transferability
An adversarial example crafted for one model can be used to attack
many different models
32
Data Augmentation
Beyond using more data, models can also squeeze more out of the existing
data using data augmentation
33
Data Augmentation Results and Caveat
37