Lec1&2 Final

Trustworthy ML
Mohammad Hossein Rohban, Ph.D.

Content of this session and next
Adversarial attacks
Improving Robustness
2
Machine Learning: The Success Story
3
Deep nets are state-of-the-art solution in
many problems
But what do these results really mean?

4
Can We Truly Rely on ML?
5
An Old Imperceptible Adversarial Distortion
Cat Adversarial Guacamole

Distortion
The adversarial distortion is optimized to cause the (undefended, off-the-shelf)
neural network to make a mistake.
But now models can be defended against such imperceptible distortions.
6
Modern Adversarial Examples
Perceptible Perceptible
● Here, the adversary made changes to the image that are perceptible to the
human eye, yet the category is unchanged.
● Modern models can be made robust to imperceptible distortions, but they are still
not robust to perceptible distortions.
7
Fooling a Binary Classifier
Input x 2 -1 3 -2 2 2 1 -4 5 1
Weight w -1 -1 1 -1 1 -1 1 1 -1 1
8
Fooling a Binary Classifier
Input x 2 -1 3 -2 2 2 1 -4 5 1
Adv Input x+𝜀 1.5 -1.5 3.5 -2.5 1.5 1.5 1.5 -3.5 4.5 1.5
Weight w -1 -1 1 -1 1 -1 1 1 -1 1
9
Lessons from Fooling a Binary Classifier
Input x 2 -1 3 -2 2 2 1 -4 5 1
Adv Input x+𝜀 1.5 -1.5 3.5 -2.5 1.5 1.5 1.5 -3.5 4.5 1.5
Weight w -1 -1 1 -1 1 -1 1 1 -1 1
The cumulative effect of many small changes made the adversary powerful
enough to change the classification decision.
Adversarial examples exist even for non-deep learning, simple models.
Something more fundemental!
10
Adversarial Examples from Overfitting
11
Adversarial Examples from Excessive
Linearity
12
Modern deep nets are very piecewise linear
13
Review : norm
14
Review : norm
15
An Adversary Threat Model
A simple threat model is to assume adversary has an attack
distortion budget (assuming a fixed and ).
Not all distortions have a small norm (e.g., rotations). This simplistic threat
model is common since it is a more tractable subproblem.
The adversary’s goal is usually to find a distortion that maximizes the loss
subject to its budget:
16
Fast Gradient Sign Method (FGSM)
How do we generate adversarial examples algorithmically?
A simple adversarial attack is the FGSM attack:
This performs a single step of gradient ascent to increase the loss,

and it obeys an attack budget
This attack is “fast” because it uses a single step
This is no longer a strong attack and is easy to defend against

17
Fast Gradient Sign Method (FGSM)
J is the loss function

18
Projected Gradient Descent (PGD)
Unlike the single-step FGSM attack, the PGD attack uses multiple
gradient ascent steps and is thus far more powerful
Pseudocode for PGD subject to an budget:
For more diverse examples,
the first step randomly
initializes gradient ascent
is a uniformly randomly
for t = 1, ..., T: For CIFAR-10, T is often 7 sampled value from
19
Untargeted vs Targeted Attacks
An untargeted attack maximizes the loss, whereas a targeted attack
optimizes examples to be misclassified as a predetermined target
Untargeted attacks for AT are standard for CIFAR, but targeted attacks
for AT are standard for ImageNet since it has many similar classes
Original
Untargeted adversary Targeted adversary
: great_white_shark
Labrador Retriever Golden Retriever Great White Shark 20

White Box vs Black Box Testing
When adversaries do not have access to the model parameters, the network
is considered a “black box,” and only model outputs are observed
Some researchers prefer “white box” assumptions because relying on

“security through obscurity” can be a fragile strategy
21
Defenses against Adversarial Attacks
Defensive Adversarial
Distillation Training
22
Defensive Distillation
Define a new softmax associated with a temperature T > 0:
23
Defensive Distillation
Defensive distillation proceeds in four steps:
1) Train a network, the teacher network, by setting the temperature of the softmax to T
during the training phase.
2) Compute soft labels by applying the teacher network to each instance in the training set,
again evaluating the softmax at temperature T.
3) Train the distilled network (a network with the same shape as the teacher network) on the
soft labels, using softmax at temperature T.
4) Finally, when running the distilled network at test time (to classify new inputs), use
temperature 1 .
Can we break defensive distillation ?
Let us find adversarial examples for distilled networks
24
Why other attacks fail on defensive distillation
● By applying T, we force the logits in the distilled network to be T times larger

than the normal case.
● When at the test time we set T=1, the gradients w.r.t the input become very
close to zero.
● This causes L-BFGS to fail (why?)
● As an alternative, one could take this function to attack the network:
● “This clearly demonstrates the fragility of using the loss function as the
objective to minimize. “
25
Effect of temperature
● Does high distillation
temperature increase the
robustness of the network
26
Defenses against Adversarial Attacks
Defensive Adversarial
Distillation Training
27
Adversarial Training (AT)
The best-known way to make models more robust to adversarial
examples is adversarial training
As follows is a common adversarial procedure:
sample minibatch from the dataset

For adversarial training to be
create from for all successful, is from a
multistep attack such as PGD
optimize the loss
Currently, AT can reduce accuracy on clean examples by 10%+

28
Adversarial Training, Impact on Accuracy
Robustness gain comes at the expense of accuracy on benign data

29
Proper Evaluation Is Difficult
A nonexhaustive checklist for evaluating defenses
30
Transferability
An adversarial example crafted for one model can be used to attack
many different models
Given models and , designed for sometimes gives

a high loss, even if is a different architecture
Even though transfer rates can vary widely, transferability demonstrates

that adversarial failure modes are somewhat shared across models
Consequently, an attacker does not always need access to a model’s

parameters or architectural information to attack it
31
Improving Robustness With Data
Adversarial robustness scales slowly with dataset size
One can adversarially pretrain using adversarially

distorted data for different tasks
For example, to increase CIFAR-100 adversarial

robustness, first adversarially pretrain on
ImageNet, a larger dataset
32
Data Augmentation
Beyond using more data, models can also squeeze more out of the existing
data using data augmentation
CutMix is a data augmentation technique that can

be combined with adversarial training CutMix Pseudocode
Original Mixup Cutout CutMix
33
Data Augmentation Results and Caveat
Note that a parameter exponential moving average is needed for data

augmentation to be beneficial
Given parameters , at each iteration the model is updated with an

exponential moving average
34
Automatic Text-Based Attacks
NLP models can also be attacked with automatic algorithms
Some specific attacks include TextBugger (typos), CompAttack

(composition), BERT-ATTACK (context), and so on
35
Robustness Guarantees
Sometimes the test set will not find important faults in a model, and some
think empirical evidence is insufficient for having high confidence
One idea is to create provable guarantees (“certificates”) for

how a model behaves given just the model weights
One line of robustness guarantees research studies

classifiers whose prediction at any example is verifiably
constant within some set around
These guarantees are demonstrated using mathematical

properties of networks
36
Refrences
● STAT240 - Robust and Nonparametric Statistics - Jacob Steinhardt
(berkeley.edu)
● Tutorial_NeurIPS_2018
● Intro to ML Safety
● 2017-05-30-Stanford-cs213n
● Towards Evaluating the Robustness of Neural Networks (arxiv.org)
● Universal adversarial perturbations (arxiv.org)
37

Lec1&2 Final

Uploaded by

Copyright:

Available Formats

Lec1&2 Final

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec1&2 Final

Uploaded by

Copyright:

Available Formats

Trustworthy ML

Mohammad Hossein Rohban, Ph.D.

But what do these results really mean?

Cat Adversarial Guacamole

A simple adversarial attack is the FGSM attack:

This performs a single step of gradient ascent to increase the loss,

This attack is “fast” because it uses a single step

This is no longer a strong attack and is easy to defend against

J is the loss function

Untargeted adversary Targeted adversary

Labrador Retriever Golden Retriever Great White Shark 20

Some researchers prefer “white box” assumptions because relying on

● By applying T, we force the logits in the distilled network to be T times larger

sample minibatch from the dataset

optimize the loss

Currently, AT can reduce accuracy on clean examples by 10%+

Robustness gain comes at the expense of accuracy on benign data

Given models and , designed for sometimes gives

Even though transfer rates can vary widely, transferability demonstrates

Consequently, an attacker does not always need access to a model’s

One can adversarially pretrain using adversarially

For example, to increase CIFAR-100 adversarial

CutMix is a data augmentation technique that can

Original Mixup Cutout CutMix

Note that a parameter exponential moving average is needed for data

Given parameters , at each iteration the model is updated with an

Some specific attacks include TextBugger (typos), CompAttack

One idea is to create provable guarantees (“certificates”) for

One line of robustness guarantees research studies

These guarantees are demonstrated using mathematical

You might also like