Lec1&2 Final

Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

Trustworthy ML

Mohammad Hossein Rohban, Ph.D.


Content of this session and next

Adversarial attacks

Improving Robustness

2
Machine Learning: The Success Story

3
Deep nets are state-of-the-art solution in
many problems

But what do these results really mean?


4
Can We Truly Rely on ML?

5
An Old Imperceptible Adversarial Distortion

Cat Adversarial Guacamole


Distortion
The adversarial distortion is optimized to cause the (undefended, off-the-shelf)
neural network to make a mistake.
But now models can be defended against such imperceptible distortions.
6
Modern Adversarial Examples

Perceptible Perceptible

● Here, the adversary made changes to the image that are perceptible to the
human eye, yet the category is unchanged.
● Modern models can be made robust to imperceptible distortions, but they are still
not robust to perceptible distortions.
7
Fooling a Binary Classifier

Input x 2 -1 3 -2 2 2 1 -4 5 1

Weight w -1 -1 1 -1 1 -1 1 1 -1 1

8
Fooling a Binary Classifier

Input x 2 -1 3 -2 2 2 1 -4 5 1

Adv Input x+𝜀 1.5 -1.5 3.5 -2.5 1.5 1.5 1.5 -3.5 4.5 1.5

Weight w -1 -1 1 -1 1 -1 1 1 -1 1

9
Lessons from Fooling a Binary Classifier
Input x 2 -1 3 -2 2 2 1 -4 5 1

Adv Input x+𝜀 1.5 -1.5 3.5 -2.5 1.5 1.5 1.5 -3.5 4.5 1.5

Weight w -1 -1 1 -1 1 -1 1 1 -1 1

The cumulative effect of many small changes made the adversary powerful
enough to change the classification decision.
Adversarial examples exist even for non-deep learning, simple models.
Something more fundemental!
10
Adversarial Examples from Overfitting

11
Adversarial Examples from Excessive
Linearity

12
Modern deep nets are very piecewise linear

13
Review : norm

14
Review : norm

15
An Adversary Threat Model
A simple threat model is to assume adversary has an attack
distortion budget (assuming a fixed and ).

Not all distortions have a small norm (e.g., rotations). This simplistic threat
model is common since it is a more tractable subproblem.

The adversary’s goal is usually to find a distortion that maximizes the loss
subject to its budget:

16
Fast Gradient Sign Method (FGSM)
How do we generate adversarial examples algorithmically?

A simple adversarial attack is the FGSM attack:

This performs a single step of gradient ascent to increase the loss,


and it obeys an attack budget

This attack is “fast” because it uses a single step

This is no longer a strong attack and is easy to defend against


17
Fast Gradient Sign Method (FGSM)

J is the loss function


18
Projected Gradient Descent (PGD)
Unlike the single-step FGSM attack, the PGD attack uses multiple
gradient ascent steps and is thus far more powerful
Pseudocode for PGD subject to an budget:
For more diverse examples,
the first step randomly
initializes gradient ascent
is a uniformly randomly
for t = 1, ..., T: For CIFAR-10, T is often 7 sampled value from

19
Untargeted vs Targeted Attacks
An untargeted attack maximizes the loss, whereas a targeted attack
optimizes examples to be misclassified as a predetermined target

Untargeted attacks for AT are standard for CIFAR, but targeted attacks
for AT are standard for ImageNet since it has many similar classes
Original

Untargeted adversary Targeted adversary

: great_white_shark

Labrador Retriever Golden Retriever Great White Shark 20


White Box vs Black Box Testing
When adversaries do not have access to the model parameters, the network
is considered a “black box,” and only model outputs are observed

Some researchers prefer “white box” assumptions because relying on


“security through obscurity” can be a fragile strategy

21
Defenses against Adversarial Attacks

Defensive Adversarial
Distillation Training

22
Defensive Distillation
Define a new softmax associated with a temperature T > 0:

23
Defensive Distillation
Defensive distillation proceeds in four steps:
1) Train a network, the teacher network, by setting the temperature of the softmax to T
during the training phase.
2) Compute soft labels by applying the teacher network to each instance in the training set,
again evaluating the softmax at temperature T.
3) Train the distilled network (a network with the same shape as the teacher network) on the
soft labels, using softmax at temperature T.
4) Finally, when running the distilled network at test time (to classify new inputs), use
temperature 1 .
Can we break defensive distillation ?
Let us find adversarial examples for distilled networks

24
Why other attacks fail on defensive distillation

● By applying T, we force the logits in the distilled network to be T times larger


than the normal case.
● When at the test time we set T=1, the gradients w.r.t the input become very
close to zero.
● This causes L-BFGS to fail (why?)
● As an alternative, one could take this function to attack the network:

● “This clearly demonstrates the fragility of using the loss function as the
objective to minimize. “

25
Effect of temperature
● Does high distillation
temperature increase the
robustness of the network

26
Defenses against Adversarial Attacks

Defensive Adversarial
Distillation Training

27
Adversarial Training (AT)
The best-known way to make models more robust to adversarial
examples is adversarial training
As follows is a common adversarial procedure:

sample minibatch from the dataset


For adversarial training to be
create from for all successful, is from a
multistep attack such as PGD

optimize the loss

Currently, AT can reduce accuracy on clean examples by 10%+


28
Adversarial Training, Impact on Accuracy

Robustness gain comes at the expense of accuracy on benign data


29
Proper Evaluation Is Difficult
A nonexhaustive checklist for evaluating defenses

30
Transferability
An adversarial example crafted for one model can be used to attack
many different models

Given models and , designed for sometimes gives


a high loss, even if is a different architecture

Even though transfer rates can vary widely, transferability demonstrates


that adversarial failure modes are somewhat shared across models

Consequently, an attacker does not always need access to a model’s


parameters or architectural information to attack it
31
Improving Robustness With Data
Adversarial robustness scales slowly with dataset size

One can adversarially pretrain using adversarially


distorted data for different tasks

For example, to increase CIFAR-100 adversarial


robustness, first adversarially pretrain on
ImageNet, a larger dataset

32
Data Augmentation
Beyond using more data, models can also squeeze more out of the existing
data using data augmentation

CutMix is a data augmentation technique that can


be combined with adversarial training CutMix Pseudocode

Original Mixup Cutout CutMix

33
Data Augmentation Results and Caveat

Note that a parameter exponential moving average is needed for data


augmentation to be beneficial

Given parameters , at each iteration the model is updated with an


exponential moving average
34
Automatic Text-Based Attacks
NLP models can also be attacked with automatic algorithms

Some specific attacks include TextBugger (typos), CompAttack


(composition), BERT-ATTACK (context), and so on
35
Robustness Guarantees
Sometimes the test set will not find important faults in a model, and some
think empirical evidence is insufficient for having high confidence

One idea is to create provable guarantees (“certificates”) for


how a model behaves given just the model weights

One line of robustness guarantees research studies


classifiers whose prediction at any example is verifiably
constant within some set around

These guarantees are demonstrated using mathematical


properties of networks
36
Refrences
● STAT240 - Robust and Nonparametric Statistics - Jacob Steinhardt
(berkeley.edu)
● Tutorial_NeurIPS_2018
● Intro to ML Safety
● 2017-05-30-Stanford-cs213n
● Towards Evaluating the Robustness of Neural Networks (arxiv.org)
● Universal adversarial perturbations (arxiv.org)

37

You might also like