771 A18 Lec20

Download as pdf or txt
Download as pdf or txt
You are on page 1of 107

Introduction to Deep Neural Networks (1)

Piyush Rai

Introduction to Machine Learning (CS771A)

October 23, 2018

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 1


Linear Models (and their limitations..)
Linear models: Output produced by taking a linear combination of input features

Some monotonic function


(e.g., sigmoid)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 2


Linear Models (and their limitations..)
Linear models: Output produced by taking a linear combination of input features

Some monotonic function


(e.g., sigmoid)

Already seen several examples (linear regression, logistic regression, linear SVM, multi-output linear
regression, softmax regression, and several others)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 2


Linear Models (and their limitations..)
Linear models: Output produced by taking a linear combination of input features

Some monotonic function


(e.g., sigmoid)

Already seen several examples (linear regression, logistic regression, linear SVM, multi-output linear
regression, softmax regression, and several others)
This basic architecture is classically also known as the “Perceptron” (not to be confused with the
Perceptron “algorithm”, which learns a linear classification model)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 2


Linear Models (and their limitations..)
Linear models: Output produced by taking a linear combination of input features

Some monotonic function


(e.g., sigmoid)

Already seen several examples (linear regression, logistic regression, linear SVM, multi-output linear
regression, softmax regression, and several others)
This basic architecture is classically also known as the “Perceptron” (not to be confused with the
Perceptron “algorithm”, which learns a linear classification model)
This simple model can’t however learn complex functions (e.g., nonlinear decision boundaries)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 2


Linear Models (and their limitations..)
Linear models: Output produced by taking a linear combination of input features

Some monotonic function


(e.g., sigmoid)

Already seen several examples (linear regression, logistic regression, linear SVM, multi-output linear
regression, softmax regression, and several others)
This basic architecture is classically also known as the “Perceptron” (not to be confused with the
Perceptron “algorithm”, which learns a linear classification model)
This simple model can’t however learn complex functions (e.g., nonlinear decision boundaries)
We have already seen a way of handling nonlinearities: Kernel Methods (invented in the 90s)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 2


Linear Models (and their limitations..)
Linear models: Output produced by taking a linear combination of input features

Some monotonic function


(e.g., sigmoid)

Already seen several examples (linear regression, logistic regression, linear SVM, multi-output linear
regression, softmax regression, and several others)
This basic architecture is classically also known as the “Perceptron” (not to be confused with the
Perceptron “algorithm”, which learns a linear classification model)
This simple model can’t however learn complex functions (e.g., nonlinear decision boundaries)
We have already seen a way of handling nonlinearities: Kernel Methods (invented in the 90s)
Something existed in the pre-kernel methods era, too..
Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 2
Neural Networks a.k.a. Multi-layer Perceptron (MLP)
Became very popular in early 80s and went to slumber in late 80s (people didn’t know how to train
them well), but woke up again in the late 2000s (now we do). Rechristened as “Deep Learning”

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 3


Neural Networks a.k.a. Multi-layer Perceptron (MLP)
Became very popular in early 80s and went to slumber in late 80s (people didn’t know how to train
them well), but woke up again in the late 2000s (now we do). Rechristened as “Deep Learning”
An MLP consists of an input layer, an output layer, and one or more hidden layers

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 3


Neural Networks a.k.a. Multi-layer Perceptron (MLP)
Became very popular in early 80s and went to slumber in late 80s (people didn’t know how to train
them well), but woke up again in the late 2000s (now we do). Rechristened as “Deep Learning”
An MLP consists of an input layer, an output layer, and one or more hidden layers
A very simple MLP with input layer with D = 3 nodes and single hidden layer with K = 2 nodes
Output Layer
(with a scalar-valued output)

Hidden Layer
(with K=2 hidden units)

Input Layer
(with D=3 visible units)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 3


Neural Networks a.k.a. Multi-layer Perceptron (MLP)
Became very popular in early 80s and went to slumber in late 80s (people didn’t know how to train
them well), but woke up again in the late 2000s (now we do). Rechristened as “Deep Learning”
An MLP consists of an input layer, an output layer, and one or more hidden layers
A very simple MLP with input layer with D = 3 nodes and single hidden layer with K = 2 nodes
Output Layer
(with a scalar-valued output)

Hidden Layer
(with K=2 hidden units)

Input Layer
(with D=3 visible units)

Each node (a.k.a. unit) in the hidden layer computes a nonlinear transform of inputs it receives

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 3


Neural Networks a.k.a. Multi-layer Perceptron (MLP)
Became very popular in early 80s and went to slumber in late 80s (people didn’t know how to train
them well), but woke up again in the late 2000s (now we do). Rechristened as “Deep Learning”
An MLP consists of an input layer, an output layer, and one or more hidden layers
A very simple MLP with input layer with D = 3 nodes and single hidden layer with K = 2 nodes
Output Layer
(with a scalar-valued output)

Hidden Layer
(with K=2 hidden units)

Input Layer
(with D=3 visible units)

Each node (a.k.a. unit) in the hidden layer computes a nonlinear transform of inputs it receives
Hidden layer nodes act as “features” in the final layer (a linear model) to produce the output

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 3


Neural Networks a.k.a. Multi-layer Perceptron (MLP)
Became very popular in early 80s and went to slumber in late 80s (people didn’t know how to train
them well), but woke up again in the late 2000s (now we do). Rechristened as “Deep Learning”
An MLP consists of an input layer, an output layer, and one or more hidden layers
A very simple MLP with input layer with D = 3 nodes and single hidden layer with K = 2 nodes
Output Layer
(with a scalar-valued output)

Hidden Layer
(with K=2 hidden units)

Input Layer
(with D=3 visible units)

Each node (a.k.a. unit) in the hidden layer computes a nonlinear transform of inputs it receives
Hidden layer nodes act as “features” in the final layer (a linear model) to produce the output
The overall effect is a nonlinear mapping from inputs to outputs (justification later)
Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 3
Illustration: A Neural Network with One Hidden Layer

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 4


Illustration: A Neural Network with One Hidden Layer
Each input x n transformed into several “pre-activations”
using linear models
XD
ank = w > x
k n = wdk xnd
d=1

inear Model

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 4


Illustration: A Neural Network with One Hidden Layer
Each input x n transformed into several “pre-activations”
using linear models
XD
ank = w > x
k n = wdk xnd
d=1

Linear Model

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 4


Illustration: A Neural Network with One Hidden Layer
Each input x n transformed into several “pre-activations”
using linear models
XD
ank = w > x
k n = wdk xnd
d=1

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 4


Illustration: A Neural Network with One Hidden Layer
Each input x n transformed into several “pre-activations”
using linear models
XD
ank = w > x
k n = wdk xnd
d=1

“pre-activations”

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 4


Illustration: A Neural Network with One Hidden Layer
Each input x n transformed into several “pre-activations”
using linear models
XD
ank = w > x
k n = wdk xnd
d=1
Nonlinear activation applied on each pre-activation
hnk = g (ank )

nonlinear
“activation”
function

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 4


Illustration: A Neural Network with One Hidden Layer
Each input x n transformed into several “pre-activations”
using linear models
XD
ank = w > x
k n = wdk xnd
sigmoid tanh ReLU Leaky ReLU Exp. ReLU
d=1
h h h h h
Nonlinear activation applied on each pre-activation a a a a
a
(Some activation functions)
hnk = g (ank )

nonlinear
“activation”
function

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 4


Illustration: A Neural Network with One Hidden Layer
Each input x n transformed into several “pre-activations”
using linear models
XD
ank = w > x
k n = wdk xnd
d=1
Nonlinear activation applied on each pre-activation
hnk = g (ank )
A linear model applied on the new “features” h n
K
X
>
sn = v h n = vk hnk
k=1

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 4


Illustration: A Neural Network with One Hidden Layer
Each input x n transformed into several “pre-activations”
using linear models
XD
ank = w > x
k n = wdk xnd
d=1
Nonlinear activation applied on each pre-activation
hnk = g (ank )
A linear model applied on the new “features” h n
K
X
>
sn = v h n = vk hnk
k=1

Finally, the output is produced as yn = o(sn )

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 4


Illustration: A Neural Network with One Hidden Layer
Each input x n transformed into several “pre-activations”
using linear models
XD
ank = w > x
k n = wdk xnd Can even be identity
(e.g., for regression yn = sn )
d=1
Nonlinear activation applied on each pre-activation
hnk = g (ank )
A linear model applied on the new “features” h n
K
X
>
sn = v h n = vk hnk
k=1

Finally, the output is produced as yn = o(sn )

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 4


Illustration: A Neural Network with One Hidden Layer
Each input x n transformed into several “pre-activations”
using linear models
XD
ank = w > x
k n = wdk xnd Can even be identity
(e.g., for regression yn = sn )
d=1
Nonlinear activation applied on each pre-activation
hnk = g (ank )
A linear model applied on the new “features” h n
K
X
>
sn = v h n = vk hnk
k=1

Finally, the output is produced as yn = o(sn )


Unknowns of the model (w 1 , . . . , w K and v ) learned by
PN
minimizing a loss L(W, v ) = n=1 `(yn , o(sn )), e.g.,
squared, logistic, softmax, etc (depending on the output)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 4


Neural Networks: A Basic Pictorial Representation
Note: Hidden layer pre-activations ank and post-activations hnk will be shown together for brevity

More succintly..

Single
Hidden
Layer

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 5


Neural Networks: A Basic Pictorial Representation
Note: Hidden layer pre-activations ank and post-activations hnk will be shown together for brevity

More succintly..

Single
Hidden
Layer

We will only show hnk to denote the value computed by the k-th hidden unit

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 5


Neural Networks: A Basic Pictorial Representation
Note: Hidden layer pre-activations ank and post-activations hnk will be shown together for brevity

More succintly..

Single
Hidden
Layer

We will only show hnk to denote the value computed by the k-th hidden unit
Likewise, for the output layer, we will directly show the final output yn

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 5


Neural Networks: A Basic Pictorial Representation
Note: Hidden layer pre-activations ank and post-activations hnk will be shown together for brevity

More succintly..

Single
Hidden
Layer

We will only show hnk to denote the value computed by the k-th hidden unit
Likewise, for the output layer, we will directly show the final output yn
Each node in hidden/output layers computes a linear trans. of its inputs + applies a nonlinearity

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 5


Neural Networks: A Basic Pictorial Representation
Note: Hidden layer pre-activations ank and post-activations hnk will be shown together for brevity

More succintly..

Single
Hidden
Layer

We will only show hnk to denote the value computed by the k-th hidden unit
Likewise, for the output layer, we will directly show the final output yn
Each node in hidden/output layers computes a linear trans. of its inputs + applies a nonlinearity
Different layers can use different types of activations (output layer may have none)
Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 5
MLPs Can Learn Nonlinear Functions: A Justification

An MLP can be seen as a composition of multiple linear models combined nonlinearly


Let’s look at a simple example of 2-dim inputs that are linearly not separable

Standard Single “Perceptron” Classifier A Multi-layer Perceptron Classifier


(no hidden units) (one hidden layer with 2 units)
Cannot learn nonlinear boundaries Capable of learning nonlinear boundaries

1.0 x -1.0 x
score score score score
=

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 6


MLPs Can Learn Nonlinear Functions: A Justification

An MLP can be seen as a composition of multiple linear models combined nonlinearly


Let’s look at a simple example of 2-dim inputs that are linearly not separable

Standard Single “Perceptron” Classifier A Multi-layer Perceptron Classifier


(no hidden units) (one hidden layer with 2 units)
Cannot learn nonlinear boundaries Capable of learning nonlinear boundaries

1.0 x -1.0 x
score score score score
=

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 6


MLPs Can Learn Nonlinear Functions: A Justification

An MLP can be seen as a composition of multiple linear models combined nonlinearly


Let’s look at a simple example of 2-dim inputs that are linearly not separable

Standard Single “Perceptron” Classifier A Multi-layer Perceptron Classifier


(no hidden units) (one hidden layer with 2 units)
Cannot learn nonlinear boundaries Capable of learning nonlinear boundaries

1.0 x -1.0 x
score score score score
=

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 6


MLPs Can Learn Nonlinear Functions: A Justification

An MLP can be seen as a composition of multiple linear models combined nonlinearly


Let’s look at a simple example of 2-dim inputs that are linearly not separable

Standard Single “Perceptron” Classifier A Multi-layer Perceptron Classifier


(no hidden units) (one hidden layer with 2 units)
Cannot learn nonlinear boundaries Capable of learning nonlinear boundaries

1.0 x -1.0 x
score score score score
=

MLP with a single, sufficiently wide hidden layer can approximate any function (Hornick, 1991)
Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 6
Examples of some NN/MLP architectures

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 7


Neural Networks with One Hidden Layer

Aleady saw the special case of a single layer NN (D = 3, K = 2)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 8


Neural Networks with One Hidden Layer

Aleady saw the special case of a single layer NN (D = 3, K = 2)


In general, an NN with D input units and a single hidden layer with K units

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 8


Neural Networks with One Hidden Layer

Aleady saw the special case of a single layer NN (D = 3, K = 2)


In general, an NN with D input units and a single hidden layer with K units

Note: wdk is the weight of edge between input layer node d and hidden layer node k

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 8


Neural Networks with One Hidden Layer

Aleady saw the special case of a single layer NN (D = 3, K = 2)


In general, an NN with D input units and a single hidden layer with K units

Note: wdk is the weight of edge between input layer node d and hidden layer node k
W = [w 1 , . . . , w K ], with w k being the weights incident on k-th hidden unit

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 8


Neural Networks with One Hidden Layer

Aleady saw the special case of a single layer NN (D = 3, K = 2)


In general, an NN with D input units and a single hidden layer with K units

Note: wdk is the weight of edge between input layer node d and hidden layer node k
W = [w 1 , . . . , w K ], with w k being the weights incident on k-th hidden unit

Each w k acts as a “feature detector” or “filter” (and there are K such filters in above NN)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 8


Neural Networks with One Hidden Layer and Multiple Outputs

Very common in multi-class or multi-output/multi-label learning problems

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 9


Neural Networks with One Hidden Layer and Multiple Outputs

Very common in multi-class or multi-output/multi-label learning problems


An NN with D input units, single hidden layers with K units, and multiple outputs

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 9


Neural Networks with One Hidden Layer and Multiple Outputs

Very common in multi-class or multi-output/multi-label learning problems


An NN with D input units, single hidden layers with K units, and multiple outputs

Similar to multi-output regression or softmax regression on h n as features

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 9


Neural Networks: Multiple Hidden Layers and Multiple Outputs
An NN with D input units, multiple hidden layers, and multiple outputs

Hidden Layer L

Hidden Layer 2

(2)

Hidden Layer 1

(1)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 10


Neural Networks: Multiple Hidden Layers and Multiple Outputs
An NN with D input units, multiple hidden layers, and multiple outputs

Hidden Layer L

Hidden Layer 2

(2)

Hidden Layer 1

(1)

Note: This (and also the previous simpler ones) is called a fully-connected feedforward network

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 10


Neural Networks: Multiple Hidden Layers and Multiple Outputs
An NN with D input units, multiple hidden layers, and multiple outputs

Hidden Layer L

Hidden Layer 2

(2)

Hidden Layer 1

(1)

Note: This (and also the previous simpler ones) is called a fully-connected feedforward network
Fully connected: All pairs of units between adjacent layers are connected to each other

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 10


Neural Networks: Multiple Hidden Layers and Multiple Outputs
An NN with D input units, multiple hidden layers, and multiple outputs

Hidden Layer L

Hidden Layer 2

(2)

Hidden Layer 1

(1)

Note: This (and also the previous simpler ones) is called a fully-connected feedforward network
Fully connected: All pairs of units between adjacent layers are connected to each other
Feedforward: No backward connections. Also, only nodes in adjacent layers connected
Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 10
Neural Networks are Feature Learners!

An NN (single/multiple hidden layers) tries to learn features that can predict the output well

Hidden Layer L

Hidden Layer 2

(2)
A learned mapping, unlike kernel methods where
the mapping was pre-defined by the choice of kernel
Hidden Layer 1

(1)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 11


Neural Networks: The Features Learned..
Deep neural networks are good at detecting features at multiple layers of abstraction
The connection weights between layers can be thought of as feature detectors or “filters”

K3 = 24
Even higher-level feature detectors
(make classification easy)

K2 = 32
higher-level feature detectors
(e.g., parts of face)

K1 = 100

Low-level feature detectors


(e.g., detect edges)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 12


Neural Networks: The Features Learned..
Deep neural networks are good at detecting features at multiple layers of abstraction
The connection weights between layers can be thought of as feature detectors or “filters”

K3 = 24
Even higher-level feature detectors
(make classification easy)

K2 = 32
higher-level feature detectors
(e.g., parts of face)

K1 = 100

Low-level feature detectors


(e.g., detect edges)

Lowest layer weights detect generic features, higher level weights detect more specific features

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 12


Neural Networks: The Features Learned..
Deep neural networks are good at detecting features at multiple layers of abstraction
The connection weights between layers can be thought of as feature detectors or “filters”

K3 = 24
Even higher-level feature detectors
(make classification easy)

K2 = 32
higher-level feature detectors
(e.g., parts of face)

K1 = 100

Low-level feature detectors


(e.g., detect edges)

Lowest layer weights detect generic features, higher level weights detect more specific features
Features learned in one layer are composed of features learned in the layer below
Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 12
Why Are Neural Network Learned Features Helpful?

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 13


Why Are Neural Network Learned Features Helpful?

A single layer model will learn an “average”


feature detector

Can’t capture subtle variations in the inputs

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 13


Why Are Neural Network Learned Features Helpful?

An MLP can learn multiple feature detectors


A single layer model will learn an “average” (even with a single hidden layer)
feature detector

=
= = =

Can’t capture subtle variations in the inputs Therefore even a single hidden layer helps
capture subtle variations in the inputs

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 13


Learning Neural Networks via Backpropagation
Backpropagation = Gradient descent using chain rule of derivatives

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 14


Learning Neural Networks via Backpropagation
Backpropagation = Gradient descent using chain rule of derivatives
(Current) Ideal way for learning deep neural networks

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 14


Learning Neural Networks via Backpropagation
Backpropagation = Gradient descent using chain rule of derivatives
(Current) Ideal way for learning deep neural networks
∂y ∂y ∂x
Chain rule of derivatives: Example, if y = f1 (x) and x = f2 (z) then ∂z = ∂x ∂z

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 14


Learning Neural Networks via Backpropagation
Backpropagation = Gradient descent using chain rule of derivatives
(Current) Ideal way for learning deep neural networks
∂y ∂y ∂x
Chain rule of derivatives: Example, if y = f1 (x) and x = f2 (z) then ∂z = ∂x ∂z

Since neural networks have a “recursive” architecture backprop is especially useful

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 14


Learning Neural Networks via Backpropagation
Backpropagation = Gradient descent using chain rule of derivatives
(Current) Ideal way for learning deep neural networks
∂y ∂y ∂x
Chain rule of derivatives: Example, if y = f1 (x) and x = f2 (z) then ∂z = ∂x ∂z

Since neural networks have a “recursive” architecture backprop is especially useful

Basic idea: Start taking the derivatives from output layer and proceed backwards (hence the name)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 14


Learning Neural Networks via Backpropagation
Backpropagation = Gradient descent using chain rule of derivatives
(Current) Ideal way for learning deep neural networks
∂y ∂y ∂x
Chain rule of derivatives: Example, if y = f1 (x) and x = f2 (z) then ∂z = ∂x ∂z

Since neural networks have a “recursive” architecture backprop is especially useful

Reuse already calculated gradients


computed by the previous layer

Basic idea: Start taking the derivatives from output layer and proceed backwards (hence the name)
Using backprop in neural nets enables us to reuse previous computations efficiently
Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 14
Learning Neural Networks via Backpropagation
Backprop iterates between a forward pass and a backward pass

Backward Pass
Forward Pass

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 15


Learning Neural Networks via Backpropagation
Backprop iterates between a forward pass and a backward pass

Backward Pass
Forward Pass

Forward pass computes the errors e n using the current parameters

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 15


Learning Neural Networks via Backpropagation
Backprop iterates between a forward pass and a backward pass

Backward Pass
Forward Pass

Forward pass computes the errors e n using the current parameters


Backward pass computes the gradients and updates the parameters, starting from the parameters
at the top layer and then moving backwards

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 15


Learning Neural Networks via Backpropagation
Backprop iterates between a forward pass and a backward pass

Backward Pass
Forward Pass

Forward pass computes the errors e n using the current parameters


Backward pass computes the gradients and updates the parameters, starting from the parameters
at the top layer and then moving backwards
Implementing backprop by hand may be very cumbersome for complex, very deep NNs

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 15


Learning Neural Networks via Backpropagation
Backprop iterates between a forward pass and a backward pass

Backward Pass
Forward Pass

Forward pass computes the errors e n using the current parameters


Backward pass computes the gradients and updates the parameters, starting from the parameters
at the top layer and then moving backwards
Implementing backprop by hand may be very cumbersome for complex, very deep NNs
Good News: Many software frameworks (e.g., Tensorflow and PyTorch) already implement
backprop so you don’t need to do it by hand (compute derivatives using chain rule, etc)
Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 15
Activation Functions
Some common activation functions
sigmoid tanh ReLU Leaky ReLU

h h h h

a a a a

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 16


Activation Functions
Some common activation functions
sigmoid tanh ReLU Leaky ReLU

h h h h

a a a a
1
Sigmoid: h = σ(a) = 1+exp(−a)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 16


Activation Functions
Some common activation functions
sigmoid tanh ReLU Leaky ReLU

h h h h

a a a a
1
Sigmoid: h = σ(a) = 1+exp(−a)
exp(a)−exp(−a)
tanh (tan hyperbolic): h = exp(a)+exp(−a) = 2σ(2a) − 1

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 16


Activation Functions
Some common activation functions
sigmoid tanh ReLU Leaky ReLU

h h h h

a a a a
1
Sigmoid: h = σ(a) = 1+exp(−a)
exp(a)−exp(−a)
tanh (tan hyperbolic): h = exp(a)+exp(−a) = 2σ(2a) − 1
ReLU (Rectified Linear Unit): h = max(0, a)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 16


Activation Functions
Some common activation functions
sigmoid tanh ReLU Leaky ReLU

h h h h

a a a a
1
Sigmoid: h = σ(a) = 1+exp(−a)
exp(a)−exp(−a)
tanh (tan hyperbolic): h = exp(a)+exp(−a) = 2σ(2a) − 1
ReLU (Rectified Linear Unit): h = max(0, a)
Leaky ReLU: h = max(βa, a) where β is a small postive number

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 16


Activation Functions
Some common activation functions
sigmoid tanh ReLU Leaky ReLU

h h h h

a a a a
1
Sigmoid: h = σ(a) = 1+exp(−a)
exp(a)−exp(−a)
tanh (tan hyperbolic): h = exp(a)+exp(−a) = 2σ(2a) − 1
ReLU (Rectified Linear Unit): h = max(0, a)
Leaky ReLU: h = max(βa, a) where β is a small postive number
Several others, e.g., Softplus h = log(1 + exp(a)), exponential ReLU, maxout, etc.

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 16


Activation Functions
Some common activation functions
sigmoid tanh ReLU Leaky ReLU

h h h h

a a a a
1
Sigmoid: h = σ(a) = 1+exp(−a)
exp(a)−exp(−a)
tanh (tan hyperbolic): h = exp(a)+exp(−a) = 2σ(2a) − 1
ReLU (Rectified Linear Unit): h = max(0, a)
Leaky ReLU: h = max(βa, a) where β is a small postive number
Several others, e.g., Softplus h = log(1 + exp(a)), exponential ReLU, maxout, etc.
Sigmoid, tanh can have issues during backprop (saturating gradients, non-centered)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 16


Activation Functions
Some common activation functions
sigmoid tanh ReLU Leaky ReLU

h h h h

a a a a
1
Sigmoid: h = σ(a) = 1+exp(−a)
exp(a)−exp(−a)
tanh (tan hyperbolic): h = exp(a)+exp(−a) = 2σ(2a) − 1
ReLU (Rectified Linear Unit): h = max(0, a)
Leaky ReLU: h = max(βa, a) where β is a small postive number
Several others, e.g., Softplus h = log(1 + exp(a)), exponential ReLU, maxout, etc.
Sigmoid, tanh can have issues during backprop (saturating gradients, non-centered)
ReLU/leaky ReLU currently one of the most popular (also cheap to compute)
Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 16
Neural Networks: Some Aspects..

Much of the magic lies in the hidden layer(s)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 17


Neural Networks: Some Aspects..

Much of the magic lies in the hidden layer(s)


As we’ve seen, hidden layers learn and detect good features

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 17


Neural Networks: Some Aspects..

Much of the magic lies in the hidden layer(s)


As we’ve seen, hidden layers learn and detect good features
However, we need to consider a few aspects

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 17


Neural Networks: Some Aspects..

Much of the magic lies in the hidden layer(s)


As we’ve seen, hidden layers learn and detect good features
However, we need to consider a few aspects
Number of hidden layers, number of units in each hidden layer

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 17


Neural Networks: Some Aspects..

Much of the magic lies in the hidden layer(s)


As we’ve seen, hidden layers learn and detect good features
However, we need to consider a few aspects
Number of hidden layers, number of units in each hidden layer
Why bother about many hidden layers and not use a single
very wide hidden layer (recall Hornick’s universal function
approximator theorem)?

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 17


Neural Networks: Some Aspects..

Much of the magic lies in the hidden layer(s)


As we’ve seen, hidden layers learn and detect good features
However, we need to consider a few aspects
Number of hidden layers, number of units in each hidden layer
Why bother about many hidden layers and not use a single
very wide hidden layer (recall Hornick’s universal function
approximator theorem)?
Complex network (several, very wide hidden layers) or simple
network (few, moderately wide hidden layers)?

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 17


Neural Networks: Some Aspects..

Much of the magic lies in the hidden layer(s)


As we’ve seen, hidden layers learn and detect good features
However, we need to consider a few aspects
Number of hidden layers, number of units in each hidden layer
Why bother about many hidden layers and not use a single
very wide hidden layer (recall Hornick’s universal function
approximator theorem)?
Complex network (several, very wide hidden layers) or simple
network (few, moderately wide hidden layers)?
Aren’t deep neural network prone to overfitting (since they
contain a huge number of parameters)?

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 17


Representational Power of Hidden Layers
Consider an NN with a single hidden layer

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 18


Representational Power of Hidden Layers
Consider an NN with a single hidden layer

Recall that each hidden unit “adds” a function to the overall function

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 18


Representational Power of Hidden Layers
Consider an NN with a single hidden layer
K=3 K=6 K = 20

Recall that each hidden unit “adds” a function to the overall function
Increasing the number of hidden units will learn more and more complex function

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 18


Representational Power of Hidden Layers
Consider an NN with a single hidden layer
K=3 K=6 K = 20

Recall that each hidden unit “adds” a function to the overall function
Increasing the number of hidden units will learn more and more complex function
Very large K seems to overfit. Should we instead prefer small K ?

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 18


Representational Power of Hidden Layers
Consider an NN with a single hidden layer
K=3 K=6 K = 20

Recall that each hidden unit “adds” a function to the overall function
Increasing the number of hidden units will learn more and more complex function
Very large K seems to overfit. Should we instead prefer small K ?
No! It is better to use large K and regularize well. Here is a reason/justification:

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 18


Representational Power of Hidden Layers
Consider an NN with a single hidden layer
K=3 K=6 K = 20

Recall that each hidden unit “adds” a function to the overall function
Increasing the number of hidden units will learn more and more complex function
Very large K seems to overfit. Should we instead prefer small K ?
No! It is better to use large K and regularize well. Here is a reason/justification:
Simple NN with small K will have a few local optima, some of which may be bad

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 18


Representational Power of Hidden Layers
Consider an NN with a single hidden layer
K=3 K=6 K = 20

Recall that each hidden unit “adds” a function to the overall function
Increasing the number of hidden units will learn more and more complex function
Very large K seems to overfit. Should we instead prefer small K ?
No! It is better to use large K and regularize well. Here is a reason/justification:
Simple NN with small K will have a few local optima, some of which may be bad
Complex NN with large K will have many local optimal, all equally good

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 18


Representational Power of Hidden Layers
Consider an NN with a single hidden layer
K=3 K=6 K = 20

Recall that each hidden unit “adds” a function to the overall function
Increasing the number of hidden units will learn more and more complex function
Very large K seems to overfit. Should we instead prefer small K ?
No! It is better to use large K and regularize well. Here is a reason/justification:
Simple NN with small K will have a few local optima, some of which may be bad
Complex NN with large K will have many local optimal, all equally good
Note: The above interesting behavior of NN has some theoretical justifications (won’t discuss here)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 18


Representational Power of Hidden Layers
Consider an NN with a single hidden layer
K=3 K=6 K = 20

Recall that each hidden unit “adds” a function to the overall function
Increasing the number of hidden units will learn more and more complex function
Very large K seems to overfit. Should we instead prefer small K ?
No! It is better to use large K and regularize well. Here is a reason/justification:
Simple NN with small K will have a few local optima, some of which may be bad
Complex NN with large K will have many local optimal, all equally good
Note: The above interesting behavior of NN has some theoretical justifications (won’t discuss here)

We can also use multiple hidden layers (each sufficiently large) and regularize well
Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 18
(Preventing) Overfitting in Neural Networks

Complex single/multiple hidden layer NN can overfit

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 19


(Preventing) Overfitting in Neural Networks

Complex single/multiple hidden layer NN can overfit


Many ways to avoid overfitting, such as

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 19


(Preventing) Overfitting in Neural Networks

Complex single/multiple hidden layer NN can overfit


Many ways to avoid overfitting, such as
Standard regularization on the weights, such as `2 , `1 , etc (`2 reg. is also called weight decay)
Single Hidden Layer NN with K = 20 hidden units and L2 regularization

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 19


(Preventing) Overfitting in Neural Networks

Complex single/multiple hidden layer NN can overfit


Many ways to avoid overfitting, such as
Standard regularization on the weights, such as `2 , `1 , etc (`2 reg. is also called weight decay)
Single Hidden Layer NN with K = 20 hidden units and L2 regularization

Early stopping (traditionally used): Stop when validation error starts increasing

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 19


(Preventing) Overfitting in Neural Networks

Complex single/multiple hidden layer NN can overfit


Many ways to avoid overfitting, such as
Standard regularization on the weights, such as `2 , `1 , etc (`2 reg. is also called weight decay)
Single Hidden Layer NN with K = 20 hidden units and L2 regularization

Early stopping (traditionally used): Stop when validation error starts increasing
Dropout: Randomly remove units (with some probability p ∈ (0, 1)) during training
Output
Layer

Hidden
Layer 2

Hidden
Layer 1

Input
Layer

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 19


Wide or Deep?

While very wide single hidden layer can approx. any function, often we prefer many hidden layers

K3 = 24
Even higher-level feature detectors
(make classification easy)

K2 = 32
higher-level feature detectors
(e.g., parts of face)

K1 = 100

Low-level feature detectors


(e.g., detect edges)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 20


Wide or Deep?

While very wide single hidden layer can approx. any function, often we prefer many hidden layers

K3 = 24
Even higher-level feature detectors
(make classification easy)

K2 = 32
higher-level feature detectors
(e.g., parts of face)

K1 = 100

Low-level feature detectors


(e.g., detect edges)

Higher layers help learn more directly useful/interpretable features (also useful for compressing data
using a small number of features)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 20


Kernel Methods vs Deep Neural Nets
Recall the prediction rule for a kernel method (e.g., kernel SVM)
N
X
y= αn k(x n , x)
n=1

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 21


Kernel Methods vs Deep Neural Nets
Recall the prediction rule for a kernel method (e.g., kernel SVM)
N
X
y= αn k(x n , x)
n=1

This is analogous to a single hidden layer NN with fixed/pre-defined hidden nodes {k(x n , x)}N
n=1
and output layer weights {αn }Nn=1

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 21


Kernel Methods vs Deep Neural Nets
Recall the prediction rule for a kernel method (e.g., kernel SVM)
N
X
y= αn k(x n , x)
n=1

This is analogous to a single hidden layer NN with fixed/pre-defined hidden nodes {k(x n , x)}N
n=1
and output layer weights {αn }Nn=1

The prediction rule for a deep neural network


K
X
y= vk h k
k=1

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 21


Kernel Methods vs Deep Neural Nets
Recall the prediction rule for a kernel method (e.g., kernel SVM)
N
X
y= αn k(x n , x)
n=1

This is analogous to a single hidden layer NN with fixed/pre-defined hidden nodes {k(x n , x)}N
n=1
and output layer weights {αn }Nn=1

The prediction rule for a deep neural network


K
X
y= vk h k
k=1

Here, the hk ’s are learned from data (possibly after multiple layers of nonlinear transformations)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 21


Kernel Methods vs Deep Neural Nets
Recall the prediction rule for a kernel method (e.g., kernel SVM)
N
X
y= αn k(x n , x)
n=1

This is analogous to a single hidden layer NN with fixed/pre-defined hidden nodes {k(x n , x)}N
n=1
and output layer weights {αn }Nn=1

The prediction rule for a deep neural network


K
X
y= vk h k
k=1

Here, the hk ’s are learned from data (possibly after multiple layers of nonlinear transformations)
Both kernel methods and deep NNs be seen as using nonlinear basis functions for making
predictions. Kernel methods use fixed basis functions (defined by the kernel) whereas NN learns the
basis functions adaptively from data
Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 21
Deep Neural Nets for Unsupervised Learning

Can use neural nets for dimensionality reduction


A popular approach is to use autoencoders
Autoencoder: Compress the input and uncompress to reconstruct
An encoder (a neural net) does compression and a decoder (a neural net) does decompression

(A neural net) (A neural net)

In an NN based autoencoder, the output layer is the same as the input!

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 22


Deep Neural Nets: Some Comments

Highly effective in learning good feature representations from data in an “end-to-end” manner

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 23


Deep Neural Nets: Some Comments

Highly effective in learning good feature representations from data in an “end-to-end” manner
The objective functions of these models are highly non-convex
Lots of recent work on non-convex opt., so non-convexity doesn’t scare us (that much) anymore

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 23


Deep Neural Nets: Some Comments

Highly effective in learning good feature representations from data in an “end-to-end” manner
The objective functions of these models are highly non-convex
Lots of recent work on non-convex opt., so non-convexity doesn’t scare us (that much) anymore

Training these models is computationally very expensive


But GPUs can help to speed up many of the computations

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 23


Deep Neural Nets: Some Comments

Highly effective in learning good feature representations from data in an “end-to-end” manner
The objective functions of these models are highly non-convex
Lots of recent work on non-convex opt., so non-convexity doesn’t scare us (that much) anymore

Training these models is computationally very expensive


But GPUs can help to speed up many of the computations

Training these models can be tricky, especially a proper initialization


Several ways to intelligently initialize these models (e.g., unsupervised layer-wise pre-training)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 23


Deep Neural Nets: Some Comments

Highly effective in learning good feature representations from data in an “end-to-end” manner
The objective functions of these models are highly non-convex
Lots of recent work on non-convex opt., so non-convexity doesn’t scare us (that much) anymore

Training these models is computationally very expensive


But GPUs can help to speed up many of the computations

Training these models can be tricky, especially a proper initialization


Several ways to intelligently initialize these models (e.g., unsupervised layer-wise pre-training)

Deep learning models can also be probabilistic and generative (will look at some of it in next class)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 23


Next Class

Other types of deep neural networks, e.g.,


Convolutional Neural Networks (especially suited for images); not “fully connected”

Neural networks for sequence data (e.g., text)

Some optimization methods especially popular for neural networks

An overview of other recent advances in deep learning

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 24

You might also like