771 A18 Lec20
771 A18 Lec20
771 A18 Lec20
Piyush Rai
Already seen several examples (linear regression, logistic regression, linear SVM, multi-output linear
regression, softmax regression, and several others)
Already seen several examples (linear regression, logistic regression, linear SVM, multi-output linear
regression, softmax regression, and several others)
This basic architecture is classically also known as the “Perceptron” (not to be confused with the
Perceptron “algorithm”, which learns a linear classification model)
Already seen several examples (linear regression, logistic regression, linear SVM, multi-output linear
regression, softmax regression, and several others)
This basic architecture is classically also known as the “Perceptron” (not to be confused with the
Perceptron “algorithm”, which learns a linear classification model)
This simple model can’t however learn complex functions (e.g., nonlinear decision boundaries)
Already seen several examples (linear regression, logistic regression, linear SVM, multi-output linear
regression, softmax regression, and several others)
This basic architecture is classically also known as the “Perceptron” (not to be confused with the
Perceptron “algorithm”, which learns a linear classification model)
This simple model can’t however learn complex functions (e.g., nonlinear decision boundaries)
We have already seen a way of handling nonlinearities: Kernel Methods (invented in the 90s)
Already seen several examples (linear regression, logistic regression, linear SVM, multi-output linear
regression, softmax regression, and several others)
This basic architecture is classically also known as the “Perceptron” (not to be confused with the
Perceptron “algorithm”, which learns a linear classification model)
This simple model can’t however learn complex functions (e.g., nonlinear decision boundaries)
We have already seen a way of handling nonlinearities: Kernel Methods (invented in the 90s)
Something existed in the pre-kernel methods era, too..
Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 2
Neural Networks a.k.a. Multi-layer Perceptron (MLP)
Became very popular in early 80s and went to slumber in late 80s (people didn’t know how to train
them well), but woke up again in the late 2000s (now we do). Rechristened as “Deep Learning”
Hidden Layer
(with K=2 hidden units)
Input Layer
(with D=3 visible units)
Hidden Layer
(with K=2 hidden units)
Input Layer
(with D=3 visible units)
Each node (a.k.a. unit) in the hidden layer computes a nonlinear transform of inputs it receives
Hidden Layer
(with K=2 hidden units)
Input Layer
(with D=3 visible units)
Each node (a.k.a. unit) in the hidden layer computes a nonlinear transform of inputs it receives
Hidden layer nodes act as “features” in the final layer (a linear model) to produce the output
Hidden Layer
(with K=2 hidden units)
Input Layer
(with D=3 visible units)
Each node (a.k.a. unit) in the hidden layer computes a nonlinear transform of inputs it receives
Hidden layer nodes act as “features” in the final layer (a linear model) to produce the output
The overall effect is a nonlinear mapping from inputs to outputs (justification later)
Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 3
Illustration: A Neural Network with One Hidden Layer
inear Model
Linear Model
“pre-activations”
nonlinear
“activation”
function
nonlinear
“activation”
function
More succintly..
Single
Hidden
Layer
More succintly..
Single
Hidden
Layer
We will only show hnk to denote the value computed by the k-th hidden unit
More succintly..
Single
Hidden
Layer
We will only show hnk to denote the value computed by the k-th hidden unit
Likewise, for the output layer, we will directly show the final output yn
More succintly..
Single
Hidden
Layer
We will only show hnk to denote the value computed by the k-th hidden unit
Likewise, for the output layer, we will directly show the final output yn
Each node in hidden/output layers computes a linear trans. of its inputs + applies a nonlinearity
More succintly..
Single
Hidden
Layer
We will only show hnk to denote the value computed by the k-th hidden unit
Likewise, for the output layer, we will directly show the final output yn
Each node in hidden/output layers computes a linear trans. of its inputs + applies a nonlinearity
Different layers can use different types of activations (output layer may have none)
Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 5
MLPs Can Learn Nonlinear Functions: A Justification
1.0 x -1.0 x
score score score score
=
1.0 x -1.0 x
score score score score
=
1.0 x -1.0 x
score score score score
=
1.0 x -1.0 x
score score score score
=
MLP with a single, sufficiently wide hidden layer can approximate any function (Hornick, 1991)
Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 6
Examples of some NN/MLP architectures
Note: wdk is the weight of edge between input layer node d and hidden layer node k
Note: wdk is the weight of edge between input layer node d and hidden layer node k
W = [w 1 , . . . , w K ], with w k being the weights incident on k-th hidden unit
Note: wdk is the weight of edge between input layer node d and hidden layer node k
W = [w 1 , . . . , w K ], with w k being the weights incident on k-th hidden unit
Each w k acts as a “feature detector” or “filter” (and there are K such filters in above NN)
Hidden Layer L
Hidden Layer 2
(2)
Hidden Layer 1
(1)
Hidden Layer L
Hidden Layer 2
(2)
Hidden Layer 1
(1)
Note: This (and also the previous simpler ones) is called a fully-connected feedforward network
Hidden Layer L
Hidden Layer 2
(2)
Hidden Layer 1
(1)
Note: This (and also the previous simpler ones) is called a fully-connected feedforward network
Fully connected: All pairs of units between adjacent layers are connected to each other
Hidden Layer L
Hidden Layer 2
(2)
Hidden Layer 1
(1)
Note: This (and also the previous simpler ones) is called a fully-connected feedforward network
Fully connected: All pairs of units between adjacent layers are connected to each other
Feedforward: No backward connections. Also, only nodes in adjacent layers connected
Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 10
Neural Networks are Feature Learners!
An NN (single/multiple hidden layers) tries to learn features that can predict the output well
Hidden Layer L
Hidden Layer 2
(2)
A learned mapping, unlike kernel methods where
the mapping was pre-defined by the choice of kernel
Hidden Layer 1
(1)
K3 = 24
Even higher-level feature detectors
(make classification easy)
K2 = 32
higher-level feature detectors
(e.g., parts of face)
K1 = 100
K3 = 24
Even higher-level feature detectors
(make classification easy)
K2 = 32
higher-level feature detectors
(e.g., parts of face)
K1 = 100
Lowest layer weights detect generic features, higher level weights detect more specific features
K3 = 24
Even higher-level feature detectors
(make classification easy)
K2 = 32
higher-level feature detectors
(e.g., parts of face)
K1 = 100
Lowest layer weights detect generic features, higher level weights detect more specific features
Features learned in one layer are composed of features learned in the layer below
Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 12
Why Are Neural Network Learned Features Helpful?
=
= = =
Can’t capture subtle variations in the inputs Therefore even a single hidden layer helps
capture subtle variations in the inputs
Basic idea: Start taking the derivatives from output layer and proceed backwards (hence the name)
Basic idea: Start taking the derivatives from output layer and proceed backwards (hence the name)
Using backprop in neural nets enables us to reuse previous computations efficiently
Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 14
Learning Neural Networks via Backpropagation
Backprop iterates between a forward pass and a backward pass
Backward Pass
Forward Pass
Backward Pass
Forward Pass
Backward Pass
Forward Pass
Backward Pass
Forward Pass
Backward Pass
Forward Pass
h h h h
a a a a
h h h h
a a a a
1
Sigmoid: h = σ(a) = 1+exp(−a)
h h h h
a a a a
1
Sigmoid: h = σ(a) = 1+exp(−a)
exp(a)−exp(−a)
tanh (tan hyperbolic): h = exp(a)+exp(−a) = 2σ(2a) − 1
h h h h
a a a a
1
Sigmoid: h = σ(a) = 1+exp(−a)
exp(a)−exp(−a)
tanh (tan hyperbolic): h = exp(a)+exp(−a) = 2σ(2a) − 1
ReLU (Rectified Linear Unit): h = max(0, a)
h h h h
a a a a
1
Sigmoid: h = σ(a) = 1+exp(−a)
exp(a)−exp(−a)
tanh (tan hyperbolic): h = exp(a)+exp(−a) = 2σ(2a) − 1
ReLU (Rectified Linear Unit): h = max(0, a)
Leaky ReLU: h = max(βa, a) where β is a small postive number
h h h h
a a a a
1
Sigmoid: h = σ(a) = 1+exp(−a)
exp(a)−exp(−a)
tanh (tan hyperbolic): h = exp(a)+exp(−a) = 2σ(2a) − 1
ReLU (Rectified Linear Unit): h = max(0, a)
Leaky ReLU: h = max(βa, a) where β is a small postive number
Several others, e.g., Softplus h = log(1 + exp(a)), exponential ReLU, maxout, etc.
h h h h
a a a a
1
Sigmoid: h = σ(a) = 1+exp(−a)
exp(a)−exp(−a)
tanh (tan hyperbolic): h = exp(a)+exp(−a) = 2σ(2a) − 1
ReLU (Rectified Linear Unit): h = max(0, a)
Leaky ReLU: h = max(βa, a) where β is a small postive number
Several others, e.g., Softplus h = log(1 + exp(a)), exponential ReLU, maxout, etc.
Sigmoid, tanh can have issues during backprop (saturating gradients, non-centered)
h h h h
a a a a
1
Sigmoid: h = σ(a) = 1+exp(−a)
exp(a)−exp(−a)
tanh (tan hyperbolic): h = exp(a)+exp(−a) = 2σ(2a) − 1
ReLU (Rectified Linear Unit): h = max(0, a)
Leaky ReLU: h = max(βa, a) where β is a small postive number
Several others, e.g., Softplus h = log(1 + exp(a)), exponential ReLU, maxout, etc.
Sigmoid, tanh can have issues during backprop (saturating gradients, non-centered)
ReLU/leaky ReLU currently one of the most popular (also cheap to compute)
Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 16
Neural Networks: Some Aspects..
Recall that each hidden unit “adds” a function to the overall function
Recall that each hidden unit “adds” a function to the overall function
Increasing the number of hidden units will learn more and more complex function
Recall that each hidden unit “adds” a function to the overall function
Increasing the number of hidden units will learn more and more complex function
Very large K seems to overfit. Should we instead prefer small K ?
Recall that each hidden unit “adds” a function to the overall function
Increasing the number of hidden units will learn more and more complex function
Very large K seems to overfit. Should we instead prefer small K ?
No! It is better to use large K and regularize well. Here is a reason/justification:
Recall that each hidden unit “adds” a function to the overall function
Increasing the number of hidden units will learn more and more complex function
Very large K seems to overfit. Should we instead prefer small K ?
No! It is better to use large K and regularize well. Here is a reason/justification:
Simple NN with small K will have a few local optima, some of which may be bad
Recall that each hidden unit “adds” a function to the overall function
Increasing the number of hidden units will learn more and more complex function
Very large K seems to overfit. Should we instead prefer small K ?
No! It is better to use large K and regularize well. Here is a reason/justification:
Simple NN with small K will have a few local optima, some of which may be bad
Complex NN with large K will have many local optimal, all equally good
Recall that each hidden unit “adds” a function to the overall function
Increasing the number of hidden units will learn more and more complex function
Very large K seems to overfit. Should we instead prefer small K ?
No! It is better to use large K and regularize well. Here is a reason/justification:
Simple NN with small K will have a few local optima, some of which may be bad
Complex NN with large K will have many local optimal, all equally good
Note: The above interesting behavior of NN has some theoretical justifications (won’t discuss here)
Recall that each hidden unit “adds” a function to the overall function
Increasing the number of hidden units will learn more and more complex function
Very large K seems to overfit. Should we instead prefer small K ?
No! It is better to use large K and regularize well. Here is a reason/justification:
Simple NN with small K will have a few local optima, some of which may be bad
Complex NN with large K will have many local optimal, all equally good
Note: The above interesting behavior of NN has some theoretical justifications (won’t discuss here)
We can also use multiple hidden layers (each sufficiently large) and regularize well
Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 18
(Preventing) Overfitting in Neural Networks
Early stopping (traditionally used): Stop when validation error starts increasing
Early stopping (traditionally used): Stop when validation error starts increasing
Dropout: Randomly remove units (with some probability p ∈ (0, 1)) during training
Output
Layer
Hidden
Layer 2
Hidden
Layer 1
Input
Layer
While very wide single hidden layer can approx. any function, often we prefer many hidden layers
K3 = 24
Even higher-level feature detectors
(make classification easy)
K2 = 32
higher-level feature detectors
(e.g., parts of face)
K1 = 100
While very wide single hidden layer can approx. any function, often we prefer many hidden layers
K3 = 24
Even higher-level feature detectors
(make classification easy)
K2 = 32
higher-level feature detectors
(e.g., parts of face)
K1 = 100
Higher layers help learn more directly useful/interpretable features (also useful for compressing data
using a small number of features)
This is analogous to a single hidden layer NN with fixed/pre-defined hidden nodes {k(x n , x)}N
n=1
and output layer weights {αn }Nn=1
This is analogous to a single hidden layer NN with fixed/pre-defined hidden nodes {k(x n , x)}N
n=1
and output layer weights {αn }Nn=1
This is analogous to a single hidden layer NN with fixed/pre-defined hidden nodes {k(x n , x)}N
n=1
and output layer weights {αn }Nn=1
Here, the hk ’s are learned from data (possibly after multiple layers of nonlinear transformations)
This is analogous to a single hidden layer NN with fixed/pre-defined hidden nodes {k(x n , x)}N
n=1
and output layer weights {αn }Nn=1
Here, the hk ’s are learned from data (possibly after multiple layers of nonlinear transformations)
Both kernel methods and deep NNs be seen as using nonlinear basis functions for making
predictions. Kernel methods use fixed basis functions (defined by the kernel) whereas NN learns the
basis functions adaptively from data
Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 21
Deep Neural Nets for Unsupervised Learning
Highly effective in learning good feature representations from data in an “end-to-end” manner
Highly effective in learning good feature representations from data in an “end-to-end” manner
The objective functions of these models are highly non-convex
Lots of recent work on non-convex opt., so non-convexity doesn’t scare us (that much) anymore
Highly effective in learning good feature representations from data in an “end-to-end” manner
The objective functions of these models are highly non-convex
Lots of recent work on non-convex opt., so non-convexity doesn’t scare us (that much) anymore
Highly effective in learning good feature representations from data in an “end-to-end” manner
The objective functions of these models are highly non-convex
Lots of recent work on non-convex opt., so non-convexity doesn’t scare us (that much) anymore
Highly effective in learning good feature representations from data in an “end-to-end” manner
The objective functions of these models are highly non-convex
Lots of recent work on non-convex opt., so non-convexity doesn’t scare us (that much) anymore
Deep learning models can also be probabilistic and generative (will look at some of it in next class)