771 A18 Lec20

Introduction to Deep Neural Networks (1)
Piyush Rai
Introduction to Machine Learning (CS771A)
October 23, 2018
Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 1

Linear Models (and their limitations..)
Linear models: Output produced by taking a linear combination of input features
Some monotonic function

(e.g., sigmoid)


(e.g., sigmoid)
Already seen several examples (linear regression, logistic regression, linear SVM, multi-output linear
regression, softmax regression, and several others)


(e.g., sigmoid)
This basic architecture is classically also known as the “Perceptron” (not to be confused with the
Perceptron “algorithm”, which learns a linear classification model)


(e.g., sigmoid)
This simple model can’t however learn complex functions (e.g., nonlinear decision boundaries)


(e.g., sigmoid)
We have already seen a way of handling nonlinearities: Kernel Methods (invented in the 90s)


(e.g., sigmoid)
We have already seen a way of handling nonlinearities: Kernel Methods (invented in the 90s)
Something existed in the pre-kernel methods era, too..
Neural Networks a.k.a. Multi-layer Perceptron (MLP)
Became very popular in early 80s and went to slumber in late 80s (people didn’t know how to train
them well), but woke up again in the late 2000s (now we do). Rechristened as “Deep Learning”

An MLP consists of an input layer, an output layer, and one or more hidden layers

A very simple MLP with input layer with D = 3 nodes and single hidden layer with K = 2 nodes
Output Layer
(with a scalar-valued output)
Hidden Layer
(with K=2 hidden units)
Input Layer
(with D=3 visible units)

Output Layer
Hidden Layer
Input Layer
Each node (a.k.a. unit) in the hidden layer computes a nonlinear transform of inputs it receives

Output Layer
Hidden Layer
Input Layer
Hidden layer nodes act as “features” in the final layer (a linear model) to produce the output

Output Layer
Hidden Layer
Input Layer
Hidden layer nodes act as “features” in the final layer (a linear model) to produce the output
The overall effect is a nonlinear mapping from inputs to outputs (justification later)
Illustration: A Neural Network with One Hidden Layer

Each input x n transformed into several “pre-activations”
using linear models
XD
ank = w > x
k n = wdk xnd
d=1
inear Model

using linear models
XD
ank = w > x
k n = wdk xnd
d=1
Linear Model

using linear models
XD
ank = w > x
k n = wdk xnd
d=1

using linear models
XD
ank = w > x
k n = wdk xnd
d=1
“pre-activations”

using linear models
XD
ank = w > x
k n = wdk xnd
d=1
Nonlinear activation applied on each pre-activation
hnk = g (ank )
nonlinear
“activation”
function

using linear models
XD
ank = w > x
k n = wdk xnd
sigmoid tanh ReLU Leaky ReLU Exp. ReLU
d=1
h h h h h
Nonlinear activation applied on each pre-activation a a a a
a
(Some activation functions)
hnk = g (ank )
nonlinear
“activation”
function

using linear models
XD
ank = w > x
k n = wdk xnd
d=1
hnk = g (ank )
A linear model applied on the new “features” h n
K
X
>
sn = v h n = vk hnk
k=1

using linear models
XD
ank = w > x
k n = wdk xnd
d=1
hnk = g (ank )
K
X
>
sn = v h n = vk hnk
k=1
Finally, the output is produced as yn = o(sn )

using linear models
XD
ank = w > x
k n = wdk xnd Can even be identity
(e.g., for regression yn = sn )
d=1
hnk = g (ank )
K
X
>
sn = v h n = vk hnk
k=1

using linear models
XD
ank = w > x
k n = wdk xnd Can even be identity
(e.g., for regression yn = sn )
d=1
hnk = g (ank )
K
X
>
sn = v h n = vk hnk
k=1

Unknowns of the model (w 1 , . . . , w K and v ) learned by
PN
minimizing a loss L(W, v ) = n=1 `(yn , o(sn )), e.g.,
squared, logistic, softmax, etc (depending on the output)

Neural Networks: A Basic Pictorial Representation
Note: Hidden layer pre-activations ank and post-activations hnk will be shown together for brevity
More succintly..
Single
Hidden
Layer

More succintly..
Single
Hidden
Layer
We will only show hnk to denote the value computed by the k-th hidden unit

More succintly..
Single
Hidden
Layer
Likewise, for the output layer, we will directly show the final output yn

More succintly..
Single
Hidden
Layer
Each node in hidden/output layers computes a linear trans. of its inputs + applies a nonlinearity

More succintly..
Single
Hidden
Layer
Each node in hidden/output layers computes a linear trans. of its inputs + applies a nonlinearity
Different layers can use different types of activations (output layer may have none)
MLPs Can Learn Nonlinear Functions: A Justification
An MLP can be seen as a composition of multiple linear models combined nonlinearly

Let’s look at a simple example of 2-dim inputs that are linearly not separable
Standard Single “Perceptron” Classifier A Multi-layer Perceptron Classifier

(no hidden units) (one hidden layer with 2 units)
Cannot learn nonlinear boundaries Capable of learning nonlinear boundaries
1.0 x -1.0 x
score score score score
=



1.0 x -1.0 x
=



1.0 x -1.0 x
=



1.0 x -1.0 x
=
MLP with a single, sufficiently wide hidden layer can approximate any function (Hornick, 1991)
Examples of some NN/MLP architectures

Neural Networks with One Hidden Layer
Aleady saw the special case of a single layer NN (D = 3, K = 2)


In general, an NN with D input units and a single hidden layer with K units


Note: wdk is the weight of edge between input layer node d and hidden layer node k


W = [w 1 , . . . , w K ], with w k being the weights incident on k-th hidden unit


W = [w 1 , . . . , w K ], with w k being the weights incident on k-th hidden unit
Each w k acts as a “feature detector” or “filter” (and there are K such filters in above NN)

Neural Networks with One Hidden Layer and Multiple Outputs
Very common in multi-class or multi-output/multi-label learning problems


An NN with D input units, single hidden layers with K units, and multiple outputs


An NN with D input units, single hidden layers with K units, and multiple outputs
Similar to multi-output regression or softmax regression on h n as features

Neural Networks: Multiple Hidden Layers and Multiple Outputs
An NN with D input units, multiple hidden layers, and multiple outputs
Hidden Layer L
Hidden Layer 2
(2)
Hidden Layer 1
(1)

Hidden Layer L
Hidden Layer 2
(2)
Hidden Layer 1
(1)
Note: This (and also the previous simpler ones) is called a fully-connected feedforward network

Hidden Layer L
Hidden Layer 2
(2)
Hidden Layer 1
(1)
Fully connected: All pairs of units between adjacent layers are connected to each other

Hidden Layer L
Hidden Layer 2
(2)
Hidden Layer 1
(1)
Fully connected: All pairs of units between adjacent layers are connected to each other
Feedforward: No backward connections. Also, only nodes in adjacent layers connected
Neural Networks are Feature Learners!
An NN (single/multiple hidden layers) tries to learn features that can predict the output well
Hidden Layer L
Hidden Layer 2
(2)
A learned mapping, unlike kernel methods where
the mapping was pre-defined by the choice of kernel
Hidden Layer 1
(1)

Neural Networks: The Features Learned..
Deep neural networks are good at detecting features at multiple layers of abstraction
The connection weights between layers can be thought of as feature detectors or “filters”
K3 = 24
Even higher-level feature detectors
(make classification easy)
K2 = 32
higher-level feature detectors
(e.g., parts of face)
K1 = 100
Low-level feature detectors

(e.g., detect edges)

K3 = 24
K2 = 32
K1 = 100

Lowest layer weights detect generic features, higher level weights detect more specific features

K3 = 24
K2 = 32
K1 = 100

Lowest layer weights detect generic features, higher level weights detect more specific features
Features learned in one layer are composed of features learned in the layer below
Why Are Neural Network Learned Features Helpful?

A single layer model will learn an “average”

feature detector
Can’t capture subtle variations in the inputs

An MLP can learn multiple feature detectors

A single layer model will learn an “average” (even with a single hidden layer)
feature detector
=
= = =
Can’t capture subtle variations in the inputs Therefore even a single hidden layer helps
capture subtle variations in the inputs

Learning Neural Networks via Backpropagation
Backpropagation = Gradient descent using chain rule of derivatives

(Current) Ideal way for learning deep neural networks

∂y ∂y ∂x
Chain rule of derivatives: Example, if y = f1 (x) and x = f2 (z) then ∂z = ∂x ∂z

∂y ∂y ∂x
Since neural networks have a “recursive” architecture backprop is especially useful

∂y ∂y ∂x
Basic idea: Start taking the derivatives from output layer and proceed backwards (hence the name)

∂y ∂y ∂x
Reuse already calculated gradients

computed by the previous layer
Basic idea: Start taking the derivatives from output layer and proceed backwards (hence the name)
Using backprop in neural nets enables us to reuse previous computations efficiently
Backprop iterates between a forward pass and a backward pass
Backward Pass
Forward Pass

Backward Pass
Forward Pass
Forward pass computes the errors e n using the current parameters

Backward Pass
Forward Pass

Backward pass computes the gradients and updates the parameters, starting from the parameters
at the top layer and then moving backwards

Backward Pass
Forward Pass

Implementing backprop by hand may be very cumbersome for complex, very deep NNs

Backward Pass
Forward Pass

Implementing backprop by hand may be very cumbersome for complex, very deep NNs
Good News: Many software frameworks (e.g., Tensorflow and PyTorch) already implement
backprop so you don’t need to do it by hand (compute derivatives using chain rule, etc)
Activation Functions
Some common activation functions
sigmoid tanh ReLU Leaky ReLU
h h h h
a a a a

h h h h
a a a a
1
Sigmoid: h = σ(a) = 1+exp(−a)

h h h h
a a a a
1
exp(a)−exp(−a)
tanh (tan hyperbolic): h = exp(a)+exp(−a) = 2σ(2a) − 1

h h h h
a a a a
1
exp(a)−exp(−a)
ReLU (Rectified Linear Unit): h = max(0, a)

h h h h
a a a a
1
exp(a)−exp(−a)
Leaky ReLU: h = max(βa, a) where β is a small postive number

h h h h
a a a a
1
exp(a)−exp(−a)
Several others, e.g., Softplus h = log(1 + exp(a)), exponential ReLU, maxout, etc.

h h h h
a a a a
1
exp(a)−exp(−a)
Sigmoid, tanh can have issues during backprop (saturating gradients, non-centered)

h h h h
a a a a
1
exp(a)−exp(−a)
Sigmoid, tanh can have issues during backprop (saturating gradients, non-centered)
ReLU/leaky ReLU currently one of the most popular (also cheap to compute)
Neural Networks: Some Aspects..
Much of the magic lies in the hidden layer(s)


As we’ve seen, hidden layers learn and detect good features


However, we need to consider a few aspects


Number of hidden layers, number of units in each hidden layer


Why bother about many hidden layers and not use a single
very wide hidden layer (recall Hornick’s universal function
approximator theorem)?


Complex network (several, very wide hidden layers) or simple
network (few, moderately wide hidden layers)?


Complex network (several, very wide hidden layers) or simple
network (few, moderately wide hidden layers)?
Aren’t deep neural network prone to overfitting (since they
contain a huge number of parameters)?

Representational Power of Hidden Layers
Consider an NN with a single hidden layer

Recall that each hidden unit “adds” a function to the overall function

K=3 K=6 K = 20
Increasing the number of hidden units will learn more and more complex function

K=3 K=6 K = 20
Very large K seems to overfit. Should we instead prefer small K ?

K=3 K=6 K = 20
No! It is better to use large K and regularize well. Here is a reason/justification:

K=3 K=6 K = 20
Simple NN with small K will have a few local optima, some of which may be bad

K=3 K=6 K = 20
Complex NN with large K will have many local optimal, all equally good

K=3 K=6 K = 20
Note: The above interesting behavior of NN has some theoretical justifications (won’t discuss here)

K=3 K=6 K = 20
Note: The above interesting behavior of NN has some theoretical justifications (won’t discuss here)
We can also use multiple hidden layers (each sufficiently large) and regularize well
(Preventing) Overfitting in Neural Networks
Complex single/multiple hidden layer NN can overfit


Many ways to avoid overfitting, such as


Standard regularization on the weights, such as `2 , `1 , etc (`2 reg. is also called weight decay)
Single Hidden Layer NN with K = 20 hidden units and L2 regularization


Early stopping (traditionally used): Stop when validation error starts increasing


Early stopping (traditionally used): Stop when validation error starts increasing
Dropout: Randomly remove units (with some probability p ∈ (0, 1)) during training
Output
Layer
Hidden
Layer 2
Hidden
Layer 1
Input
Layer

Wide or Deep?
While very wide single hidden layer can approx. any function, often we prefer many hidden layers
K3 = 24
K2 = 32
K1 = 100


Wide or Deep?
While very wide single hidden layer can approx. any function, often we prefer many hidden layers
K3 = 24
K2 = 32
K1 = 100

Higher layers help learn more directly useful/interpretable features (also useful for compressing data
using a small number of features)

Kernel Methods vs Deep Neural Nets
Recall the prediction rule for a kernel method (e.g., kernel SVM)
N
X
y= αn k(x n , x)
n=1

N
X
y= αn k(x n , x)
n=1
This is analogous to a single hidden layer NN with fixed/pre-defined hidden nodes {k(x n , x)}N
n=1
and output layer weights {αn }Nn=1

N
X
y= αn k(x n , x)
n=1
n=1
The prediction rule for a deep neural network

K
X
y= vk h k
k=1

N
X
y= αn k(x n , x)
n=1
n=1

K
X
y= vk h k
k=1
Here, the hk ’s are learned from data (possibly after multiple layers of nonlinear transformations)

N
X
y= αn k(x n , x)
n=1
n=1

K
X
y= vk h k
k=1
Here, the hk ’s are learned from data (possibly after multiple layers of nonlinear transformations)
Both kernel methods and deep NNs be seen as using nonlinear basis functions for making
predictions. Kernel methods use fixed basis functions (defined by the kernel) whereas NN learns the
basis functions adaptively from data
Deep Neural Nets for Unsupervised Learning
Can use neural nets for dimensionality reduction

A popular approach is to use autoencoders
Autoencoder: Compress the input and uncompress to reconstruct
An encoder (a neural net) does compression and a decoder (a neural net) does decompression
(A neural net) (A neural net)
In an NN based autoencoder, the output layer is the same as the input!

Deep Neural Nets: Some Comments
Highly effective in learning good feature representations from data in an “end-to-end” manner

The objective functions of these models are highly non-convex
Lots of recent work on non-convex opt., so non-convexity doesn’t scare us (that much) anymore

Training these models is computationally very expensive

But GPUs can help to speed up many of the computations


Training these models can be tricky, especially a proper initialization

Several ways to intelligently initialize these models (e.g., unsupervised layer-wise pre-training)


Training these models can be tricky, especially a proper initialization

Several ways to intelligently initialize these models (e.g., unsupervised layer-wise pre-training)
Deep learning models can also be probabilistic and generative (will look at some of it in next class)

Next Class
Other types of deep neural networks, e.g.,

Convolutional Neural Networks (especially suited for images); not “fully connected”
Neural networks for sequence data (e.g., text)
Some optimization methods especially popular for neural networks
An overview of other recent advances in deep learning

771 A18 Lec20

Uploaded by

Copyright:

Available Formats

771 A18 Lec20

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

771 A18 Lec20

Uploaded by

Copyright:

Available Formats

Introduction to Deep Neural Networks (1)

Introduction to Machine Learning (CS771A)

October 23, 2018

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 1

Some monotonic function

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 2

Some monotonic function

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 2

Some monotonic function

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 2

Some monotonic function

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 2

Some monotonic function

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 2

Some monotonic function

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 3

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 3

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 3

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 3

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 3

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 4

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 4

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 4

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 4

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 4

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 4

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 4

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 4

Finally, the output is produced as yn = o(sn )

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 4

Finally, the output is produced as yn = o(sn )

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 4

Finally, the output is produced as yn = o(sn )

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 4

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 5

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 5

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 5

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 5

An MLP can be seen as a composition of multiple linear models combined nonlinearly

Standard Single “Perceptron” Classifier A Multi-layer Perceptron Classifier

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 6

An MLP can be seen as a composition of multiple linear models combined nonlinearly

Standard Single “Perceptron” Classifier A Multi-layer Perceptron Classifier

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 6

An MLP can be seen as a composition of multiple linear models combined nonlinearly

Standard Single “Perceptron” Classifier A Multi-layer Perceptron Classifier

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 6

An MLP can be seen as a composition of multiple linear models combined nonlinearly

Standard Single “Perceptron” Classifier A Multi-layer Perceptron Classifier

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 7

Aleady saw the special case of a single layer NN (D = 3, K = 2)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 8

Aleady saw the special case of a single layer NN (D = 3, K = 2)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 8

Aleady saw the special case of a single layer NN (D = 3, K = 2)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 8

Aleady saw the special case of a single layer NN (D = 3, K = 2)

Intro to Machine Learning (CS771A) Introduction to Deep Neural Networks (1) 8

Aleady saw the special case of a single layer NN (D = 3, K = 2)