Multi Perceptor

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 37

Multilayer neural

networks
Agenda

 Some historical notes


 Multilayer neural networks

 Backpropagation

 Error surfaces and feature mapping

 Speed ups

 Conclusions
Some historical notes
Rosenblatt’s perceptron 1958: a single neuron
with adjustable synaptic weights and bias
Perceptron convergence theorem 1962
(Rosenblatt): If patterns used to train the
perceptron are drawn from two linearly
separable classes, then the algorithm
converges and positions the decision surface in
the form of hyperplane between two classes.
The limitations of networks implementing linear
discriminants were well known in the 1950s and
1960s.
Some historical notes
 A single neuron -> the class of solutions that
can be obtained is not general enough ->
multilayer perceptron

 Widrow & Hoff (1960): least-mean square


algorithm (LMS) = delta rule: three-layer
networks designed by hand (fixed input-to-
hidden weights, trained hidden-to-output
weights).
 The development of backpropagation: Kalman
filtering (Kalman 1960, Bryson,Denham,Dreyfus
1963,1969).
Some historical notes
 Werbos : Beyond Regression: New
Tools for Prediction and Analysis in
the Behavioral Science, Ph.D. thesis,
Harvard University, 1974
 Parker 1982,1985
 Rumelhart, Hinton, Williams: Learning
internal representations by
backpropagating errors, Nature
323(99),pp533-536,1986
Multilayer Neural Networks
 Input layer
 Hidden layer
 Output layer
 Bias unit = neuron
 netj = wtjx
 yj = f(netj)
 f(net) = Sgn(net)
 zk = f(netk)
Multilayer Neural Networks
 The XOR problem:
(x1 OR x2)
AND NOT (x1 AND
x2)
 y1 : x1 + x2 +0.5 = 0
 > 0 -> y1 = 1,
otherwise –1
 y2 : x1 + x2 -1.5 = 0
 > 0 -> y2 = 1,
otherwise –1
Multilayer Neural Networks
 Any desired
continuous function
can be implemented by
a three-layer network
given sufficient number
of hidden units, proper
nonlinearitiers and
weights (Kolmogorov)
 Feedforward
 Learning
 (Demo chapter 11)
Multilayer Neural Networks
 Nonlinear multilayer
networks

 How to set the


weights of a three-
layer neural network
based on training
patterns and the
desired output?
Backpropagation Algorithm

 The LMS
algorithm exists
for linear
systems
 Training error

 Learning rule
Backpropagation Algorithm
 Learning rule

 Hidden-to-output

 Input-to-hidden
 Note, that wij are
initialized with
random values
 Demo Chapter 11
Backpropagation Algorithm
 Compare with LMS
algoritms
 1) Method of Steepest
Descent
 The direction of
steepest descent is in
direction opposite to
the gradient vector g =
E(w)
 w(n+1) = w(n) –g(n)
  is the stepsize or
learning-rate parameter
Backpropagation Algorithm
 Training set = a set of patterns with known
labels
 Stochastic training = patterns are chosen
randomly from the training set
 Batch training = all patterns are presented to
the network before learning takes place
 On-line protocol = each pattern is presented
once and only once, no memory in use
 Epoch = a single presentation of all patterns in
the training set. The number of epochs is an
indication of the relative amount of learning.
Backpropagation Algorithm
 Stopping criterion
Backpropagation Algorithm

Learning set
Validation set
Test set

Stopping criterion
Learning curve, the
average error per
pattern
Cross-validation
Error Surfaces and Feature
Mapping
 Note, error backpropagation is based
on gradient descent in a criterion
function J(w) that is represented
represented
Error Surfaces and Feature
Mapping
 The total training
error is minimized.
 It usually
decreases
monotonically,
even though this is
not the case for the
error on each
individual pattern.
Error Surfaces and Feature
Mapping
 Hidden-to-output
weights ~ a linear
discriminant
 Input-to-hidden
weights ~ ”matched
filter”
Practical Techniques for
Improving Backpropagation
 How to improve
convergence,
performance, and
results?
 Neuron:
Sigmoid function =
centered zero and
antisymmetric
Practical Techniques for
Improving Backpropagation
Scaling input variables
= the input patterns should be shifted so
that the average over the training set of
each feature is zero.
= the full data set should be scaled to have
the same variance in each feature
component
Note, this standardization can only be done
for stochastic and batch learning!
Practical Techniques for
Improving Backpropagation
 When the training set is small one can generate
surrogate training patterns.
 In the absence of problem-specific information,
the surrogate patterns should be made by
adding d-dimensional Gaussian noise to true
training points, the category label should be left
unchanged.
 If we know about the source of variation among
patterns we can manufacture training data.
 The number of hidden units should be less than
the total number of training points n, say roughly
n/10.
Practical Techniques for
Improving Backpropagation
 We cannot initialize the weights to 0.
 Initializing weights = uniform learning ->
choose weights randomly from a
single distribution
 Input-to-hidden weights: -1 / d < wij < + 1 /
d ,where d is the number of input units
 Hidden-to-output weights: -1 /  nH < wkj < +
1 / nH ,where d is the number of hidden
units
Practical Techniques for
Improving Backpropagation
 Learning Rates
 Demo Chapter 9
 The optimal rate
Practical Techniques for
Improving Backpropagation
 Momentum : allows
the network to learn
more quickly when
plateaus in the
error surface exist.
Demo chapter 12.
0.9
Practical Techniques for
Improving Backpropagation
 Weight Decay to avoid overfitting = to
start with a network with too many
weights and decay all the weights
during training wnew =
wold(1-), where 0 <  < 1
 The weights that are not needed for
reducing the error function become
smaller and smaller -> eliminated
Practical Techniques for
Improving Backpropagation
 If we have insufficient
training data for the
desired classification
accuracy.
 Learning with hints is
to add output units for
addressing an ancillary
problem, one different
from but related to the
specific classification
problem at hand.
Practical Techniques for
Improving Backpropagation
 Stopped Training
 Number of hidden layers: typically 2-3.

 Criterion Function: cross entropy,


Minkowski Error
Practical Techniques for
Improving Backpropagation
 Second-Order Methods
to speed up the
learning rate:
 Newton’s Method,
demo chapter 9
 Quickprop
 Conjugate Gradient
Descent requires batch
training, demo chapter
9
Practical Techniques for
Improving Backpropagation
 2) Newton’s method
 The idea is to minimize the quadratic
approximation of the cost function E(w) around
the current point w(n).
 Using a second-order Taylor series expansion
of the cost function around the point w(n).
  Ew(n)  gT(n)w(n) +½ wT(n) H(n) w(n)
 g(n) is the m-by-1 gradient vector of the cost
function E(w) evaluated at the point w(n). The
matrix H(n) is the m-by-m Hessian matrix of
E(w) (second derivative), H = ²E(w)
Practical Techniques for
Improving Backpropagation
 H = ²E(w) requires the cost function E(w) to
be twice continuously differentiable with
respect to the elements of w.
 Differentiating  Ew(n)  gT(n)w(n) +½
wT(n) H(n) w(n) with respect to w, the
change E(n) is minimized when
 g(n) + H(n)w(n) = 0 -> w(n) = H-1(n)g(n)
 w(n+1) = w(n) + w(n)
 w(n+1) = w(n)+H-1(n)g(n)
 where H-1(n) is the inverse of the Hessian of
E(w).
Practical Techniques for
Improving Backpropagation
 Newton’s method converges quickly
asymtotically and does not exhibit the
zigzagging behavior.
 Newton’s method requires that the
Hessian H(n) has to be a positive
definite matrix for all n!
Practical Techniques for
Improving Backpropagation
 3) Gauss-Newton Method
 It is applicable to a cost function that is expressed
as the sum of error squares.
 E(w) = ½ n e²(i), note that all the error terms are
i=1
calculated on the basis of a weight vector w that is
fixed over the entire observation interval 1 i  n.
 The error signal e(i) is a function of the adjustable
weight vector w. Given an operating point w(n), we
linearize the dependence of e(i) on w by writing
e’(i,w) = e(i) + [e(i)/w]Tw=w(n) (w-w(n)), i=1,2,...,n
Practical Techniques for
Improving Backpropagation
e’(n,w) = e(n) + J(n)(w-w(n))
where e(n) is the error vector e(n) =
[e(1),e(2),...,e(n)]T and J(n) is the n-by-m
Jacobian matrix of e(n) (The Jacobian J(n) is
the transpose of the m-by-n gradient matrix
e(n), where e(n) =[e(1), e(2), ...,e(n)].
w(n+1) = arg min w {½e’(n,w)²}
= ½e(n)² +eT(n)J(n)(w-w(n)) +
½(w-w(n))TJT(n)J(n)(w-w(n))
Differentiating the expression with respect w and
setting the result equal to zero
Practical Techniques for
Improving Backpropagation
JT(n)e(n) + JT(n)e(n) (w-w(n)) = 0
w(n+1) = w(n) – (JT(n)J(n))-1JT(n)e(n)
The Gauss-Newton requires only the Jacobian
matrix of the error function e(n).
For the Gauss-Newton iteration to be
computable, the matrix product JT(n)J(n) must
be nonsigular. JT(n)J(n) is always nonnegative
definite but to ensure that it is nonsingular, the
Jacobian J(n) must have row rank n. -> add
the diagonal matrix I to the matrix JT(n)J(n),
the parameter  is a small positive constant.
Practical Techniques for
Improving Backpropagation
 JT(n)J(n)+ I ; positve definite for all n.
 -> The Gauss-Newton method is
implemented in the following form:
w(n+1) = w(n) – (JT(n)J(n) + I )-
J (n)e(n)
1 T

 This is the solution to the modified cost


function:
 E(w) = ½{w-w(0)²+  n e²(i)}
i=1
 where w(0) is the initial value of w.
Regularization, Complexity
Adjustment and Pruning
 If we have too many
weight and train too
long -> overfitting
 Wald Statistic: we can
estimate the
importance of a
parameter in a model
 The Optimal Brain
Damage, The Optimal
Brain Surgeon:
eliminate the
parameter having the
least importance
Summary
 Multilayer nonlinear neural networks trained
by gradient descent methods such as
backpropagation perform a maximum-
likelihood estimation of the weight values in
the model defined by the network topology.
 f(net) at the hidden units allows the
networks to form an arbitary decision
boundary, so long as there are sufficiently
many hidden units.

You might also like