Neural Network BSC

Artificial Neural Networks
Overview
1. Introduction
2. ANN representations
3. Perceptron Training
4. Gradient Descent and Delta Rule
5. Multilayer networks and Backpropagation algorithm
6. Remarks on the backpropagation algorithm
7. An illustrative example: face recognition
8. Advanced topics in artificial neural networks
Introduction
- Human brain : densely interconnected network

of 1011 neurons each connected to 104 others
(neuron switching time : approx. 10-3 sec.)
- Properties of artificial neural nets (ANN’s):

• Many neuron-like threshold switching units
• Many weighted interconnections among units
• Highly parallel, distributed process
Appropriate problems for neural network learning
• Input is high-dimensional discrete or real-valued

(e.g. raw sensor input)
• Output is discrete or real valued
• Output is a vector of values
• Possibly noisy data
• Long training times accepted
• Fast evaluation of the learned function required.
• Not important for humans to understand the weights
Examples:
• Speech phoneme recognition
• Image classification
• Financial prediction
Appropriate problems for neural network learning
-ALVINN drives 70 mph on highways

-The ALVINN system uses backpropagation algorithm to learn to steer
an antonomous vehicle driving at speeds up to 70 miles per hour
Perceptron
• Input values → Linear weighted sum → Threshold

Activation Functions
1. Identity Function
f(x)=x all x
2. Binary Step function
 1 if x 
f ( x)  
 0 if x 
3. Bipolar Step function
 1 if x  
f ( x)  
 1 if x  
4. Sigmoidal Functions:- Continuous functions

(a)Binary Sigmoidal
1
f ( x) 
1  e  x
f ' ( x)  f ( x)[1  f ( x)]
(b) Bipolar Sigmoidal

2 1  e  x
f ( x)   x
1 
1 e 1  e  x

f ' ( x)  [1  f ( x)][1  f ( x)]
2
Activation functions:
(A) Identity
(B) Binary step
(C) Bipolar step
(D) Binary sigmoidal
(E) Bipolar sigmoidal
(F) Ramp
Decision surface of a perceptron
• Representational power of perceptrons

- Linearly separable case like (a) :
possible to classify by hyperplane,
- Linearly inseparable case like (b) :
impossible to classify
Perceptron training rule and delta rule
wi  wi + wi
where wi =  (t – o) xi
Where:
• t = c(x) is target value
• o is perceptron output
  is small constant (e.g., 0.1) called learning rate
Can prove it will converge
• If training data is linearly separable
Gradient descent
Hypothesis Space
- Error of different hypotheses

- For a linear unit with two weights, the hypothesis space H is the wo,w1
plane.
- This error surface must be parabolic with a single global minimum (we
desire a hypothesis with minimum error).
Derivation of gradient descent
 Gradient descent
- Error (for all training examples.):
- the gradient of E ( partial differentiating ) :
- direction : steepest increase in E.

- Thus, training rule is as follows.
(The negative sign : the direction that decreases E)

Derivation of gradient descent
where xid denotes the single input

components xi for training example d
- The weight update rule for gradient descent


Gradient descent and delta rule
Because the error surface

contains only a single global
minimum, this algorithm will
converge to a weight vector with
minimum error, given a
sufficiently small  is used
Gradient descent
Gradient Descent can be applied when
1.The hypothesis space contains continuously parameterized
hypothesis
2.The error can be differentiated with respect to these

parameters
The difficulties in applying gradient descent are:
1. Converging to minima may be slow
2. If there are multiple local minima there is no guarantee that

procedure will reach global minima
Stochastic approximation to gradient descent
- Stochastic gradient descent (i.e. incremental mode) can sometimes

avoid falling into local minima because it uses the various gradient of E
rather than overall gradient of E.
Summary
• Perceptron training rule guaranteed to succeed if

– training examples are linearly separable
– Sufficiently small learning rate η
• Linear unit training rule using gradient descent

– Converge to min. error hypothesis
(Guaranteed to converge to hypothesis with minimum
squared error )
Multilayer networks and the backpropagation algorithm
 Speech recognition example of multilayer networks learned

by the backpropagation algorithm
 Highly nonlinear decision surfaces
Sigmoid Threshold Unit
The Backpropagation algorithm
Adding Momentum
 Often include weight momentum α
- nth iteration update depend on (n-1)th iteration

-  : constant between 0 and 1 (momentum)
 Roles of momentum term
 The effect of keeping the ball rolling through small local
minima in the error surface
 The effect of gradually increasing the step size of the
search in regions (greatly improves the speed of
learning)
Convergence and Local Minima
 Gradient descent to some local minimum

– Perhaps not global minimum...
– Add momentum
– Stochastic gradient descent
Expressive Capabilities of ANNs
Hidden layer representations
Hidden layer representations

- This 8x3x8 network was trained to learn the identity function.
- 8 training examples are used.
- After 5000 training iterations, the three hidden unit values encode
the eight distinct inputs using the encoding shown on the right .
Learning the 8x3x8 network
- Most of the interesting weight
changes occurred during the
first 2500 iterations.
Generalization, Overfitting, and Stopping Criterion
• Termination condition
– Until the error E falls below some predetermined threshold
• Techniques to address the overfitting problem
• Weight decay : Decrease each weight by some small factor
during each iteration.
• Cross-validation (k-fold cross-validation)
Neural Nets for Face Recognition
(http://www.cs.cmu.edu/tom/faces.html)
• Training images : 20 different persons with 32 images per person.

• After 260 training images, the network achieves an accuracy of 90% over test
set.
• Algorithm parameters : η=0.3, α=0.3
Alternative Error Functions
• Penalize large weights: (weight decay) : Reducing the risk of

overfitting
• Train on target slopes as well as values:
• Minimizing the cross entropy : Learning a probabilistic output

function (chapter 6)
  t d log od  (1  t d ) log(1  od )
d∈ D
Recurrent Networks
(a) (b) (c)
(a) Feedforward network

(b) Recurrent network
(c) Recurrent network unfolded
in time
Dynamically Modifying Network Structure
• To improve generalization accuracy and training

efficiency
• Cascade-Correlation algorithm (Fahlman and Lebiere 1990)
– Start with the simplest possible network (no hidden units) and
add complexity
• Lecun et al. 1990
– Start with the complex network and prune it as we find that
certain connectives are inessential.

Neural Network BSC

Uploaded by

Copyright:

Available Formats

Neural Network BSC

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neural Network BSC

Uploaded by

Copyright:

Available Formats

Artificial Neural Networks

- Human brain : densely interconnected network

- Properties of artificial neural nets (ANN’s):

• Input is high-dimensional discrete or real-valued

-ALVINN drives 70 mph on highways

• Input values → Linear weighted sum → Threshold

4. Sigmoidal Functions:- Continuous functions

(b) Bipolar Sigmoidal

(B) Binary step

(C) Bipolar step

(D) Binary sigmoidal

(E) Bipolar sigmoidal

• Representational power of perceptrons

- Error of different hypotheses

- the gradient of E ( partial differentiating ) :

- direction : steepest increase in E.

(The negative sign : the direction that decreases E)

where xid denotes the single input

- The weight update rule for gradient descent

Because the error surface

2.The error can be differentiated with respect to these

2. If there are multiple local minima there is no guarantee that

- Stochastic gradient descent (i.e. incremental mode) can sometimes

• Perceptron training rule guaranteed to succeed if

• Linear unit training rule using gradient descent

 Speech recognition example of multilayer networks learned

 Often include weight momentum α

- nth iteration update depend on (n-1)th iteration

 Gradient descent to some local minimum

Hidden layer representations

• Training images : 20 different persons with 32 images per person.

• Penalize large weights: (weight decay) : Reducing the risk of

• Train on target slopes as well as values:

• Minimizing the cross entropy : Learning a probabilistic output

(a) (b) (c)

(a) Feedforward network

• To improve generalization accuracy and training

You might also like