Neural Network BSC

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 32

Artificial Neural Networks

Overview
1. Introduction
2. ANN representations
3. Perceptron Training
4. Gradient Descent and Delta Rule
5. Multilayer networks and Backpropagation algorithm
6. Remarks on the backpropagation algorithm
7. An illustrative example: face recognition
8. Advanced topics in artificial neural networks
Introduction

- Human brain : densely interconnected network


of 1011 neurons each connected to 104 others
(neuron switching time : approx. 10-3 sec.)

- Properties of artificial neural nets (ANN’s):


• Many neuron-like threshold switching units
• Many weighted interconnections among units
• Highly parallel, distributed process
Appropriate problems for neural network learning

• Input is high-dimensional discrete or real-valued


(e.g. raw sensor input)
• Output is discrete or real valued
• Output is a vector of values
• Possibly noisy data
• Long training times accepted
• Fast evaluation of the learned function required.
• Not important for humans to understand the weights

Examples:
• Speech phoneme recognition
• Image classification
• Financial prediction
Appropriate problems for neural network learning

-ALVINN drives 70 mph on highways


-The ALVINN system uses backpropagation algorithm to learn to steer
an antonomous vehicle driving at speeds up to 70 miles per hour
Perceptron

• Input values → Linear weighted sum → Threshold


Activation Functions

1. Identity Function
f(x)=x all x
2. Binary Step function

 1 if x 
f ( x)  
 0 if x 
3. Bipolar Step function
 1 if x  
f ( x)  
 1 if x  
Activation Functions

4. Sigmoidal Functions:- Continuous functions


(a)Binary Sigmoidal
1
f ( x) 
1  e  x
f ' ( x)  f ( x)[1  f ( x)]

(b) Bipolar Sigmoidal


2 1  e  x
f ( x)   x
1 
1 e 1  e  x

f ' ( x)  [1  f ( x)][1  f ( x)]
2
Activation Functions
Activation functions:

(A) Identity

(B) Binary step

(C) Bipolar step

(D) Binary sigmoidal

(E) Bipolar sigmoidal

(F) Ramp
Decision surface of a perceptron

• Representational power of perceptrons


- Linearly separable case like (a) :
possible to classify by hyperplane,
- Linearly inseparable case like (b) :
impossible to classify
Perceptron training rule and delta rule

wi  wi + wi
where wi =  (t – o) xi
Where:
• t = c(x) is target value
• o is perceptron output
  is small constant (e.g., 0.1) called learning rate
Can prove it will converge
• If training data is linearly separable
Gradient descent
Hypothesis Space

- Error of different hypotheses


- For a linear unit with two weights, the hypothesis space H is the wo,w1
plane.
- This error surface must be parabolic with a single global minimum (we
desire a hypothesis with minimum error).
Derivation of gradient descent

 Gradient descent
- Error (for all training examples.):

- the gradient of E ( partial differentiating ) :

- direction : steepest increase in E.


- Thus, training rule is as follows.

(The negative sign : the direction that decreases E)


Derivation of gradient descent

where xid denotes the single input


components xi for training example d

- The weight update rule for gradient descent



Gradient descent and delta rule

Because the error surface


contains only a single global
minimum, this algorithm will
converge to a weight vector with
minimum error, given a
sufficiently small  is used
Gradient descent
Gradient Descent can be applied when
1.The hypothesis space contains continuously parameterized
hypothesis

2.The error can be differentiated with respect to these


parameters
The difficulties in applying gradient descent are:
1. Converging to minima may be slow

2. If there are multiple local minima there is no guarantee that


procedure will reach global minima
Stochastic approximation to gradient descent

- Stochastic gradient descent (i.e. incremental mode) can sometimes


avoid falling into local minima because it uses the various gradient of E
rather than overall gradient of E.
Summary

• Perceptron training rule guaranteed to succeed if


– training examples are linearly separable
– Sufficiently small learning rate η

• Linear unit training rule using gradient descent


– Converge to min. error hypothesis
(Guaranteed to converge to hypothesis with minimum
squared error )
Multilayer networks and the backpropagation algorithm

 Speech recognition example of multilayer networks learned


by the backpropagation algorithm
 Highly nonlinear decision surfaces
Sigmoid Threshold Unit
The Backpropagation algorithm
Adding Momentum

 Often include weight momentum α

- nth iteration update depend on (n-1)th iteration


-  : constant between 0 and 1 (momentum)
 Roles of momentum term
 The effect of keeping the ball rolling through small local
minima in the error surface
 The effect of gradually increasing the step size of the
search in regions (greatly improves the speed of
learning)
Convergence and Local Minima

 Gradient descent to some local minimum


– Perhaps not global minimum...
– Add momentum
– Stochastic gradient descent
Expressive Capabilities of ANNs
Hidden layer representations

Hidden layer representations


- This 8x3x8 network was trained to learn the identity function.
- 8 training examples are used.
- After 5000 training iterations, the three hidden unit values encode
the eight distinct inputs using the encoding shown on the right .
Learning the 8x3x8 network
- Most of the interesting weight
changes occurred during the
first 2500 iterations.
Generalization, Overfitting, and Stopping Criterion

• Termination condition
– Until the error E falls below some predetermined threshold
• Techniques to address the overfitting problem
• Weight decay : Decrease each weight by some small factor
during each iteration.
• Cross-validation (k-fold cross-validation)
Neural Nets for Face Recognition
(http://www.cs.cmu.edu/tom/faces.html)

• Training images : 20 different persons with 32 images per person.


• After 260 training images, the network achieves an accuracy of 90% over test
set.
• Algorithm parameters : η=0.3, α=0.3
Alternative Error Functions

• Penalize large weights: (weight decay) : Reducing the risk of


overfitting

• Train on target slopes as well as values:

• Minimizing the cross entropy : Learning a probabilistic output


function (chapter 6)

  t d log od  (1  t d ) log(1  od )
d∈ D
Recurrent Networks

(a) (b) (c)

(a) Feedforward network


(b) Recurrent network
(c) Recurrent network unfolded
in time
Dynamically Modifying Network Structure

• To improve generalization accuracy and training


efficiency
• Cascade-Correlation algorithm (Fahlman and Lebiere 1990)
– Start with the simplest possible network (no hidden units) and
add complexity
• Lecun et al. 1990
– Start with the complex network and prune it as we find that
certain connectives are inessential.

You might also like