Neural Networks The Adaline: Last Lecture Summary

Last Lecture Summary
Introduction to Neural Networks
Biological Neurons
Artificial Neurons
McCulloch and Pitts TLU
Rosenblatt’s Perceptron
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
Neural Networks
The ADALINE
MACHINE LEARNING 09/10

Perceptron Limitations
Perceptron’s learning rule is not guaranteed to converge if data is

not linearly separable.
Widrow-Hoff (1960)
Minimize the error at the output of the linear unit (e) rather than at the output of
the threshold unit (e’).
ADALINE – Adaptive Linear Element
Separating hyperplane is equivalent to the

perceptron
w0 + x1w1 + ... + x N wN = 0

ADALINE – Adaptive Linear Element
Learning rule is different from the perceptron.
Given the training set: {( )}

T = xp,d p , p = 1,..., P
r 1 P p 1 P p
minimize cost function: E (w) = ∑ e( )
P p =1
2
= ∑ s −dp
P p =1
( ) 2
ADALINE - Simplification
Let us consider that, for every pattern: x0p = 1
Thus we can write:

N
s = ∑ wl xlp
p
l =0

ADALINE – Analytic Solution
Optimize the cost Given that:

function:
r 1 P p 2 (3)
E (w) = ∑ e
P p =1
( ) (1) ep = sp − d p
r
w =[w0 w1 ... wN ]
T
N
∂E
= 0, ∀k = 0,..., N (2) s = ∑ wl xlp
p
(4)
∂wk
l =0
r 1 P p
Compute the gradient of cost function: E (w) = ∑ e
P p =1
( ) 2
p
∂E 1 P p ∂e
= ∑ 2e
∂wk P p =1 ∂wk
∂E 1 P ∂ p
= ∑ 2e p
∂wk P p =1 ∂wk
(s −d p)
p
∂E 1 P p ∂s
= ∑ 2e
∂wk P p =1 ∂wk

p
∂E 1 P p ∂s
= ∑ 2e
∂wk P p =1 ∂wk
N ∂s p
s = ∑w x
p
l l
p
⇔ = xkp
l =0
∂wk
∂E 1 P p ∂s
p
∂E 1 P
= ∑ 2e ⇔ = ∑ 2e p xkp
∂wk P p =1 ∂wk ∂wk P p =1
Very important !
The partial derivative of the error function with respect a
weight is proportional to the sum for all patterns of the input
on that weight multiplied by the error.
∂E 2 P p p
= ∑ xk e
∂wk P p =1

Given that ep = sp − d p
∂E 1 P
= ∑ 2e p xkp
∂wk P p =1
∂E 1 P
(
= ∑ 2 s p − d p xkp
∂wk P p =1
)
∂E 1 P 1 P
= ∑ 2s xk − ∑ 2d p xkp
p p
∂wk P p =1 P p =1
N
s = ∑ wl xlp
p
l =0
∂E 1 P 1 P
= ∑ 2 s xk − ∑ 2d p xkp
p p
∂wk P p =1 P p =1
∂E 1 P N 1 P
= ∑∑ 2wl xl xk − ∑ 2d p xkp
p p
∂wk P p =1 l =1 P p =1

∂E
= 0, ∀k = 0,..., N
∂wk
∂E 1 P N 1 P
= ∑∑ 2wl xlp xkp − ∑ 2d p xkp
∂wk P p =1 l =1 P p =1
P N P
∑∑ w x
p =1 l =1
p p
l l k x = ∑ d p xkp
p =1
P N P
∑∑ w x
p =1 l =1
x = ∑ d p xkp , ∀k = 0,..., N
p p
l l k
p =1
It is a linear system of N+1 equations with N+1

unknowns. How to solve it ?

ADALINE – Matrix Notation
r
w =[w0 w1 ... wN ]
T
rp
x = x0p [ x 1
p
... x p T
N ], x0p = 1
N
r r
s = ∑ wl xlp = wT x p
p
l =0
r r
e p = wT x p − d p
r 1 P p 1 P rT r p
E (w) = ∑ e
P p =1
( ) 2
(
= ∑ w x −dp
P p =1
) 2
r 1 P rT r p r r
(
E (w) = ∑ w x − d p wT x p − d p
P p =1
)( )
T
r 1 P rT r p
(r Tr
E (w) = ∑ w x − d p x p w − d p
P p =1
)( )
r
(
1 P rT r p r p T r rT r p p
( ) r Tr
E (w) = ∑ w x x w − w x d − d p x p w + d p
P p =1
( ))
2

r
(
1 P rT r p r p T r rT r p p
( ) r Tr
E (w) = ∑ w x x w − w x d − d p x p w + d p
P p =1
( )) 2
r 1 P rT r p r p T r 1 P rT r p p 1 P p
P p =1
( )
E (w) = ∑ w x x w − 2 ∑ w x d + ∑ d
P p =1 P p =1
( ) 2
r rT  1 P r p r p T  r rT  1 P r p p  1 P p
( )
E (w) = w  ∑ x x  w − 2w  ∑ x d  + ∑ d ( ) 2
 P p =1   P p =1  P p =1
Let us introduce the average operator < >:
1 P
⋅ = ∑ (⋅)
P p =1
The cost function is written as:
r r r r r r
E (w) = wT x p x p ( ) T
w − 2wT x p d p + d p( ) 2

Defining: 1 P rp rp r r
Rxx = ∑ x x
P p=1
( ) T
( )
= xp xp
T
r 1 P rp p r
p = ∑ x d = x pd p
P p =1
1 P p
σ = ∑d
2
d
P p =1
( ) 2
= dp( )2
the cost function is:

r r r r r
E (w) = wT Rxx w − 2wT p + σ d2
ADALINE – Quadratic Cost
r r r r r
Rxx is a covariance matrix –

positive semi-definite
The error function surface is a

parabola.

ADALINE – Gradient Vector
r r r r r
r r r
∇E (w) = 2 Rxx w − 2 p
ADALINE – Closed Form Solution
r r r
∇E (w *) = 0 ⇔ Rxx w* = p
If Rxx is positive definite, the minimum is unique.
r −1 r
w* = Rxx p

ADALINE – Closed Form Solution
Closed form solution requires the inversion of the

covariance matrix, which can be problematic in high
dimensions.
Gradient methods are simpler and have proved

convergence properties for quadratic functions.
∂E
wk(t +1) = wk(t ) − η
∂wk
ADALINE – Gradient Based Solution
r 1 P
E (w) = ∑ (e p )
2
P p =1
Remember ~10 slides back:
∂E 2 P p p
= ∑ xk e
∂wk P p =1

ADALINE – Gradient Based Solution
∂E 2 P p p
= ∑ xk e
∂wk P p =1
1 P
wk(t +1)
= wk − ∑ 2ηxkp e p
(t )
P p =1
∂E
wk(t +1) = wk(t ) − η
∂wk
ADALINE – Batch Algorithm
Initialize weigths at arbitrary values

Define a learning rate η.
Repeat:
For each pattern in the training set
Apply xp to the adaline input
Observe the output sp and compute the error ep = sp –dp
For each weight k, accumulate the product xkpep.
After processing all patterns, update each weight k by:

1 P
(t +1)
wk = wk − ∑ 2ηxkp e p
(t )
P p =1

ADALINE’s batch algorithm properties:
Guarateed to converge to weight set with minimum

squared error:
given sufficiently small learning rate η.
Even when training data contains noise.
Even when training data is not separable.
ADALINE’s batch algorithm requires the availability of

all the training data from the beginning.
The weights are updated only after presenting the whole

training data.
But humans learn continuously!
In some applications we may want to update the weights

immediately after each training pattern is available.

ADALINE – Incremental Algorithm
Incremental Algorithm – approximate the complete

gradient by its estimate for each pattern.
∂E 2 P p p
= ∑ xk e Complete (exact) gradient
∂wk P p =1
∂Eˆ
= xkp e p Stochastic (approximate) gradient
∂wk
ADALINE – Incremental Algorithm
Incremental mode Batch mode gradient

gradient descent descent:
1 P
wk(t +1) (t )
= wk − 2ηx e p
k
p
wk(t +1) = wk(t ) − ∑
P p =1
2ηxkp e p
Incremental Gradient Descent can approximate Batch gradient

descent arbitrarily closely if η is made small enough.
ADALINE – Incremental Algorithms
Incremental gradient descent is also known as stochastic

gradient descent.
It is also called LMS algorithm or the Delta Rule.
It is based on an approximation of the gradient so it never

goes exactly to the minimum of the cost function.
After reaching a vicinity of the minimum, it oscilates around

it.
The amplitude of the oscilations can be reduced by reducing

η.
ADALINE - Comparison
The plots show the value of one weight along time
Batch Incremental
Epochs Patterns
1 Epoch = P patterns (the full training set)

ADALINE vs Perceptron
Both ADALINE Delta Rule and Perceptron weight update rule

are instances of Error Correction Learning.
ADALINE Delta Rule Perceptron update rule
(
wk(t +1) = wk(t ) − 2ηxkp s p − d p ) (
wk(t +1) = wk(t ) − ηxkp y p − d p )
The ADALINE allows abritrary real values in the output values
whereas the perceptron assumes binary outputs.
The ADALINE always converges (given small enough η) to the

minimum squared error, while the perceptron only converges
when data is separable.
ADALINE – Statistical Interpretation
The analytical solution for the weights was obtained

“averaging” quantities obtained from the training
set.
It is possible to make a statistical interpretation of
the process:
Inputs: Observarions of N Random Variables
X = [1, X1, ..., Xk, ..., XN]
Desired Output: Observations of 1 Random Variable
D
Output: Observations of 1 Random Variable
Y = wTX

The error function can be interpreted as an approximation to the

statistical expectation E[.]:
1 P p
r
E (w) = ∑ y − d p
P p =1
( )
2
[
≈ E (Y − D )
2
]
Solution
r r
w* = Rxx−1 p
Matrix Rxx and vector p can be interpreted as approximations to the
statistical auto-covariance and cross-covariance between random
variables:
1 P rp rp r 1 P rp p
Rxx = ∑ x x
P p=1
( ) T
≈ E[ XX ] T
p = ∑ x d ≈ E [DX ]
P p =1
The LMS algorithm is based on an instantaneous

estimate of the gradient.
This estimate can be modeled by:
gˆ (n) =g (n) + eg (n )
where eg(n) is a random noise vector.
LMS = stochastic gradient descent

Under reasonable conditions, stochastic gradient

methods may converge to the exact solution.
Convergence Conditions [Monro and Ljung]:
eg(n) is zero mean.
Pattern sequence is random.
η(n) tends slowly toward zero.
∞ ∞
∑η (n ) < ∞
2
∑η (n ) = ∞
n =0
n =0
Typical learning rate schedules

1
0.9
0.8
c
η (n ) =
0.7
0.6
n 0.5
0.4
0.3
0.2
0.1
0
0 100 200 300 400 500 600 700 800 900 1000
η0 0.9
η (n ) = 0.8
n 0.7
1+ 0.6
τ
0.5
0.4
0.3
0.2
0.1
0
0 100 200 300 400 500 600 700 800 900 1000

Neural Networks The Adaline: Last Lecture Summary

Uploaded by

Copyright:

Available Formats

Neural Networks The Adaline: Last Lecture Summary

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neural Networks The Adaline: Last Lecture Summary

Uploaded by

Copyright:

Available Formats

Last Lecture Summary

Introduction to Neural Networks

McCulloch and Pitts TLU

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

MACHINE LEARNING 09/10

Perceptron’s learning rule is not guaranteed to converge if data is

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE – Adaptive Linear Element

Separating hyperplane is equivalent to the

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

Learning rule is different from the perceptron.

Given the training set: {( )}

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

Let us consider that, for every pattern: x0p = 1

Thus we can write:

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

Optimize the cost Given that:

ADALINE – Analytic Solution

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE – Analytic Solution

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE – Analytic Solution

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE – Analytic Solution

It is a linear system of N+1 equations with N+1

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE – Matrix Notation

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE – Matrix Notation

Let us introduce the average operator < >:

The cost function is written as:

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

the cost function is:

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE – Quadratic Cost

Rxx is a covariance matrix –

The error function surface is a

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE – Closed Form Solution

If Rxx is positive definite, the minimum is unique.

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

Closed form solution requires the inversion of the

Gradient methods are simpler and have proved

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE – Gradient Based Solution

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE – Batch Algorithm

Initialize weigths at arbitrary values

After processing all patterns, update each weight k by:

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE’s batch algorithm properties:

Guarateed to converge to weight set with minimum