Neural Networks The Adaline: Last Lecture Summary

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Last Lecture Summary

 Introduction to Neural Networks

 Biological Neurons

 Artificial Neurons

 McCulloch and Pitts TLU

 Rosenblatt’s Perceptron

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

Neural Networks
The ADALINE

MACHINE LEARNING 09/10


Perceptron Limitations

 Perceptron’s learning rule is not guaranteed to converge if data is


not linearly separable.

 Widrow-Hoff (1960)
 Minimize the error at the output of the linear unit (e) rather than at the output of
the threshold unit (e’).

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE – Adaptive Linear Element

 Separating hyperplane is equivalent to the


perceptron

w0 + x1w1 + ... + x N wN = 0

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010


ADALINE – Adaptive Linear Element

 Learning rule is different from the perceptron.

 Given the training set: {( )}


T = xp,d p , p = 1,..., P

r 1 P p 1 P p
 minimize cost function: E (w) = ∑ e( )
P p =1
2
= ∑ s −dp
P p =1
( ) 2

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE - Simplification

 Let us consider that, for every pattern: x0p = 1

 Thus we can write:


N
s = ∑ wl xlp
p

l =0

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010


ADALINE – Analytic Solution

 Optimize the cost  Given that:


function:

r 1 P p 2 (3)
E (w) = ∑ e
P p =1
( ) (1) ep = sp − d p
r
w =[w0 w1 ... wN ]
T

N
∂E
= 0, ∀k = 0,..., N (2) s = ∑ wl xlp
p
(4)
∂wk
l =0

ADALINE – Analytic Solution

r 1 P p
 Compute the gradient of cost function: E (w) = ∑ e
P p =1
( ) 2

p
∂E 1 P p ∂e
= ∑ 2e
∂wk P p =1 ∂wk

∂E 1 P ∂ p
= ∑ 2e p
∂wk P p =1 ∂wk
(s −d p)

p
∂E 1 P p ∂s
= ∑ 2e
∂wk P p =1 ∂wk

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010


ADALINE – Analytic Solution

p
∂E 1 P p ∂s
= ∑ 2e
∂wk P p =1 ∂wk

N ∂s p
s = ∑w x
p
l l
p
⇔ = xkp
l =0
∂wk

∂E 1 P p ∂s
p
∂E 1 P
= ∑ 2e ⇔ = ∑ 2e p xkp
∂wk P p =1 ∂wk ∂wk P p =1

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE – Analytic Solution

 Very important !
 The partial derivative of the error function with respect a
weight is proportional to the sum for all patterns of the input
on that weight multiplied by the error.

∂E 2 P p p
= ∑ xk e
∂wk P p =1

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010


ADALINE – Analytic Solution

Given that ep = sp − d p

∂E 1 P
= ∑ 2e p xkp
∂wk P p =1

∂E 1 P
(
= ∑ 2 s p − d p xkp
∂wk P p =1
)

∂E 1 P 1 P
= ∑ 2s xk − ∑ 2d p xkp
p p

∂wk P p =1 P p =1

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE – Analytic Solution

N
s = ∑ wl xlp
p

l =0

∂E 1 P 1 P
= ∑ 2 s xk − ∑ 2d p xkp
p p

∂wk P p =1 P p =1

∂E 1 P N 1 P
= ∑∑ 2wl xl xk − ∑ 2d p xkp
p p

∂wk P p =1 l =1 P p =1

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010


ADALINE – Analytic Solution

∂E
= 0, ∀k = 0,..., N
∂wk

∂E 1 P N 1 P
= ∑∑ 2wl xlp xkp − ∑ 2d p xkp
∂wk P p =1 l =1 P p =1

P N P

∑∑ w x
p =1 l =1
p p
l l k x = ∑ d p xkp
p =1

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE – Analytic Solution

P N P

∑∑ w x
p =1 l =1
x = ∑ d p xkp , ∀k = 0,..., N
p p
l l k
p =1

 It is a linear system of N+1 equations with N+1


unknowns. How to solve it ?

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010


ADALINE – Matrix Notation

r
w =[w0 w1 ... wN ]
T

rp
x = x0p [ x 1
p
... x p T
N ], x0p = 1
N
r r
s = ∑ wl xlp = wT x p
p

l =0

r r
e p = wT x p − d p

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE – Matrix Notation

r 1 P p 1 P rT r p
E (w) = ∑ e
P p =1
( ) 2
(
= ∑ w x −dp
P p =1
) 2

r 1 P rT r p r r
(
E (w) = ∑ w x − d p wT x p − d p
P p =1
)( )
T

r 1 P rT r p
(r Tr
E (w) = ∑ w x − d p x p w − d p
P p =1
)( )
r
(
1 P rT r p r p T r rT r p p
( ) r Tr
E (w) = ∑ w x x w − w x d − d p x p w + d p
P p =1
( ))
2

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010


ADALINE – Matrix Notation

r
(
1 P rT r p r p T r rT r p p
( ) r Tr
E (w) = ∑ w x x w − w x d − d p x p w + d p
P p =1
( )) 2

r 1 P rT r p r p T r 1 P rT r p p 1 P p
P p =1
( )
E (w) = ∑ w x x w − 2 ∑ w x d + ∑ d
P p =1 P p =1
( ) 2

r rT  1 P r p r p T  r rT  1 P r p p  1 P p
( )
E (w) = w  ∑ x x  w − 2w  ∑ x d  + ∑ d ( ) 2

 P p =1   P p =1  P p =1

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE – Matrix Notation

 Let us introduce the average operator < >:

1 P
⋅ = ∑ (⋅)
P p =1

 The cost function is written as:

r r r r r r
E (w) = wT x p x p ( ) T
w − 2wT x p d p + d p( ) 2

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010


ADALINE – Matrix Notation

 Defining: 1 P rp rp r r
Rxx = ∑ x x
P p=1
( ) T
( )
= xp xp
T

r 1 P rp p r
p = ∑ x d = x pd p
P p =1
1 P p
σ = ∑d
2
d
P p =1
( ) 2
= dp( )2

 the cost function is:


r r r r r
E (w) = wT Rxx w − 2wT p + σ d2

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE – Quadratic Cost

r r r r r
E (w) = wT Rxx w − 2wT p + σ d2

 Rxx is a covariance matrix –


positive semi-definite

 The error function surface is a


parabola.

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010


ADALINE – Gradient Vector

r r r r r
E (w) = wT Rxx w − 2wT p + σ d2

r r r
∇E (w) = 2 Rxx w − 2 p

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE – Closed Form Solution

r r r
∇E (w *) = 0 ⇔ Rxx w* = p

 If Rxx is positive definite, the minimum is unique.

r −1 r
w* = Rxx p

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010


ADALINE – Closed Form Solution

 Closed form solution requires the inversion of the


covariance matrix, which can be problematic in high
dimensions.

 Gradient methods are simpler and have proved


convergence properties for quadratic functions.

∂E
wk(t +1) = wk(t ) − η
∂wk

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE – Gradient Based Solution

r 1 P
E (w) = ∑ (e p )
2

P p =1
 Remember ~10 slides back:

∂E 2 P p p
= ∑ xk e
∂wk P p =1

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010


ADALINE – Gradient Based Solution

∂E 2 P p p
= ∑ xk e
∂wk P p =1
1 P
wk(t +1)
= wk − ∑ 2ηxkp e p
(t )
P p =1
∂E
wk(t +1) = wk(t ) − η
∂wk

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE – Batch Algorithm

 Initialize weigths at arbitrary values


 Define a learning rate η.
 Repeat:
 For each pattern in the training set
 Apply xp to the adaline input
 Observe the output sp and compute the error ep = sp –dp
 For each weight k, accumulate the product xkpep.

 After processing all patterns, update each weight k by:


1 P
(t +1)
wk = wk − ∑ 2ηxkp e p
(t )
P p =1

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010


ADALINE – Batch Algorithm

ADALINE’s batch algorithm properties:

 Guarateed to converge to weight set with minimum


squared error:

 given sufficiently small learning rate η.

 Even when training data contains noise.

 Even when training data is not separable.

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE – Batch Algorithm

 ADALINE’s batch algorithm requires the availability of


all the training data from the beginning.

 The weights are updated only after presenting the whole


training data.

 But humans learn continuously!

 In some applications we may want to update the weights


immediately after each training pattern is available.

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010


ADALINE – Incremental Algorithm

 Incremental Algorithm – approximate the complete


gradient by its estimate for each pattern.

∂E 2 P p p
= ∑ xk e Complete (exact) gradient
∂wk P p =1

∂Eˆ
= xkp e p Stochastic (approximate) gradient
∂wk

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE – Incremental Algorithm

 Incremental mode  Batch mode gradient


gradient descent descent:

1 P
wk(t +1) (t )
= wk − 2ηx e p
k
p
wk(t +1) = wk(t ) − ∑
P p =1
2ηxkp e p

Incremental Gradient Descent can approximate Batch gradient


descent arbitrarily closely if η is made small enough.
ADALINE – Incremental Algorithms

 Incremental gradient descent is also known as stochastic


gradient descent.

 It is also called LMS algorithm or the Delta Rule.

 It is based on an approximation of the gradient so it never


goes exactly to the minimum of the cost function.

 After reaching a vicinity of the minimum, it oscilates around


it.

 The amplitude of the oscilations can be reduced by reducing


η.

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE - Comparison

 The plots show the value of one weight along time

Batch Incremental

Epochs Patterns

1 Epoch = P patterns (the full training set)

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010


ADALINE vs Perceptron

 Both ADALINE Delta Rule and Perceptron weight update rule


are instances of Error Correction Learning.

ADALINE Delta Rule Perceptron update rule

(
wk(t +1) = wk(t ) − 2ηxkp s p − d p ) (
wk(t +1) = wk(t ) − ηxkp y p − d p )
 The ADALINE allows abritrary real values in the output values
whereas the perceptron assumes binary outputs.

 The ADALINE always converges (given small enough η) to the


minimum squared error, while the perceptron only converges
when data is separable.

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE – Statistical Interpretation

 The analytical solution for the weights was obtained


“averaging” quantities obtained from the training
set.
 It is possible to make a statistical interpretation of
the process:
 Inputs: Observarions of N Random Variables
 X = [1, X1, ..., Xk, ..., XN]
 Desired Output: Observations of 1 Random Variable
 D
 Output: Observations of 1 Random Variable
 Y = wTX

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010


ADALINE – Statistical Interpretation

 The error function can be interpreted as an approximation to the


statistical expectation E[.]:
1 P p
r
E (w) = ∑ y − d p
P p =1
( )
2
[
≈ E (Y − D )
2
]
Solution
r r
w* = Rxx−1 p
 Matrix Rxx and vector p can be interpreted as approximations to the
statistical auto-covariance and cross-covariance between random
variables:
1 P rp rp r 1 P rp p
Rxx = ∑ x x
P p=1
( ) T
≈ E[ XX ] T
p = ∑ x d ≈ E [DX ]
P p =1
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE – Statistical Interpretation

 The LMS algorithm is based on an instantaneous


estimate of the gradient.

 This estimate can be modeled by:

gˆ (n) =g (n) + eg (n )

where eg(n) is a random noise vector.

 LMS = stochastic gradient descent

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010


ADALINE – Statistical Interpretation

 Under reasonable conditions, stochastic gradient


methods may converge to the exact solution.
 Convergence Conditions [Monro and Ljung]:
 eg(n) is zero mean.
 Pattern sequence is random.
 η(n) tends slowly toward zero.

∞ ∞

∑η (n ) < ∞
2
∑η (n ) = ∞
n =0
n =0

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE – Statistical Interpretation

 Typical learning rate schedules


1

0.9

0.8

c
η (n ) =
0.7

0.6

n 0.5

0.4

0.3

0.2

0.1

0
0 100 200 300 400 500 600 700 800 900 1000

η0 0.9

η (n ) = 0.8

n 0.7

1+ 0.6

τ
0.5

0.4

0.3

0.2

0.1

0
0 100 200 300 400 500 600 700 800 900 1000

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

You might also like