Neural Networks The Adaline: Last Lecture Summary
Neural Networks The Adaline: Last Lecture Summary
Neural Networks The Adaline: Last Lecture Summary
Biological Neurons
Artificial Neurons
Rosenblatt’s Perceptron
Neural Networks
The ADALINE
Widrow-Hoff (1960)
Minimize the error at the output of the linear unit (e) rather than at the output of
the threshold unit (e’).
w0 + x1w1 + ... + x N wN = 0
r 1 P p 1 P p
minimize cost function: E (w) = ∑ e( )
P p =1
2
= ∑ s −dp
P p =1
( ) 2
ADALINE - Simplification
l =0
r 1 P p 2 (3)
E (w) = ∑ e
P p =1
( ) (1) ep = sp − d p
r
w =[w0 w1 ... wN ]
T
N
∂E
= 0, ∀k = 0,..., N (2) s = ∑ wl xlp
p
(4)
∂wk
l =0
r 1 P p
Compute the gradient of cost function: E (w) = ∑ e
P p =1
( ) 2
p
∂E 1 P p ∂e
= ∑ 2e
∂wk P p =1 ∂wk
∂E 1 P ∂ p
= ∑ 2e p
∂wk P p =1 ∂wk
(s −d p)
p
∂E 1 P p ∂s
= ∑ 2e
∂wk P p =1 ∂wk
p
∂E 1 P p ∂s
= ∑ 2e
∂wk P p =1 ∂wk
N ∂s p
s = ∑w x
p
l l
p
⇔ = xkp
l =0
∂wk
∂E 1 P p ∂s
p
∂E 1 P
= ∑ 2e ⇔ = ∑ 2e p xkp
∂wk P p =1 ∂wk ∂wk P p =1
Very important !
The partial derivative of the error function with respect a
weight is proportional to the sum for all patterns of the input
on that weight multiplied by the error.
∂E 2 P p p
= ∑ xk e
∂wk P p =1
Given that ep = sp − d p
∂E 1 P
= ∑ 2e p xkp
∂wk P p =1
∂E 1 P
(
= ∑ 2 s p − d p xkp
∂wk P p =1
)
∂E 1 P 1 P
= ∑ 2s xk − ∑ 2d p xkp
p p
∂wk P p =1 P p =1
N
s = ∑ wl xlp
p
l =0
∂E 1 P 1 P
= ∑ 2 s xk − ∑ 2d p xkp
p p
∂wk P p =1 P p =1
∂E 1 P N 1 P
= ∑∑ 2wl xl xk − ∑ 2d p xkp
p p
∂wk P p =1 l =1 P p =1
∂E
= 0, ∀k = 0,..., N
∂wk
∂E 1 P N 1 P
= ∑∑ 2wl xlp xkp − ∑ 2d p xkp
∂wk P p =1 l =1 P p =1
P N P
∑∑ w x
p =1 l =1
p p
l l k x = ∑ d p xkp
p =1
P N P
∑∑ w x
p =1 l =1
x = ∑ d p xkp , ∀k = 0,..., N
p p
l l k
p =1
r
w =[w0 w1 ... wN ]
T
rp
x = x0p [ x 1
p
... x p T
N ], x0p = 1
N
r r
s = ∑ wl xlp = wT x p
p
l =0
r r
e p = wT x p − d p
r 1 P p 1 P rT r p
E (w) = ∑ e
P p =1
( ) 2
(
= ∑ w x −dp
P p =1
) 2
r 1 P rT r p r r
(
E (w) = ∑ w x − d p wT x p − d p
P p =1
)( )
T
r 1 P rT r p
(r Tr
E (w) = ∑ w x − d p x p w − d p
P p =1
)( )
r
(
1 P rT r p r p T r rT r p p
( ) r Tr
E (w) = ∑ w x x w − w x d − d p x p w + d p
P p =1
( ))
2
r
(
1 P rT r p r p T r rT r p p
( ) r Tr
E (w) = ∑ w x x w − w x d − d p x p w + d p
P p =1
( )) 2
r 1 P rT r p r p T r 1 P rT r p p 1 P p
P p =1
( )
E (w) = ∑ w x x w − 2 ∑ w x d + ∑ d
P p =1 P p =1
( ) 2
r rT 1 P r p r p T r rT 1 P r p p 1 P p
( )
E (w) = w ∑ x x w − 2w ∑ x d + ∑ d ( ) 2
P p =1 P p =1 P p =1
1 P
⋅ = ∑ (⋅)
P p =1
r r r r r r
E (w) = wT x p x p ( ) T
w − 2wT x p d p + d p( ) 2
Defining: 1 P rp rp r r
Rxx = ∑ x x
P p=1
( ) T
( )
= xp xp
T
r 1 P rp p r
p = ∑ x d = x pd p
P p =1
1 P p
σ = ∑d
2
d
P p =1
( ) 2
= dp( )2
r r r r r
E (w) = wT Rxx w − 2wT p + σ d2
r r r r r
E (w) = wT Rxx w − 2wT p + σ d2
r r r
∇E (w) = 2 Rxx w − 2 p
r r r
∇E (w *) = 0 ⇔ Rxx w* = p
r −1 r
w* = Rxx p
∂E
wk(t +1) = wk(t ) − η
∂wk
r 1 P
E (w) = ∑ (e p )
2
P p =1
Remember ~10 slides back:
∂E 2 P p p
= ∑ xk e
∂wk P p =1
∂E 2 P p p
= ∑ xk e
∂wk P p =1
1 P
wk(t +1)
= wk − ∑ 2ηxkp e p
(t )
P p =1
∂E
wk(t +1) = wk(t ) − η
∂wk
∂E 2 P p p
= ∑ xk e Complete (exact) gradient
∂wk P p =1
∂Eˆ
= xkp e p Stochastic (approximate) gradient
∂wk
1 P
wk(t +1) (t )
= wk − 2ηx e p
k
p
wk(t +1) = wk(t ) − ∑
P p =1
2ηxkp e p
ADALINE - Comparison
Batch Incremental
Epochs Patterns
(
wk(t +1) = wk(t ) − 2ηxkp s p − d p ) (
wk(t +1) = wk(t ) − ηxkp y p − d p )
The ADALINE allows abritrary real values in the output values
whereas the perceptron assumes binary outputs.
gˆ (n) =g (n) + eg (n )
∞ ∞
∑η (n ) < ∞
2
∑η (n ) = ∞
n =0
n =0
0.9
0.8
c
η (n ) =
0.7
0.6
n 0.5
0.4
0.3
0.2
0.1
0
0 100 200 300 400 500 600 700 800 900 1000
η0 0.9
η (n ) = 0.8
n 0.7
1+ 0.6
τ
0.5
0.4
0.3
0.2
0.1
0
0 100 200 300 400 500 600 700 800 900 1000