Lec 7
Lec 7
Lec 7
Linear discriminants,
neural networks.
Discriminant functions(1)
Bayes classification rule:
P ( w1 | x ) P ( w2 | x ) 0 ? w1 : w2
f w1 , w2 ( x) 0 ? w1 : w2
{x | f w1 , w2 ( x) 0} - decision surface
Discriminant functions (2)
Class 1 Class 1
Class 2 Class 2
1, if x w1 and x is wT x 0
where x 1, if x w2 and x is wT x 0
0, x is correctly classified
Linear discriminant – perceptron cost function
Class 1
Perceptron cost function:
Class 2
J ( x) x wT x
x
Value of J (x ) is proportional to
the sum of distances of all
misclassified samples to the
decision surface.
J ( w)
At points where J (x ) is differentiable δx x
w misclas
sified x
i 1
ˆ arg min J ( w )
w
w
Sum of error squares estimation
Minimize mean square error:
J ( w ) N
2 x i ( yi w x i ) 0
T
w i 1
N N
x i x i w
T
ˆ x i yi
i 1 i 1
Thus 1
N
N
ˆ x i x i x i yi
w T
i 1 i 1
Neurons
Artificial neuron.
x1
w1
x2
w2 f
xl wl
w0
1 1
0 0
1 x 0 1
f ( x) f ( x)
0 x 0 1 e ax
Combining artificial neurons
x1
x2
xl
ykr 1 v r y rj
wrjk j
f f
f
f
2 m 1
N
Total cost function: J E (i )
i 1
r
Goal of the training: find values of w jk which minimize
cost function J .
Gradient descent
Denote: w rj [ wrj 0 , wrj1 ,..., wrjk r 1 ]T
J
Gradient descent: w ( new) w (old )
r r
j j
w rj
N
Since J E (i ) , we might want to update weights
after processing
i 1 each training sample separately:
E (i )
w (new) w (old )
r
j
r
j
w jr
Gradient descent
Chain rule for differentiating composite functions:
r
E (i ) E (i ) j (i ) E (i ) r 1
v
r r y (i )
w jr
v j (i ) w jr
v j (i )
E (i )
Denote: (i ) r
r
j
v j (i )
Backpropagation
If r=L, then
E ( i ) 1 kL
2
j (i ) L L ( f (vm (i )) yˆ m (i ))
L L
v j (i ) v j (i ) 2 m 1
( f (v Lj (i )) yˆ j (i )) f (v Lj (i )) e j (i ) f (v Lj (i ))
If r<L, then
E (i ) E (i ) v
kr r
j (i )
j (i ) r 1 r
r 1
v j (i ) k 1 v j (i ) v rj 1 (i )
kr
v rj (i ) kr
jr (i ) r 1
jr (i ) wkjr f (v rj 1 (i ))
k 1 v j (i ) k 1
Backpropagation algorithm
w rj
w rj (old ) jr (i ) y r 1 (i )
MLP issues
• What is the best network configuration?
• How to choose proper learning parameter ?
• When training should be stopped?
• Choose another threshold function f or cost function J?