Lec 7

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 21

Non-Bayes classifiers.

Linear discriminants,
neural networks.
Discriminant functions(1)
Bayes classification rule:
P ( w1 | x )  P ( w2 | x )  0 ? w1 : w2

Instead might try to find a function:

f w1 , w2 ( x)  0 ? w1 : w2

f w1 , w2 ( x) is called discriminant function.

{x | f w1 , w2 ( x)  0} - decision surface
Discriminant functions (2)
Class 1 Class 1

Class 2 Class 2

Linear discriminant function:


f w1 , w2 ( x)  w x  w0
T

Decision surface is a hyperplane w x  w0  0


T
Linear discriminant – perceptron cost function
w  x
Replace w    and x   
 w0  1 
f
Thus now decision function is w1 , w2 ( x )  wT
x
and decision surface is wT
x0

Perceptron cost function:


J ( x)    x w x T

 1, if x  w1 and x is wT x  0

where  x  1, if x  w2 and x is wT x  0
0, x is correctly classified

Linear discriminant – perceptron cost function
Class 1
Perceptron cost function:
Class 2
J ( x)    x wT x
x
Value of J (x ) is proportional to
the sum of distances of all
misclassified samples to the
decision surface.

If discriminant function separates classes perfectly, then J ( x )  0


Otherwise, J ( x )  0 and we want to minimize it.

J (x) is continuous and piecewise linear. So we might try to


use gradient descent algorithm.
Linear discriminant – Perceptron algorithm
Gradient descent:
J ( w)
w(t  1)  w(t )   t
w w w( t )

J ( w)
At points where J (x ) is differentiable   δx x
w misclas
sified x

Thus w(t  1)  w(t )   t δ


misclas
x x
sified x

Perceptron algorithm converges when classes are linearly


separable with some conditions on  t
Sum of error squares estimation
Let denote y ( x)  1 as desired output function, 1 for
one class and –1 for the other.
Want to find discriminant function f w1 , w2 ( x)  w x
T

whose output is similar to y (x)

Use sum of error squares as similarity criterion:


N
J ( w )   yi  w x i T 2

i 1

ˆ  arg min J ( w )
w
w
Sum of error squares estimation
Minimize mean square error:
J ( w ) N
 2  x i ( yi  w x i )  0
T

w i 1

 N  N
  x i x i w
T
ˆ   x i yi
 i 1  i 1

Thus 1
 N
  N

ˆ    x i x i    x i yi 
w T

 i 1   i 1 
Neurons
Artificial neuron.
x1
w1
x2
w2  f

xl wl
w0

Above figure represent artificial neuron calculating:


 l 
y  f   wi xi 
 i 1 
Artificial neuron.
Threshold functions f:

Step function Logistic function

1 1

0 0

1 x  0 1
f ( x)   f ( x) 
0 x  0 1  e  ax
Combining artificial neurons

x1
x2

xl

Multilayer perceptron with 3 layers.


Discriminating ability of multilayer perceptron
Since 3-layer perceptron can approximate any smooth
function, it can approximate F ( x )  P ( w1 | x )  P ( w2 | x )
- optimal discriminant function of two classes.
Training of multilayer perceptron
f f

ykr 1 v r y rj
wrjk j
f f

f
f

Layer r-1 Layer r


Training and cost function
Desired network output: x(i )  y (i )
Trained network output: x(i )  yˆ (i )

Cost function for one training sample:


1 kL
E (i )   ( ym (i )  ym (i ))
ˆ 2

2 m 1
N
Total cost function: J   E (i )
i 1

r
Goal of the training: find values of w jk which minimize
cost function J .
Gradient descent
Denote: w rj  [ wrj 0 , wrj1 ,..., wrjk r 1 ]T

J
Gradient descent: w ( new)  w (old )  
r r
j j
w rj

N
Since J   E (i ) , we might want to update weights
after processing
i 1 each training sample separately:

E (i )
w (new)  w (old )  
r
j
r
j
w jr
Gradient descent
Chain rule for differentiating composite functions:

 r
E (i ) E (i ) j (i ) E (i ) r 1
v
 r  r y (i )
w jr
v j (i ) w jr
v j (i )

E (i )
Denote:  (i )  r
r
j
v j (i )
Backpropagation
If r=L, then
E ( i )   1 kL
2
 j (i )  L  L   ( f (vm (i ))  yˆ m (i )) 
L L

v j (i ) v j (i )  2 m 1 
 ( f (v Lj (i ))  yˆ j (i )) f (v Lj (i ))  e j (i ) f (v Lj (i ))

If r<L, then
E (i ) E (i ) v
kr r
j (i )
 j (i )  r 1   r
r 1

v j (i ) k 1 v j (i ) v rj 1 (i )
kr
v rj (i ) kr
   jr (i ) r 1
   jr (i ) wkjr f (v rj 1 (i ))
k 1 v j (i ) k 1
Backpropagation algorithm

• Initialization: initialize all weights with random values.


• Forward computations: for each training vector x(i)
r r
compute all j v (i ), y j (i )
• Backward computations: for each i, j and r=L, L-1,…,2
compute  jr 1 (i )
• Update weights: E (i )
w j (new)  w j (old )  
r r

w rj
 w rj (old )   jr (i ) y r 1 (i )
MLP issues
• What is the best network configuration?
• How to choose proper learning parameter  ?
• When training should be stopped?
• Choose another threshold function f or cost function J?

You might also like