Lec 7

Non-Bayes classifiers.
Linear discriminants,
neural networks.
Discriminant functions(1)
Bayes classification rule:
P ( w1 | x )  P ( w2 | x )  0 ? w1 : w2
Instead might try to find a function:
f w1 , w2 ( x)  0 ? w1 : w2
f w1 , w2 ( x) is called discriminant function.
{x | f w1 , w2 ( x)  0} - decision surface
Discriminant functions (2)
Class 1 Class 1
Class 2 Class 2
Linear discriminant function:

f w1 , w2 ( x)  w x  w0
T
Decision surface is a hyperplane w x  w0  0

T
Linear discriminant – perceptron cost function
w  x
Replace w    and x   
 w0  1 
f
Thus now decision function is w1 , w2 ( x )  wT
x
and decision surface is wT
x0
Perceptron cost function:

J ( x)    x w x T
 1, if x  w1 and x is wT x  0

where  x  1, if x  w2 and x is wT x  0
0, x is correctly classified

Linear discriminant – perceptron cost function
Class 1
Perceptron cost function:
Class 2
J ( x)    x wT x
x
Value of J (x ) is proportional to
the sum of distances of all
misclassified samples to the
decision surface.
If discriminant function separates classes perfectly, then J ( x )  0

Otherwise, J ( x )  0 and we want to minimize it.
J (x) is continuous and piecewise linear. So we might try to

use gradient descent algorithm.
Linear discriminant – Perceptron algorithm
Gradient descent:
J ( w)
w(t  1)  w(t )   t
w w w( t )
J ( w)
At points where J (x ) is differentiable   δx x
w misclas
sified x
Thus w(t  1)  w(t )   t δ

misclas
x x
sified x
Perceptron algorithm converges when classes are linearly

separable with some conditions on  t
Sum of error squares estimation
Let denote y ( x)  1 as desired output function, 1 for
one class and –1 for the other.
Want to find discriminant function f w1 , w2 ( x)  w x
T
whose output is similar to y (x)
Use sum of error squares as similarity criterion:

N
J ( w )   yi  w x i T 2
i 1
ˆ  arg min J ( w )
w
w
Sum of error squares estimation
Minimize mean square error:
J ( w ) N
 2  x i ( yi  w x i )  0
T
w i 1
 N  N
  x i x i w
T
ˆ   x i yi
 i 1  i 1
Thus 1
 N
  N

ˆ    x i x i    x i yi 
w T
 i 1   i 1 
Neurons
Artificial neuron.
x1
w1
x2
w2  f
xl wl
w0
Above figure represent artificial neuron calculating:

 l 
y  f   wi xi 
 i 1 
Artificial neuron.
Threshold functions f:
Step function Logistic function
1 1
0 0
1 x  0 1
f ( x)   f ( x) 
0 x  0 1  e  ax
Combining artificial neurons
x1
x2
xl
Multilayer perceptron with 3 layers.

Discriminating ability of multilayer perceptron
Since 3-layer perceptron can approximate any smooth
function, it can approximate F ( x )  P ( w1 | x )  P ( w2 | x )
- optimal discriminant function of two classes.
Training of multilayer perceptron
f f
ykr 1 v r y rj
wrjk j
f f
f
f
Layer r-1 Layer r

Training and cost function
Desired network output: x(i )  y (i )
Trained network output: x(i )  yˆ (i )
Cost function for one training sample:

1 kL
E (i )   ( ym (i )  ym (i ))
ˆ 2
2 m 1
N
Total cost function: J   E (i )
i 1
r
Goal of the training: find values of w jk which minimize
cost function J .
Gradient descent
Denote: w rj  [ wrj 0 , wrj1 ,..., wrjk r 1 ]T
J
Gradient descent: w ( new)  w (old )  
r r
j j
w rj
N
Since J   E (i ) , we might want to update weights
after processing
i 1 each training sample separately:
E (i )
w (new)  w (old )  
r
j
r
j
w jr
Gradient descent
Chain rule for differentiating composite functions:
 r
E (i ) E (i ) j (i ) E (i ) r 1
v
 r  r y (i )
w jr
v j (i ) w jr
v j (i )
E (i )
Denote:  (i )  r
r
j
v j (i )
Backpropagation
If r=L, then
E ( i )   1 kL
2
 j (i )  L  L   ( f (vm (i ))  yˆ m (i )) 
L L
v j (i ) v j (i )  2 m 1 
 ( f (v Lj (i ))  yˆ j (i )) f (v Lj (i ))  e j (i ) f (v Lj (i ))
If r<L, then
E (i ) E (i ) v
kr r
j (i )
 j (i )  r 1   r
r 1
v j (i ) k 1 v j (i ) v rj 1 (i )
kr
v rj (i ) kr
   jr (i ) r 1
   jr (i ) wkjr f (v rj 1 (i ))
k 1 v j (i ) k 1
Backpropagation algorithm
• Initialization: initialize all weights with random values.

• Forward computations: for each training vector x(i)
r r
compute all j v (i ), y j (i )
• Backward computations: for each i, j and r=L, L-1,…,2
compute  jr 1 (i )
• Update weights: E (i )
w j (new)  w j (old )  
r r
w rj
 w rj (old )   jr (i ) y r 1 (i )
MLP issues
• What is the best network configuration?
• How to choose proper learning parameter  ?
• When training should be stopped?
• Choose another threshold function f or cost function J?

Lec 7

Uploaded by

Copyright:

Available Formats

Lec 7

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec 7

Uploaded by

Copyright:

Available Formats

Non-Bayes classifiers.

Instead might try to find a function:

f w1 , w2 ( x) is called discriminant function.

Linear discriminant function:

Decision surface is a hyperplane w x  w0  0

Perceptron cost function:

If discriminant function separates classes perfectly, then J ( x )  0

J (x) is continuous and piecewise linear. So we might try to

Thus w(t  1)  w(t )   t δ

Perceptron algorithm converges when classes are linearly

whose output is similar to y (x)

Use sum of error squares as similarity criterion:

Above figure represent artificial neuron calculating:

Step function Logistic function

Multilayer perceptron with 3 layers.

Layer r-1 Layer r

Cost function for one training sample:

• Initialization: initialize all weights with random values.

You might also like