Linear Discriminant Functions: CS479/679 Pattern Recognition Dr. George Bebis
Linear Discriminant Functions: CS479/679 Pattern Recognition Dr. George Bebis
Linear Discriminant Functions: CS479/679 Pattern Recognition Dr. George Bebis
d
g (x) w t x w0 wi xi w0
i 1
n
class labels:
1
J (w, w0 ) [ zk g ( x k )]2
1 if xk 1
n k 1 zk
1 if xk 2
i 1
w direction of r
x xp r
|| w ||
Geometric Interpretation of g(x) (contd)
Substitute x in g(x):
w
g (x) w x w0 w (x p r
t t
) w0
|| w ||
wt w
w xp r
t
w0 r || w ||
|| w ||
g ( x)
r
|| w ||
w0
setting x=0: r
|| w ||
Linear Discriminant Functions:
multi-category case
There are several ways to devise multi-category
classifiers using linear discriminant functions:
problem:
ambiguous regions
Linear Discriminant Functions:
multi-category case (contd)
(2) One against another (i.e., c(c-1)/2 pairs of classes)
problem:
ambiguous regions
Linear Discriminant Functions:
multi-category case (contd)
To avoid the problem of ambiguous regions:
Define c linear discriminant functions
Assign x to i if gi(x) > gj(x) for all j i.
The resulting classifier is called a linear machine
(see Chapter 2)
Linear Discriminant Functions:
multi-category case (contd)
A linear machine divides the feature space in c
convex decisions regions.
If x is in region Ri, the gi(x) is the largest.
|| w i w j ||
Higher Order Discriminant Functions
g (x)
Linear Discriminants Revisited
Augmented feature/parameter space:
d d
g (x) w t x w0 wi xi x0 w0 wi xi t y
i 1 i 0
( x0 1)
w1 w0 x1 x0
w w x x
w 2 1 x 2 y 1
... ... ... ...
wd wd xd xd
Linear Discriminants Revisited (contd)
Discriminant: g ( x ) t
y
Classification rule:
If tyi>0 assign yi to 1
else if tyi<0 assign yi to 2
Generalized Discriminants
First, map the data to a space of higher dimensionality.
Non-linearly separable linearly separable
g(x)=ty
pk search direction
(k )p k
(k) (k+1) (k ) learning rate
estimate at estimate at (search step)
k-th iteration k+1-th iteration
learning rate
(a = )
Gradient Descent (contd)
search space
(k ) 2
(0)
1
-1
-2
-2 -1 0 1 2
J()
Gradient Descent (contd)
What is the effect of the learning rate?
= 0.37
2 = 0.39
2
1
1
J() 0
0
-1
-1
-2
-2 -1 0 1 2 -2
-2 -1 0 1 2
requires inverting H
(a = )
Newtons method (contd)
If J() is quadratic,
J() 0
Newtons method
-1
converges in one step!
-2
-2 -1 0 1 2
Gradient descent vs Newtons method
Dual Problem
Classification rule:
If yi in 2, replace yi by -yi
If tyi>0 assign yi to 1
else if tyi<0 assign yi to 2 Find such that: tyi>0
or (k 1) (k ) (k )
yY ( )
y
Perceptron rule (contd)
(note: replace a
with and
Yk with Y()) missclassified
examples
Perceptron rule (contd)
a2
Example:
a1
Perceptron rule (contd)
(k)=1
yk
one example
at a time
order of examples:
y2 y3 y1 y3
Batch algorithm
leads to a smoother
trajectory in solution
space.