Linear Discriminant Functions: CS479/679 Pattern Recognition Dr. George Bebis

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 41

Linear Discriminant Functions

Chapter 5 (Duda et al.)

CS479/679 Pattern Recognition


Dr. George Bebis
Generative vs Discriminant Approach

Generative approaches estimate the discriminant


function by first estimating the probability
distribution of the patterns belonging to each class.

Discriminant approaches estimate the discriminant


function explicitly, without assuming a probability
distribution.
Generative Approach
(two categories)
More common to use a single discriminant function
(dichotomizer) instead of two:

Examples: g (x) P(1 / x) P(2 / x)


p(x / 1 ) P(1 )
g (x) ln ln
p ( x / 2 ) P(2 )

If g(x)=0, then x lies on the decision boundary and can


be assigned to either class.
Discriminant Approach
(two categories)
Specify parametric form of the discriminant
function, for example, a linear discriminant:

d
g (x) w t x w0 wi xi w0
i 1

Decide 1 if g(x) > 0 and 2 if g(x) < 0

If g(x)=0, then x lies on the decision boundary and can


be assigned to either class.
Discriminant Approach (contd)
(two categories)
Find the best decision boundary (i.e., estimate w
and w0) using a set of training examples xk.
Discriminant Approach (contd)
(two categories)
The solution can be found by minimizing an error
function (e.g., training error or empirical risk):

n
class labels:
1
J (w, w0 ) [ zk g ( x k )]2
1 if xk 1
n k 1 zk
1 if xk 2

correct class predicted class

Learning algorithms can be applied to find the


solution.
Linear Discriminants
(two categories)
A linear discriminant has the following form:
d
g (x) w x w0 wi xi w0
t

i 1

The decision boundary (g(x)=0), is a hyperplane


where the orientation of the hyperplane is
determined by w and its location by w0.
w is the normal to the hyperplane
If w0=0, the hyperplane passes through the origin
Geometric Interpretation of g(x)

g(x) provides an algebraic measure of the distance


of x from the hyperplane.

x can be expressed as follows:

w direction of r
x xp r
|| w ||
Geometric Interpretation of g(x) (contd)
Substitute x in g(x):

w
g (x) w x w0 w (x p r
t t
) w0
|| w ||
wt w
w xp r
t
w0 r || w ||
|| w ||

since wt w || w ||2 and wt x p w0 0


Geometric Interpretation of g(x) (contd)

The distance of x from the hyperplane is given by:

g ( x)
r
|| w ||

w0
setting x=0: r
|| w ||
Linear Discriminant Functions:
multi-category case
There are several ways to devise multi-category
classifiers using linear discriminant functions:

(1) One against the rest

problem:
ambiguous regions
Linear Discriminant Functions:
multi-category case (contd)
(2) One against another (i.e., c(c-1)/2 pairs of classes)

problem:
ambiguous regions
Linear Discriminant Functions:
multi-category case (contd)
To avoid the problem of ambiguous regions:
Define c linear discriminant functions
Assign x to i if gi(x) > gj(x) for all j i.
The resulting classifier is called a linear machine

(see Chapter 2)
Linear Discriminant Functions:
multi-category case (contd)
A linear machine divides the feature space in c
convex decisions regions.
If x is in region Ri, the gi(x) is the largest.

Note: although there are c(c-1)/2 pairs of regions, there


typically less decision boundaries
Linear Discriminant Functions:
multi-category case (contd)
The decision boundary between adjacent
regions Ri and Rj is a portion of the hyperplane
Hij given by:
gi (x) g j (x) or gi (x) g j (x) 0
or (w i w j )t x ( wi 0 w j 0 ) 0

(wi-wj) is normal to Hij and the signed distance


from x to Hij is g ( x) g ( x)
r
i j

|| w i w j ||
Higher Order Discriminant Functions

Can produce more complicated decision boundaries


than linear discriminant functions.

g (x)
Linear Discriminants Revisited
Augmented feature/parameter space:
d d
g (x) w t x w0 wi xi x0 w0 wi xi t y
i 1 i 0
( x0 1)

w1 w0 x1 x0
w w x x
w 2 1 x 2 y 1
... ... ... ...

wd wd xd xd
Linear Discriminants Revisited (contd)

Discriminant: g ( x ) t
y

Classification rule:

If tyi>0 assign yi to 1
else if tyi<0 assign yi to 2
Generalized Discriminants
First, map the data to a space of higher dimensionality.
Non-linearly separable linearly separable

This can be accomplished using special transformation


functions ( functions):
x1 y1 (x)
Map a point from a d-dimensional x y ( x)
space to a point in a d -dimensional x 2
2
(where d >> d ) ... ...

xd yd (x)
Generalized Discriminants (contd)
A generalized discriminant is a linear discriminant in
the d - dimensional space:
d
g (x) ai yi (x) or g ( x) t y
i 1

Separates points in the d - space by a hyperplane


passing through the origin.
Example
g ( x) 0 if x 1 or x 0.5
The corresponding decision regions
R1,R2 in the d-space are not simply
connected (not linearly separable).

Consider the following mapping


functions:
Discriminant:
functions
y1 ( x) 1 1 g ( x) t y

y y2 ( x) x
1 g ( x) 1 x 2 x 2
y3 ( x) x 2 2 d=1, d 3
Example (contd)

g(x) maps a line in d-space to


a parabola in d - space.

The problem has now


become linearly separable!

The plane t y 0 divides


the d-space in two decision
regions R , R
1 2
Learning: linearly separable case
(two categories)

Given a linear discriminant function

g(x)=ty

the goal is to learn the parameters (weights)


from a set of n labeled samples yi, where each yi
has a class label 1 or 2.
Learning: effect of training examples

Every training sample yi places a


constraint on the weight vector ; parameter
lets see how.
space (1, 2)

Case 1: visualize solution in a2


parameter space: y i 1

ty=0 defines a hyperplane in the


y j 2
parameter space with y being the
a1
normal vector.
Given n examples, the solution
must lie on the intersection of n
half-spaces.
Learning: effect of training examples

Case 2: visualize solution in


feature space:
ty=0 defines a hyperplane 1
in the feature space with
being the normal vector.
Given n examples, the
solution must lie within a
certain region. 2
Uniqueness of Solution

Solution vector is usually not unique; we can impose


certain constraints to enforce uniqueness, for example:

Find unit-length weight vector that maximizes the


minimum distance from the training examples to the
separating plane
Learning Using Iterative
Optimization
Minimize an error function J() (e.g., classification
error) with respect to :

Minimize J() iteratively: (k 1) (k ) (k )p k

pk search direction
(k )p k
(k) (k+1) (k ) learning rate
estimate at estimate at (search step)
k-th iteration k+1-th iteration

How should we choose pk?


Choosing pk using Gradient Descent
pk J ((k ))

learning rate

(a = )
Gradient Descent (contd)
search space
(k ) 2

(0)
1

-1

-2
-2 -1 0 1 2

J()
Gradient Descent (contd)
What is the effect of the learning rate?

= 0.37

2 = 0.39

2

1
1

J() 0
0

-1
-1

-2
-2 -1 0 1 2 -2
-2 -1 0 1 2

slow but converges to solution fast by overshoots solution


Gradient Descent (contd)
How to choose the learning rate (k)?
Taylor series approximation (a = )

Hessian (2nd derivatives)

Setting a=a(k+1) and using

optimum learning rate


Choosing pk using Newtons Method
p k H 1J ( (k ))

requires inverting H

(a = )
Newtons method (contd)

If J() is quadratic,
J() 0
Newtons method
-1
converges in one step!

-2
-2 -1 0 1 2
Gradient descent vs Newtons method
Dual Problem
Classification rule:
If yi in 2, replace yi by -yi
If tyi>0 assign yi to 1
else if tyi<0 assign yi to 2 Find such that: tyi>0

Seek a hyperplane that separates Seek a hyperplane that puts


patterns from different categories normalized patterns on the same
(positive) side
Perceptron rule
Use Gradient Descent assuming that the error
function to be minimized is:
J p ( )
yY ( )
( y )
t

where Y() is the set of samples misclassified by .


If Y() is empty, Jp()=0; otherwise, Jp()>0

Find such that: tyi>0


Perceptron rule (contd)
The gradient of Jp() is:
J p ( )
yY ( )
( t y ) J p
yY ( )
(y )

The perceptron update rule is obtained using


gradient descent: (a = )

or (k 1) (k ) (k )
yY ( )
y
Perceptron rule (contd)

(note: replace a
with and
Yk with Y()) missclassified
examples
Perceptron rule (contd)

Move the hyperplane so that training samples are


on its positive side.

a2

Example:

a1
Perceptron rule (contd)

(k)=1
yk
one example
at a time

Perceptron Convergence Theorem: If training


samples are linearly separable, then the perceptron
algorithm will terminate at a solution vector in a finite
number of steps.
Perceptron rule (contd)

order of examples:
y2 y3 y1 y3

Batch algorithm
leads to a smoother
trajectory in solution
space.

You might also like