Linear Discriminant Functions: CS479/679 Pattern Recognition Dr. George Bebis

Linear Discriminant Functions
Chapter 5 (Duda et al.)
CS479/679 Pattern Recognition

Dr. George Bebis
Generative vs Discriminant Approach
Generative approaches estimate the discriminant

function by first estimating the probability
distribution of the patterns belonging to each class.
Discriminant approaches estimate the discriminant

function explicitly, without assuming a probability
distribution.
Generative Approach
(two categories)
More common to use a single discriminant function
(dichotomizer) instead of two:
Examples: g (x) P(1 / x) P(2 / x)

p(x / 1 ) P(1 )
g (x) ln ln
p ( x / 2 ) P(2 )
If g(x)=0, then x lies on the decision boundary and can

be assigned to either class.
Discriminant Approach
(two categories)
Specify parametric form of the discriminant
function, for example, a linear discriminant:
d
g (x) w t x w0 wi xi w0
i 1
Decide 1 if g(x) > 0 and 2 if g(x) < 0
If g(x)=0, then x lies on the decision boundary and can

be assigned to either class.
Discriminant Approach (contd)
(two categories)
Find the best decision boundary (i.e., estimate w
and w0) using a set of training examples xk.
Discriminant Approach (contd)
(two categories)
The solution can be found by minimizing an error
function (e.g., training error or empirical risk):
n
class labels:
1
J (w, w0 ) [ zk g ( x k )]2
1 if xk 1
n k 1 zk
1 if xk 2
correct class predicted class
Learning algorithms can be applied to find the

solution.
Linear Discriminants
(two categories)
A linear discriminant has the following form:
d
g (x) w x w0 wi xi w0
t
i 1
The decision boundary (g(x)=0), is a hyperplane

where the orientation of the hyperplane is
determined by w and its location by w0.
w is the normal to the hyperplane
If w0=0, the hyperplane passes through the origin
Geometric Interpretation of g(x)
g(x) provides an algebraic measure of the distance

of x from the hyperplane.
x can be expressed as follows:
w direction of r
x xp r
|| w ||
Geometric Interpretation of g(x) (contd)
Substitute x in g(x):
w
g (x) w x w0 w (x p r
t t
) w0
|| w ||
wt w
w xp r
t
w0 r || w ||
|| w ||
since wt w || w ||2 and wt x p w0 0

Geometric Interpretation of g(x) (contd)
The distance of x from the hyperplane is given by:
g ( x)
r
|| w ||
w0
setting x=0: r
|| w ||
Linear Discriminant Functions:
multi-category case
There are several ways to devise multi-category
classifiers using linear discriminant functions:
(1) One against the rest
problem:
ambiguous regions
multi-category case (contd)
(2) One against another (i.e., c(c-1)/2 pairs of classes)
problem:
ambiguous regions
To avoid the problem of ambiguous regions:
Define c linear discriminant functions
Assign x to i if gi(x) > gj(x) for all j i.
The resulting classifier is called a linear machine
(see Chapter 2)
A linear machine divides the feature space in c
convex decisions regions.
If x is in region Ri, the gi(x) is the largest.
Note: although there are c(c-1)/2 pairs of regions, there

typically less decision boundaries
The decision boundary between adjacent
regions Ri and Rj is a portion of the hyperplane
Hij given by:
gi (x) g j (x) or gi (x) g j (x) 0
or (w i w j )t x ( wi 0 w j 0 ) 0
(wi-wj) is normal to Hij and the signed distance

from x to Hij is g ( x) g ( x)
r
i j
|| w i w j ||
Higher Order Discriminant Functions
Can produce more complicated decision boundaries

than linear discriminant functions.
g (x)
Linear Discriminants Revisited
Augmented feature/parameter space:
d d
g (x) w t x w0 wi xi x0 w0 wi xi t y
i 1 i 0
( x0 1)
w1 w0 x1 x0
w w x x
w 2 1 x 2 y 1
... ... ... ...

wd wd xd xd
Linear Discriminants Revisited (contd)
Discriminant: g ( x ) t
y
Classification rule:
If tyi>0 assign yi to 1
else if tyi<0 assign yi to 2
Generalized Discriminants
First, map the data to a space of higher dimensionality.
Non-linearly separable linearly separable
This can be accomplished using special transformation

functions ( functions):
x1 y1 (x)
Map a point from a d-dimensional x y ( x)
space to a point in a d -dimensional x 2
2
(where d >> d ) ... ...

xd yd (x)
Generalized Discriminants (contd)
A generalized discriminant is a linear discriminant in
the d - dimensional space:
d
g (x) ai yi (x) or g ( x) t y
i 1
Separates points in the d - space by a hyperplane

passing through the origin.
Example
g ( x) 0 if x 1 or x 0.5
The corresponding decision regions
R1,R2 in the d-space are not simply
connected (not linearly separable).
Consider the following mapping

functions:
Discriminant:
functions
y1 ( x) 1 1 g ( x) t y

y y2 ( x) x
1 g ( x) 1 x 2 x 2
y3 ( x) x 2 2 d=1, d 3
Example (contd)
g(x) maps a line in d-space to

a parabola in d - space.
The problem has now

become linearly separable!
The plane t y 0 divides

the d-space in two decision
regions R , R
1 2
Learning: linearly separable case
(two categories)
Given a linear discriminant function
g(x)=ty
the goal is to learn the parameters (weights)

from a set of n labeled samples yi, where each yi
has a class label 1 or 2.
Learning: effect of training examples
Every training sample yi places a

constraint on the weight vector ; parameter
lets see how.
space (1, 2)
Case 1: visualize solution in a2

parameter space: y i 1
ty=0 defines a hyperplane in the

y j 2
parameter space with y being the
a1
normal vector.
Given n examples, the solution
must lie on the intersection of n
half-spaces.
Learning: effect of training examples
Case 2: visualize solution in

feature space:
ty=0 defines a hyperplane 1
in the feature space with
being the normal vector.
Given n examples, the
solution must lie within a
certain region. 2
Uniqueness of Solution
Solution vector is usually not unique; we can impose

certain constraints to enforce uniqueness, for example:
Find unit-length weight vector that maximizes the

minimum distance from the training examples to the
separating plane
Learning Using Iterative
Optimization
Minimize an error function J() (e.g., classification
error) with respect to :
Minimize J() iteratively: (k 1) (k ) (k )p k
pk search direction
(k )p k
(k) (k+1) (k ) learning rate
estimate at estimate at (search step)
k-th iteration k+1-th iteration
How should we choose pk?

Choosing pk using Gradient Descent
pk J ((k ))
learning rate
(a = )
Gradient Descent (contd)
search space
(k ) 2
(0)
1
-1
-2
-2 -1 0 1 2
J()
What is the effect of the learning rate?
= 0.37

2 = 0.39

2
1
1
J() 0
0
-1
-1
-2
-2 -1 0 1 2 -2
-2 -1 0 1 2
slow but converges to solution fast by overshoots solution

How to choose the learning rate (k)?
Taylor series approximation (a = )
Hessian (2nd derivatives)
Setting a=a(k+1) and using
optimum learning rate

Choosing pk using Newtons Method
p k H 1J ( (k ))
requires inverting H
(a = )
Newtons method (contd)
If J() is quadratic,
J() 0
Newtons method
-1
converges in one step!
-2
-2 -1 0 1 2
Gradient descent vs Newtons method
Dual Problem
Classification rule:
If yi in 2, replace yi by -yi
If tyi>0 assign yi to 1
else if tyi<0 assign yi to 2 Find such that: tyi>0
Seek a hyperplane that separates Seek a hyperplane that puts

patterns from different categories normalized patterns on the same
(positive) side
Perceptron rule
Use Gradient Descent assuming that the error
function to be minimized is:
J p ( )
yY ( )
( y )
t
where Y() is the set of samples misclassified by .

If Y() is empty, Jp()=0; otherwise, Jp()>0
Find such that: tyi>0

Perceptron rule (contd)
The gradient of Jp() is:
J p ( )
yY ( )
( t y ) J p
yY ( )
(y )
The perceptron update rule is obtained using

gradient descent: (a = )
or (k 1) (k ) (k )
yY ( )
y
(note: replace a
with and
Yk with Y()) missclassified
examples
Move the hyperplane so that training samples are

on its positive side.
a2
Example:
a1
(k)=1
yk
one example
at a time
Perceptron Convergence Theorem: If training

samples are linearly separable, then the perceptron
algorithm will terminate at a solution vector in a finite
number of steps.
order of examples:
y2 y3 y1 y3
Batch algorithm
leads to a smoother
trajectory in solution
space.

Linear Discriminant Functions: CS479/679 Pattern Recognition Dr. George Bebis

Uploaded by

Copyright:

Available Formats

Linear Discriminant Functions: CS479/679 Pattern Recognition Dr. George Bebis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Discriminant Functions: CS479/679 Pattern Recognition Dr. George Bebis

Uploaded by

Copyright:

Available Formats

Linear Discriminant Functions

Chapter 5 (Duda et al.)

CS479/679 Pattern Recognition

Generative approaches estimate the discriminant

Discriminant approaches estimate the discriminant

Examples: g (x) P(1 / x) P(2 / x)

If g(x)=0, then x lies on the decision boundary and can

Decide 1 if g(x) > 0 and 2 if g(x) < 0

If g(x)=0, then x lies on the decision boundary and can

correct class predicted class

Learning algorithms can be applied to find the

The decision boundary (g(x)=0), is a hyperplane

g(x) provides an algebraic measure of the distance

x can be expressed as follows:

since wt w || w ||2 and wt x p w0 0

The distance of x from the hyperplane is given by:

(1) One against the rest

Note: although there are c(c-1)/2 pairs of regions, there

(wi-wj) is normal to Hij and the signed distance

Can produce more complicated decision boundaries

This can be accomplished using special transformation

Separates points in the d - space by a hyperplane

Consider the following mapping

g(x) maps a line in d-space to

The problem has now

The plane t y 0 divides

Given a linear discriminant function

the goal is to learn the parameters (weights)

Every training sample yi places a

Case 1: visualize solution in a2

ty=0 defines a hyperplane in the

Case 2: visualize solution in

Solution vector is usually not unique; we can impose

Find unit-length weight vector that maximizes the

Minimize J() iteratively: (k 1) (k ) (k )p k

How should we choose pk?

slow but converges to solution fast by overshoots solution

Hessian (2nd derivatives)

Setting a=a(k+1) and using

optimum learning rate

Seek a hyperplane that separates Seek a hyperplane that puts

where Y() is the set of samples misclassified by .

Find such that: tyi>0

The perceptron update rule is obtained using

Move the hyperplane so that training samples are

Perceptron Convergence Theorem: If training

You might also like