Pattern Recognition: P.S.Sastry

Download as pdf or txt
Download as pdf or txt
You are on page 1of 90

Pattern Recognition

P.S.Sastry
[email protected]

Dept. Electrical Engineering


Indian Institute of Science, Bangalore

PR video course p.1/90

Reference Books

R.O.Duda, P.E.Hart and D.G.Stork, Pattern


Classification, Johy Wiley, 2002

C.M.Bishop, Pattern Recognition and Machine


Learning, Springer, 2006.

C. M. Bishop, Neural Networks for Pattern


Recognition, Oxford University Press, Indian Edition,
2003.
R.O.Duda and P.E. Hart, Pattern Classification and
Scene Analysis, Wiley, 1973

PR video course p.2/90

Pattern Recognition
A basic attribute of people categorisation of sensory
input
Pattern PR System Class label
Examples of Pattern Recognition tasks

Reading facial expressions

Recognising Speech

Reading a Document

Identifying a person by fingerprints

Diagnosis from medical images

PR video course p.3/90

Machine Recognition of Patterns


X

pattern feature extractor classifier class


label

Feature extractor makes some measurements on the


input pattern.

X is called Feature Vector. Often, X n .

Classifier maps each feature vector to a class label.

Features to be used are problem-specific.

PR video course p.4/90

Some Examples of PR Tasks

Character Recognition
Pattern Image.
Class identity of character
Features: Binary image, projections (e.g., row and
column sums), Moments etc.

PR video course p.5/90

Some Examples of PR Tasks

Speech Recognition
Pattern 1-D signal (or its sampled version)
Class identity of speech units
Features LPC model of chunks of speech,
spectral info, cepstrum etc.
Pattern can become a sequence of feature vectors.

PR video course p.6/90

Examples contd...

Fingerprint based identity verification


Pattern image plus a identity claim
Class Yes / No
Features: Position of Minutiae, Orientation field of
ridge lines etc.

PR video course p.7/90

Examples contd...

Video-based Surveillance
Pattern video sequence
Class e.g., level of alertness
Features Motion trajectories, Parameters of a
prefixed model etc.

PR video course p.8/90

Examples contd...

Credit Screening
Pattern Details of an applicant (for, e.g., credit
card)
Class Yes / No
Features: income, job history, level of credit, credit
history etc.

PR video course p.9/90

Examples contd...

Imposter detection (of, e.g., credit card)


Pattern A sequence of transactions
Class Yes / No
Features: Amount of money, locations of
transactions, times between transactions etc.

PR video course p.10/90

Examples contd...

Document Classification
Pattern A document and a query
Class Relevant or not (in general, rank)
Features word occurrence counts, word context
etc.
Spam filtering, diagnostics of machinery etc.

PR video course p.11/90

Design of Pattern Recognition Systems

Features depend on the problem. Measure relevant


quantities.

PR video course p.12/90

Design of Pattern Recognition Systems

Features depend on the problem. Measure relevant


quantities.

Some techniques available to extract more relevant


quantities from the initial measurements. (e.g., PCA)

PR video course p.13/90

Design of Pattern Recognition Systems

Features depend on the problem. Measure relevant


quantities.

Some techniques available to extract more relevant


quantities from the initial measurements. (e.g., PCA)

After feature extraction each pattern is a vector

Classifier is a function to map such vectors into class


labels.

PR video course p.14/90

Design of Pattern Recognition Systems

Features depend on the problem. Measure relevant


quantities.

Some techniques available to extract more relevant


quantities from the initial measurements. (e.g., PCA)

After feature extraction each pattern is a vector

Classifier is a function to map such vectors into class


labels.
Many general techniques of classifier design are
available.
Need to test and validate the final system.

PR video course p.15/90

Some notation

Feature Space, X Set of all possible feature vectors.

PR video course p.16/90

Some notation

Feature Space, X Set of all possible feature vectors.

Classifier: a decision rule or a function,


h : X {1, , M }.

Often, X = n . Convenient to take M = 2.


Then we take the labels as {0, 1} or {1, 1}.

PR video course p.17/90

Some notation

Feature Space, X Set of all possible feature vectors.

Classifier: a decision rule or a function,


h : X {1, , M }.

Often, X = n . Convenient to take M = 2.


Then we take the labels as {0, 1} or {1, 1}.

Then, any binary valued function on X is a classifier.


What h to choose? We want correct or optimal
classifier.
How to design classifiers?
How to judge performance?
How to provide performance guarantees?
PR video course p.18/90

We first consider the 2-class problem.

PR video course p.19/90

We first consider the 2-class problem.

Can handle the M > 2 case if we know how to handle


2-class problem.

PR video course p.20/90

We first consider the 2-class problem.

Can handle the M > 2 case if we know how to handle


2-class problem.

Simplest alternative: design M 2-class classifiers.


One Vs Rest

PR video course p.21/90

We first consider the 2-class problem.

Can handle the M > 2 case if we know how to handle


2-class problem.

Simplest alternative: design M 2-class classifiers.


One Vs Rest
There are other possibilities: e.g., Tree structured
classifiers.

PR video course p.22/90

We first consider the 2-class problem.

Can handle the M > 2 case if we know how to handle


2-class problem.

Simplest alternative: design M 2-class classifiers.


One Vs Rest
There are other possibilities: e.g., Tree structured
classifiers.
The 2-class problem is the basic problem.

PR video course p.23/90

We first consider the 2-class problem.

Can handle the M > 2 case if we know how to handle


2-class problem.

Simplest alternative: design M 2-class classifiers.


One Vs Rest
There are other possibilities: e.g., Tree structured
classifiers.
The 2-class problem is the basic problem.

We will also look at M-class classifiesrs.

PR video course p.24/90

A simple PR problem

: Problem: Spot the Right Candidate

: Features:
x1 : Marks based on academic record
x2 : Marks in the interview

PR video course p.25/90

A simple PR problem

: Problem: Spot the Right Candidate

: Features:
x1 : Marks based on academic record
x2 : Marks in the interview

A Classifier: ax1 + bx2 > c Good

PR video course p.26/90

A simple PR problem

: Problem: Spot the Right Candidate

: Features:
x1 : Marks based on academic record
x2 : Marks in the interview

A Classifier: ax1 + bx2 > c Good


Another Classifier: x1 x2 > c Good
(or (x1 + a)(x2 + b) > c).

PR video course p.27/90

A simple PR problem

: Problem: Spot the Right Candidate

: Features:
x1 : Marks based on academic record
x2 : Marks in the interview

A Classifier: ax1 + bx2 > c Good


Another Classifier: x1 x2 > c Good
(or (x1 + a)(x2 + b) > c).

Design of classifier:
We have to choose a specific form for the classifier.
What values to use for parameters such as a, b, c?
PR video course p.28/90

Designing Classifiers

Need to decide how feature vector values determine


the class.
(How different marks reflect goodness of candidate)

PR video course p.29/90

Designing Classifiers

Need to decide how feature vector values determine


the class.
(How different marks reflect goodness of candidate)

In most applications, not possible to design classifier


from physics of the problem.

PR video course p.30/90

Designing Classifiers

Need to decide how feature vector values determine


the class.
(How different marks reflect goodness of candidate)

In most applications, not possible to design classifier


from physics of the problem.

The difficulties are


Lot of variability in patterns of a single class
Variability in feature vector values
Feature vectors of patterns from different classes
can be arbitrarily close.
Noise in measurements
PR video course p.31/90

Designing Classifiers contd...

Often the only information available for the design is


A training set of example patterns.

Training set: {(Xi , yi ), i = 1, . . . , }.


Here Xi is an example feature vector of class yi .

PR video course p.32/90

Designing Classifiers contd...

Often the only information available for the design is


A training set of example patterns.

Training set: {(Xi , yi ), i = 1, . . . , }.


Here Xi is an example feature vector of class yi .

Generation of training set Take representative


patterns of known category (data collection) and
obtain the feature vectors. (choice of feature
measurements).

Now learn an appropriate function h as classifier.


(Model choice)

Test and validate the classifier on more data.


PR video course p.33/90

A simple PR problem

: Problem: Spot the Right Candidate

: Features:
x1 : Marks based on academic record
x2 : Marks in the interview

A Classifier: ax1 + bx2 > c Good


We have chosen a specific form for the classifier.

Design of classifier: What values to use for a, b, c?

Information available: experience history of past

candidates

PR video course p.34/90

Training Set

PR video course p.35/90

Another example problem

Problem: recognize persons of medium build

Features: Height and Weight

The classifier is nonlinear here


PR video course p.36/90

Learning from Examples

Designing a classifier is a typical problem of learning


from examples. (Also called learning with a teacher).

Nature of feedback from teacher determines difficulty


of the learning problem.
PR video course p.37/90

In the context of Pattern Recognition


f eature

V ector Classifier

Class

f eedback

Teacher

Supervised learning Teacher gives the true class


label for each feature vector
Reinforcement Learning noisy assessment of
performance, e.g., correct/incorrect
Unsupervised learning no teacher input (Clustering
problems)
PR video course p.38/90

When the class labels of training patterns as given by


teacher are noisy, we consider it as supervised
learning with label noise or classification noise.

Many classifier design algorithms do supervised


learning.

PR video course p.39/90

Function Learning

Closely related problem. Output is continuous-valued


rather than discrete as in classifiers.
Here training set examples could be
{(Xi , yi ), i = 1, . . . , }, Xi X , yi .

PR video course p.40/90

Function Learning

Closely related problem. Output is continuous-valued


rather than discrete as in classifiers.
Here training set examples could be
{(Xi , yi ), i = 1, . . . , }, Xi X , yi .

The prediction variable, y , is continuous; rather than


taking finitely many values. ( There can be noise in
examples).

Similar learning techniques needed to infer the


underlying functional relationship between X and y .
(Regression function of y on X ).
PR video course p.41/90

Examples of Function Learning

Time series prediction: Given a series x1 , x2 , , find


a function to predict xn .

PR video course p.42/90

Examples of Function Learning

Time series prediction: Given a series x1 , x2 , , find


a function to predict xn .

Based on past values: Find a best function

xn = h(xn1 , xn2 , , xnp )

Predict stock prices, exchange rates etc.


Linear prediction model used in speech analysis

PR video course p.43/90

Examples of Function Learning

Time series prediction: Given a series x1 , x2 , , find


a function to predict xn .

Based on past values: Find a best function

xn = h(xn1 , xn2 , , xnp )

Predict stock prices, exchange rates etc.


Linear prediction model used in speech analysis

More general predictors can use other variables also.


Predict rainfall based on measurements and
(possibly) previous years data.
In general, System Identification. (An application:
smart sensors)
PR video course p.44/90

Examples contd... : Equaliser


Tx x(k) channel Z(k) filter
y(k) Rx
We want y(k) = x(k). Design (or adapt) the filter to
achieve this
We can choose a filter as T
y(k) =

ai Z(k i)

i=1

Find best ai a function learning problem.

PR video course p.45/90

Examples contd... : Equaliser


Tx x(k) channel Z(k) filter
y(k) Rx
We want y(k) = x(k). Design (or adapt) the filter to
achieve this
We can choose a filter as T
y(k) =

ai Z(k i)

i=1

Find best ai a function learning problem.


Training set: {(x(k), Z(k)), k = 1, 2, , N }
How do we know x(k) at the receiver end?
Prior agreements (Protocols)

PR video course p.46/90

Learning from examples

Learning from examples is basic to both classification


and regression.

Training set: {(X1 , y1 ), , (X , y )}.

We essentially want to fit a best function: y = f (X).

PR video course p.47/90

Learning from examples

Learning from examples is basic to both classification


and regression.

Training set: {(X1 , y1 ), , (X , y )}.

We essentially want to fit a best function: y = f (X).

Suppose X . Then this is a familiar curve-fitting


problem.

PR video course p.48/90

Learning from examples

Learning from examples is basic to both classification


and regression.

Training set: {(X1 , y1 ), , (X , y )}.

We essentially want to fit a best function: y = f (X).

Suppose X . Then this is a familiar curve-fitting


problem.

Model choice (choice of form for f ) is very important


here.
If we choose f to be a polynomial, higher the degree
better would be the performance on training data.

But that is not the real performance A polynomial of


degree would give zero error!!
PR video course p.49/90

Learning from examples Generalization

To obtain a classifier (or a regression function) we use


the training set.

We know the class label of patterns (or the values for


prediction variable) in the training set.

PR video course p.50/90

Learning from examples Generalization

To obtain a classifier (or a regression function) we use


the training set.

We know the class label of patterns (or the values for


prediction variable) in the training set.

Errors on the training set do not necessarily tell how


good is the classifier.

PR video course p.51/90

Learning from examples Generalization

To obtain a classifier (or a regression function) we use


the training set.

We know the class label of patterns (or the values for


prediction variable) in the training set.

Errors on the training set do not necessarily tell how


good is the classifier.

Any classifier that amounts to only storing the training


set is useless.
Interested in the generalization abilities how does
our classifier perform on unseen or new patterns.

PR video course p.52/90

Design of Classifiers

The classifier should perform well inspite of inherent


variability of patterns and noise in feature extraction
and/or in class labels as given in training set.

PR video course p.53/90

Design of Classifiers

The classifier should perform well inspite of inherent


variability of patterns and noise in feature extraction
and/or in class labels as given in training set.

Statistical Pattern Recognition An approach where


the variabilities are captured through probabilistic
models.

PR video course p.54/90

Design of Classifiers

The classifier should perform well inspite of inherent


variability of patterns and noise in feature extraction
and/or in class labels as given in training set.

Statistical Pattern Recognition An approach where


the variabilities are captured through probabilistic
models.
There are other approaches, e.g., syntactic pattern
recognition, fuzzy-set based methods etc.

PR video course p.55/90

Design of Classifiers

The classifier should perform well inspite of inherent


variability of patterns and noise in feature extraction
and/or in class labels as given in training set.

Statistical Pattern Recognition An approach where


the variabilities are captured through probabilistic
models.
There are other approaches, e.g., syntactic pattern
recognition, fuzzy-set based methods etc.

In this course we consider classification and


regression (function learning) problems in the
statistical framework.
PR video course p.56/90

Statistical Pattern Recognition

X is the feature space. (We take X = n ). We


Consider a 2-class problem for simplicity of notation.

A given feature vector can come from different


classes with different probabilities.

PR video course p.57/90

Statistical Pattern Recognition

X is the feature space. (We take X = n ). We


Consider a 2-class problem for simplicity of notation.

A given feature vector can come from different


classes with different probabilities.

Let: fi be the probability density function of the


feature vectors from class-i, i = 0, 1.

fi are called class conditional densities.

PR video course p.58/90

Statistical Pattern Recognition

X is the feature space. (We take X = n ). We


Consider a 2-class problem for simplicity of notation.

A given feature vector can come from different


classes with different probabilities.

Let: fi be the probability density function of the


feature vectors from class-i, i = 0, 1.

fi are called class conditional densities.

Let X = (X1 , , Xn ) n represent the feature


vector.
Then fi is the (conditional) joint density of the random
variables X1 , , Xn given that X is from class-i.

PR video course p.59/90

Class conditional densities model the variability in the


feature values.
For example, the two classes can be uniformly
distributed in the two regions as shown. (The two
classes are separable here).

Class 1
Class 2

PR video course p.60/90

When class regions are separable, an important


special case is linear separability.

Class 1
Class 2

Class 1

Class 2

The classes in the left panel above are linearly


separable (can be separated by a line) while those in
the right panel are not linearly separable (though
separable).
PR video course p.61/90

In general, the two class conditional densities can


overlap. (The same value of feature vector can be
from different classes with different probabilities)

Class 1

Class 2

PR video course p.62/90

The statistical viewpoint gives us one way of looking


for optimal classifier.

We can say we want a classifier that has least


probability of misclassifying a random pattern (drawn
from the underlying distributions).

PR video course p.63/90

Let

qi (X) = Prob[class = i|X], i = 0, 1.


qi is called posterior probability (function) for class-i.

PR video course p.64/90

Let

qi (X) = Prob[class = i|X], i = 0, 1.


qi is called posterior probability (function) for class-i.

Consider the classifier

h(X) = 0 if q0 (X) > q1 (X)


= 1 otherwise

PR video course p.65/90

Let

qi (X) = Prob[class = i|X], i = 0, 1.


qi is called posterior probability (function) for class-i.

Consider the classifier

h(X) = 0 if q0 (X) > q1 (X)


= 1 otherwise

q0 (X) > q1 (X) would imply that the feature vector X


is more likely to come from class-0 rather than
class-1.
PR video course p.66/90

Let

qi (X) = Prob[class = i|X], i = 0, 1.


qi is called posterior probability (function) for class-i.

Consider the classifier

h(X) = 0 if q0 (X) > q1 (X)


= 1 otherwise

q0 (X) > q1 (X) would imply that the feature vector X


is more likely to come from class-0 rather than
class-1.
Hence, intuitively, such a classifier should minimize
probability of error in classification.
PR video course p.67/90

Statistical Pattern Recognition

X (= n ) is the feature space. Y = {0, 1} is the set of


class labels

PR video course p.68/90

Statistical Pattern Recognition

X (= n ) is the feature space. Y = {0, 1} is the set of


class labels

H = {h | h : X Y} is the set of classifiers of


interest.

PR video course p.69/90

Statistical Pattern Recognition

X (= n ) is the feature space. Y = {0, 1} is the set of


class labels

H = {h | h : X Y} is the set of classifiers of

interest.
For a feature vector X , let y(X) denote the class
label of X . In general, y(X) would be random.

PR video course p.70/90

Statistical Pattern Recognition

X (= n ) is the feature space. Y = {0, 1} is the set of


class labels

H = {h | h : X Y} is the set of classifiers of

interest.
For a feature vector X , let y(X) denote the class
label of X . In general, y(X) would be random.

Now we want to assign a figure of merit to each


possible classifier in H.

PR video course p.71/90

For example, we can rate different classifiers by

F (h) = Prob[h(X) 6= y(X)]

F (h) is the probability that h misclassifies a random


X.

PR video course p.72/90

For example, we can rate different classifiers by

F (h) = Prob[h(X) 6= y(X)]

F (h) is the probability that h misclassifies a random


X.

Optimal classifier would be one with lowest value of


F.

PR video course p.73/90

For example, we can rate different classifiers by

F (h) = Prob[h(X) 6= y(X)]

F (h) is the probability that h misclassifies a random


X.

Optimal classifier would be one with lowest value of


F.
Given a h we can calculate F (h) only if we know the
probability distributions of classes.

PR video course p.74/90

For example, we can rate different classifiers by

F (h) = Prob[h(X) 6= y(X)]

F (h) is the probability that h misclassifies a random


X.

Optimal classifier would be one with lowest value of


F.
Given a h we can calculate F (h) only if we know the
probability distributions of classes.

Minimizing F is not a straight-forward optimization


problem.
PR video course p.75/90

For example, we can rate different classifiers by

F (h) = Prob[h(X) 6= y(X)]

F (h) is the probability that h misclassifies a random


X.

Optimal classifier would be one with lowest value of


F.
Given a h we can calculate F (h) only if we know the
probability distributions of classes.

Minimizing F is not a straight-forward optimization


problem.

Here we are treating all errors as same. We will

PR video course p.76/90

Statistical PR contd.

Recall that fi (X) denotes the probability density


function of feature vectors of class-i. ( class
conditional densities).

PR video course p.77/90

Statistical PR contd.

Recall that fi (X) denotes the probability density


function of feature vectors of class-i. ( class
conditional densities).

Let pi = Prob[y(X) = i]. Called prior probabilities.

PR video course p.78/90

Statistical PR contd.

Recall that fi (X) denotes the probability density


function of feature vectors of class-i. ( class
conditional densities).

Let pi = Prob[y(X) = i]. Called prior probabilities.

Recall, posterior probabilities,


qi (X) = Prob[y(X) = i | X].

PR video course p.79/90

Statistical PR contd.

Recall that fi (X) denotes the probability density


function of feature vectors of class-i. ( class
conditional densities).

Let pi = Prob[y(X) = i]. Called prior probabilities.

Recall, posterior probabilities,


qi (X) = Prob[y(X) = i | X]. Now, by Bayes theorem

qi (X) = fi (X)pi / Z
where Z = f0 (X)p0 + f1 (X)p1 is the normalising
constant
PR video course p.80/90

Bayes Classifier

Consider the classifier given by

q0 (X)
>1
h(X) = 0 if
q1 (X)
= 1 otherwise

PR video course p.81/90

Bayes Classifier

Consider the classifier given by

q0 (X)
>1
h(X) = 0 if
q1 (X)
= 1 otherwise

This is called the Bayes classifier. Given our statistical


framework, this is the optimal classifier.

PR video course p.82/90

Bayes Classifier

Consider the classifier given by

q0 (X)
>1
h(X) = 0 if
q1 (X)
= 1 otherwise

This is called the Bayes classifier. Given our statistical


framework, this is the optimal classifier.

q0 (X) > q1 (X) is same as p0 f0 (X) > p1 f1 (X).

PR video course p.83/90

Bayes Classifier

Consider the classifier given by

q0 (X)
>1
h(X) = 0 if
q1 (X)
= 1 otherwise

This is called the Bayes classifier. Given our statistical


framework, this is the optimal classifier.

q0 (X) > q1 (X) is same as p0 f0 (X) > p1 f1 (X).

We will prove optimality of Bayes classifier later.

PR video course p.84/90

story so far

We consider PR as a two step process Feature


measurement/extraction and Classification
A classifier is to map feature vectors to Class labels.

The main difficulty in designing classifiers is due to


large variability in feature vector values.

The main information we have for the design is a


training set of examples.

Function learning is a closely related problem.

In both cases we need to learn from (training)


examples.
PR video course p.85/90

story so far

The statistical view is: model the variation in feature


vectors from a given class through probability models.

One objective can be: Find a classifier that has least


probability of error.

For example, Bayes Classifier is the optimal one if we


know class conditional densities.

PR video course p.86/90

Organization of the course

Overview of classifier learning strategies

Nearest Neighbour classification rule

Bayes Classifier for minimizing risk.

Some variations on this theme (e.g.


Neymann-Pearson Classifier)

Techniques for estimating class conditional densities


to implement Bayes classifier.

PR video course p.87/90

Organization of the course

Linear discriminant functions


Learning linear discriminant functions (Perceptron,
LMS, logistic regression)

Linear least-squares regression

Simple overview of statistical learning theory

Empirical risk minimization and VC theory

PR video course p.88/90

Organization of the course

Learning nonlinear classifiers

Neural Networks (Backpropagation, RBF networks)

Support Vector Machines

Kernel-based methods
Feature extraction and dimensionality reduction (PCA)

Boosting and ensemble classifiers (AdaBoost)

PR video course p.89/90

Thank You!

PR video course p.90/90

You might also like