Introduction. Binary Classification and Bayes Optimal Classifier
Introduction. Binary Classification and Bayes Optimal Classifier
Introduction. Binary Classification and Bayes Optimal Classifier
Outline
• Introduction and course overview
Figure 1 shows some examples of problems where we might use machine learning techniques (indeed, machine
learning techniques are already being used with success to solve these problems in practice). As can be seen,
the problems are varied and come from a variety of different application domains.
The reasons that one might be interested in studying machine learning can be equally varied (see Figure 2).
In this course, our reasons for studying machine learning are largely practical. Predictive models are needed
in many areas of science, engineering, and business. With the explosion in data everywhere, ranging from
astronomy, biology, and drug discovery to climate modeling, finance, and the Web, there is an increasing
need for algorithms that can automatically “learn” such predictive models from data. In this course, we
would like to understand how to design and analyze such algorithms.
That said, the foundations of machine learning are built on elegant mathematics from the fields of probability,
statistics, computer science, and optimization, and it is through the right appreciation and understanding
1
2 Introduction. Binary Classification and Bayes Optimal Classifier.
of these mathematical foundations that one can make lasting contributions to the development and analysis
of new machine learning algorithms. To this end, while our focus will be on understanding the main tools
and techniques in machine learning, we will emphasize a clear understanding of the underlying mathematical
principles, with the hope that several of you may later go on to contribute new ideas to this area which also
continues to be an active and exciting area of research with much potential for real impact.
Figure 1: Examples of problems that can be solved using machine learning techniques.
Introduction. Binary Classification and Bayes Optimal Classifier. 3
The input to the problem consists of training data; the output consists of a predictive (or explanatory
or decision-making) model. In order to specify a learning problem, one needs to specify three critical
components:
A candidate solution to the problem is then any learning algorithm that takes the specified form of data
as input and produces the specified form of model as output. Algorithms that have stronger guarantees on
the performance of the learned models in terms of the specified performance measure generally constitute
better solutions than algorithms with weaker performance guarantees. In addition, there may be other
considerations that need to be taken into account in designing a learning algorithm, e.g. time and space
complexity of the algorithm, “interpretability” of the models produced, etc.
A prominent class of learning problems, and one that will be the focus of the first part of the course, is that of
supervised learning problems. Here one is given examples of some objects together with associated labels,
and the goal is to learn a model that can predict the labels of new objects; this includes for example the
email spam filter, handwritten digit recognition, weather forecasting, and natural language parsing examples
in Figure 1. In a typical supervised learning problem, there is an instance space X containing (descriptions
1 As we’ll see, not all machine learning problems follow this exact structure, but this is a good place to start.
4 Introduction. Binary Classification and Bayes Optimal Classifier.
of) instances or objects for which predictions are to be made, a label space Y from which labels are drawn,
and a prediction space Yb in which predictions are to be made (often Yb = Y, but this is not always the case).
The training data consists of a finite number of labeled examples S = ((x1 , y1 ), . . . , (xm , ym )) ∈ (X × Y)m ,
and the goal is to learn from these examples a model hS : X →Yb that given a new instance x ∈ X , predicts
yb = hS (x) ∈ Y:
b
There are many types of supervised learning problems. For example, in a binary classification problem,
there are two classes or labels, denoted without loss of generality by Y = {±1} = {−1, +1}, and the goal is
to learn a model that can predict accurately the class (label) of new instances, Yb = Y = {±1}. The email
spam filter problem in Figure 1 is an example of such a problem: here the two classes (labels) are spam and
non-spam. The instance space X depends on the representation of the objects (in this case email messages):
for example, if the email messages are represented as binary feature vectors denoting presence/absence of
say some d words in a vocabulary, then the instance space would be X = {0, 1}d ; if the messages are
represented as feature vectors containing counts of the d words, then the instance space would be X = Zd+ .
In a multiclass classification problem, there are K > 2 classes (labels), say Y = {1, . . . , K}, and the
goal again is to learn a model that predicts accurately labels of new instances, Yb = Y = {1, . . . , K}; the
handwritten digit recognition problem in Figure 1 is an example of such a problem with K = 10 classes. In
a regression problem, one has real-valued labels, Y ⊆ R, and the goal is to learn a model that predicts
labels of new instances, Yb = Y ⊆ R; the weather forecasting problem in Figure 1 is an example of such a
problem. In a structured prediction problem, the label space Y contains a potentially very large number
of complex structures such as sequences, trees, or graphs, and the goal is generally to learn a model that
predicts structures in the same space, Yb = Y; the natural language parsing problem in Figure 1 is an example
of such a problem. Performance in supervised learning problems is often (but not always) measured via a
loss function of the form ` : Y × Y→R
b + , where `(y, y
b) denotes the ‘loss’ incurred on predicting yb when the
true label is y; we will see examples later.2 Supervised learning techniques can also be extended to problems
such as collaborative filtering, of which the movie recommendation problem in Figure 1 is an example.
In an unsupervised learning problem, there are no labels; one is simply given instances from some instance
space X , and the goal is to learn or discover some patterns or structure in the data; typically, one assumes the
instances in the given training data S = (x1 , . . . , xm ) ∈ X m are drawn i.i.d. from some unknown probability
distribution on X , and one wishes to estimate some property of this distribution:
Unsupervised learning problems include for example density estimation problems, where one wants to
estimate the probability distribution assumed to be generating the data S, and clustering problems, where
one is interested in identifying natural ‘groups’ or ‘clusters’ in the distribution generating S. The gene
expression analysis problem in Figure 1 falls in this category.
In a reinforcement learning problem, there is a set of states that an agent (learner) can be in, a set of
actions that can be taken in each state, each of which leads to another state, and some form of ‘reward’
that the agent receives when it moves to a state; the goal is to learn a model, called a policy, that maps
states to actions and determines which action the agent should take in each state, such that as an agent
takes actions according to the model and moves from one state to another, the overall cumulative reward
received is maximized. The difference from classical supervised or unsupervised learning problems is that
the agent does not receive training data upfront, but rather learns the model while performing actions and
observing their effects. Reinforcement learning problems arise for example in robotics, and in designing
computer programs that can automatically learn to play games such as chess or backgammon.
There are also several useful variants of the above types of learning problems, such as online learning
problems, where instances are received in a dynamic manner, the algorithm must predict a label for each
instance as it arrives, and after each prediction, the true label of the instance is revealed and the algorithm
can update its model; semi-supervised learning problems, where the goal is to learn a prediction model
from a mix of labeled and unlabeled data; and active learning problems, where the algorithm is allowed
to query labels of certain instances. We will see examples of all these later in the course.
As noted above, we will begin our study with supervised learning. We start with binary classification.
Let X denote the instance space, and let the label and prediction spaces be Y = Yb = {±1}. We are given a
training sample S = ((x1 , y1 ), . . . , (xm , ym )) ∈ (X × {±1})m , where each xi ∈ X is an instance in X (such
as a feature vector representing an email message in a spam classification problem, or a gene expression
vector extracted from a patient’s tissue sample in a cancer diagnosis problem), and yi ∈ {±1} is a binary
label representing which of two classes the instance xi belongs to (such as spam/non-spam or cancer/non-
cancer). Each pair (xi , yi ) is referred to as a (labeled) training example. The goal is to learn from the
training sample S a classification model or classifier hS : X →{±1}, which given a new instance (email
message or tissue sample) x ∈ X , can predict accurately its class label via yb = hS (x).
What should count as a good classifier? In other words, how should we evaluate the quality of a proposed
classifier h : X →{±1}? We could count the fraction of instances in S whose labels are correctly predicted
by h. But this would be a bad idea, since what we care about is not how well the classifier will perform on
the given training examples, but how well it will perform on new examples in the future. In other words,
what we care about is how well the classifier will generalize to new examples. How should we measure this?
A common approach is to assume that all examples (both those seen in training and those that will be
seen in the future) are drawn i.i.d. from some (unknown) joint probability distribution D on X × Y. Then,
given the finite sample S drawn according to Dm , the goal is to learn a classifier that performs well on new
examples drawn from D. In particular, one can define the generalization accuracy of a classifier h w.r.t.
D as the probability that a new example drawn randomly from D is classified correctly by h:
accD [h] = P(X,Y )∼D h(X) = Y .
Here the notation P(X,Y )∼D (A) denotes the probability of an event A when the random quantity (X, Y ) is
drawn from the distribution D; when the distribution and/or random quantities are clear from context, we
will write this as simply P(X,Y ) (A) or P(A).
6 Introduction. Binary Classification and Bayes Optimal Classifier.
Equivalently, the generalization error (also called risk) of a classifier h w.r.t. D is the probability that a
new example drawn from D is misclassified by h:
erD [h] = P(X,Y )∼D h(X) 6= Y .
Another view of the generalization error that will prove to be very useful is in terms of a loss function. In
particular, let us define the 0-1 loss function `0-1 : {±1} × {±1}→R+ as follows:
y 6= y) ,
`0-1 (y, yb) = 1(b
where 1(·) is the indicator function whose value is 1 if its argument is true and 0 otherwise. Then the
generalization error of h is simply its expected 0-1 loss on a new example:
erD [h] = P(X,Y )∼D h(X) 6= Y
= E(X,Y )∼D 1(h(X) 6= Y )
= E(X,Y )∼D `0-1 (Y, h(X)) .
We will sometimes make this connection implicit by referring to erD [h] as the 0-1 generalization error of
h w.r.t. D, and denoting it as er0-1
D [h]. Clearly, a smaller generalization error means a better classifier.
It is worth pausing for a moment here to think about what it means to have a joint probability distribution D
generating labeled examples in X × {±1}. If for every instance x ∈ X , there is a single, deterministic, ‘true’
label t(x) ∈ {±1}, then we don’t really need a joint distribution; it’s sufficient to consider a distribution µ
on X alone, where instances x are generated randomly from µ and labels y are then assigned according to
y = t(x). However, in general, it is possible that there is no ‘true’ label function t, in the sense that the
same instance can sometimes appear with a +1 label and sometimes with a −1 label. This can happen for
example if there is inherent noise or uncertainty in the underlying process (e.g. if the same gene expression
vector can sometimes correspond to patients with a disease and sometimes to patients without the disease),
or if the instances x do not contain all the information necessary to predict the outcome (e.g. if in addition
to the gene expression measurements, one also needs some other information to make reliable predictions,
in the absence of which both outcomes could potentially be seen). The joint distribution D allows for this
possibility. In particular, we will denote by η(x) the conditional probability of label +1 given x under D:
η(x) = P Y = +1 | X = x .
If labels are deterministic, then we have η(x) ∈ {0, 1} for all x; but in general, if instances can appear with
both +1 and −1 labels, then η(x) can take any value in [0, 1].
Now, given a training sample S, a good learning algorithm should produce a classifier hS with small gen-
eralization error w.r.t. the underlying probability distribution D. How small can this error be? Clearly,
erD [hS ] ∈ [0, 1]. Can we always aim for a classifier with zero error? If labels are deterministic functions of
instances, i.e. if there is a ‘true’ label function t : X →{±1} such that every instance x appears only with
label y = t(x) (which, as noted above, means that for each x, η(x) is either 0 or 1)3 , then clearly for hS = t
we get zero error. In general, however, if the same instance x can appear with both +1 and −1 labels, then
no classifier can achieve zero error. In particular, we have
erD [h] = E(X,Y )∼D 1(h(X) 6= Y )
= EX EY |X 1(h(X) 6= Y )
= EX P(Y = +1|X) · 1(h(X) 6= +1) + P(Y = −1|X) · 1(h(X) 6= −1)
= EX η(X) · 1(h(X) = −1) + (1 − η(X)) · 1(h(X) = +1) .
3 Technically, we require this to hold with probability 1 over X ∼ µ.
Introduction. Binary Classification and Bayes Optimal Classifier. 7
For any x, a prediction h(x) = −1 contributes η(x) to the above error, and a prediction h(x) = 1 contributes
(1 − η(x)). The minimum achievable error, called the Bayes error associated with D, is therefore given
by
er∗D =
inf erD [h] = EX min η(X), (1 − η(X)) .
h:X →{±1}
Any classifier achieving the above error is called a Bayes (optimal) classifier for D; in particular, it
follows from the above discussion that the classifier h∗ : X →{±1} defined as
(
∗ 1
+1 if η(x) > 12
h (x) = sign η(x) − 2 =
−1 otherwise
• Assume a parametric form for D, such that given the parameters, one can compute a Bayes optimal
classifier; the parameters are then estimated using the training sample S, which constitutes an i.i.d.
sample from D, and a Bayes optimal classifier w.r.t. the estimated distribution is used;
• Assume a parametric form for the classification model, i.e. some parametric function class H ⊆ {±1}X ,
and use S to find a good classifier hS within H;4
• Use ‘non-parametric’ methods to learn a classifier hS from S.
In the next few lectures, we will discuss a variety of learning algorithms in each of the above categories. At
the end of the course, we will see that even without knowledge of D, many of these algorithms can be made
(universally) statistically consistent, in the sense that whatever the distribution D might be, as the
number m of training examples in S (drawn i.i.d. from D) goes to infinity, the 0-1 generalization error of
the learned classifier is guaranteed to converge in probability to the Bayes error for D.