Lecturenotes Cse176

Download as pdf or txt
Download as pdf or txt
You are on page 1of 80

CSE176 Introduction to Machine Learning — Lecture notes

Miguel Á. Carreira-Perpiñán


EECS, University of California, Merced
September 2, 2019

These are notes for a one-semester undergraduate course on machine learning given by Prof.
Miguel Á. Carreira-Perpiñán at the University of California, Merced. The notes are largely based on
the book “Introduction to machine learning” by Ethem Alpaydın (MIT Press, 3rd ed., 2014), with
some additions.
These notes may be used for educational, non-commercial purposes.
c
2015–2016 Miguel Á. Carreira-Perpiñán
1 Introduction
1.1 What is machine learning (ML)?
• Data is being produced and stored continuously (“big data”):

– science: genomics, astronomy, materials science, particle accelerators. . .


– sensor networks: weather measurements, traffic. . .
– people: social networks, blogs, mobile phones, purchases, bank transactions. . .
– etc.

• Data is not random; it contains structure that can be used to predict outcomes, or gain knowl-
edge in some way.
Ex: patterns of Amazon purchases can be used to recommend items.

• It is more difficult to design algorithms for such tasks (compared to, say, sorting an array or
calculating a payroll). Such algorithms need data.
Ex: construct a spam filter, using a collection of email messages labelled as spam/not spam.

• Data mining: the application of ML methods to large databases.

• Ex of ML applications: fraud detection, medical diagnosis, speech or face recognition. . .

• ML is programming computers using data (past experience) to optimize a performance criterion.

• ML relies on:

– Statistics: making inferences from sample data.


– Numerical algorithms (linear algebra, optimization): optimize criteria, manipulate models.
– Computer science: data structures and programs that solve a ML problem efficiently.

• A model:

– is a compressed version of a database;


– extracts knowledge from it;
– does not have perfect performance but is a useful approximation to the data.

1.2 Examples of ML problems


• Supervised learning: labels provided.

– Classification (pattern recognition):


∗ Face recognition. Difficult because of the complex variability in the data: pose and
illumination in a face image, occlusions, glasses/beard/make-up/etc.

Training examples: Test images:

∗ Optical character recognition: different styles, slant. . .


∗ Medical diagnosis: often, variables are missing (tests are costly).

1
∗ Speech recognition, machine translation, biometrics. . .
∗ Credit scoring: classify customers into high- and low-risk, based on their income and
savings, using data about past loans (whether they were paid or not).
– Regression: the labels to be predicted are continuous:
∗ Predict the price of a car from its mileage.
∗ Navigating a car: angle of the steering.
∗ Kinematics of a robot arm: predict workspace location from angles.

if income > θ1 and savings > θ2


y = wx + w0
then low-risk else high-risk
Savings

Low-Risk

y: price
θ2
High-Risk

Income
θ1 x: mileage

• Unsupervised learning: no labels provided, only input data.

– Learning associations:
∗ Basket analysis: let p(Y |X) = “probability that a customer who buys product X
also buys product Y ”, estimated from past purchases. If p(Y |X) is large (say 0.7),
associate “X → Y ”. When someone buys X, recommend them Y .
– Clustering: group similar data points.
– Density estimation: where are data points likely to lie?
– Dimensionality reduction: data lies in a low-dimensional manifold.
– Feature selection: keep only useful features.
– Outlier/novelty detection.

• Semisupervised learning: labels provided for some points only.

• Reinforcement learning: find a sequence of actions (policy) that reaches a goal. No supervised
output but delayed reward.
Ex: playing chess or a computer game, robot in a maze.

2
2 Supervised learning
2.1 Learning a class from examples: two-class problems
• We are given a training set of labeled examples (positive and negative) and want to learn a
classifier that we can use to predict unseen examples, or to understand the data.
• Input representation: we need to decide what attributes (features) to use to describe the input
patterns (examples, instances). This implies ignoring other attributes as irrelevant.

Hypothesis class of rectangles


training set for a “family car” (p1 ≤ price ≤ p2 ) AND (e1 ≤ engine power ≤ e2 )
where p1 , p2 , e1 , e2 ∈ R
x2: Engine power

x2: Engine power


e2 C

e1
x2t

x1t p1 p2
x1: Price x1: Price

• Training set: X = {(xn , yn )}N


n=1 where xn ∈ R
D
is the nth input vector and yn ∈ {0, 1} its
class label.
• Hypothesis (model) class H: the set of classifier functions we will use. Ideally, the true class
distribution C can be represented by a function in H (exactly, or with a small error).
• Having selected H, learning the class reduces to finding an optimal h ∈ H. We don’t know the
true class regions C, but we can approximate them by the empirical error :
N
X
E(h; X ) = I(h(xn ) 6= yn ) = number of misclassified instances
n=1
There may be more than one optimal h ∈ H. In that case, we achieve better generalization by maximizing the margin (the distance
between the boundary of h and the instances closest to it).

the hypothesis with the largest margin noise and a more complex hypothesis


x2









 h2
h1

x1
3
2.4 Noise
• Noise is any unwanted anomaly in the data. It can be due to:
– Imprecision in recording the input attributes: xn .
– Errors in labeling the input vectors: yn .
– Attributes not considered that affect the label (hidden or latent attributes, may be unob-
servable).
• Noise makes learning harder.
• Should we keep the hypothesis class simple rather than complex?
– Easier to use and to train (fewer parameters, faster).
– Easier to explain or interpret.
– Less variance in the learned model than for a complex model (less affected by single
instances), but also higher bias.
Given comparable empirical error, a simple model will generalize better than a complex one.
(Occam’s razor : simpler explanations are more plausible; eliminate unnecessary complexity.)

2.5 Learning multiple classes


• With K classes, we can code the label as an integer y = k ∈ {1, . . . , K}, or as a one-of-K
binary vector y = (y1 , . . . , yK )T ∈ {0, 1}K (containing a single 1 in position k).
• One approach for K-class classification: consider it as K two-class classification problems, and
minimize the total empirical error:
N X
X K
E({hk }K
k=1 ; X ) = I(hk (xn ) 6= ynk )
n=1 k=1

where yn is coded as one-of-K and hk is the two-class classifier for problem k, i.e., hk (x) ∈ {0, 1}.
• Ideally, for a given pattern x only one hk (x) is one. When no, or more than one, hk (x) is one
then the classifier is in doubt and may reject the pattern.
regression: polynomials of order 1, 2, 6
3-class example
(✐ solve for order 1: optimal w0 , w1 )
Sports car
Engine power

?
y: price

Luxury sedan

Family car

Price x: milage

4
2.6 Regression
• Training set X = {(xn , yn )}N D
n=1 where the label for a pattern xn ∈ R is a real value yn ∈ R.
In multivariate regression, yn ∈ Rd is a real vector.

• We assume the labels were produced by applying an unknown function f to the instances, and
we want to learn (or estimate) that function using functions h from a hypothesis class H.
Ex: H= class of linear functions: h(x) = w0 + w1 x1 + · · · + wD xD = wT x + w0 .

• Interpolation: we learn a function h(x) that passes through each training pair (xn , yn ) (no
noise): yn = h(xn ), n = 1, . . . , N.
Ex: polynomial interpolation (requires a polynomial of degree N − 1 with N points in general position).

• Regression: we assume random additive noise ǫ: y = h(x) + ǫ.


N
1 X
• Empirical error: E(h; X ) = (yn − h(xn ))2 = sum of squared errors at each instance.
N n=1
Other definitions of error possible, e.g. absolute value instead of square.
• Occam’s razor applies again. A line with low enough error may be preferable to a parabola with a slightly lower error.

2.7 Model selection and generalization


• Machine learning problems (classification, regression and others) are typically ill-posed : the
observed data is finite and does not uniquely determine the classification or regression function.
• In order to find a unique solution, and learn something useful, we must make assumptions (=
inductive bias of the learning algorithm).
Ex: the use of a hypothesis class H; the use of the largest margin; the use of the least-squares objective function.

• We can always enlarge the class of functions that can be learned by using a larger hypothesis
class H (higher capacity, or complexity of H).
Ex: a union of rectangles; a polynomial of order N .

• How to choose the right inductive bias, in particular the right hypothesis class H? This is the
model selection problem.
• The goal of ML is not to replicate the training data, but to predict unseen data well, i.e., to
generalize well.
• For best generalization, we should match the complexity of the hypothesis class H with the
complexity of the function underlying the data:
– If H is less complex: underfitting. Ex: fitting a line to data generated from a cubic polynomial.

– If H is more complex: overfitting. Ex: fitting a cubic polynomial to data generated from a line.

• In summary, in ML algorithms there is a tradeoff between 3 factors:


– the complexity c(H) of the hypothesis class
– the amount of training data N
– the generalization error E
so that
– as N ↑, E ↓
– as c(H) ↑, first E ↓ and then E ↑
5
Cross-validation Often used in practice to select among several hypothesis classes H1 , H2 , . . .
Divide the available dataset into three disjoint parts (say, 13 each):
• Training set:
– Used to train, i.e., to fit a hypothesis h ∈ Hi .
– Optimize parameters of h given the model structure and hyperparameters.
– Usually done with an optimization algorithm (the learning algorithm).
Ex: learn the weights of a neural net (with backpropagation); construct a k-nearest-neighbor classifier (by storing the training set).

• Validation set:
– Used to minimize the generalization error.
– Optimize hyperparameters or model structure.
– Usually done with a “grid search”. Ex: try all values of H ∈ {10, 50, 100} and λ ∈ {10−5 , 10−3 , 10−1 }.
Ex: select the number of hidden units H, the architecture, the regularization parameter λ, or how long to train for a neural net;
select k for a k-nearest-neighbor classifier.

• Test set:
– Used to report the generalization error.
– We optimize nothing on it, we just evaluate the final model on it.
Then:
1. For each class Hi , fit its optimal hypothesis hi using the training set.
2. Of all the optimal hypotheses, pick the one that is most accurate in the validation set.
We can then train the selected one on the training and validation set together (useful if we have little data altogether).

3. Report its error in the test set.

5.3 Estimation of missing values


• For any given example xn , the values of some features x1n , . . . , xDn may be missing.
Ex: survey nonresponse.

• Strategies to deal with missing values in the training set:


– Discard examples having any missing values. Easy but throws away data.
– Fill in the missing values by estimating them (imputation):
∗ Mean imputation: impute feature d as its average value over the examples where it is
present (for discrete features: mode imputation). Independent of the present features in xn .
∗ Imputation by regression: predict the value of xdn from the values of the present
features in xn . The regression model is learned using present values in the training set.
∗ Multiple imputation: create M complete training sets, each a sample from a model of
the missing data, and combine the M resulting trained models.
• Whether a feature is missing for xn may depend on the values of the other features at xn .
Ex: in a census survey, rich people may wish not to give their salary.

Then, it is better to introduce a binary random variable zdn that models whether xdn is missing.

6
8.7 Outlier (novelty, anomaly) detection
• An outlier is an instance that is very different from the other instances in the sample. Reasons:

– Abnormal behaviour. Fraud in credit card transactions, intruder in network traffic, etc.

– Recording error. Faulty sensors, etc.

• Not usually cast as a two-class classification problem because there are typically few outliers
and they don’t fit a consistent pattern that can be easily learned.

• Instead, “one-class classification”: fit a density p(x) to non-outliers, then consider x as an


outlier if p(x) < θ for some threshold θ > 0 (low-probability instance).

• We can also identify outliers as points that are far away from other samples. p. 200

2.8 Dimensions of a supervised ML algorithm


• We have a sample X = {(xn , yn )}N
n=1 (usually independent and identically distributed, “iid”)
drawn from an unknown distribution.

• We want to learn a useful approximation to the underlying function that generated the data.

• We must choose:

– A model h(x; Θ) (hypothesis class) with parameters Θ. A particular value of Θ determines


a particular hypothesis in the class.
Ex: for linear models, Θ = slope w1 and intercept w0 .

– A loss function L(·, ·) to compute the difference between the desired output (label) yn and
our prediction to it h(xn ; Θ). Approximation error (loss):
N
X
E(Θ; X ) = L(yn , h(xn ; Θ)) = sum of errors over instances
n=1

Ex: 0/1 loss for classification, squared error for regression.

– An optimization procedure (learning algorithm) to find parameters Θ∗ that minimize the


error:
Θ∗ = arg min E(Θ; X )
Θ

Different ML algorithms differ in any of these choices.

• The model, loss and learning algorithm are chosen by the ML system designer so that:

– The model class is large enough to contain a good approximation to the underlying function
that generated the data in X in a noisy form.
– The learning algorithm is efficient and accurate.
– We must have sufficient training data to pinpoint the right model.

7
Joint distribution: p(X = x, Y = y).
p(X = x, Y = y)
Conditioning (product rule): p(Y = y | X = x) = .
p(X = x)
3 Bayesian decision theory Marginalizing (sum rule): p(X = x) =
X
p(X = x, Y = y).
y
Bayes’ theorem: p(Y = y | X = x) p(X = x)
Probability review: appendix A. (inverse probability)
p(X = x | Y = y) =
p(Y = y)
.

Probability theory (and Bayes’ rule) is the framework for making decisions under uncertainty. It can
also be used to make rational decisions among multiple actions to minimize expected risk.

3.2 Classification
• Binary classification with random variables x ∈ RD (example) and C ∈ {0, 1} (label).

– Joint probability p(x, C): how likely it is to observe an example x and a class label C.
– Prior probability p(C): how likely it is to observe a class label C, regardless of x.
p(C) ≥ 0 and p(C = 1) + p(C = 0) = 1.
– Class likelihood p(x|C): how likely it is that, having observed an example with class label
C, the example is at x.
This represents how the examples are distributed for each class. We need a model of x for each class.

– Posterior probability p(C|x): how likely it is that, having observed an example x, its class
label is C.
p(C = 1|x) + p(C = 0|x) = 1.
This is what we need to classify x. We infer it from p(C) and p(x|C) using Bayes’ rule.

– Evidence p(x): probability of observing an example x at all (regardless of its class).


Marginal distribution of x: p(x) = p(x|C = 0)p(C = 0) + p(x|C = 1)p(C = 1).
prior × likelihood
– Bayes’ rule: posterior =
evidence
p(x|C)p(C) p(x|C)p(C)
p(C|x) = =
p(x) p(x|C = 0)p(C = 0) + p(x|C = 1)p(C = 1)

• Making a decision on a new example: given x, classify it as class 1 iff p(C = 1|x) > p(C = 0|x).

• Examples: 0.5


x−µk
2 0.4
p(x|C1 ) p(x|C2 )
− 12
– Gaussian classes in 1D: p(x|Ck ) = √ 1 e σk
.
2πσk 0.3

– Exe 3.1: disease diagnosis based on a test. 0.2


p(t=1|d=1)
Approx. as p(d = 1|t = 1) ≈ p(d = 1) p(t=1|d=0) in terms of likelihood
0.1
p(x|C1 )p(C1 ) p(x|C2 )p(C2 )
ratio. How can we increase p(d = 1|t = 1)?
0
• K-class case: C ∈ {1, . . . , K}. 0 1 2 3 4 5
x
6 7 8 9 10

PK
– Prior probability: p(Ck ) ≥ 0, k = 1, . . . , K, and k=1 p(Ck ) = 1.
– Class likelihood: p(x|Ck ).
p(x|Ck )p(Ck ) p(x|Ck )p(Ck )
– Bayes’ rule: p(Ck |x) = p(x)
= PK .
i=1 p(x|Ci )p(Ci )

– Choose as class arg maxk=1,...,K p(Ck |x).

• We learn the distributions p(C) and p(x|C) for each class from data using an algorithm.

• Naive Bayes classifier : a very simple classifier were we assume p(x|Ck ) = p(x1 |Ck ) · · · p(xD |Ck )
for each class k = 1, . . . , K, i.e., within each class the features are independent from each other.

8
3.3 Losses and risks
• Sometimes the decision of what class to choose for x based on p(Ck |x) has a cost that depends
on the class. In that case, we should choose the class with lowest risk.
Ex: medical diagnosis, earthquake prediction.

• Define:
– λik : loss (cost) incurred for choosing class i when the input actually belongs to class k.
XK
– Expected risk for choosing class i: Ri (x) = λik p(Ck |x).
k=1

• Decision: we choose the class with minimum risk: arg mink=1,...,K Rk (x).
0 1000

Ex: C1 = no cancer, C2 = cancer; (λik ) = 1 0 ; p(C1 |x) = 0.2.

• Particular case: 0/1 loss.


– Correct decisions have no loss and all incorrect decisions are equally costly:
λik = 0 if i = k, 1 if i 6= k.
– Then? Ri (x) = 1 − p(Ci |x), hence to minimize the risk we pick the most probable class.
• Up to now we have considered K possible decisions given an input x (each of the K classes).
When incorrect decisions (misclassifications) are costly, it may be useful to define an additional
decision of reject or doubt (and then defer the decision to a human). For the 0/1 loss:
– Define a cost λreject,k = λ ∈ (0, 1) for rejecting an input that actually belongs to class k.
P
– Risk of choosing class i: Ri (x) = K λ p(Ck |x) = 1 − p(Ci |x).
P k=1 ik
Risk of rejecting: Rreject (x) = K k=1 reject,k p(Ck |x) = λ.
λ
⇒ optimal decision rule (given by the minimal risk)? : arg max{p(C1 |x), . . . , p(CK |x), 1 − λ}.
– Extreme cases of λ:
∗ λ = 0: always reject (rejecting is less costly than a correct classification).
∗ λ = 1: never reject (rejecting is costlier than any misclassification).
x2

3.4 Discriminant functions C1

• Classification rule: choose arg maxk=1,...,K gk (x)


where we have a set of K discriminant functions
g1 , . . . , gK (not necessarily probability-based). reject

C2
C3
• Examples:
– Risk based on Bayes’ classifier: gk = −Rk .
x
– With the 0/1 loss: 1

gk (x) = p(Ck |x), or? gk (x) = p(x|Ck )p(Ck ), or? gk (x) = log p(x|Ck ) + log p(Ck ).
– Many others: SVMs, neural nets, etc.
• They divide the feature space into K decision regions Rk = {x: k = arg maxi gi (x)}, k =
1, . . . , K. The regions are separated by decision boundaries, where ties occur.
• With two classes we can define a single discriminant? g(x) = g1 (x) − g2 (x) and choose class 1
iff g(x) > 0.

9
3.5 Association rules
• Association rule: implication of the form X → Y (X: antecedent, Y : consequent).
• Application: basket analysis (recommend product Y when a customer buys product X).
• Define the following measures of the association rule X → Y :
# purchases of X and Y
– Support = p(X, Y ) = .
# purchases
For the rule to be significant, the support should be large. High support means items X
and Y are frequently bought together.
p(X, Y ) # purchases of X and Y
– Confidence = p(Y |X) = = .
p(X) # purchases of X
For the rule to hold with enough confidence, should be ≫ p(Y ) and close to 1.
• Generalization to more than 2 items: X, Z → Y has confidence p(Y |X, Z), etc.
• Exe 3.7: example database of purchases.
• Given a database of purchases, we want to find all possible rules having enough support and
confidence.
• Algorithm Apriori :
1. Find itemsets with enough support (by finding frequent itemsets).
We don’t need to enumerate all possible subsets of items: for (X, Y, Z) to have enough
support, all its subsets (X, Y ), (Y, Z), (X, Z) must have enough support? . Hence: start
by finding frequent one-item sets (by doing a pass over the database). Then, inductively,
from frequent k-item sets generate k + 1-item sets and check whether they have enough
support (by doing a pass over the database).
2. Convert the found itemsets into rules with enough confidence (by splitting the itemset
into antecedent and consequent).
Again, we don’t need to enumerate all possible subsets: for X → Y, Z to have enough
confidence, X, Y → Z must have enough confidence? . Hence: start by considering a single
consequent and test the confidence for all possible single consequents (adding as a valid
rule if it has enough confidence). Then, inductively, consider two consequents for the
added rules; etc.
• A more general way to establish rules (which may include hidden variables) are graphical
models.

10
19.7 Measuring classifier performance
Binary classification problems
# misclassified patterns
• The most basic measure is the classification error (or accuracy) in %: .
# patterns
• It is often of interest to distinguish
Predicted class
the different types of errors: confus-
True class Positive Negative Total
ing the positive class with the nega-
Positive tp: true positive fn: false negative p
tive one, or vice versa.
Ex: voice authentication to log into a user account. Negative fp: false positive tn: true negative n
A false positive (allowing an impostor) is much
worse than a false negative (refusing a valid user).
Total p′ n′ N

• Having trained a classifier, we can control fn vs fp through a threshold θ ∈ [0, 1]. Let C1 be
the positive class and C2 the negative class. If the classifier returns p(C1 |x), let us choose the
positive class if p(C1 |x) > θ (so θ = 12 gives the usual classifier):
– θ ≈ 1: almost always choose C2 , nearly no false positives but nearly no true positives.
– Decreasing θ increases the number of true positives but risks introducing false positives.
• ROC curve (“receiver operating curve”): the pair values Measure Formula
(fp-rate,tp-rate) as a function of θ ∈ [0, 1]. Properties: error (fp + fn)/N
– It is always increasing. accuracy = 1− error (tp + tn)/N
– An ideal classifier is at (0, 1) (top left corner). tp-rate (hit rate) tp/p
fp-rate (false alarm rate) fp/n
– Diagonal (fp-rate = tp-rate): random classifier? . precision tp/p′
This is the worst we can do. Any classifier that is below recall = tp-rate tp/p
the diagonal can be improved by flipping its decision? . precision × recall
F-score (precision + recall)/2
• Area under the curve (AUC): reduces the ROC curve sensitivity = tp-rate
tp/p
to a number. Ideal classifier: AUC = 1. specificity = 1− fp-rate
tn/n
• ROC and AUC allow us to compare classifiers over different loss conditions, and choose a value
of θ accordingly. Often, there is not a dominant classifier.
• Information retrieval : we have a database of records (e.g. documents, images) and a query, for which
some of the records are relevant (“positive”) and the rest are not (“negative”). A retrieval sys-
tem returns K records for the query. Its performance is measured using a precision/recall curve:
– Precision: #(relevant retrieved records) / #(retrieved records).
– Recall : #(relevant retrieved records) / #(relevant records). We can always achieve perfect
recall by returning the entire database (which will contain many irrelevant records).
(a) ROC curve; b) ROC curves for 3 classifiers Precision & recall using Venn diagrams
  a
 
  a + b

a
a + c

11
K > 2 classes Ex: MNIST
handwritten digits
• Again, the most basic measure is the classification error.
0
• Confusion matrix : K × K matrix where entry (i, j) contains the 1

True class
2
number of instances of class Ci that are classified as Cj . 3
4

• It allows us to identify which types of misclassification errors tend 5


6
to occur, e.g. if there are two classes that are frequently confused. 7
8
Ideal classifier: the confusion matrix is diagonal. 9
0 1 2 3 4 5 6 7 8 9
Predicted class

12
4 (Univariate) parametric methods
• How to learn probability distributions from data Joint distribution: p(X = x, Y = y).
p(X = x, Y = y)
(in order to use them to make decisions). Conditioning (product rule): p(Y = y | X = x) = .
X p(X = x)
Marginalizing (sum rule): p(X = x) = p(X = x, Y = y).
• We assume such distributions follow a particular y
Bayes’ theorem: p(Y = y | X = x) p(X = x)
parametric form (e.g. Gaussian), so we need to (inverse probability)
p(X = x | Y = y) = .
p(Y = y)
estimate its parameters (µ, σ).
• Several ways to learn them:
– by optimizing an objective function (e.g. maximum likelihood)
– by Bayesian estimation.
• This chapter: univariate case; next chapter: multivariate case.

4.2 Maximum likelihood estimation: parametric density estimation


• Problem: estimating a density p(x). Assume an iid sample X = {xn }N n=1 drawn from a known
probability density family p(x; Θ) with parameters Θ. We want to estimate Θ from X .
Q PN
• Log-likelihood of Θ given X : L(Θ; X ) = log p(X ; Θ) = log N
n=1 p(xn ; Θ) = n=1 log p(xn ; Θ).

• Maximum likelihood estimate (MLE): ΘMLE = arg maxΘ L(Θ; X ).


• Examples: 
x
– Bernoulli: Θ = {θ}, p(x; θ) = θ (1 − θ) = 1−x θ, x = 1 , x ∈ {0, 1}, θ ∈ [0, 1].
1 − θ, x = 0
P
MLE? : p̂ = N1 Nn=1 xn (sample average).
2
e− 2 ( σ ) , x ∈ R, µ ∈ R, σ ∈ R+ .
1 x−µ
1
– Gaussian: Θ = {µ, σ 2 }, p(x; µ, σ 2 ) = √2πσ
P PN
MLE? : µ̂ = N1 N
n=1 xn (sample average), σ̂ = N
2 1 2
n=1 (xn − µ̂) (sample variance).

For more complicated distributions, we usually need an algorithm to find the MLE.

4.4 The Bayes’ estimator: parametric density estimation


• Consider the parameters Θ as random variables themselves (not as unknown numbers), and
assume a prior distribution p(Θ) over them (based on domain information): how likely it is for
the parameters to take a value before having observed any data.
|Θ)p(Θ) |Θ)p(Θ)
• Posterior distribution p(Θ|X ) = p(X p(X )
= R p(Xp(X
|Θ′ ) p(Θ′ ) dΘ′
: how likely it is for the parame-
ters to take a value after having observed a sample X .
R
• Resulting estimate for the probability at a new point x? : p(x|X ) = p(x|Θ) p(Θ|X ) dΘ. Hence,
rather than using the prediction of a single Θ value (“frequentist statistics”), we average the
prediction of every parameter value Θ using its posterior distribution (“Bayesian statistics”).
• Approximations: reduce p(Θ|X ) to a single point Θ.
– Maximum a posteriori (MAP) estimate: ΘMAP = arg maxΘ p(Θ|X ).
Particular case: if p(Θ) = constant, then p(Θ|X ) ∝ p(X |Θ) and MAP estimate = MLE.
R
– Bayes’ estimator : ΘBayes = E {Θ|X } = Θ p(Θ|X ) dΘ.
Works well if p(Θ|X ) is peaked around a single value.
13
• Example: suppose xn ∼ N (θ, σ 2 ) and θ ∼ N (µ0, σ02 ) where µ0 , σ 2 , σ02 are known. Then? :
2 1/σ02
E {θ|X } = N/σN/σ
2 +1/σ 2 µ̂ + N/σ 2 +1/σ 2 µ0 .
0 0

• Advantages and disadvantages of Bayesian learning:


✓ works well when the sample size N is small (if the prior is helpful).
✗ computationally harder (needs to compute, usually approximately, integrals or summa-
tions); needs to define a prior.

4.5 Maximum likelihood estimation: parametric classification


• From ch. 3, for classes k = 1, . . . , K, we use:
p(x|Ck )p(Ck ) p(x|Ck )p(Ck )
p(Ck |x) = = PK discriminant function gk (x) = p(x|Ck )p(Ck ).
p(x) i=1 p(x|Ci )p(Ci )

• Ex: assume x|Ck ∼ N (x; µk , σk2 ). Estimate the parameters from a data sample {(xn , yn )}N
n=1 :

– p̂(Ck ) = proportion of yn that are class k.


– µ̂k , σ̂k2 = as the MLE above, separately for each class k.

(a) Likelihoods
0.4

0.3
C1
p(x|C )

C2
i

0.2

0.1

0
−10 −8 −6 −4 −2 0 2 4 6 8 10
x
(b) Posteriors with equal priors
1

C C
2 2
p(C |x)

C
0.5 1
i

0
−10 −8 −6 −4 −2 0 2 4 6 8 10
x
(c) Expected risks
1

C C2
R(α |x)

C 1
2
i

0.5

rej rej
0
−10 −8 −6 −4 −2 0 2 4 6 8 10
x

14
4.6 Maximum likelihood estimation: parametric regression
• Assume there exists an unknown function f that maps inputs x to outputs y = f (x), but that
what we observe as output is a noisy version y = f (x) + ǫ, where ǫ is an random error. We
want to estimate f by a parametric function h(x; Θ). In ch. 2 we saw the least-squares error
was a good loss function to use for that purpose. We will now show that maximum likelihood
estimation under Gaussian noise is equivalent to that.
• Log-likelihood of Θ given a sample {(xn , yn )}N
n=1 drawn iid from p(x, y):
QN ? PN
L(Θ; X ) = log n=1 p(xn , yn ) = n=1 log p(yn |xn ; Θ) + constant.
• Assume an error ǫ ∼ N (0, σ 2 ), so p(y|x) ∼
PN (h(x; Θ), σ 2 ). Then maximizing the log-likelihood
N
?
is equivalent to minimizing E(Θ; X ) = n=1 (yn − h(xn ; Θ))2 , i.e., the least-squares error. p. 78

• Examples:
– Linear regression: h(x; w0 , w1) = w1 x + w0 . LSQ estimate? :
 1
PN     1 PN 
−1 1 N P n=1 xn w 0 n=1 y n
w = A y, A = 1 PN 1 N 2 , w= , y = 1NPN .
N n=1 xn N n=1 xn
w1 N n=1 yn xn

– Polynomial regression: h(x; w0 , . . . , wk ) = wk xk + · · · + w1x + w0 . The model is still linear


on the parameters. LSQ estimate also of the form w = A−1y. p. 79
y: price

x: milage

4.3 Evaluating an estimator: bias and variance


• Statistic d(X ): any value that is calculated from a sample X (e.g. average, maximum. . . ). It is a
r.v. with an expectation (over samples) EX {d(X )} and a variance EX {(d(X ) − EX {d(X )})2 }.
• X = (x1 , . . . , xN ) iid sample from p(x; θ). Let d(X ) be an estimator for θ. How good is it?

mean square error of the estimator d: error(d, θ) = EX (d(X ) − θ)2 .
• Bias of the estimator: bθ (d) = EX {d(X )} − θ. How much the expected value of the estimator
over samples differs from the true parameter value.
If bθ (d) = 0 for all θ values: unbiased estimator.
1 PN
Ex: the sample average N n=1 xn is an unbiased estimator of the true mean µ? .

• Variance of the estimator: var {d} = EX {(d(X ) − EX {d(X )})2 }. How much the estimator
varies around its expected value from one sample to another. variance
If var {d} → 0 as N → ∞: consistent estimator.
di
Ex: the sample average is a consistent estimator of the true mean? .
E[d] θ

15 bias
• Bias-variance decomposition: p. 70
 
error(d, θ) = EX (d − θ)2 = EX (d − EX {d})2 + (EX {d} − θ)2 = var {d} + b2θ (d).
| {z } | {z }
variance bias2

all over the place


outside the picture

bias↓, var↓ bias↓, var↑ bias↑, var↓ bias↑, var↑

4.7 Tuning model complexity: bias-variance dilemma


• Consider the regression setting as before, i.e., y = f (x) + ǫ. Given a sample X = {(xn , yn )}N
n=1
drawn iid from p(x, y), we construct a regression estimate h(x). Then, for any h:
 ? 
Expected square error at x: E (y − h(x))2 |x = E (y − E {y|x})2 |x + (E {y|x} − h(x))2 .
| {z } | {z }
noise squared error
The expectations are over the joint density p(x, y) for fixed x, or, equivalently? , over p(y|x).

– Noise term: variance of y given x. Equal to the variance of the noise ǫ added, independent
of h or X . Can never be removed no matter which estimator we use.
– Squared error term: how much the estimate h(x) deviates (at each x) from the regression
function E {y|x}. Depends on h and X . It is zero if h(x) = E {y|x} for each x.
The ideal? , optimal regression function is h(x) = f (x) = E {y|x} (conditional mean of p(y|x)).
• Consider h(x) as an estimate (for each x) of the true f (x). Bias-variance decomposition at x:
 
EX (E {y|x} − h(x))2 |x = (E {y|x} − EX {h(x)})2 + EX (h(x) − EX {h(x)})2 .
| {z } | {z }
bias2 variance
The expectation EX {·} is over samples X of size N drawn from p(x, y). The other expectations are over p(y|x).

– Bias: how much h(x) is wrong on average over different samples.


– Variance: how much h(x) fluctuates around its expected value as the sample varies.
We want both to be small. Ex: let hm be the estimator using a sample Xm :
– hm (x) = 2 ∀x (constant fit): zero var, high bias (unless f (x) ≈ 2 ∀x), high total error.
P (m)
– hm (x) = N1 Nn=1 yn (average of sample Xm ): higher var, lower bias, lower total error.
– hm (x) = polynomial of degree k: if k ↑ then bias↓, var↑.
• Low bias requires sufficiently flexible models, but flexible models have high variance. As we
increase complexity, bias decreases (better fit to data) and variance increases (fit varies more
with data). The optimal model has the best trade-off between bias and variance.
– Underfitting: model class doesn’t contain the solution (it is not flexible enough) ⇒ bias.
– Overfitting: model class too general and also learns the noise ⇒ variance.
Also, the variance due to the sample decreases as the sample size increases.

16
(a) Function and data (b) Order 1 Bias and variance estimated
5 5
from samples Xm using the
true function f in panel (a).
4

3.5

3
0 0
2.5

Error
2

Error
1.5

Bias
1

−5 −5 0.5 Variance

0 1 2 3 4 5 0 1 2 3 4 5
0
1 2 3 4 5
Order

(c) Order 3 (d) Order 5


5 5 Cross-validation (f unknown)
(a) Data and fitted polynomials
5

0 0 −5
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

(b) Error vs. polynomial order


3
Training
2.5 Validation

1.5

1
−5 −5
0 1 2 3 4 5 0 1 2 3 4 5 0.5
1 2 3 4 5 6 7 8

4.8 Model selection procedures


• Methods to find the model complexity that is best suited to the data.
• Cross-validation: the method most used in practice. We cannot calculate the bias and variance
(since we don’t know the true function f ), but we can estimate the total error.
– Given a dataset X , divide it into two parts as training and validation sets, T and V.
– Train candidate models of different complexities on T .
– Evaluate their errors on V.
– Pick the model that gives lowest error on V.
This works because, as the model complexity increases:
– the training error keeps decreasing;
– the validation error first decreases then increases (or stays about constant).
• Regularization: instead of minimizing just the error on the data, minimize
N
X
min (yn − h(xn ))2 + λ C(h)
h
n=1

where C(h) ≥ 0 measures the model complexity and λ ≥ 0. Ex:


– Model selection criteria (AIC, BIC. . . ): C(h) ∝ number of parameters in h.
P
– Smoothness penalty: e.g. C(h) = ki=1 wi2 for polynomials of degree k.
λ is set by cross-validation.

17
5 Multivariate parametric methods
• Sample X = {xn }N
n=1 where each example xn contains D features (continuous or discrete).

• We often write the sample (or dataset) as a matrix X = (x1 , . . . , xN ) (examples = columns).
Sometimes as its transpose (examples = rows).

Review of important moments


For a sample:
1
PN
• Mean vector : µ = N n=1 xn .
1
PN ? 1
PN
• Covariance matrix : Σ = N n=1 (xn − µ)(xn − µ)T = N n=1 xn xTn − µµT .
Symmetric, positive definite, diagonal = variances σ11 = σ12 , . . . , σDD = σD
2 .

σij
• Correlation matrix : corr(Xi , Xj ) = σi σj ∈ [−1, 1].
If corr(Xi , Xj ) = 0 (equivalently σij = 0) then Xi and Xj are uncorrelated (but not necessarily independent).
If corr(Xi , Xj ) = ±1 then Xi and Xj are linearly related.

For a continuous distribution ):


R P
(for a discrete distribution, replace →
R
• Mean vector : E {x} = x p(x) dx.
 ? 
• Covariance matrix : cov {x} = E (x − E {x})(x − E {x})T = E xxT − E {x} E {x}T .

5.4 Review of the multivariate normal (Gaussian) distribution


T
• Density at x ∈ RD : p(x) = |2πΣ|−1/2 e− 2 (x−µ)
1
Σ−1 (x−µ)
. E {x} = µ and cov {x} = Σ.
• Mahalanobis distance: dΣ (x, x′ ) = (x − x′ )T Σ−1 (x − x′ ). Σ = I ⇒ dΣ (x, x′ ) = kx − x′ k2 (Euclidean distance).

• Properties:
– Σ diagonal ⇒ the D features are uncorrelated and independent: p(x) = p(x1 ) · · · p(xD ).
– x is Gaussian ⇒ each marginal p(xi ) and conditional distribution p(xi |xj ) is also Gaussian.
– x ∼ ND (µ, Σ), WD×K ⇒ WT x ∼ NK (WT µ, WT ΣW). A linear projection of a Gaussian is Gaussian.

• Widely used in practice because:


– Its special mathematical properties simplify the calculations.
– Many natural phenomena are approximately Gaussian.
– It models well blobby distributions centered around a prototype vector µ with a noise characterized by
Σ.
It does make strong assumptions: symmetry, unimodality, non-heavy tails.

Cov(x ,x )=0, Var(x )=Var(x ) Cov(x ,x )=0, Var(x )>Var(x )


1 2 1 2 1 2 1 2

0.4

0.35
2
x

0.3

0.25

0.2
x
0.15 1

0.1 Cov(x ,x )>0 Cov(x ,x )<0


1 2 1 2
0.05

x
2 x
1

18
5.5 Multivariate classification with Gaussian classes
Assume the class-conditional densities are Gaussian: x|Ck ∼ N (µk , Σk ). The maths carry over from
the 1D case in a straightforward way: p. 100
• MLE? : class proportions for p(Ck ); sample mean and sample covariance for µk and Σk , resp.
?
• Discriminant function gk (x) = log p(x|Ck ) + log p(Ck ) = quadratic form on x.
D(D+1) ?
• Number of parameters per class k = 1, . . . , K: p(Ck ): 1; µk : D; Σk : 2
.
A lot of parameters for the covariances! Estimating them may need a large dataset.

Special cases, having fewer parameters:


• Equal (“shared”) covariances: Σk = Σ ∀k. Total D(D+1)2
parameters (instead of K D(D+1)
2
).
✐ P
– ⇒ MLE for Σ = K k=1 p(Ck )Σk , where Σk = covariance matrix of class k.
?
– ⇒ the discriminant gk (x) becomes linear on x. p. 103
– If, in addition, equal priors: p(Ck ) = K1 ∀k: Mahalanobis distance classifier
?
⇒ assign x to the closest centroid µk in Mahalanobis distance (x − µk )T Σ−1 (x − µk ).
• Equal, isotropic covariances: Σk = σ 2 I ∀k. Total 1 parameter for all K covariances.
✐ P PK
– ⇒ MLE for σ 2 = D1 D d=1
2 2
k=1 p(Ck )σkd , where σkd = dth diagonal element of Σk .
– If, in addition, equal priors: Euclidean distance classifier (“template matching”)
? ?
⇒ gk (x) = kx − µk k2 ⇔ gk (x) = µTk x − 21 kµk k2 .
If kµk k2 = 1 ⇒ gk (x) = µT
k x: dot product classifier. p. 105
2 2
• Diagonal covariances (not shared): Σk = diag (σk1 , . . . , σkD ). Total KD parameters.
This assumes the features are independent within each class, and gives a naive Bayes classifier.

✐ 2
– ⇒ MLE for σkd = dth diagonal element of Σk = variance of class k for feature d.

Σ1 6= Σ2 , full Σ1 = Σ2 , full
0.1
p( x|C1)

0.05

x2 x1

Σ1 = Σ2 , diagonal Σ1 = Σ2 = σ 2 I
1
p(C | x)

0.5
1

x2 x1

19
5.6 Tuning complexity
• Again a bias-variance dilemma:
– Two few parameters (e.g. Σ1 = Σ2 = σ 2 I): high bias.
– Too many parameters (e.g. Σ1 6= Σ2 , full): high variance.
• We can use cross-validation, regularization, Bayesian priors on the covariances, etc.

5.7 Discrete features with Bernoulli distributions


• Assume that, in each class k, the features are independent and each feature d = 1, . . . , D is
Bernoulli with parameter θkd :
D
Y xd
p(x|Ck ) = θkd (1 − θkd )1−xd x ∈ {0, 1}D .
d=1

– This is a naive Bayes classifier. Its discriminant gk (x) is linear? . p. 108


– MLE: class proportions for p(Ck ); sample mean for θkd .
• Works well with document categorization (e.g. classifying news reports into politics, sports and
fashion) and spam filtering (classifying email messages into spam or non-spam). We typically
represent a document as a bag of words vector x: given a predetermined dictionary of D words,
x is a binary vector of dimension D where xd = 1 iff word d is in the document.

5.8 Multivariate regression


• Linear regression where x ∈ RD : y = f (x) + ǫ with f (x) = wD xD + · · · + w1 x1 + w0 = wT x
(where we define x0 = 1).
• The maths carry over from the 1D case in a straightforward way: p. 110

– Least-squares error (or equivalently maximum


P likelihood where ǫ is Gaussian with zero
mean and constant variance): E(w) = N n=1 (y T 2
n − w xn ) .
– The error E is quadratic on w. Equating the derivatives of E wrt w to 0 we obtain the
normal equations (a linear system for w):
N N
! N
∂E X
T
X
T
X
= −2 (yn − w xn )xn = 0 ⇒ xn xn w = yn xn ⇒ (XXT )w = Xy
∂w n=1 n=1 n=1

where XD×N = (x1 , . . . , xN ), wD×1 = (w0 , . . . , wD )T , yN ×1 = (y1 , . . . , yN )T .


• The resulting values of the model parameters can give insights into the data:
– the sign of wd tells us whether xd has a positive or negative effect on y;
– the magnitude of wd tells us how influential xd is (if x1 , . . . , xD all have the same range).
• With multiple outputs y = (y1 , . . . , yD′ )T , the linear regression y = f(x) + ǫ is equivalently
defined as D ′ independent single-output regressions.
kyn − Wxn k2 . Taking
PN ∂E
E(W) = n=1 ∂W
= 0 gives (XXT )W = XYT with YD ′ ×N = (y1 , . . . , yN ) and WD ′ ×D .

• We can fit a nonlinear function h(x) similarly to a linear regression by defining additional,
nonlinear features such as x2 = x2 , x3 = ex , etc. (just as happens with polynomial regression).
Then, using a linear model in the augmented space x = (x, x2 , ex )T will correspond to a
nonlinear model in the original space x. Radial basis function networks, kernel SVMs. . .
20
6 Dimensionality reduction
• If we want to train a classifier (or regressor) on a sample {(xn , yn )}N
n=1 where the number of
features D in x (the dimension of x) is large:

– Training will be slow.


– Learning a good classifier will require a large sample.

• It is then convenient to transform each example xn ∈ RD into a new example zn = F(xn ) ∈ RL


having lower dimension L < D (as long as we don’t lose much information).

• Two basic ways to do this: x = (x1 , x2 , x3 , x4 , x5 )T ⇒ for L = 2:

– Feature selection: F(x) = a subset of x1 , . . . , xD . – F(x) = ( xx25 ).


It doesn’t modify the features, it simply selects L and discards the
rest.

Ex: best-subset/forward/backward selection.


1x1 +3x2 −5x3 +5x4 −4x5

– Dimensionality reduction (DR): F(x) = a l.c. or – F(x) = 2x1 +3x2 −1x3 +0x4 +2x5 .
some other function of all the x1 , . . . , xD .
It constructs L new features and discards the original D features.

Ex: PCA, LDA. . .


• If reducing to L ≤ 3 dimensions, can visualize the dataset and look for patterns (clusters, etc.).

Review of eigenvalues and eigenvectors


For a real symmetric matrix A of D × D: 
λ ∈ R: eigenvalue
• Eigenvalues and eigenvectors of: Au = λu ⇒
u ∈ RD : eigenvector.
• A has D eigenvalues λ1 ≥ · · · ≥ λD and D corresponding eigenvectors u1 , . . . , uD .
• Eigenvectors of different eigenvalues are orthogonal: uTi uj = 0 if i 6= j.

nonsingular :
 all λ 6= 0
• A is positive definite: all λ > 0 (⇔ xT Ax > 0 ∀x 6= 0)

(⇔ xT Ax ≥ 0 ∀x 6= 0).

positive semidefinite: all λ ≥ 0
• Spectral theorem: A symmetric, real with normalized PD eigenvectors u1 , . . . , uD ∈ RD associated with eigen-
T T
values λ1 ≥ · · · ≥ λD ∈ R ⇒ A = UΛU = i=1 λi ui ui where U = (u1 . . . uD ) is orthogonal and
Λ = diag (λ1 , . . . , λD ). In other words, a symmetric real matrix can be diagonalized in terms of its eigen-
values and eigenvectors.
xT Ax
• λ1 = maxx6=0 xT x ⇔ max xT Ax s.t. kxk = 1, achieved at x = u1 .
xT Ax
λ2 = maxx6=0 xT x s.t. xT u1 = 0, achieved at x = u2 .
Etc.
PN
• Covariance matrix Σ = N1 n=1 (xn − µ)(xn − µ)T positive definite (unless zero variance along some dim.).
√ √
Mahalanobis distance (x − µ)T Σ−1 (x − µ) = 1 ⇒ ellipsoid with axes = λ1 , . . . , λD (stdev along PCs).
 
• w ∈ RD : var wT x1 , . . . , wT xN = wT Σw. In general for WD×L : cov WT x1 , . . . , WT xN = WT ΣW.

21
6.2 Feature selection: forward selection
• Problem: given a sample {(xn , yn )}N D
n=1 with xn ∈ R , determine the best subset of the D
features such that the number of selected features L is as small as possible and the classification
accuracy (using a given classifier, e.g. a linear SVM) is as high as possible.
Using all D features will always give the highest accuracy on the training set but not necessarily on the validation set.

• Useful when some features are unnecessary (e.g. irrelevant for classification or pure noise) or
redundant (so we don’t need them all).
Useful with e.g. microarray data. Not useful with e.g. image pixels.

• Best-subset selection: for each subset of features, train a classifier and evaluate it on a validation
set. Pick the subset having highest accuracy and up to L features.
Combinatorial optimization: 2D possible subsets of D features? . Ex: D = 3: {∅, {x1 }, {x2 }, {x3 }, {x1 , x2 }, {x1 , x3 }, {x2 , x3 }, {x1 , x2 , x3 }}.
Brute-force search only possible for small D ⇒ approximate search.

• Forward selection: starting with an empty subset F , sequentially add one new feature at a
time. We add the feature d ∈ {1, . . . , D} such that the classifier trained on F ∪ {d} has highest
classification accuracy in the validation set. Stop when the accuracy doesn’t improve much, or
when we reach L features.
• Backward selection: same thing but start with F = {1, . . . , D} and remove one feature at a time.
• Both are greedy algorithms and are not guaranteed to find an optimal subset.
• Forward selection is preferable when we expect the optimal subset to contain few features (faster
than backward selection). It trains O(D 2 ) classifiers if we try up to D features.
• These feature selection algorithms are supervised : they use the labels yn when training the
classifier. The features selected depend on the classifier we use. There are also unsupervised algorithms.
Ex: forward selection on the Iris dataset (D = 4 features). p. 119
F1
1
4

8
0
F1

0
1
F4

−1
4 4.5 5 5.5 6 6.5 7 7.5 8
2

1
3

F2
F2

0
2.5

3.5

4.5
2

4
0

−1
2 2.5 3 3.5 4 4.5
1
F4

1
2
F3

0
3

F3
−1
0

1 2 3 4 5 6 7
0

1
1
F4

F4

0
2

−1
0 0.5 1 1.5 2 2.5
3

22
6.3 Dimensionality reduction: principal component analysis (PCA)
10 (a) Scree graph for Optdigits Optdigits after PCA
200 30

2 2
20

Eigenvalues
9 2
8 3 2 2
2 2
100 3
3 3 3 2 8
3
2 32 5 3 2 00
10 3
3 8 1
3 8 8 0
8 5 5 0 6
6 0
00 1

Second eigenvector
888 0
7 9 6 6
0 7 6

x2
0 6
0 10 20 30 40 50 60 70 7 7
7 9 6
Eigenvectors 7 5 8 5 0
6
7 9
(b) Proportion of variance explained 77 9 9 6
−10 7 7 1 1
4 1 19 1 1
41
7 9 14
0.8 94 1 9
1
4

Prop. of var.
−20
PC1 0.6 49 4
4 4
2 4 4
0.4 4
−30 4
PC2
0.2

0 0 −40
0 2 4 6 8 10 0 10 20 30 40 50 60 70 −40 −30 −20 −10 0 10 20 30 40

x1 Eigenvectors First eigenvector

• Aims at preserving most of the signal information.


Find a low-dimensional space such that when x is projected there, information loss is minimized.

• Which direction w ∈ RD shows most variation? maxw wT Σw s.t. kwk = 1 ⇒ w = u1 ? . p. 120

• Unsupervised linear DR method: given {xn }N D


n=1 ⊂ R (with mean zero and covariance matrix
Σ of D × D), when reducing dimension to L < D, PCA finds:
– a linear projection mapping F: x ∈ RD → WT x ∈ RL , and
– a linear reconstruction mapping f: z ∈ RL → Wz ∈ RD ,
where WD×L has orthonormal columns (WT W = I), that are optimal in two equivalent senses:
  ? 
– Maximum projected variance: maxW tr cov WT x1 , . . . , WT xN = tr WT ΣW .
P 2? 
– Minimum reconstruction error : minW N1 N T
n=1 kxn − Wzn k = − tr W ΣW +constant.

• Covariance in the projected space cov {Z} = WT ΣW is diagonal ⇒ uncorrelated projections.


P
• If the mean of the sample is µ = N1 N T
n=1 xn ⇒ F(x) = W (x − µ) and f(z) = Wz + µ.

• How to compute W, given Σ? Eigenproblem maxW tr WT ΣW s.t. WT W = I whose solution
is given by the spectral theorem. Decompose Σ = UΛUT with eigenvectors U = (u1 . . . uD )
and eigenvalues Λ = diag (λ1 , . . . , λD ), sorted decreasingly. Then W = U1:L = (u1 , . . . , uL ),
i.e., the eigenvectors associated with the largest L eigenvalues of the covariance matrix Σ.
• Total variance of the data: λ1 + · · · + λD = tr (Σ) = σ12 + · · · + σD
2
. 
T
Variance “explained” by the latent space: λ1 + · · · + λL = tr W ΣW .
We can use the proportion λλ11+···+λ
+···+λL
D
∈ [0, 1] of explained variance to determine a good value
for L (e.g. 90% of the variance, which usually will be achieved with L ≪ D).
• In practice with high-dimensional data (e.g. images), a few principal components explain most
of the variance if there are correlations among the features. (
reduce the number of features
• Useful as a preprocessing step for classification/regression:
partly remove noise.
• Basic disadvantage: it fails with nonlinear manifolds.
• Related linear DR methods:
– Factor analysis: essentially, a probabilistic version of PCA.
– Canonical correlation analysis (CCA): projects two sets of features x, y onto a common
latent space z.
• Related nonlinear DR methods: autoencoders (based on neural nets), etc.
23
6.8 Dimensionality reduction: linear discriminant analysis (LDA)
Optdigits after LDA
4

7
3
x2

7 7
77 4 4
7 77 7
2 7
7 7 4
4 4 4
m2 91 1
1
4
4
1 4
1 9
9 4
9 9 1
91 1 1 44
3 5 1
3 1 9 9 9
0
m1 3
1

3 6
3 3 33 3 9
8
−1 3 85 66
m2 2
2
2
2
5
9
8
8 8
88
8 666
6
2 3 55
s2 2 −2 2 2 22 8
2 0
m1 −3
2 0

w 0
0

s1 2 −4
0
00 0
0
0

x1 −5
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

1 1
4
0.5 0.5

2
0 0
pca
0 −0.5 −0.5
lda

−2 −1 −1
−2 0 2 4 −4 −2 0 2 −4 −3 −2 −1
PCA projection LDA projection

• Aims at preserving most of the signal information that is useful to discriminate among the classes.
Find a low-dimensional space such that when x is projected there, classes are well separated.

• Supervised linear DR method: given {(xn , yn )}N D


n=1 where xn ∈ R is a high-dimensional feature
vector and yn ∈ {1, . . . , K} a class label, when reducing dimension to L < D, LDA finds a
linear projection mapping F: x ∈ RD → WT x ∈ RL with WD×L that is optimal in maximally
separating the classes from each other while maximally compressing each class.
Unlike PCA, LDA does not find a reconstruction mapping f : z ∈ RL → Wz ∈ RD . It only finds the projection mapping F.

• Define:
P
– Number of points in class k: Nk . Mean of class k: µk = N1k yn =k xn .
P
– Within-class scatter matrix for class k: Sk = yn =k (xn − µk )(xn − µk )T .
P
– Total within-class scatter matrix SW = K k=1 Sk .
PK 1
PK
– Between-class scatter matrix SB = k=1 Nk (µk − µ)(µk − µ)T where µ = K k=1 µk .

• In the latent space, the between-class and within-class scatter matrices are WT SB W and
WT SW W (of L × L). T
W SB W between-class scatter
• Fisher discriminant: max J(W) = = .
W |WT SW W| within-class scatter
• This is an eigenproblem whose solution is W = (u1 , . . . , uL ) = eigenvectors associated with the
largest L eigenvalues of S−1
W SB .
?
rank (SB ) ≤K − 1 (so we can only use values of L that satisfy 1 ≤ L ≤ K − 1).
SW must be invertible (if it is not, apply PCA to the data and eliminate directions with zero variance).

24
6.7 Dimensionality reduction: multidimensional scaling (MDS)
True distances along earth surface Estimated positions on a 2D map
2000

Helsinki
1500 Moscow

1000
Dublin

London
500

Paris Berlin
0
Zurich

Lisbon Madrid
−500

−1000 Istanbul
Rome

−1500
Athens

−2000
−2500 −2000 −1500 −1000 −500 0 500 1000 1500 2000

• Aims at preserving distances or similarities.


Place N points in a low-dimensional map (of dimension L) such that their distances are well preserved.

• Unsupervised DR method: given the matrix of squared Euclidean distances d2nm = kxn − xm k2
between N data points, MDS finds points z1 , . . . , zN ∈ RL that approximate those distances:
N
X 2
min d2nm − kzn − zm k2 .
Z
n,m=1

• MDS does not use as training data the actual feature vectors xn ∈ RD , only the pairwise
distances dnm . Hence, it is applicable even when the “distances” are computed between objects
that are not represented by features. Ex: perceptual distance between two different colors according to a subject.
• If d2nm = kxn − xm k2 where xn ∈ RD and D ≥ L, then MDS is equivalent to PCA on {xn }N
n=1 . p. 137

• MDS does not produce a projection or reconstruction mapping, only the actual L-dimensional
projections z1 , . . . , zN ∈ RL for the N training points x1 , . . . , xN ∈ RD .
• How to learn a projection mapping F: x ∈ RD → z ∈ RL with parameters Θ?
Direct fit: find the projections z1 , . . . , zN by Parametric embedding: requires nonlinear
MDS and then solve a nonlinear regression optimization
N
X XN
2
min (zn − F(xn ; Θ))2 min d2nm − kF(xn ; Θ) − F(xm ; Θ)k2 .
Θ Θ
n=1 n,m=1

• Generalizations of MDS:
– Spectral methods: Isomap, Locally Linear Embedding, Laplacian eigenmaps. . .
Require solving an eigenproblem.
Isomap: define dnm = geodesic distances (approximated by shortest paths in a nearest-neighbor graph of the sample).

– Nonlinear embeddings: elastic embedding, t-SNE. . .


Require solving a nonlinear optimization.
25
7 Clustering
• Unsupervised problem: given a sample {xn }N D
n=1 ⊂ R , partition it into groups such that points
within each group are similar and groups are dissimilar (“hard partition”).
Or, determine soft assignments
PK znk of point xn to cluster k for n = 1, . . . , N and k = 1, . . . , K,
where znk ∈ [0, 1] and k=1 znk = 1 (“soft partition”).
• Useful to understand structure in the data, or as preprocessing for supervised learning (e.g.
train a classifier for each cluster). Clustering does not need labels, which are usually costly to obtain.
• Examples:
– Customer segmentation: group customers according to demographic and transaction at-
tributes, then provide different strategies for each segment.
– Image segmentation: partition pixels into meaningful objects in the image.
Represent each pixel xn by a feature vector, typically position & intensity (i, j, I) or color (i, j, L∗ , u∗ , v∗ ).

• Basic types:
– Centroid-based : find a prototype (“centroid”) for each cluster.
k-means, k-medoids, k-modes, etc.

– Probabilistic models: mixture densities, where each component models one cluster.
Gaussian mixtures. Usually trained with the EM algorithm.

– Density-based : find regions of high density of data.


Mean-shift clustering, level-set clustering, etc.

– Graph-based (or distance- or similarity-based ): construct a graph with the data points as
vertices and edges weighted by distance values, and partition it.
Hierarchical clustering, connected-components clustering, spectral clustering, etc.

• Ill-defined problem: what does “similar” mean, and how similar should points be to be in the
same cluster? Hence, many different definitions of clustering and many different algorithms.
This is typical of exploratory tasks, where we want to understand the structure of the data before
committing to specific analyses. In practice, one should try different clustering algorithms.

7.2 Centroid-based clustering: k-means clustering


• Two uses: clustering and vector quantization.
• User parameter: number of clusters K. Output: clusters and a centroid µk ∈ RD per cluster.
• Objective function of centroids µ1 , . . . , µK and cluster assignments ZN ×K :
N X
X K
E({µk }K
k=1 , Z) = znk kxn − µk k2 s.t. Z ∈ {0, 1}N K , Z 1 = 1.
n=1 k=1

• No closed-form solution. This problem is NP-complete (runtime ≥ exponential on problem


size). Instead, we do a local search with an iterative algorithm (alternating optimization):
– Assignment step (over Z given {µk }K k=1 ):
for each n = 1, . . . , N, assign xn to the cluster with the closest centroid to xn .
– Centroid step (over {µk }K given Z):
k=1 P
N
znk xn
for each k = 1, . . . , K, µk = Pn=1N
z
= mean of the points currently assigned to cluster k.
n=1 nk

26
k−means: Initial After 1 iteration
20 20

10 10

0 0
2

2
x

x
−10 −10

−20 −20

−30 −30
−40 −20 0 20 40 −40 −20 0 20 40
x1 x1

After 2 iterations After 3 iterations


20 20

10 10

0 0
2

2
x

−10
x −10

−20 −20

−30 −30
−40 −20 0 20 40 −40 −20 0 20 40
x x
1 1

• Each iteration (centroid + assignment step) reduces E or leaves it unchanged. After a finite
number of iterations, no more changes occur and we stop.
• The result depends on the initialization. We usually initialize Z by a random assignment of
points to clusters, run k-means from several such random Z, and pick the best result.
• Vector quantization: compress a continuous space of dimension D, represented by a sample
X = {xn }N
n=1 , into a finite set of reference vectors (codebook ), with minimal distortion.

– Ex: color quantization. Represent 24-bit color images using only 256 colors with minimal distortion.
– It can be done with k-means clustering using large K. Better than uniform quantization, because it
adapts to density variations in the data and does not require a codebook of size exponential on D.

– A new point x ∈ RD is quantized as µk∗ with k ∗ = arg mink=1,...,K kx − µk k.


This partitions the space into a Voronoi tessellation.
– Lossy compression: encode x ∈ RD (D floats) as an integer k ∈ {1, . . . , K} (⌈log2 K⌉ bits).
Also needs to store the codebook.

Encoder Decoder
Communication
x i i x' =mi
line
Find closest

mi mi
. .
. .
. .

27
7.1 Mixture densities and the EM algorithm
K
(
X p(x|k) component densities
• Mixture density with K components: p(x; Θ) = p(x|k)p(k)
k=1
p(k) = πk mixture proportions.

• Ex: Gaussian mixture: x|k ∼ N (x; µk , Σk ). Mixture parameters: Θ = {πk , µk , Σk }K


k=1 .

• Maximum likelihood estimation of Gaussian mixture parameters: given a sample X = {xn }N


n=1 :
PN PN PK 
maxΘ L(Θ; X ) = n=1 log p(xn ; Θ) = n=1 log k=1 p(x|k)p(k) .

• L cannot be maximized in closed form over Θ; it needs an iterative optimization algorithm.


Many such algorithms exist (such as gradient descent), but there is a specially convenient one
for mixture models (and more generally, for maximum likelihood with missing data).
• Expectation-Maximization (EM) algorithm: for Gaussian mixtures:
– E step: given the current parameter values Θ, compute the posterior probability of com-
ponent k given data point xn (for each k = 1, . . . , K and n = 1, . . . , N):

p(xn |k)p(k) πk |2πΣk |−1/2 exp − 12 (xn − µk )T Σ−1 k (xn − µk )
znk = p(k|xn ; Θ) = = PK −1/2  ∈ (0, 1).
p(xn ; Θ) π
k =1 k
′ ′ |2πΣ k ′ | exp − 1
(xn − µ k ′ ) T Σ−1 (x − µ ′ )
k ′ n k
2
– M step: given the posterior probabilities, estimate the parameters Θ: for k = 1, . . . , K:
N PN PN T
1 X n=1 znk xn n=1 znk (xn − µk )(xn − µk )
πk = z nk µ k = PN Σk = PN .
N n=1 n=1 znk n=1 znk

Similar to k-means, where the assignment and centroid steps correspond to the E and M steps.
But in EM the assignments are soft (znk ∈ [0, 1]), while in k-means they are hard (znk ∈ {0, 1}).
• If we knew which component xn came from for each n = 1, . . . , N, we’d not need the E step:
a single M step that estimates each component’s parameters on its set of points would suffice.
This was the case in classification (where x|Ck is Gaussian): we are given (xn , yn ).
• Each EM step increases L or leaves it unchanged, but it takes an infinite number of iterations
to converge. In practice, we stop when the parameters don’t change much, or when the number
of iterations reaches a limit.
• EM converges to a local optimum that depends on the initial value of Θ. Usually from k-means? .

• User parameter: number of clusters K. Output: posterior probabilities {p(k|xn )} and {πk , µk , Σk }K
k=1 .

• Parametric clustering: K clusters, assumed Gaussian.


• The fundamental advantage of Gaussian mixtures over k-means for clustering is that we can
model the uncertainty in the assignments (particularly useful for points near cluster bound-
aries), and the clusters can be elliptical and have different proportions.

k-means EM for Gaussian mixtures


assignments znk hard soft, p(k|xn )
probability model? no yes
number of iterations finite infinite
parameters centroids {µk }K
k=1 {πk , µk , Σk }K
k=1

28
Density-based clustering: mean-shift clustering
C3 C1 C2

p(x)

• Define a function p(x) that represents the density of the dataset {xn }N D
n=1 ⊂ R , then declare
each mode (maximum) of p as a cluster representative and assign each xn to a mode via the
mean-shift algorithm. Most useful with low-dimensional data, e.g. image segmentation (where xn = features of nth pixel).
• Kernel density estimate with bandwidth σ: a mixture having one component for each data point:
N N  
X 1 X x − xn
p(x) = p(x|n)p(n) = D
K x ∈ RD .
n=1
Nσ n=1
σ
x−x  
Usually the kernel K is Gaussian: K σ ∝ exp − 12 k(x − xn )/σk2 .
n

• Mean-shift algorithm: starting from an initial value of x, it iterates the following expression:
N 
X p(x|n)p(n) exp − 12 k(x − xn )/σk2
x← p(n|x)xn where p(n|x) = = PN 1 2
.
n=1
p(x) ′
n =1 exp − 2
k(x − xn ′ )/σk

PN
n=1 p(n|x)xn can be understood as the weighted average of the N data points using as
weights the posterior probabilities p(n|x). The mean-shift algorithm converges to a mode of
p(x). Which one it converges to depends on the initialization. By running mean-shift starting
at a data point xn , we effectively assign xn to a mode. We repeat for all points x1 , . . . , xN .
• User parameter: σ, which determines the number of clusters (σ ↓: N clusters, σ ↑: 1 cluster).
Output: modes and clusters.
• Nonparametric clustering: no assumption on the shape of the clusters or their number.
0.8

0.6

Graph-based clustering: connected-components clustering 0.4


ǫ-ball

0.2

• Define a distance d(x, y) for pairs of instances x, y: −0.2

−0.4

−0.6
PD p 1/p
– Minkowski distance d(x, y) = d=1 |xd − yd | , if x, y ∈ RD .
−0.8
−1.5 −1 −0.5 0 0.5 1 1.5
0.8

0.6

p = 1: city-block distance; p = 2: Euclidean distance.


k-NNG

0.4

0.2

– We may use distances without x and y being defined by explicit 0

feature vectors. −0.2

−0.4

Number of words that two documents have in common; “edit” distance between strings (Exe. 6). −0.6

−0.8
−1.5 −1 −0.5 0 0.5 1 1.5

• Define a neighborhood graph with vertices = data points and edges xn ∼ xm if:
– d(xn , xm ) ≤ ǫ (ǫ-ball graph)
– xn and xm are among the k-nearest-neighbors of each other (k-nearest-neighbor graph).
• Clusters = connected-components of the graph (which can be found by depth-first search).
• User parameter: ǫ > 0 or k ∈ N. The number of clusters depends on that (ǫ ↓: N, ǫ ↑: 1).
• It can handle complex cluster shapes, but works only with non-overlapping clusters.
29
7.8 Graph-based clustering: hierarchical clustering
• Generates a nested sequence of clusterings.
• Agglomerative clustering: start with N clusters (= singleton points) and repeatedly merge pairs
of clusters until there is only one cluster left containing all points.
Divisive clustering: start with one cluster containing all points and repeatedly divide it until we have N clusters (= singleton points).

• We merge the two closest clusters Ci , Cj according to a distance δ(Ci , Cj ) between clusters:
– Single-link clustering: δ(Ci , Cj ) = minxn ∈Ci ,xm ∈Cj d(xn , xm ).
Tends to produce elongated clusters (“chaining” effect).
Equivalent to Kruskal’s algorithm to find a minimum spanning tree of a weighted graph G = (V, E, w) where V = {xn }N
n=1 ,
E = V × V and wnm = d(xn , xm ).

– Complete-link clustering: δ(Ci , Cj ) = maxxn ∈Ci ,xm ∈Cj d(xn , xm ).


Tends to produce compact clusters.
• Result: dendrogram, a binary tree where leaves = instances xn and internal nodes = merges.
We can obtain a specific clustering from the tree by allowing only distances d(xn , xm ) ≤ ǫ
(equivalent to connected-components for single-link clustering).
• User parameter: when to stop merging. Output: dendrogram and clusters.

a 3
c 2
e f h
d 1

a b e c d f

7.10 How to choose the number of clusters or other user parameters


• Typical user parameters:
– K: number of clusters (k-means, mixtures).
– σ, ǫ: “scale” in feature space (mean-shift, connected-components, hierarchical clustering).
– k: number of neighbors (connected-components).
• How to set K, σ, etc.? Don’t trust “automatic” algorithms that select all user parameters for you!
– Try several values (and several clustering algorithms) and explore the results:
∗ Plot the objective function (error, log-likelihood, etc.) vs K and look for an “elbow”.
∗ Project to 2D with PCA and inspect the result.
– In some applications, K may be fixed or known. Color quantization, medical image segmentation.

30
8 Nonparametric methods
Parametric methods

• Examples:

– Linear function f (x; Θ) = w1 x + w0 (Θ = {w0 , w1 }).


2
e− 2 ( σ ) (Θ = {µ, σ 2 }).
1 x−µ
1
– Gaussian distribution p(x; Θ) = √2πσ
P
– Gaussian mixture p(x; Θ) = K K
k=1 p(x|k)p(k) (Θ = {πk , µk , Σk }k=1 ).

– etc.

• Training is the process of choosing the best parameter values Θ from a given dataset.

• The size of the parameters is separate and smaller from the size of the training set N. The
model “compresses” the dataset into the parameters Θ, and the complexity of the model doesn’t
grow with N. Ex: a linear function f (x) = w1 x + w0 has 2 parameters (w0 , w1 ) regardless of the size N of the training set.

• After training, we discard the dataset, and use only the parameters to apply the model to
future data. Ex: keep the 2 parameters (w0 , w1 ) of the learned linear function and discard the training set {(xn , yn )}Nn=1 .

• There are user parameters: number of components K, regularization parameter λ, etc.

• The hypothesis space is restricted: only models of a certain parametric form (linear, etc.).

• Small complexity at test time as a function of N: both memory and time are O(1).

Nonparametric methods

• There is no training.

• The size of the model is given by the size of the dataset N, hence the complexity of the model
can grow with N.

• We keep the training set, which we need to apply the model to future data.

• There are user parameters: number of neighbors k or kernel bandwidth h.

• The hypothesis space is less restricted (fewer assumptions). Often based on a smoothness
assumption: similar instances have similar outputs, and the output for an instance is a local
function of its neighboring instances. Essentially, the model behaves like a lookup table that we
use to interpolate future instances.

• Large complexity at test time as a function of N: both memory and time are O(N), because
we have to store all training instances and look through them to make predictions.

• They are particularly preferable to parametric methods with small training sets, where they
give better models and are relatively efficient computationally.

• Also called instance-based or memory-based learning algorithms.

• Examples: kernel density estimate, nearest-neighbor classifier, running mean smoother, etc.

31
8.2 Nonparametric density estimation
• Given a sample X = {xn }N n=1 drawn iid from an unknown density, we want to construct an
estimator p(x) of the density.
• Histogram (consider first x ∈ R): split the real line into bins [x0 + mh, x0 + (m + 1)h] of width
h for m ∈ Z, and count points in each bin:
1
p(x) = (number of xn in the same bin as x) x ∈ R.
Nh
– We need to select the bin width h and the origin x0 .
– x0 has a small but annoying effect on the histogram (near bin boundaries).
– h controls the histogram smoothness: spiky if h ↓ and smooth if h ↑.
– p(x) is discontinuous at bin boundaries.
– We don’t have to retain the training set once we have computed the counts.
– They generalize to D dimensions, but are practically useful only for D . 2
In D dimensions, it requires an exponential number of bins, most of which are empty.

• Kernel density estimate (Parzen windows): generalization of histograms to define smooth,


multivariate density estimates. Place a kernel K(·) on each data point and sum them:
N  
1 X x − xn
p(x) = D
K x ∈ RD “sum of bumps”.
Nh n=1 h
R
– K must satisfy K(t) ≥ 0 ∀t ∈ R and h1D R K(t) dt = 1, typically K is Gaussian or uniform.
 
Gaussian: K x−x = exp − 12 k(x − xn )/hk2 . The uniform kernel gives a histogram without an origin x0 .
n

h

– Only parameter: the bandwidth h > 0. The KDE is spiky if h ↓, smooth if h ↑.


The KDE is not very sensitive to the choice of K.

– p(x) is continuous and differentiable if K is continuous and differentiable.


– In practice, can take K(k(x − xn )/hk) = 0 if kx − xn k > 3h to simplify the calculation.
We still need to find the samples xn that satisfy kx − xn k ≤ 3h (neighbors at distance ≤ 3h).

– Also possible to define a different bandwidth hn for each data point xn (adaptive KDE ).
– The KDE quality degrades as the dimension D increases (no matter how h is chosen).
Could be improved by using a full covariance Σn per point, but it is preferable to use a mixture with K < N components.

k 1
• k-nearest-neighbor density estimate: p(x) = for x ∈ RD , where dk (x) = (Euclidean)
2N dk (x)
distance of x to its kth nearest sample in X .
– Like using a KDE with an adaptive bandwidth h = 2dk (x).
Instead of fixing h and counting how many samples fall in the bin, we fix k and compute the bin size containing k samples.

– Only parameter: the number of nearest neighbors k ≥ 1.


– p(x) has a discontinuous derivative. It does not integrate to 1 so it is not a pdf.

Histogram: h = 2 Naive estimator: h = 2 Kernel estimator: h = 1 k−nn estimator: k = 5


0.4 0.4 0.2 0.4

0.3 0.3 0.15 0.3

0.2 0.2 0.1 0.2

0.1 0.1 0.05 0.1

0 0 0 0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
h=1 h=1 h = 0.5 k=3
0.4 0.4 0.4 1

0.3 0.3 0.3

0.2 0.2 0.2 0.5

0.1 0.1 0.1

0 0 0 0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
h = 0.5 h = 0.5 h = 0.25 k=1
0.8 0.8 0.8 1

0.6 0.6 0.6

0.4 0.4 0.4 0.5

0.2 0.2 0.2

0 0 0 0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

32
8.4 Nonparametric classification
• Given {(xn , yn )}N D
n=1 where xn ∈ R is a feature vector and yn ∈ {1, . . . , K} a class label.

• Estimate the class-conditional densities p(x|Ck ) with a KDE each. The resulting discriminant
function can be written (ignoring constant factors) as? :
N  
X x − xn
gk (x) = K h
k = 1, . . . , K
n=1,yn =k

so each data point xn votes only for its class, with a weight given by K(·) (instances closer to
x have bigger weight).
• k-nearest-neighbor classifier : assigns x to the class k having most
instances among the k nearest neighbors of x.

x2
– Most common case: k = 1, nearest-neighbor classifier.
It defines a Voronoi tesselation in RD .
It works with as few as one data point per class!
*
– k is chosen to be an odd number to minimize ties.
*
– Simple; good classification accuracy; but slow at test time.
Condensed nearest-neighbor classifier : heuristic way to approximate the nearest-
neighbor classifier using a subset of the N data points (to reduce its space and x1
time complexity).

8.8 Nonparametric regression


• Consider a sample {(xn , yn )}N
n=1 with xn ∈ R
D
and yn ∈ RE drawn iid from an unknown
function plus noise y = f(x) + ǫ.
We construct a smoother (nonparametric regression estimate) g(x) of f(x).
• Kernel smoother : weighted average of the labels of instances near x (local average):
N 
K k(x − xn )/hk ⇔ g(x) = N of {yn }N
P
n=1 wn (x) yn is a weighted average
X n=1
g(x) = PN  yn . with weights satisfying w 1 (x), . . . , w N (x) ≥ 0,
PN
n=1 wn (x) = 1,
n=1 n′ =1 K k(x − xn′ )/hk and wn (x) is large if xn is near x and small otherwise

g(x) can be seen as the conditional mean E {y|x} of a KDE p(x, y) constructed on the sample? .
Also, as with KDEs:

– Regressogram: with an origin x0 and bin width h, and discontinuous at boundaries. eq. (8.24)

– Running mean smoother : K = uniform kernel. No origin x0 , only bin width h, but still discontinuous at boundaries. eq. (8.25)

– Running median smoother : like the running mean but using the median instead; more robust to outliers and noise.
– k-nearest-neighbor smoother : fixes k instead of h, adapting to the density around x.

• Running-line smoother : we use the data points around x (as defined by h or k) to estimate a
line rather than a constant, hence obtaining a local regression line.

Regressogram smoother: h = 6 Running mean smoother: h = 6 Kernel smooth: h = 1 Running line smooth: h = 6
4 4 4 4

2 2 2 2

0 0 0 0

−2 −2 −2 −2
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
h=3 h=3 h = 0.5 h=3
4 4 4 4

2 2 2 2

0 0 0 0

−2 −2 −2 −2
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
h=1 h=1 h = 0.25 h=1
4 4 4 6

4
2 2 2
2
0 0 0
0

−2 −2 −2 −2
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

33
8.9 How to choose the smoothing parameter
• Smoothing parameter : number of neighbors k or kernel bandwidth h.

• It controls the complexity of the model:

– Too small: low bias, high variance.


The model is too rough (single instances have a large effect on the predictions).
– Too large: high bias, low variance.
The model is too smooth (single instances have a small effect on the predictions).

• How to choose it?

– Regression or classification: cross-validation.


– Density estimation: by trial-and-error.
Some heuristic rules exist that can be used as ballpark estimates.

34
9 Decision trees
• Nonparametric methods applicable to classification and regression given a sample {(xn , yn )}N
n=1 .

• It can use continuous and discrete features.

• Efficient at test time. May not even use all the features in the training set.

• The tree can be transformed into a set of IF-THEN rules that are interpretable for people.

• Widely used in practice, sometimes preferred over more accurate but less interpretable models.

• They define class regions as a sequence of recursive splits.

• The tree consists of:

– Internal decision nodes, each having ≥ 2 children. Decision node m selects one of its
children based on a test (a split) applied to the input x.
∗ Continuous feature xd : “xd > sm ” for some sm ∈ R.
∗ Discrete feature xd : n-way split for the n possible values of xd .
– Leaves, each containing a value (class label or output value y). Instances xn falling in the
same leaf should have identical (or similar) output values yn .
∗ Classification: class label y ∈ {1, . . . , K} (or proportion of each class p1 , . . . , pK ).
∗ Regression: numeric value y ∈ R (average of the outputs for the leaf’s instances).

The predicted output for an instance x is obtained by following a path from the root to a leaf.
In the best case (balanced tree) for binary trees, the path has length log2 L if there are L leaves.

• Decision trees vs nonparametric methods:

– Each leaf defines a “bin” (box) in input space. Different bins may have different sizes and
contain different numbers of instances.
Unlike in nonparametric methods using a fixed bandwidth h or a fixed number of neighbors k).

– The bin for an instance x can be found very quickly (a few feature comparisons).
No need to scan the whole training set.

– We can discard the training set once the tree is constructed, and keep only its nodes (and
associated thresholds and output values).
Then, why do we call it nonparametric?

– The size of the tree can still be O(N) if each leaf represents one instance. We stop the
tree from growing before this happens, so each leaf represents several instances.
We can also grow the full tree and the prune it back to a desired size.

35
9.2
x2 Univariate trees

x1>w10
C2
Yes
No

x2>w20
w20
C1
Yes No

C1
C2 C1

w10 x1

• The test at node m compares one feature with a threshold: “xd > sm ” for some d ∈ {1, . . . , D}
and sm ∈ R. This defines a binary split into two regions: {x ∈ RD : xd ≤ sm } and {x ∈
RD : xd > sm }. The overall tree defines box-shaped, axis-aligned regions in input space.
T x > s ”, which define oblique regions (multivariate trees).
Simplest and most often used. More complex tests exist, e.g. “wm m

• With discrete features, the number of children equals the number nd of values the feature can
take, and the test selects the child corresponding to the value of xd (n-way split).
Ex: xd ∈ {red, green, blue} ⇒ nd = 3 children.

• Tree induction is learning the tree from a training set {(xn , yn )}N
n=1 , i.e., determining its nodes:

– For each internal node, its test (feature d ∈ {1, . . . , D} and threshold sm ).
– For each leaf, its output value y.

• For a given sample, many (sufficiently big) trees exist that code it with zero error. We want to
find the smallest such tree. This is NP-complete so an approximate, greedy algorithm is used: Fig. 9.3
pseudocode

– Starting at the root with the entire training set, select a best split according to a purity
criterion, and associate each child with the subset of instances that fall in it.
– Continue splitting each child (with its subset of the training set) recursively until each
child is pure (hence a leaf) and no more splits are necessary.
Ex. algorithms: CART, ID3, C4.5.

36
Classification trees 1

0.9
• Purity criterion: a split is pure if each child it produces con-

entropy=−p*log2(p)−(1−p)*log2(1−p)
0.8

tains instances of the same class. Consider a node and all 0.7

the training set instances that reach it, and call pk the pro- 0.6

PKof instances of class k, for k = 1, . . . , K (so pk ≥ 0


portion 0.5

and k=1 pk = 1). We can measure impurity as the en- 0.4


PK
tropy of p = (p1 , . . . , pK ): φ(p) = − k=1 pk log2 pk (where 0.3

0 log2 0 ≡ 0? ). This is maximum if p1 = · · · = pK = K1 , and 0.2

minimum (“pure”) if one pk = 1 and the rest are 0. 0.1

PK 0
Other measures possible: Gini index φ(p) = i6=j pi pj , misclassification error 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
p
φ(p) = 1 − max(p1 , . . . , pK ).

• If a node is pure, i.e., all its instances are of the same class k, there is no need to split it. It
becomes a leaf with output value k.
We can also store the proportions p = (p1 , . . . , pK ) in the node (e.g. to compute risks).

• If a node m is not pure, we split it. We evaluate all possible features d = 1, . . . , D and all
possible split thresholds sm for each feature and pick the one with minimum impurity φ.
– If the number of instances that reach node m is Nm , there are are Nm −1 possible thresholds
(half way between each value of xd , assuming we have sorted them), and the best threshold
must be between values of different classes.
– For discrete features, there is no threshold but an n-way split.

Regression trees
P
• Purity criterion: the squared error E(g) = n∈node (yn − g)2 where g ∈ R, which is minimal
when g is the mean of the yn values? , and then E(g) is the variance of the yn values.
If there is much noise or outliers, it is preferable to set g to the median of the yn values.

• We consider a node to be pure if E ≤ θ for a user threshold θ > 0. In that case, we do not
split it. It becomes a leaf with output value g.
• If a node m is not pure, we split it. We evaluate all possible features d = 1, . . . , D and all
possible split thresholds sm for each feature and pick the one with minimum impurity (= sum
of the variances E of each of the children), as in the classification case.
• Rather than assigning a constant output value to a leaf, we can assign it a regression function
(e.g. linear), as in a running-mean smoother.

Early stopping and pruning


• Growing the tree until each leaf is pure will produce a large tree with high variance (sensitive
to the training sample) that will overfit when there is noise.
How to learn smaller trees that generalize better to unseen data?
• Early stopping: we stop splitting if the impurity is below a user threshold θ > 0.
θ ↓ low bias, high variance, large tree; θ ↑ high bias, low variance, small tree.
• Pruning: we grow the tree in full until all leaves are pure and the training error is zero. Then,
we find subtrees that cause overfitting and prune them.
Keep aside a subset from the training set (“pruning set”). For each possible subtree, try replacing it with a leaf node labeled with
the training instances covered by the subtree. If the leaf node performs no worse than the subtree on the pruning set, we prune the
subtree and keep the leaf node because the additional complexity of the subtree is not justified; otherwise, we keep the subtree.

37
• Pruning is slower than early stopping but it usually leads to trees that generalize better.
An intuitive reason why is as follows. In the greedy algorithm used to grow the tree, once we make a decision at a given node (to
select a split) we never backtrack and try a different, maybe better, possibility.

4 x < 3.16
θr = 0.5 Yes No

2 x < 1.36 1.86

Yes No

0 1.37 -1.35

−2 x < 3.16
0 1 2 3 4 5 6 7 8 Yes No

4
θ = 0.2 x < 1.36 x < 5.96
r
Yes No Yes No

2 1.37 -1.35 2.20 x < 6.91

Yes No

0
0.9 2.40

x < 3.16
−2
0 1 2 3 4 5 6 7 8 Yes No

4 x < 1.36 x < 5.96


θr = 0.05
Yes No Yes No

2 x < 0.76 -1.35 2.20 x < 6.91

Yes No Yes No

0 1.15 1.80 x < 6.31 2.40

Yes No

−2 1.20 0.60
0 1 2 3 4 5 6 7 8

9.4 Rule extraction from trees


• A decision tree does its own feature extraction: the final tree may not use all the D features.
• Features closer to the root are more important globally.
• Path root leaf = conjunction of tests. This and the leaf’s output value give a rule.
• The set of extracted rules allows us to extract knowledge from the dataset.
x1 : Age
x1 > 38.5 x2 : Years in job
R1: IF (age > 38.5) AND (years-in-job > 2.5) THEN y = 0.8 Yes No
x3 : Gender
x4 : Job type

R2: IF (age > 38.5) AND (years-in-job ≤ 2.5) THEN y = 0.6


x2 > 2.5 x4
R3: IF (age ≤ 38.5) AND (job-type = ‘A’) THEN y = 0.4
Yes No 'A' 'B' 'C'
R4: IF (age ≤ 38.5) AND (job-type = ‘B’) THEN y = 0.3
R5: IF (age ≤ 38.5) AND (job-type = ‘C’) THEN y = 0.2 0.8 0.6 0.4 0.3 0.2

38
10 Linear discrimination
• Consider learning a classifier given a sample {(xn , yn )}N D
n=1 where xn ∈ R and yn ∈ {1, . . . , K}.

• Classification works as follows:


– Training time: learn a set of discriminant functions {gk (x)}K
k=1 .
– Test time: given a new instance x, choose Ck if k = arg maxi=1,...,K {gi (x)}.
• Two approaches to learning discriminant functions:
– Generative approach: we learn p(x|Ck ) and p(Ck ) for each class from the training data,
and then use gk (x) = p(Ck |x) ∝ p(x|Ck )p(Ck ) (from Bayes’ rule) to predict a class. Hence,
besides learning the class boundaries (where p(Ci |x) = p(Cj |x) for i 6= j), we model also
the density of each class.
Previous chapters, using parametric and nonparametric methods for p(x|Ck ).

– Discriminative approach: we learn only the class boundaries, through discriminant func-
tions gk (x), k = 1, . . . , K. We don’t learn the class densities, and gk (x) need not be
modeled using probabilities. Hence, it requires assumptions only about the class bound-
aries but not about their densities. It solves a simpler problem.
This and future chapters, using linear and nonlinear discriminant functions.

• We define a parametric model gkP (x; Θk ) for each class discriminant. In linear discrimination,
this model is linear: gk (x; wk ) = D T
d=1 wkd xd + wk0 = wk x (assuming an extra constant feature
x0 = 1). We assume that, in the training set, instances from any one class are linearly separable
from instances of the other classes.
• Learning means finding parameter values {Θk }K
k=1 that optimize the quality of the separation.
In the parametric generative approach, learning means finding parameter values Θk for each class p(x|Ck ; Θk ) that maximize the
likelihood, separately for each class.

• Linear discriminants are simpler than nonlinear ones:


– Faster to train.
– Low space and time complexity at test time: O(D) per class. To store wk and multiply times it.
– Simple to interpret: the output is a weighted sum of the features xd .
∗ Magnitude of wkd : importance of xd in the decision.
∗ Sign of wkd: whether the effect of xd is positive or negative.
– Accurate enough in many applications.
• In practice, try linear discrimination first, before trying nonlinear discrimination.

10.2 Generalizing the linear model


• When a linear model (linear in terms of x) is not flexible enough, we can:
– Add higher-order terms.
Ex: if x = (x1 , x2 ), we can define new variables z1 = 1, z2 = x1 , z3 = x2 , z4 = x21 , z5 = x22 , z6 = x1 x2 and take
z = (z1 , z2 , z3 , z4 , z5 , z6 ) as input. A linear function in the 6D space of z, wT z, corresponds to a nonlinear function in the
2D space of x.
P
– Write the discriminant as gi (x) = K k=1 wk φik (x) where φik (x) are basis functions.
α1 αD 2
Ex. of φ(x): x1 · · · xD , exp(−((x − µ)/σ) ), sin(uT x), etc., for suitable, fixed values of α, µ, σ, u, etc.

Radial basis function (RBF) networks and kernel support vector machines (SVMs) do this. ch. 12–13

39
10.3 Geometry of the linear discriminant
Boundary in 2D = line Geometry Linearly separable Pairwise linearly separable
x2

x2

x2
g(x)=w1x1+w2 x2+w0=0 H2

x2
g(x)=0 H1
g(x)>0 + H12
g(x)<0 g(x)<0 g(x)>0 +
C1 C1 C1
C2 |w0|/||w|| C2 C2
H3
H31

+
x

+
+ H23 C3
+ w C3
|g(x)|/||w||

x1 x1 x1 x1

Two classes
?
• One discriminant function is sufficient: g1 (x) − g2 (x) = wT x + w0 = g(x).
Testing: choose C1 if g(x) > 0 and C2 if g(x) < 0.
• This defines a hyperplane where w is the weight vector and w0 the threshold (or bias). It divides
the input space RD into two half-spaces, the decision regions R1 for C1 (positive side) and R2
for C2 (negative side). The hyperplane itself is the boundary or decision surface.

positive side if w0 > 0

• The origin x = 0 is on the boundary if w0 = 0


negative side if w0 < 0.
• w is orthogonal to the hyperplane. Pf. Pick x, y on the hyperplane.

• The signed distance from x ∈ RD to the hyperplane is r = g(x)/kwk.


w
Pf. Write x = xp + r kwk where xp = orthogonal projection of x on the hyperplane and compute g(x).
The signed distance of the origin to the hyperplane is r0 = w0 /kwk.

• So w determines the orientation of the hyperplane and w0 its location wrt the origin.

K > 2 classes There are two common ways to define the decision for a point x ∈ RD :
• One-vs-all (linear separability): we use K discriminant functions gk (x) = wkT x + wk0 .
– Training: discriminant function gk is trained to classify the training points of class Ck vs
the training points of all other classes.
– Testing: choose Ck if k = arg maxi=1,...,K gi (x), i.e., pick the class having the larger dis-
criminant. Ideal case: gk (x) > 0 and gi (x) < 0 ∀i 6= k.
T
• One-vs-one (pairwise linear separability): we use K(K −1)/2 discriminants gij (x) = wij x+wij0.
– Training: discriminant function gij (for i 6= j) is trained to classify the training points of
class Ci vs the training points of class Cj (training points from other classes are not used).
P
– Testing: choose Ck if k = arg maxi=1,...,K K j6=i gij (x), i.e., pick the class having the larger
summed discriminant. Ideal case: gkj (x) > 0 ∀j and gij (x) < 0 ∀i 6= k, ∀j.
Also possible: for each class k, count the number of times gkj (x) > 0 for j 6= k, and pick
the class with most votes.
• Both divide the input space RD into K convex decision regions R1 , . . . , RK .

40
10.6 Minimizing functions by (stochastic) gradient descent
• When the minimization problem minw∈RD E(w) cannot be solved in closed form (solution
w∗ = some formula), we use iterative methods (w(0) → w(1) → · · · → w(∞) = w∗ ). Many
such methods exist (Newton’s method, conjugate gradients. . . ). One of the simplest ones,
reasonably effective in many cases (although sometimes very slow), is gradient descent.

• Gradient descent (GD): repeatedly iterate w ← w + ∆w with an update ∆w = −η ∇E(w).


∂E
Elementwise for d = 1, . . . , D: wd ← wd + ∆wd where ∆wd = −η ∂w d
.
∂E ∂E T

– Gradient (vector of partial derivatives): ∇E(w) = ∂w 1
, . . . , ∂wD
∈ RD .
– Step size or learning rate: η > 0. Neither too small (slow convergence) not too large
(oscillations or divergence).
– Initial weight vector w(0) : usually small random numbers. Ex: wd ∼ uniform[−0.01, 0.01].

– Stop iterating:
∗ When ∇E(w) ≈ 0 (since ∇E(w∗ ) = 0 at a minimizer w∗ of E). But this overfits.
∗ When the error on a validation set starts increasing (early stopping), which will
happen before the training error is minimized. The model w generalizes better to unseen data.
– It will find a local minimizer (not necessarily global ).
P
• Stochastic gradient descent (SGD): applicable when E(w) = N n=1 e(w; xn ), i.e., the total
error is the sum of the error at each data point. We update w as soon as we process xn :
repeatedly iterate w ← w + ∆w with an update ∆w = −η ∇e(w; xn ).

– Epoch = one pass over the whole dataset {x1 , . . . , xN }. It corresponds to one “noisy”
iteration of gradient descent: itP
need not always decrease E(w) because it doesn’t use
the correct gradient ∇E(w) = N n=1 ∇e(w; xn ).

– Much faster than (batch) gradient descent to get an approximate solution for w if N
is large (so there is redundancy in the dataset), or if data points come one at a time
and we don’t store them (online learning). However, very slow convergence thereafter.
– The step size has to decrease slowly over epochs for convergence to occur.
α
One typically takes η(t) = β+t
at epoch t sor suitable α, β > 0.

– Shuffling: in each epoch, process the N points in a random order. Better than a fixed order.
P
– Minibatches: computing the update ∆w = −η n∈B ∇e(w; xn ) based on subsets B of
1 < |B| < N points works better than with |B| = 1 (pure online learning) or |B| = N
(pure batch learning). In practice |B| = 10 to 1 000 points typically.
P
• Ex: E(w) = 12 N T 2
n=1 (yn − w xn ) (least-squares error for linear regression). The updates:
PN
GD: ∆w = η n=1 (yn − wT xn ) xn SGD: ∆w = η(yn − wT xn ) xn
have the form (ignoring η): “Error × Input” where Error = DesiredOutput − ActualOutput.
This has the following effect:

– If Error = 0, don’t update w, otherwise update w proportionally to Error.


– The update increases w if Error and Input have the same sign, else it decreases w.

This results in correcting w so the error becomes smaller.

41
10.7 Logistic regression E a classification method, not a regression method!

Two classes Logistic (sigmoid) function


σ(t) = 1+e1 −t
• The logistic regression or logistic classifier : given x ∈ RD , calculate 1

θ = σ(wT x + w0 ) ∈ (0, 1) and choose C1 if θ > 12 , or equivalently 0.9

0.8

calculate wT x + w0 ∈ (−∞, ∞) and choose C1 if it is positive.


0.7

0.6

0.5

It transforms a discriminant value g(x) = wT x+w0 into a posterior 0.4

0.3

1
probability p(C1 |x) = θ = σ(wT x + w0 ) = 1+exp (−(w
0.2

T x+w )) .
0.1

0 0
−10 −8 −6 −4 −2 0 2 4 6 8 10

• It defines a parametric estimator for p(C1 |x) directly, without having a model for p(x|C1 ), p(C1 )
and p(x|C2 ), p(C2 ). We learn its parameters (w, w0) from a training set {(xn , yn )}N
n=1 .

• Another way to derive the logistic classifier: let us model the log-ratio of the class-conditional
p(x|C1 )
densities as a linear function: log p(x|C 2)
= wT x + w00 . This implies? p(C1 |x) = σ(wT x + w0 )
(where w0 = w00 + log p(C1 ) ). Note that, unlike in generative models, we don’t have a model of the
p(C )
2

class-conditional densities p(x|Ck ) themselves.


A linear log-ratio does hold for Gaussian classes with shared covariance (x|Ck ∼ N (µk , Σ)) ✐, but can also hold in other cases.

• When the data are Gaussian distributed, logistic regression has a comparable error rate to
a parametric Gaussian linear discriminant. Logistic regression can still be applied when the
class-conditional densities are non-Gaussian, as long as they are linearly separable.
• Geometry of the logistic classifier: like the geometry of the linear discriminant (w determines
the orientation of the decision boundary and w0 its location wrt the origin), and the magnitude
of w and w0 determines how steep the logistic function becomes.
θ

• Logit function or log odds of θ: logit(θ) = log 1−θ ∈ (−∞, ∞) for θ ∈ (0, 1). It is the inverse
of the logistic function.

K > 2 classes
exp (wkT x + wk0)
• We use a softmax function to model p(Ck |x) = PK T
for k = 1, . . . , K.
j=1 exp (w j x + w j0 )
PK
– It satisfies p(Ck |x) ∈ (0, 1) and k=1 p(Ck |x) = 1. For K = 2, softmax = logistic function.
– “Soft” max(·): if wkT x + wk0 ≫ wjT x + wj0 ∀j 6= k then p(Ck |x) ≈ 1, p(Cj |x) ≈ 0 ∀j 6= k.
It is differentiable (unlike the max(·) function).
In this sense, the logistic function is a soft step function.

• Testing: given x ∈ RD , calculate (θ1 , . . . , θK ) = softmax(w1T x + w10 , . . . , wK


T
x + wK0 ) and
choose Ck if k = arg maxi=1,...,K θi .

Decision boundaries Linear discriminants Posterior probabilities


for K = 3 classes wkT x + wk0 (softmax of linear discriminants)
4

3.5

2.5

20 1
x2

2
0.8
10
P(C |x ,x )
i 1 2
g (x ,x )

0.6
1 2

1.5 0
0.4
i

−10
0.2
1
−20 0

0.5

0 x x
1 1
0 0.5 1 1.5 2 2.5 3 3.5 4
x1 x x
2 2

42
10.8 Learning logistic regression
Given a sample {(xn , yn )}N
n=1 , we want to learn an estimator:
1
• K = 2: p(C1 |x) = σ(wT x + w0 ) = 1+exp (−(wT x+w0 ))
(logistic) with yn ∈ {0, 1}.
exp (wkT x+wk0 )
• K > 2: p(Ck |x) = PK T (softmax) for each class with yn ∈ {1, . . . , K}.
j=1 exp (wj x+wj0 )

We can learn {w, w0} or {wk , wk0}K


k=1 by maximum likelihood, or by least-squares regression.

10.8.1 By maximum likelihood


Two classes: yn ∈ {0, 1}
• We model yn |xn as a Bernoulli distribution with parameter θn = p(C1 |xn ) = σ(wT xn + w0 ):
N
X N
X
 
max L w, w0 ; {(xn , yn )}N
n=1 = log p(yn |xn ; w, w0) = log θnyn (1 − θn )1−yn ⇔ change
of sign
w,w0
n=1 n=1
XN

min E w, w0 ; {(xn , yn )}N
n=1 = − (yn log θn + (1 − yn ) log (1 − θn )) = cross-entropy
w,w0
n=1 3

XN N
X 2.5
1000

=− log (1 − θn ) − log θn . 2
100

10

n∈C1 n∈C2 1.5

• No closed-form solution. We apply gradient descent and obtain:

P(C |x)
o
0.5

N N
∂E X ∂E X −0.5

∆w = −η =η (yn − θn )xn , ∆w0 = −η =η (yn − θn )


∂w ∂w0
−1

n=1 n=1 −1.5

where θn = σ(wT xn + w0 ). ✐ Pf. Use the chain rule and dσ(t)


−2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

dt
= σ(t)(1 − σ(t)). x

If the classes are linearly separable, kwk → ∞ and θn → yn ∈ {0, 1} as training proceeds.
Instead, we stop early, when the number of misclassifications is zero. Fig. 10.6
pseudocode
Newton’s method (suitably modified) is much more effective than gradient descent for this problem.

K > 2 classes: yn ∈ {1, . . . , K}


• As before, but we model yn |xn as a multinomial distribution with parameter θnk = p(Ck |xn ) =
exp (wkT xn +wk0 )
PK
exp (wT x +w )
. The error function is again the cross-entropy (maximum likelihood with a p. 254
j=1 j n j0

change of sign), where yn ∈ {0, 1}K uses a 1-of-K encoding:


N X
X K

min E {wk , wk0 }K N
k=1 ; {(xn , yn )}n=1 =− ynk log θnk .
{wk ,wk0 }K
k=1 n=1 k=1

Multinomial distribution:
PK
• A die with K faces, each with probability θk ∈ [0, 1] for k = 1, . . . , K with k=1 θk = 1.

θ1 , y1 = 1
p. 600
yK
• p(y; θ) = θ1y1 · · · θK = ... where y ∈ {0, 1}K has exactly one yk = 1 and the rest are 0.
θ , y = 1
K K

• For K = 2 it becomes the Bernoulli distribution.

43
10.8.2 By least-squares regression
Two classes: yn ∈ {0, 1}
N
1X
• We define a least-squares regression problem min E(Θ) = (yn − f (xn ; Θ))2 where:
Θ 2 n=1
– The labels {yn } are considered as real values (which happen to be either 0 or 1), and can
be seen as the desired posterior probabilities for the points {xn }.
1
– The parametric function to be learned is f (x; w, w0) = σ(wT x + w0 ) = 1+exp (−(wT x+w0 ))
.
– Hence, learning the classifier by regression means trying to estimate the desired posterior
probabilities with the function f (whose outputs are in [0, 1]).

• No closed-form solution. We apply gradient descent and obtain? :


N N
∂E X ∂E X
∆w = −η =η (yn − θn )θn (1 − θn )xn , ∆w0 = −η =η (yn − θn )θn (1 − θn )
∂w n=1
∂w0 n=1

where θn = σ(wT xn + w0 ).

• As before, if the classes are linearly separable, this would drive kwk → ∞ and θn → yn ∈ {0, 1}.
Instead, we stop iterating when the right output dimension is larger than the others for each
instance.

K > 2 classes: yn ∈ {1, . . . , K}

• Same thing but: p. 259

– We represent the labels yn as a one-of-K binary vector yn ∈ {0, 1}K (containing a single
1 in position k if yn corresponds to class k), and consider the ykn values as real values
(which happen to be either 0 or 1).
1
– We learn a separate function fk (x; wk , wk0) = σ(wkT x + wk0 ) = 1+exp (−(wkT x+wk0 ))
per out-
put dimension k = 1, . . . , K.

• Hence, we solve a least-squares regression problem


N K N
1X 1 XX
min E(Θ) = kyn − f(xn ; Θ)k2 = (ykn − fk (xn ; wk , wk0))2
Θ 2 n=1 2 k=1 n=1

which is equivalent to K independent regression problems, each on an fk .

• This formulation ignores the fact that the components of f(x) should sum to 1. A correct
formulation, but more complicated to optimize, would use the softmax function for f.

44
11 Multilayer perceptrons (artificial neural nets)
• Parametric nonlinear function approximators f(x; Θ) for classification or regression.
• Originally inspired by neuronal circuits in the brain (McCullough-Pitts neurons, Rosenblatt’s
perceptron, etc.), and by the brain’s ability to solve intelligent problems (visual and speech
perception, navigation, planning, etc.).
Synapses between neurons = MLP weight parameters (which are modified during learning). Neuron firing = MLP unit nonlinearity.

• They are usually implemented in software in serial computers and more recently in parallel or
distributed computers. There also exist VLSI implementations.
Neuron Organization of neurons in the retina

11.2 The perceptron


P
• Linear perceptron: computes a linear function y = Dd=1 wd xd + w0 ∈ R of an input vector
x ∈ R with weight vector w ∈ R (or y = w x with w ∈ RD+1 and we augment x with a
D D T

0th component of value 1). To have K outputs: y = Wx with W ∈ RK×(D+1) .


• For classification, we can use:
1
– Two classes: y = σ(wT x) = ∈ (0, 1) (logistic).
1 + exp (−wT x)
exp (wkT x) P
– K > 2 classes: yk = PK T
∈ (0, 1), k = 1, . . . , K with K
k=1 yk = 1 (softmax).
j=1 exp (wj x)

45
11.3 Training a perceptron
P
• Apply stochastic gradient descent: to minimize the error E(w) = N
n=1 e(w; xn , yn ), repeatedly
update the weights w ← w + ∆w with ∆w = −η ∇e(w; xn , yn ) and n = 1, . . . , N (one epoch).
– Regression by least-squares error: e(w; xn , yn ) = 21 (yn −wT xn )2 ⇒ ∆w = η(yn −wT xn ) xn .
– Classification by maximum likelihood, or equivalently cross-entropy:
∗ For two classes: yn ∈ {0, 1} and e(w; xn , yn ) = −yn log θn − (1 − yn ) log (1 − θn ) where
θn = σ(wT xn ) ⇒ ∆w = η(yn − θn ) xn .
P
∗ For K > 2 classes: yn ∈ {0, 1}K coded as 1-of-K and e(w; xn , yn ) = − K k=1 ykn log θkn pseudocode
Fig. 11.3

T
exp (w x)
where θkn = PK expk(wT x) ⇒ ∆wkd = η(ykn −θkn ) xdn for d = 0, . . . , D and k = 1, . . . , K.
j=1 j

The original perceptron algorithm was a variation of stochastic gradient descent. For linearly separable problems, it converges in a
finite (possibly large) number of iterations. For problems that are not linearly separable problems, it never converges.

11.4 Learning Boolean functions


• Boolean function: {0, 1}D → {0, 1}. Maps a vector of D bits to a single bit (truth value).
• Can be seen as a binary classification problem where the input instances are binary vectors.
• Since a perceptron can learn linearly separable problems, it can learn AND, OR but not XOR.

AND x2
x2
y XOR
x1 x2 y 1.5
x1 x2 y
0 0 0 -1.5 (0,1) (1,1) 0 0 0
0 1 0 +1 +1
0 1 1
1 0 0 + 1 0 1
1 1 1  1 1 0
y = σ x1 + x2 − 23 x0=+1 x1 x2
(0,0) (1,0) 1.5
x1 x1

11.5 Multilayer perceptrons (MLPs)


• Multilayer perceptron (or feedforward neural net): nested sequence of perceptrons, with an input
perceptrons perceptrons perceptrons
layer, an output layer and zero or more hidden layers: x −−−−−−→ · −−−−−−→ . . . −−−−−−→ y.
• It can represent nonlinear discriminants (for classification) or functions (for regression).
• Architecture of the MLP: each layer has several units. Each unit h takes as input the output z
of the previous layer’s units and applies to it a linear function whT z (using a weight vector wh ,
including a bias term) followed by a nonlinearity s(t): output of unit h = s(whT z).
• Typical nonlinearity functions s(t) used: 2
1
– Logistic function: s(t) = 1+e−t
(or softmax). 1
et −e−t
– Hyperbolic tangent: s(t) = tanh t = et +e−t
. sigmoid
– Rectified linear unit (ReLU): s(t) = max (0, t). 0 tanh
– Step function: s(t) = 0 if t < 0, else 1. step
−1 identity
– Identity function: s(t) = t (no nonlinearity).
ReLU
−2
−4 −2 0 2 4
46
The output layer uses as nonlinearity:
– For regression: the identity (so the outputs can take any real value).
– For classification: the sigmoid and a single unit (K = 2 classes), or the softmax and K
units (K > 2 classes).
All nonlinearities (sigmoid, etc.) have a similar universal approximation ability, but some (e.g.
ReLU) are easier to optimize than others.
• If all the layers are linear (s(t) = t for all units in each layer) the MLP is overall a linear
function, which is not useful, so we must have some nonlinear layers.

• Ex: an MLP with one hidden layer having H units, with inputs x ∈ RD and outputs y ∈ RD ,
where the hidden units are sigmoidal and the output units are linear:
 H D
!
T
zh (x) = σ(wh x), h = 1, . . . , H X X
⇒ fi (x) = vih σ whd xd + wh0 + vi0 .
fi (x) = viT z(x), i = 1, . . . , D ′
h=1 d=1

An MLP with a single nonlinear hidden layer. . . . . . solves the XOR problem
z2
y y

-0.5 +1 +1

z1
z1 z2

z0=+1 -0.5 -0.5


-1 +1 z2
+1 -1 x2

+
z1
+

x0=+1 x1 x2 x1

11.6 MLP as a universal approximator


• A Boolean function can always be written as a disjunction of conjunctions, and this can be
implemented by an MLP with one hidden layer. Each conjunction (AND) is implemented by
one hidden unit. Ex: x1 XOR x2 = (x1 AND x2 ) OR (x1 AND x2 ). The disjunction (OR) is
implemented by the output unit.
This existence proof generates very large MLPs (up to 2D hidden units with D inputs). Practically, MLPs are much smaller.

• Universal approximation: any continuous function (satisfying mild assumptions) from RD to



RD can be approximated by an MLP with a single hidden layer with an error as small as
desired (by using sufficiently many hidden units).
Simple proof for the case of two hidden layers: we can enclose every input instance (or region) with a set of hyperplanes using
hidden units in the first layer. A hidden unit in the second layer ANDs them together to bound the region. We then set the weight
of the connection from that hidden unit to the output unit equal to the desired function value. This gives a piecewise constant
approximation to the function. Its accuracy may be increased by using more hidden units that define finer regions in input space.

47
11.7 Learning MLPs: the backpropagation algorithm
• The backpropagation algorithm is (stochastic) gradient descent applied to an MLP using a
least-squares or cross-entropy error function. Since the MLP consists of a nested sequence of
W1 W2 WK
functions (layers) x −−→ z1 −−→ . . . −−→ y = f(x; Θ) with Θ = {W1 , . . . , WK }, each being a
nonlinear perceptron, the gradient is computed with the chain rule, and can be interpreted as
backpropagating the error at the output layer through the hidden layers back to the input.

• Ex: an MLP with one hidden layer having H units, with inputs x ∈ RD and outputs y ∈ RD ,
where the hidden units are sigmoidal and the output units are linear. The gradient is as follows:

∂E ∂E ∂fi ∂fi (as with a perceptron 
– wrt second-layer weight vih : = = δi 
∂vih given fixed inputs zh )  with “error”

∂vih ∂fi ∂vih
D′ D′ ∂E
∂E X ∂E ∂fi ∂zh X ∂fi ∂zh  δi = .
– wrt first-layer weight whd : = = δi


 ∂fi
∂whd i=1
∂fi ∂zh ∂whd i=1
∂zh ∂whd

• With more hidden layers, the gradient wrt each layer’s weights is computed recursively.
• The updates using stochastic
 gradient descent with a minibatch B to minimize an error func-
P N
E Θ; {(xn , yn )}N
tionP n=1 = n=1 e(Θ; xn , yn ) are of the form Θ ← Θ + ∆Θ with ∆Θ =
−η n∈B ∇Θ e(Θ; xn , yn ), where the gradients are given below. B = {1, . . . , N } gives batch gradient descent.

Regression We assume outputs y of dimension D ′ . Θ = {W, V}. Least-squares error : Fig. 11.11
pseudocode
P ′
e(W, V; xn , yn ) = en (W, V) = 21 kyn − f(xn ; W, V)k2 = 12 D
i=1 (yin − fi (xn ; W, V))
2

fi fi
D ′ H
′ 1X 2
vih  X
e= e ({fi }D
i=1 ) = (yi − fi ) , zh fi = fi {vih , zh }H
h=0 = vih zh ,
2 | {z }
i=1 δ = error of
|{z}
h=0
i output of
output unit i output unit i

D
!
 X
zh
zh = zh {whd }D
d=0 = σ whd xd
whd
|{z}
output of d=0
hidden unit h

σ ′
D′
X  z }| {
vih
∂en ∂en
= − (yin − fi (xn )) zhn = − (yin − fi (xn )) vih z (1 − zhn ) xdn .
∂vih | {z } |{z} whd ∂whd | {z } |{z} | hn {z }
∂fi /∂vih i=1 ∂fi /∂zh
∂en /∂fi ∂en /∂fi ∂zh /∂whd

Classification, two classes We need a single output unit f (xn ). Θ = {W, v}. Cross-entropy error :
 
e(W, v; xn, yn ) = en (W, v) = −yn log f (xn ; W, v) − (1 − yn ) log 1 − f (xn ; W, v)
∂en ∂en
= −(yn − f (xn )) zhn = −(yn − f (xn )) vh zhn (1 − zhn ) xdn .
∂vh ∂whd

Classification, K > 2 classes We need one output unit fi (xn ) per class i = 1, . . . , K, given by
the softmax function. Θ = {W, V}. Cross-entropy error :
K H
X  exp (oin ) X
e(W, V; xn, yn ) = en (W, V) = − yin log fi (xn ; W, V) , fi (xn ) = PK , oin = vih zhn
i=1 k=1 exp (okn ) h=0
P ′ 
∂en ∂en D
∂vih
= −(yin − fi (xn )) zhn ∂whd
= i=1 −(y in − fi (xn )) vih zhn (1 − zhn ) xdn .

48
2 1.4 4 4 4
Training
Validation

1.5 1.2 3 3 3

1 2 2 2
1

0.5 1 1 1

Mean Square Error


0.8
300 200
0 0 0 0
100
0.6

−0.5 −1 −1 −1

0.4

−1 −2 −2 −2

0.2
−1.5 −3 −3 −3

0
−2 0 50 100 150 200 250 300 −4 −4 −4
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 Training Epochs −0.5 0 0.5 −0.5 0 0.5 −0.5 0 0.5

MLP with H = 2 hidden units Error on training & validation sets a) hyperplanes of hidden units in L1;
after 100, 200 and 300 epochs b) hidden unit outputs;
c) hidden unit outputs × weights in L2

11.8 Training procedures


Training neural nets in practice is tricky and requires some expertise and trial-and-error.

Improving convergence Training a neural net can be computationally very costly:

• The dataset {(xn , yn )}N


n=1 and the number of weights can both be very large in practical
problems.

• Gradient descent converges very slowly in optimization problems with many parameters.

• Vanishing gradients problem: there is one product σ ′ (z) = σ(z) (1 − σ(z)) per layer. If the
value of z is large, which will happen if the weights or inputs at that layer are large, then the
sigmoid saturates and σ ′ (z) ≈ 0, so the gradient becomes tiny, and it takes many iterations to
make progress. This is particularly problematic with deep networks (having several layers of
hidden units). Other nonlinearities have less of a problem, e.g. ReLU.

The following techniques are typically used to speed up the convergence:

• Initial weights: small random values (e.g. uniform in [−0.01, 0.01]) so as not to saturate the
sigmoids.

• Normalizing the inputs so they have zero mean and unit variance speeds up the optimization
(since we use a single step size η for all parameters).

• Momentum: we use an update (for any given weight w) ∆w new = −η ∂E ∂w


+ α∆w old where α
is generally taken between 0.5 and 1. This tends to smooth the trajectory of the iterates and
reduce oscillations. It is a limited form of conjugate gradients.

• Rescaling the learning rate ηw for each weight w by using a diagonal approximation to the
Hessian of the objective function E(w). This helps to correct for the fact that weights in
different layers, or subject to weight sharing, may have different effects on the output.

• It is also possible to use other optimization methods (modified Newton’s method, conjugate
gradients, L-BFGS, etc.) but, for problems with large N and redundant samples, they don’t
seem to improve significantly over SGD (with properly tuned learning rate, minibatch size and
momentum term).

49
Overtraining
• Model selection for the number of hidden units (hence weights): the bias-variance tradeoff
applies as usual:
– Neural nets with many hidden units can achieve a very low training error but memorize
the noise as well as the signal, hence overfit.
– Neural nets with few hidden units can have a large bias, hence underfit.
The number of hidden units can be estimated by cross-validation.
• More generally, one needs to select the overall architecture of the neural net (number of layers
and of hidden units in each layer, connectivity between them, choice of nonlinearity, etc.). This
is done by trial and error and can be very time-consuming.
• Early stopping: during the (stochastic) gradient descent optimization (with a given number of
hidden units), we monitor the error on a validation set and stop training when the validation
error doesn’t keep decreasing. This helps to prevent overfitting as well.
• Weight decay: we discourage weights from taking large values by adding to the objective
function, or equivalently to the update, a penalty term:
 
′ λX 2 ∂E
E (w) = E(w) + w ⇐⇒ ∆wi = −η + λwi .
2 i i ∂wi
This has the effect of choosing networks with small weights as long as they have a low training
error (depending on λ). Such networks are smoother and generalize better to unseen data. The
value of λ is set by cross-validation.
0.12 3.5
Training Training
Validation Validation

3
0.1

2.5
0.08
Mean Square Error

Mean Square Error

0.06

1.5

0.04
1

0.02
0.5

0 0
0 5 10 15 20 25 30 0 100 200 300 400 500 600 700 800 900 1000
Number of Hidden Units Training Epochs

Structuring the network


• Depending on the application and data, some types of connectivity may be more effective than
having all layers be fully-connected.
• Convolutional neural nets: useful with images or time series that have local structure (around
a location in space or time, respectively). Each hidden unit receives input from a subset of the
input layer’s units, corresponding to a spatial or temporal window in the input image or signal.
Ex: in handwritten digit images, nearby pixels are correlated and give rise to local features such as edges, corners or T-junctions.
A stroke or a digit can be seen as a combination of such primitive features.

• Weight sharing: the values of certain weights are constrained to be equal. For example, with
convolutional neural nets, the weights in the window are the same for every hidden unit. This
corresponds to using a filter that is homogenous in space or time, and helps to detect features
regardless of their location in the input. Ex: edges at different locations and of different orientations.

50
• Convolutional neural nets with weight sharing have a much smaller number of weights, can be
trained faster, and usually give better results than fully-connected networks.

• This process may be repeated in successive layers and make it possible for the network to learn
a hierarchical representation of the data with each layer learning progressively more complex
and abstract concepts.
Ex: pixels → edges in different orientations → corners, T-junctions, contour segments → parts → objects → classes of objects.

• With temporal data, as in speech recognition or machine translation, one may use time-delay
neural nets or recurrent neural nets.

• Deep learning refers to neural networks having multiple layers that can learn hierarchical rep-
resentations of complex data such as images. Properly trained on large, labeled datasets (using
GPUs), they have achieved impressive results in recent years in difficult tasks such as object
recognition, speech recognition, machine translation or game learning. This success is due to
the ability of the network to learn nonlinear mappings in high-dimensional spaces, and to the
fact that the network is learned end-to-end, i.e., all its weights are optimized with little human
contribution (rather than e.g. having a preprocessing layer that is fixed by hand by an expert).
Ex: instead of extracting preselected features from an image (e.g. SIFT features) or from the speech waveform (e.g. MFCCs), the
first layer(s) of the net learn those features optimally.

A
A A A
Hints

• Hints are properties of the target function that are known to us independent of the training
data. They should be built into the neural net if possible, because they help to learn a good
network, particularly when the training data is limited.

• Ex: invariance. The output of a classifier is often invariant to certain known transformations
of the input:

– Object recognition in images: translation, rotation, scaling.


– Speech recognition: loudness, reverberation.

• Hints may be incorporated in different ways, such as:

– Virtual examples: generate multiple copies of every object to be recognized at different


locations, rotations and scales, and add them as training examples with the same la-
bel. This doesn’t change the optimization algorithm but it increases the training set size
considerably.
– Preprocessing stage: the image could be centered, aligned and rescaled to a standard
position/orientation/scale before being fed to the network.
– Putting hints into the network structure, as with convolutional nets and weight sharing.

51
11.11 Dimensionality reduction: autoencoders
• Autoencoder : an MLP that, given a training set {xn }N D
n=1 ∈ R without labels, tries to recon-
struct its input (“labels” yn = xn ), but has a bottleneck layer, whose number of hidden units
H is smaller than the dimension D of the input x. Its output layer is linear and it is trained
by least-squares error:
N
1X
min E(W1 , W2 ) = kyn − f(F(xn ; W1 ); W2 )k2
W1 ,W2 2 n=1

where F: RD → RH is the encoder network and f: RH → RD the decoder network.

• This forces the overall network f(F(x)) to learn an optimal low-dimensional representation
zn = F(xn ) ∈ RH of each instance xn ∈ RD in the bottleneck (or “code”) layer. The codes (=
outputs of the bottleneck hidden layer) are optimal for reconstructing the input instances.

• If the hidden layers are all linear then the network becomes equivalent to PCA? .
The hidden unit weights need not be the H leading eigenvectors of the data covariance matrix, but they span the same subspace.

If they are nonlinear (e.g. sigmoidal) then it learns a nonlinear dimensionality reduction.

y1 yd y1 yd 1
Hidden Representation

7 7
7 4 4
777 77
7 7 7 1 4 4 44
0.9 9 1 1
4
7 1 4
1 1
9 99 9 9 4
0.8 1 44
9
4
3 9 9 1
Decoder 0.7 3 1
9

0.6
33
zH zH 33
3 9

Hidden 2
3
0.5 5 5

Encoder 0.4
8 88 8
3 8
8 8
0.3 3
5 8
5 8
1 6
5 1
0.2
66
6
6
0.1 2 2 0 0 66
x0=+ x1 xd x0=+ x1 xd
2
2 2 2 22
2
0
0 00 6
2 2 0 0 0 0
1 1 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Linear Nonlinear Hidden 1

• If a neural network is trained for classification and has a bottleneck (a hidden layer with
dimension smaller than the input), then the network will learn a low-dimensional representation
in that hidden layer that is optimal for classification (similar to LDA if using linear layers).

52
12 Local models: radial basis function networks
12.3 Radial basis function networks

• Assume a dataset {xn , yn }N
n=1 with xn ∈ R
D
and yn ∈ RD for regression or yn ∈ {1, . . . , K}
for classification.

• Radial basis function (RBF) network :


H
X 
f(x) = wh φh (x) φh (x) = exp − 21 kx − µh k2 /σ 2
h=1

where {φh (·)}H


h=1 are radial basis functions, typically (proportional to) Gaussians with centroids
D′
{µh }h=1 ⊂ R and common width σ, and {wh }H
H D
h=1 ⊂ R are weights. We take φ1 (x) ≡ 1 if
we want to use a bias.
It is also possible to use a separate width σh per RBF.

• They can be seen as a feedforward neural net with a single hidden layer of Gaussian units and
an output layer of linear units. They are not usually constructed with more hidden layers, unlike MLPs.

• Two types of representation of information in neural nets: assume for simplicity that units
output either 0 (not activated) or 1 (activated):

– Distributed representation, as in sigmoidal MLPs: an input x ∈ RD is encoded by the


simultaneous activation of many hidden units.
Every hidden unit splits the space into two half-spaces, so x will activate half of the units on average.

– Local representation, as in RBF nets: an input x ∈ RD is encoded by the simultaneous


activation of few hidden units.
Every unit splits the space into inside/outside a hypersphere, so x activates only a few units, unless the width σ is large.

Each unit is locally tuned to a small area of the input space called its receptive field, and
the input space is paved with such units.
Neurons in the primary visual cortex respond only to highly localized stimuli in retinal position and angle of visual orientation.

yi w2 w1
+
x2 x2
m 1 xa m3
xa
xb xb
wih
xc xc
+
ph m2
x1 x1
p0=+1

mhj , s h Local representation in the Distributed representation in the


space of (p1, p2, p3) space of (h1, h2)
xa : (1.0, 0.0, 0.0) xa : (1.0, 1.0)
xb : (0.0, 0.0, 1.0) xb : (0.0, 1.0)
x1 xj xd xc : (1.0, 1.0, 0.0) xc : (1.0, 0.0)

53
• Training for regression: we minimize the least-squares error
N H
 1X 2
X
E {wh , µh }H
h=1 , σ = kyn − f(xn )k + λ kwh k2
2 n=1 h=1
where λ ≥ 0 is a regularization user parameter which controls the smoothness of f. We can use
gradient descent, where the gradient is computed using the chain rule and is similar to that of p. 330
an MLP. However, RBF nets can be trained in an approximate but much simpler and faster
way as follows:
1. Set the centroids {µh }H N
h=1 in an unsupervised way using only the input points {xn }n=1 ,
usually by running K-means with H centroids, or by selecting H points at random.
2. Set the width σ by cross-validation (see below).
It is also possible to set σ to a rough value as follows. Having run k-means, we compute the distance from each centroid µh
to the farthest point xn in its cluster. We then set σ to the average such distance over the H centroids.

3. Given the centroids and width, the values φh (xn ) are fixed, and the weights {wh }H h=1 are
determined by optimizing E, which reduces to a simple linear regression. We solve the
linear system✐ (recall the normal equations of chapter 5):

ΦΦT + λI W = ΦY T ΦH×N = (φh (xn ))hn , WH×D′ = (w1 , . . . , wH )T , YD′ ×N = (y1 , . . . , yN ).
• Training for classification: we minimize the cross-entropy error
N X
X K H
X

E {wh , µh }H
h=1 , σ =− yin log θin + λ kwh k2
n=1 i=1 h=1
H
exp (fi (xn )) X
θin = PK fi (x) = wih φh (x).
k=1 exp (fk (xn )) h=1
That is, we use an RBF net with H BFs and K outputs, which we pass through a softmax.
This can either be optimized by gradient descent, or trained approximately as above (but the
optimization over the weights cannot be solved by a linear system and requires gradient descent
itself).
• Model selection: the complexity of a RBF net is controlled by:
– The number H of basis functions: H↓ high bias, low variance; H↑ low bias, high variance.
– The regularization parameter λ ≥ 0: the larger, the smoother the RBF net.
Numerically, we typically need λ & 10−7 to make the linear system sufficiently well conditioned.

– The width σ (if not optimized over or set by hand): the larger, the smoother the RBF net.
They are all set by cross-validation. We usually search over a range of values of H, log λ and log σ.

• Because they are localized, RBF nets may need many basis functions to achieve a desired
accuracy.
• Universal approximation: any continuous function (satisfying mild assumptions) from RD to

RD can be approximated by a RBF net with an error as small as desired (by using sufficiently
many basis functions).
Intuitive proof: we can enclose every input instance (or region) with a basis function having as centroid that instance and a small
enough width so it has little overlap with other BFs. This way we “grid” the input space RD with sufficiently many BFs, so that
only one BF is active for a given instance xn . Finally, we set the weight of each BF to the desired function value. This gives
an approximately piecewise constant approximation to the function (essentially equal to a nonparametric kernel smoother). Its
accuracy may be increased by using more hidden units and placing a finer grid in the input.

54
12.5 Normalized basis functions
• With Gaussian BFs, it is possible that φh (x) = 0 for all h = 1, . . . , H for some x ∈ RD . This
happens if x is far enough (wrt σ) from each centroid.

• We can define the basis functions as normalized Gaussians:


 H
exp − 21 kx − µh k2 /σ 2 X
φh (x) = P  2  =⇒ φh (x) ∈ (0, 1) and φh (x) = 1.
H 1 2
j=1 exp − 2
x − µ j
/σ h=1

Now, even if x is far (wrt σ) from each centroid, at least one BF will have a nonzero value.

• φh (x) can be seen as the posterior probability p(h|x) of BF h given input x assuming a Gaussian
mixture model p(x), where component h has mean µh , and all components have the same
proportion H1 and the same, isotropic covariance matrix Σh = σ 2 I.

• The nonparametric kernel smoother of chapter 8 using a Gaussian kernel of bandwidth h > 0
N 
X K k(x − xn )/hk
g(x) = PN yn
n=1 n′ =1 K k(x − xn′ )/hk

is identical to a RBF network with the following special choices:

– it uses normalized BFs;


– it has H = N BFs, one per input data point;
– the centroids are the input points: µn = xn ;
– the weights equal the output vectors: wn = yn .

55
13 Kernel machines (support vector machines, SVMs)
• Discriminant-based method: it models the classification boundary, not the classes themselves.
• It can produce linear classifiers (linear SVMs) or nonlinear classifiers (kernel SVMs).
• Very effective in practice with certain problems:
– It gives good generalization to test cases.
– The optimization problem is convex and has a unique solution (no local optima).
– It can be trained on reasonably large datasets.
– Special kernels can be defined for many applications.
• It can be extended beyond classification to solve regression, dimensionality reduction, outlier
detection and other problems.
The basic approach is the same in all cases: maximize the margin and penalize deviations, resulting in a convex quadratic program.

• We have seen several linear classifiers: logistic regression, Gaussian classes with common co-
variance, now linear SVMs. . . They give different results because they have different inductive
bias, i.e., make different assumptions (the objective function, etc.).

13.2 Binary classification, linearly separable case:


optimal separating hyperplane
• Consider a dataset {xn , yn }N
n=1 where xn ∈ R
D
and yn ∈ {−1, +1}, and a linear discriminant
function g(x) = w x + w0 , i.e., a hyperplane, where w ∈ RD and w0 ∈ R. The classification
T

rule induced by the discriminant is given by sgn wT x + w0 ∈ {−1, +1}.


• Assume the two classes are linearly separable, so there exist an infinite number of separating
hyperplanes. Although they are all equally good on the training set, they differ with test data.
Which one is best?
• We define the margin of a separating hyperplane as the distance to the hyperplane of the closest
instance to it. We want to find the hyperplane having the largest margin.
• This has two advantages:
– It provides a unique solution to the separating hyperplane problem.
– It gives better classification performance on test data.
If noise perturbs slightly an instance xn , it will still be correctly classified if it is not too close to the boundary.

• The signed distance of a point xn to the hyperplane is (wT xn + w0 )/kwk ∈ R, hence its
absolute value is yn (wT xn + w0 )/kwk ≥ 0 (since yn ∈ {−1, +1}). Let xn∗ be the closest
instance to the hyperplane and ρ = yn∗ (wT xn∗ + w0 )/kwk > 0 its distance (the margin), i.e.,
yn (wT xn + w0 )/kwk ≥ ρ ∀n = 1, . . . , N. To make this distance as large as possible, we can
maximize ρ subject to yn (wT xn + w0 )/kwk ≥ ρ ∀n = 1, . . . , N. But yn (wT xn + w0 )/kwk
is invariant to rescaling both w and w0 by a constant factor, so we arbitrarily fix this factor
by requiring kwk = 1/ρ (or, equivalently, by requiring yn (wT xn + w0 ) = 1 for the instance xn
closest to the hyperplane). Then, since maximizing ρ is the same as minimizing 1/2ρ2 = 21 kwk2 ,
we finally reach the following maximum margin optimization problem:
1
Primal QP: min kwk2 s.t. yn (wT xn + w0 ) ≥ 1, n = 1, . . . , N.
w∈R , w0 ∈R 2
D

56
• This is a constrained optimization problem, specifically a convex quadratic program (QP), de-
fined on D +1 variables (w, w0) with N linear constraints and a quadratic objective function. It
has a unique minimizer (w, w0), which can be found with different QP algorithms. The closest
instances to each side of the hyperplane are at a distance? 1/kwk from the hyperplane.
• Solving the primal problem, a QP on D + 1 variables, is equivalent to solving the following
dual problem, another convex QP but on N variables α = (α1 , . . . , αN )T and N + 1 constraints
(α1 , . . . , αN are the Lagrange multipliers of the N constraints in the primal QP):
N N N
1X T
 X X
Dual QP: min yn ym (xn xm ) αn αm − αn s.t. yn αn = 0, αn ≥ 0, n = 1, . . . , N.
α∈RN 2
n,m n=1 n=1
3 2 2
This can be solved in O(N + N D) times and O(N ) space (because we need the dot products
xTn xm for n, m = 1, . . . , N). The optimal α ∈ RN has the property that αn > 0 only for those
instances that lie on the margin (the closest ones to the hyperplane), i.e., yn (wT xn + w0 ) = 1,
called the support vectors (SVs). For all other instances, which lie beyond the margin, αn = 0.
• From the optimal α ∈ RN , we recover the solution (w, w0) as follows:
N
X N
X
w= αn yn xn = αn yn xn w0 = yn − wT xn for any support vector xn ?
n=1 n∈SVs

so the optimal weight vector can be written as a l.c. of the SVs. This allows kernelization later.

• The resulting, linear discriminant (the linear SVM ) is


N
X N
X
T
g(x) = w x + w0 = αn yn xTn x + w0 = αn yn xTn x + w0 .
n=1 n∈SVs
• Properties of the support vectors:
– They are the instances that are closest to the boundary and thus the more difficult ones
to classify.
– The number of SVs is usually much smaller than N, though this depends on the dimension
D and the problem.
– Solving the problem using as training set just the SVs gives the same result.
In practice, we don’t know which instances are SVs, so we have to solve the QP using the entire dataset.
= 1

Linearly separable Not linearly separable Loss functions


2 2 5 squared error
loss for r

(a)
4
1.5 1.5

(b)
3 hinge loss
1 1 (c)
2 cross entropy
1 (d)
0.5 0.5 1
−1 1
0/1 loss
−1 0
0
0 0.5 1 1.5 2
0
0 0.5 1 1.5 2 −2 −1 0 1 2
t
In terms of ξn : (a) ξn = 0, (b) ξn = 0, (c) 0 < ξn < 1, (d) ξn > 1;
In terms of αn : (a) αn = 0, (b) 0 < αn < C, (c)–(d) αn = C;
(b)–(d) are SVs (circled instances ⊕ ⊙, on or beyond the margin)
57
13.3 Binary classification, not linearly separable case:
soft margin hyperplane
• If the dataset is not linearly separable, we look for the hyperplane that incurs least classification
error while having the largest margin. For each data point xn , we allow a deviation ξn ≥ 0 (a
slack variable) from the margin, but penalize such deviations proportionally to C > 0:
N 
1 2
X yn (wT xn + w0 ) ≥ 1 − ξn ,
Primal QP: min kwk + C ξn s.t.
w∈RD , w0 ∈R, ξ∈RN 2 ξn ≥ 0, n = 1, . . . , N.
n=1
2
C > 0 is a user parameter that controls the tradeoff
PN between maximizing the margin (2/kwk )
and minimizing the deviations or “soft error” ( n=1 ξn ):
– C ↑: tries to avoid every single error but reduces the margin so it can overfit;
– C ↓: enlarges the margin but ignores most errors so it can underfit.
C is set by cross-validation. Typically, one chooses C on a log scale, e.g. C ∈ {10−6 , 10−5 , . . . , 10+5 , 10+6 }.

• This is again a convex QP on D + 1 + N variables and its unique minimizer (w, w0, ξ) can be
found with different QP algorithms. It is equivalent to solving the dual QP on N variables:
N N N
1X T
 X X C ≥ αn ≥ 0,
Dual QP: min yn ym (xn xm ) αn αm − αn s.t. yn αn = 0,
α∈RN 2 n = 1, . . . , N
n,m n=1 n=1

which differs from the dual QP of the linearly separable case in the extra “αn ≤ C” constraints.
The optimal α ∈ RN has the property that αn > 0 only for those instances that lie on or within
the margin or are misclassified, i.e., yn (wT xn + w0 ) ≤ 1, called the support vectors (SVs). For
all other instances, which lie beyond the margin, αn = 0. As a result, for each training point
n = 1, . . . , N we can have:
– ξn = 0: xn is correctly classified and beyond the margin (here, αn = 0).
– 0 < ξn < 1: xn is correctly classified but within the margin (here, 0 < αn ≤ C).
– ξn ≥ 1: xn is misclassified (here, αn = C).
• From the optimal α ∈ RN , we recover the solution (w, w0) as before:
N
X N
X
w= αn yn xn = αn yn xn w0 = yn − wT xn for any xn with 0 < αn < C.
n=1 n∈SVs

• The resulting, linear discriminant (the linear SVM ) is


N
X N
X
T
g(x) = w x + w0 = αn yn xTn x + w0 = αn yn xTn x + w0 .
n=1 n∈SVs

• The primal problem can be written equivalently but without constraints by eliminating ξn as:
N
1 2
X
min kwk + C [1 − yn g(xn )]+ where g(x) = wT x + w0 .
w∈RD ,w0 ∈R 2
n=1
This defines the hinge loss, which penalizes errors linearly:
(
 0 if yn g(xn ) ≥ 1
[1 − yn g(xn )]+ = max 0, 1 − yn g(xn ) =
1 − yn g(xn ) otherwise
and is more robust against outliers compared to the quadratic loss (1 − yn g(xn ))2 .
58
13.5 Kernelization: kernel SVMs for nonlinear classification
• We map the data points x to a higher-dimensional space. Learning a linear model in the new
space corresponds to learning a nonlinear model in the original space.
• Assume we map x ∈ RD → z = φ(x) = (φ1 , . . . , φK )T ∈ RK where K > D (usually, K is much
larger or even infinite), and z1 = φ1 (x) ≡ 1 P
(so we don’t have to write a bias term explicitly).
The discriminant is now g(x) = w φ(x) = K
T
i=1 wi φi (x).

• The primal and dual soft-margin QPs are as before but replacing xn with φ(xn ):
N
1 X
Primal QP: min kwk2 + C ξn s.t. yn (wT φ(xn )) ≥ 1 − ξn , ξn ≥ 0, n = 1, . . . , N
w∈RD , ξ∈RN 2 n=1
N N N
1X  X X C ≥ αn ≥ 0,
Dual QP: min yn ym (φ(xn )T φ(xm )) αn αm − αn s.t. yn αn = 0,
α∈RN 2 n,m | {z } n = 1, . . . , N
n=1 n=1
K(xn ,xm )
N
X N
X
Optimal solution: w = αn yn φ(xn ) = αn yn φ(xn )
n=1 n∈SVs
N
X
Discriminant: g(x) = wT φ(x) = αn yn (φ(xn )T φ(x)).
n=1
| {z }
K(xn ,x)
T
• Kernelization: we replace the inner product φ(xn ) φ(xm ) between basis functions with a kernel
function K(xn , xm ) that operates on a pair of instances in the original input space. This saves
the work of having to construct φ(·), whose dimension can be very high or infinite, and taking
the dot product. All we need in practice is the kernel K(·, ·), not the basis functions φ(·).
• The resulting, nonlinear discriminant (the kernel SVM ) is
N
X N
X N
X
T T
g(x) = w φ(x) = αn yn φ(xn ) φ(x) = αn yn K(xn , x) = αn yn K(xn , x).
n=1 n=1 n∈SVs

Again, we can use the kernel K(·, ·) directly and don’t need the basis functions.
• Many other algorithms that depend on dot products of input instances have been kernelized
(hence made nonlinear) in the same way: PCA, LDA, etc.
• The fundamental object in a kernel SVM, which determines the resulting discriminant, is the
kernel, and the N × N Gram matrix of dot products it defines on a training set of N points:
K = (K(xn , xm ))nm = (φ(xn )T φ(xm ))nm .
For this matrix and the dual QP to be well defined, the kernel K(·, ·) must be a symmetric positive definite function, i.e., it must
satisfy uT Ku ≥ 0 ∀u ∈ RN , u 6= 0.

• A kernel SVM can be seen as a basis function expansion (as in RBF networks), but the number
of BFs is not set by the user (possibly by cross-validation); it is determined automatically
during training, and each BF corresponds to one input instance that is a support vector.
• At test time, a kernel SVM works like a template classifier, by “comparing” the test pattern x
with the SVs (templates) by means of the kernel, and combining these comparisons into g(x)
to compute the final discriminant. Ex: for MNIST handwritten digits with SVs , ...
N
X
g(x) = αn yn K(xn , x) = α1 y1 K( , x) + α2 y2 K( , x) + . . .
n∈SVs
| {z }
how similar x is to
59
Polynomial kernel of degree 2 Gaussian kernel of different widths
2 2
(a) s =2
2
(b) s =0.5
2 2
−1

1
1.5 −1 1 1 1
−1

1 0 0
0 1 2 0 1 2
1
2 2
(c) s =0.25 (d) s =0.1
2 2
1
0.5 1
−1

1 1
−1

−1 −1
0 0 0
0 0.5 1 1.5 2 0 1 2 0 1 2
. . . . . . . . . . . . . . . . . . . circled instances ⊕, ⊙ are SVs; +, · are non-SVs . . . . . . . . . . . . . . . . . . .

13.6 Vectorial kernels


• Intuitively, the kernel K(x, y) measures the similarity between two input points x, y. Different
kernels may be useful in different applications.
• The most practically used kernels (and their hyperparameters) are, for x, y ∈ RD :
– Polynomial kernel of degree q: K(x, y) = (xT y + 1)q .
Ex. for D = 2 and q = 2: K(x, y) = (x1 y1 + x2 y2 + 1)2 = 1 + 2x1 y1 + 2x2 y2 + 2x1 x2 y1 y2 + x21 y12 + x22 y22 , which corresponds
√ √ √
to the inner product of the basis function φ(x) = (1, 2x1 , 2x2 , 2x1 x2 , x21 , x22 )T ∈ R6 ✐.

For q = 1 we recover the linear kernel of the linear SVM.



– Gaussian or RBF kernel of width σ: K(x, y) = exp − 21 kx − yk2 /σ 2 .
The Gaussian kernel implicitly defines a certain infinite-dimensional basis function φ(x).

– Sigmoidal (neural net) kernel of hyperparameters a, b ∈ R: K(x, y) = tanh (axT y + b).


• In addition to the kernel and its hyperparameters (q, σ, etc.), the user also has to select the
C hyperparameter from the SVM. With kernel SVMs (nonlinear discriminants) this has the
following effect:
– C ↑ tries to avoid any error by discouraging any positive ξn , which leads to a wiggly
boundary in the original input space with a small margin, so it can overfit.
– C ↓ encourages a small kwk and hence a large margin, which leads to a smoother boundary
in the original input space, but ignores most errors, so it can underfit.
• Likewise, a large q or small σ result in a more flexible boundary.
• The best values of the hyperparameters are found by cross-validation. Typically, we do a grid
search, i.e., we select a set of values for each hyperparameter (say, C and σ), train an SVM for
each combination of hyperparameter values (C, σ), and pick the SVM with the lowest error on
the validation set.
• It is possible to define kernels for data where the instances are not represented in vector form.
Ex: if x, y are documents using an unknown dictionary, we could define K(x, y) as the number of words they have in common.

60
13.9 Multiclass kernel machines
With K > 2 classes, there are two common ways to define the decision for a point x ∈ RD :
• One-vs-all : we use K SVMs gk (x), k = 1, . . . , K.
– Training: SVM gk is trained to classify the training points of class Ck (label +1) vs the
training points of all other classes (label −1).
– Testing: choose Ck if k = arg maxi=1,...,K gi (x).
• One-vs-one: we use K(K − 1)/2 SVMs gij (x), i, j = 1, . . . , K, i 6= j.
– Training: SVM gij is trained to classify the training points of class Ci (label +1) vs the
training points of class Cj (label −1); training points from other classes are not used.
P
– Testing: choose Ck if k = arg maxi=1,...,K K j6=i gij (x). Also possible: for each class k, count
the number of times gkj (x) > 0 for j 6= k, and pick the class with most votes.

13.10 Kernel machines for regression: support vector regression


• We are given a sample {(xn , yn )}N D
n=1 with xn ∈ R and yn ∈ R.
The generalization to the case where yn has dimension D ′ > 1 is straightforward.

• We consider first linear regression: f (x) = wT x + w0 .


• Instead of the least-squares error (yn −f (xn ))2 , in support vector regression we use the ǫ-sensitive
loss function:
(
 0, if |yn − f (xn )| < ǫ
[|yn − f (xn )| − ǫ]+ = max 0, |yn − f (xn )| − ǫ =
|yn − f (xn )| − ǫ, otherwise
which means we tolerate errors up to ǫ > 0 and that errors beyond ǫ have a linear penalty and
not a quadratic one. This error function is more robust to noise and outliers; the estimated f
will be less affected by them.
• As with the soft-margin hyperplane, we introduce slack variables ξn+ , ξn− to account for devia-
tions (positive and negative) out of the ǫ-zone. We get the following, primal QP:
N 
1 2
X
+ − −ǫ − ξn+ ≤ yn − (wT xn + w0 ) ≤ ǫ + ξn+
min + − kwk + C (ξn + ξn ) s.t.
w∈RD , w0 ∈R, ξ ,ξ ∈RN 2 ξn+ , ξn− ≥ 0, n = 1, . . . , N.
n=1

• This is a convex QP on D + 1 + 2N variables and its unique minimizer (w, w0, ξ + , ξ− ) can be
found with different QP algorithms. It is equivalent to solving the dual QP on 2N variables:
N N N
1X − −
 X

X
min + + T
(αn − αn )(αm − αm )(xn xm ) + ǫ +
(αn + αn ) − yn (αn+ − αn− )
α+ ,α− ∈RN 2 n,m n=1 n=1
N
X
s.t. (αn+ − αn− ) = 0, C ≥ αn+ , αn− ≥ 0, n = 1, . . . , N.
n=1

The optimal α ∈ RN has the property that αn+ = αn− = 0 only for those instances that lie
within the ǫ-tube; these are the instances that are fitted with enough precision. The support
vectors satisfy either αn+ > 0 or αn− > 0. As a result, for each training point n = 1, . . . , N we
can have:

61
– αn+ = αn− = 0: xn is fitted within the ǫ-tube.
– Either αn+ or αn− is in (0, C): xn is fitted on the boundary of the ǫ-tube.
We use these instances to calculate w0 , since they satisfy yn = wT xn + w0 ± ǫ if αn∓ > 0.
– αn+ = C or αn− = C: xn is fitted outside the ǫ-tube.

• From the optimal α+ , α− ∈ RN , we recover the solution (w, w0 ):


N
X
w= (αn+ − αn− )xn w0 = yn ∓ ǫ − wT xn for any n with αn± > 0.
n=1

• Hence, the fitted line can be written as a weighted sum of the support vectors:
N
X
f (x) = wT x + w0 = (αn+ − αn− )(xTn x) + w0 .
n=1

• Kernelization: we replace the dot products xTn xm (in the dual QP) or xTn x (in the regression
line) with K(xn , xm ) or K(xn , x), respectively, where K(·, ·) is a kernel (polynomial, Gaussian,
etc.). This results in a nonlinear regression function.

Linear kernel: Quadratic kernel Gaussian kernel Loss functions


3.5 (a) s2 = 5
3
2.4 (b)
(c) 3 2
2.2
1
(a)
2.5
2 0
0 2 4 6 8
1.8 2
2 (b) s = 0.1
2.5
1.6

−4 −2 0 2 4
2
1.5
1.4 1.5

1.2 1 1
0 2 4 6 8 0 2 4 6 8 0 2 4 6 8

(a) αn+ = αn− = 0, (b) αn+ < C, (c) αn+ = C least-squares,


......... .........
non-SVs ×, SVs on margin ⊗, SVs beyond margin ⊠ ǫ-sensitive

62
Joint distribution: p(X = x, Y = y).
p(X = x, Y = y)
Conditioning (product rule): p(Y = y | X = x) = .
p(X = x)
14 Graphical models
X
Marginalizing (sum rule): p(X = x) = p(X = x, Y = y).
y
Bayes’ theorem: p(Y = y | X = x) p(X = x)
p(X = x | Y = y) = .
Directed graphical models (inverse probability) p(Y = y)

• Also called Bayesian networks, belief networks, probabilistic networks.


• They represent the interaction between random variables visually, through a directed acyclic
graph (DAG):
– Each node represents one random variable X.
– A directed arc Y → X indicates that Y has a direct influence on X (possibly but not
necessarily indicating causality).
– At node X, its probability distribution is conditional only on the nodes that point to
it, and has tunable parameters (e.g. a table of probabilities for discrete variables, a
mean/covariance for Gaussian variables).
Ex: if X1 → X3 and X2 → X3 over (X1 , . . . , Xd ), then we write p(X3 |all variables) = p(X3 |X1 , . . . , Xd ) as p(X3 |X1 , X2 ).

• Types of independence of random variables:


– X and Y are independent iff p(X, Y ) = p(X) p(Y ).
Equivalently? p(X|Y ) = p(X) and p(Y |X) = p(Y ). Knowing Y tells me nothing about X and vice versa.

– X and Y are conditionally independent given Z iff p(X, Y |Z) = p(X|Z) p(Y |Z).
Equivalently? p(X|Y, Z) = p(X|Z) and p(Y |X, Z) = p(Y |Z).

• A graphical model represents a joint distribution by making conditional independence assump-


tions. In a graphical model, not all nodes are connected; typically, each node is connected to
a few others only. The subgraphs containing these connections imply assumptions that the
model makes about conditional independence among groups of variables. This allows inference
over a set of variables to be decomposed into smaller sets of variables, each requiring local
calculations.
• In this chapter, we focus mostly on discrete, binary random variables, and illustrate how to do
inference in simple cases of graphical models.

Undirected graphical models


• Also called Markov random fields (MRFs).
• Convenient to represent symmetric interactions between random variables.

Example Notation: p(R) means p(R = 1), p(W |R) means p(W = 0|R = 1), etc.

A graphical model with completely specified conditional distributions at “Modeling wet grass and rain”
each node (in the example, tables giving p(R) and p(W |R) for all com-
binations of values of R and W ), specifies the joint distribution over all
variables (in the example, p(R, W ) = p(R) p(W |R)), from which we can
compute any specific distribution over groups of variables:

• p(R) = ✐, p(W |R) = ✐, p(W |R) = ✐


X
• Marginal: p(W ) = p(R, W ) = p(W |R) p(R) + p(W |R) p(R) = ✐
R
p(W |R) p(R)
• Bayes’ rule: p(R|W ) = =✐
p(W ) p(R), p(W |R), p(W |R) can be derived from the
values above, so they are not stored explicitly.
Inverts the dependency to give a diagnosis.
63
14.2 Canonical cases for conditional independence
• By repeated application of the rule of conditional probability, we can write any joint distribution
over d random variables without assumptions as follows (✐ What graphical model corresponds to this?)
?
p(X1 , X2 , . . . , Xd ) = p(X1 ) p(X2 |X1 ) p(X3 |X1 , X2 ) · · · p(Xd |X1 , X2 , . . . , Xd−1 ).
• In particular, consider 3 random variables X, Y and Z. Their joint distribution can always be
written, without assumption, as p(X, Y, Z) = p(X) p(Y |X) p(Z|X, Y ). This can be simplified
with assumptions given by conditional independencies, i.e., by removing arrows from the full
graphical model X Y Z . There are 3 canonical cases.

Case 1: head-to-tail connection


“Cloudy weather causes rain,
• X Y Z means which in turn causes wet grass”
p(X, Y, Z) = p(X) p(Y |X) p(Z|Y ).
Note p(Z|Y ), not p(Z|X, Y ); we removed X → Z.

• Knowing Y tells Z everything, knowing also X adds nothing,


so X and Z are independent given Y : p(Z|X, Y ) = p(Z|Y ) ✐

• Typically, X is the cause of Y and Y is the cause of Z.

• p(R) = ✐, p(W ) = ✐, p(W |C) = ✐, p(C|W ) = ✐

Case 2: tail-to-tail connection


• Z X Y means “Cloudy weather causes both rain and
p(X, Y, Z) = p(X) p(Y |X) p(Z|X). makes us less likely to turn the sprinkler on”
Note p(Z|X), not p(Z|X, Y ); we removed Y → Z.

• If we don’t know X: Y and Z are dependent.


If we know X: Y and Z become independent,
p(Y, Z|X) = p(Y |X) p(Z|X) ✐

• Typically, X is the cause of Y and Z.

• p(C|R) = ✐, p(R|S) = ✐, p(R|S) = ✐

Case 3: head-to-head connection


• X Z Y means
p(X, Y, Z) = p(X) p(Y ) p(Z|X, Y ). “Wet grass is caused by rain and
Note p(Y ), not p(Y |X); we removed X → Y . by turning the sprinkler on”

• If we don’t know Z: X and Y are independent,


p(X, Y ) = p(X) p(Y ) ✐
If we know Z: X and Y become dependent.

• Typically, Z has two independent causes X and Y .

• p(W ) = ✐, p(W |S) = ✐, p(S|W ) = ✐,


p(S|R, W ) < p(S|W ) ✐ (explaining away),
p(S|R, W ) > p(S|W ) ✐
64
A more general example
• Joint distribution without assumptions (requires 15 parameters):
p(C, S, R, W ) = p(C) p(S|C) p(R|C, S) p(W |C, S, R).

• Joint distribution with assumptions (requires 9 parameters):


p(C, S, R, W ) = p(C) p(S|C) p(R|C) p(W |S, R).

• p(W |C) = ✐, p(C|W ) = ✐


A real-world example: QMR-DT, a Bayesian network for medical diagnosis, with binary variables
for diseases (flu, asthma. . . ) and symptoms (fatigue, dyspnea, wheezing, fever, chest-pain. . . ),
and directed arcs from each disease to the symptoms it causes.

Summary For d variables X1 , X2 , . . . , Xd (binary, but this generalizes to discrete or continuous variables):

• The general expression for a joint distribution without assumptions (requires 2d−1 parameters):
p(X1 , X2 , . . . , Xd ) = p(X1 ) p(X2 |X1 ) p(X3 |X1 , X2 ) · · · p(Xd |X1 , X2 , . . . , Xd−1 )
simplifies under the conditional independence assumptions of the graphical model to:
d
Y
p(X1 , X2 , . . . , Xd ) = p(Xi |parents(Xi ))
i=1

which requires 2|parents(Xi )| parameters Θi at node Xi : p(Xi |parents(Xi )) = p(Xi |parents(Xi ); Θi ).


• In turn, this simplifies the computations required for:
– Inference (testing): answering questions of the form “p(X1 , X7 |X3 , X5 ) = ?”.
– Learning (training): learning the parameters at each node (tables of probability values).
• In graphical models, we need not specify explicitly some variables as “inputs” and others as
“outputs” as in a supervised problem (e.g. classification). Having the trained graphical model
(i.e., with known values for the conditional distribution table at each node), we can set the
values of any subset of the random variables Soutputs ⊂ {X1 , . . . , Xd } (e.g. based on observed
data) and do inference over any other subset Sinputs ⊂ {X1 , . . . , Xd } (unobserved variables), i.e.,
compute p(Soutputs |Sinputs ), where Soutputs ∩ Sinputs = ∅. The graphical model is a “probabilistic
database”, a machine that can answer queries regarding the values of random variables.
• We can also have hidden variables, which are never observed, but which help in defining the
dependency structure.
Ex: basket analysis. We want to find dependencies between items sold for “baby food”, “diapers” and “milk” (a customer buying
one of these is likely to buy the other two). Instead of putting arcs among these, we may designate a hidden node “baby at home”
as the hidden cause of the consumption of these 3 items.

• Training, i.e., estimating the parameters of the graphical model given a dataset, is typically
done by maximum likelihood.
• The computational complexity of training and inference depends on properties of the graphical
model:
– If the graph is a tree: exact inference takes polynomial time (junction-tree algorithm).
– Otherwise, exact inference is NP-complete and becomes intractable for large models. In
that case one uses approximate algorithms (variational methods, Markov chain Monte
Carlo, etc.).

65
14.3 Generative models
• Generative model : a graphical model that represents the process we believe created the data.
• Convenient to design specific graphical models for supervised and unsupervised problems in
machine learning (classification, regression, clustering, etc.).
• Ex: classification with inputs x = (x1 , . . . , xD ) and class labels C:
– Generative model: C x means p(C, x) = p(C) p(x|C), or “first pick a class C by
sampling from p(C), then (given C) pick an instance x by sampling from p(x|C)”. This
allows us to sample pairs (C, x) if we know the model for p(x|C) (the class-conditional
probability), and the model parameters (p(C) and p(x|C)).
Ex: p(C) = πC ∈ (0, 1) and p(x|C) = N (µC , ΣC ), for continuous variables.
p(x|C) p(C)
– Bayes’ rule inverts the generative process to classify an instance x: p(C|x) = .
p(x)
– If, in addition, we assume that x = (x1 , . . . , xD ) are conditionally independent given C
(i.e., we ignore possible correlations between features for instances within the same class),
then we achieve a simpler model, the naive Bayes classifier : p(x|C) = p(x1 |C) · · · p(xD |C).

14.6 Undirected graphical models (Markov random fields)


• A different way to express graphically a probability distribution over d random variables, more
convenient than directed graphical models if the influences between variables are symmetric.
Ex: pixels on an image. Two neighboring pixels tend to have the same color, and the correlation goes both ways.

• The joint distribution is not defined using parents (which imply a directed graph), but using:
– Cliques (clique = set of nodes such that there is a link between every pair of nodes in the clique).
– Potential functions ψC (XC ), where XC is the set of variables in clique C.
1 Y X Y
X = (X1 , . . . , Xd ): p(X) = ψC (XC ) Z= ψC (XC ).
Z all cliques C X all cliques C
P
The partition function Z is a normalization constant that ensures that X p(X) = 1.
• Ex: consider an image X with d black or white pixels (Xi ∈ {0, 1}, so X = (X1 , . . . , Xd ) can
take 2d different values). Connecting each pixel to its 4 neighboring pixels defines two types of
cliques: single pixels Xi , and pairs of neighboring pixels (Xi , Xj ). Then:
d
! d !
1 Y Y
p(X) = ψ1 (Xi ) ψ2 (Xi , Xj ) ψ1 (Xi ) = e−|Yi −Xi | ψ2 (Xi , Xj ) = e−α|Xi −Xj |
Z i=1 i∼j

where Y1 , . . . , Yd are the pixel values for an observed, noisy image Y and α > 0. Then, p(X)
is high if X is close to Y (single-pixel cliques), and if X is smooth (neighboring pixels tend to
have the same value), as controlled by α. We can compute a denoised image as arg maxX p(X).
In statistical mechanics, the Ising model for magnetic materials is similar to an MRF: pixel = atom with spin ∈ {↑, ↓}.

66
15 Discrete Markov models and hidden Markov models
• Up to now, we have regarded data points x1 , . . . , xN as identically independently distributed
(iid): each xn is an independent sample from a (possibly unknown) distribution p(x; Θ) with
parameters Θ. Hence, the log-likelihood is simply a sum of the individual log-likelihood for each
P
data point: log P (x1 , . . . , xN ) = N
n=1 log p(xn ; Θ). As a graphical model: x1 x2 · · · xN .

• Now, we relax that assumption: x1 , . . . , xN depend on each other.


• Ex: temporal or spatial series (discrete or continuous):
– Temporal, continuous: tides, trajectory of a moving body, etc.
– Temporal, discrete:
∗ Letter or word sequences in English: t → ‘h’ more likely than ‘x’. Because of grammar or
phonological rules, only certain sequences are allowed, and some are more likely than others.

∗ Speech: sequence of sounds, each corresponding to a phoneme. Physical constraints of the


vocal tract articulators (tongue, lips, etc.) mean their motion is smooth, and so is the acoustic speech they produce.

– Spatial, discrete: base pairs in a DNA sequence.

15.2 Discrete Markov processes (Markov chains)


• Consider a system that at any time is in one of a set of N distinct states S1 , S2 , . . . , SN .
Write the state at time t = 1, 2, . . . as a discrete random variable qt ∈ {S1 , S2 , . . . , SN }.
• In general, we can always write the probability of a sequence of states of length T as:
p(q1 , q2 , . . . , qT ) = p(q1 ) p(q2 |q1 ) p(q3 |q1 , q2 ) · · · p(qT |q1 , q2 , . . . , qT −1 ),
that is, the state at time t + 1 depends on all the previous states since t = 1:
p(qt+1 = Sj | qt = Si , qt−1 = Sk , . . . ) ← requires a table of N T parameters.

• First-order Markov model : assumes the state at time t + 1 depends only on the state at time t
(it is conditionally independent of all other times given time t, or “the future is independent of the past given the present”):

p(qt+1 = Sj | qt = Si , qt−1 = Sk , . . . ) = p(qt+1 = Sj | qt = Si ).


Markov model of order k: the state at time t + 1 depends on the state at the previous k times t, t − 1, . . . , t − k + 1.

• Homogeneous Markov model : assumes the transition probability from Si to Sj (for any pair of
states Si , Sj ) is independent of the time (so going from Si to Sj has the same probability no matter when it happens):
N
X
aij ≡ p(qt+1 = Sj | qt = Si ) aij = 1, i = 1, . . . , N, aij ∈ [0, 1], i, j = 1, . . . , N.
j=1

• Transition probability matrix A = (aij ) of N × N: contains nonnegative elements, rows sum 1.


• The first state in a sequence has its own initial distribution π (a vector of N × 1):
N
X
πi ≡ p(q1 = Si ), πi = 1, πi ∈ [0, 1], i = 1, . . . , N.
i=1

• The probability of a given sequence q1 , q2 , . . . , qT of length T is:


T
!
Y
p(q1 , . . . , qT ; A, π) = p(q1 ) p(q2 |q1 ) · · · p(qT |qT −1 ) = πq1 aq1 q2 · · · aqT −1 qT = πq1 aqt−1 qt .
t=2
67
a11
• A discrete Markov process can be seen as a stochastic automaton a12
with states S1 , . . . , SN , transition matrix A = (aij ) and initial π1
1 a21 2
π2

distribution π = (πi ).
a13
• The sequence of states (each a random variable) can be seen as π3
3
a graphical model : q1 q2 ··· qT .

• Sampling: given A and π, we can generate a sequence of states of length T as follows: sample
q1 from π; sample q2 from Aq1 • ; . . . ; sample qT from AqT −1 • .
• Training: given K observed sequences of length T , we want to estimate the parameters (A, π).
This can be done by maximum likelihood:
K T
!
X Y
max log P (sequence k) where P (sequence k) = πqk1 aqk,t−1 qkt .
π,A
k=1 t=2

The solution is:


# sequences starting with Si # transitions from Si to Sj
πi = aij = .
# sequences # transitions from Si

15.3 Hidden Markov models (HMMs)


• At each time t = 1, 2, . . . there are two random variables:
– The state qt , which is unobserved (a hidden variable). It depends only on the previous
time state, qt−1 : transition probability p(qt |qt−1 ).
– The observation Ot , which we do observe. It depends only on the current state, qt : obser-
vation (or emission) probability p(Ot |qt ). The observation Ot can be:
∗ Discrete, i.e., symbols from an alphabet (= a set of symbols). Then, p(Ot |qt ) is a
matrix B of probabilities for each pair (state,symbol).
∗ Continuous and possibly multidimensional. Then, p(Ot |qt ) is a continuous distribu-
tion, e.g. a Gaussian N (µi , Σi ) for state Si .
Hence, an observed sequence O1 , . . . , OT results from two sources of randomness: randomly
moving from state to state, and randomly emitting an observation at each state.
• Again, we assume these probabilities don’t change over time (homogenous HMM).
• The first state in a sequence has its own initial distribution p(q1 ).
• Altogether, the HMM is characterized by 3 groups of parameters: the transition matrix A, the
observation matrix B (assuming discrete observations), and the initial distribution π.
• An HMM for an observation se-
quence O1 , . . . , OT (and unobserved
state sequence q1 , . . . , qT ) is a graph-
ical model with one node per ran-
dom variable (Ot or qt ). Each node
contains the parameters needed to
generate its random variable:
node qt for t > 1 contains A (necessary to compute p(qt |qt−1 )), node Ot contains B (necessary
to compute p(Ot |qt )), and node q1 contains π (necessary to compute p(q1 )).

68
What can we do with an HMM?
• Sampling: given (A, B, π) we can generate a sequence of observations as follows: sample q1
from π and O1 from Bq1 • ; sample q2 from Aq1 • and O2 from Bq2 • ; . . . ; sample qT from AqT −1 •
and OT from BqT • .
• Evaluating the probability p(O; A, B, π) of a given observation sequence O = (O1 , . . . , OT ). The
joint probability of a given sequence of both observations and states (if they were observed) is:
T
! T !
Y Y
p(O, Q; A, B, π) = p(q1 ) p(qt |qt−1 ) p(Ot |qt ) = πq1 bq1 O1 aq1 q2 · · · aqT −1 qT bqT OT .
t=2 t=1

We can compute p(O|A, B, π) by marginalizing π1


1
a 11
1
a 11 a 11
1
a 11
1

this joint distribution over P


all possible state se- O2 O T-1

quences Q: p(O|A, B, π) = Q p(O, Q; A, B, π). πi


i i i i

Although there are N T such sequences, this O1 OT

marginalization can be computed exactly in πN

O(N 2 T ) using the forward-backward algorithm (a N N N N

dynamic programming algorithm). 1 2 T-1 T

• Decoding: given a sequence of observations O = (O1 , . . . , OT ), find the sequence of states Q =


(q1 , . . . , qT ) that has highest probability p(Q|O; A, B, π) (many state sequences are possible,
but some are likelier than others to have generated the observation sequence O). Again, this can
be done exactly in O(N 2 T ) using the Viterbi algorithm (a dynamic programming algorithm).
Speech recognition: states = phonemes, observations = acoustic measurements (e.g. mel-frequency cepstral coefficients (MFCCs)).

• Training: given a training set of observation sequences, learn the HMM parameters (A, B, π)
that maximize the probability of generating those sequences. This can be solved with a form of
the EM algorithm (the Baum-Welch algorithm). E step: compute “p(Q|O; A, B, π)” for the current parameters
(A, B, π). M step: reestimate the parameters (A, B, π), as in a discrete Markov process.✐

In an HMM, the state variables follow a first-order Markov process: p(qt |q1 , . . . , qt−1 ) = p(qt |qt−1 ).
But the observed variables do not: p(Ot |Ot−1 , . . . , O1) 6= p(Ot |Ot−1 ). In fact, Ot depends (implicitly)
on all previous observations. An HMM achieves this long-range dependency with fewer parameters
than using a Markov process of order k. The latter needs N k+1 transition probabilities, and a huge training set where
every possible sequence of k + 1 symbols (bigrams, trigrams,. . . ,k + 1-grams) should be observed.

Continuous states and observations: tracking (dynamical models)


• Both the states and the observations are continuous. We want to predict the state at time t + 1
given the state qt and the observation Ot at time t.
• Many applications in control, computer vision, robotics, etc. Ex: predicting the trajectory of
a guided missile, mobile robot, pedestrian, etc. State qt = (xt , yt , zt ) spatial coordinates of robot at time t,
observation Ot = sensor information at time t (camera image, depth sensor, etc.).

• A Kalman filter (or linear dynamical system) is like an HMM, but:


– Transition prob. p(qt+1 |qt ): qt+1 is a linear function of qt plus zero-mean Gaussian noise.
– Emission prob. p(Ot |qt ): Ot is another linear function of qt plus zero-mean Gaussian noise.
Training can also be done with an EM algorithm.
• Extensions to nonlinear functions and/or nongaussian distributions: extended Kalman filter,
particle filters.
69
17 Combining multiple learners
• By suitably combining multiple learners the accuracy can be improved (but it need not to).
– How to generate base learners that complement each other?
– How to combine their outputs for maximum accuracy?
• Ex: train an ensemble of L decision trees on L different subsets of the training set and define
the ensemble output for a test instance as the majority vote (for classification) or the average
(for regression) of the L trees.
• Ensembles of decision trees (random forest, boosted decision trees) are practically among the
most accurate models in machine learning.
• Disadvantages:
– An ensemble of learners is computationally more costly in time and space than a single
learner, both at training and test time.
– A decision tree is interpretable (as rules), but an ensemble of trees is hard to interpret.

17.1 Generating diverse learners


• If the learners behave identically, i.e., the outputs of the learners are the same for any given
input, their combination will be identical to any individual learner. The accuracy doesn’t
improve and the computation is slower.
☞ Diversity: we need learners whose decisions differ and complement each other.
• If each learner is extremely accurate, or extremely inaccurate, the ensemble will barely improve
the individual learners (if at all).
☞ Accuracy: the learners have to be sufficiently accurate, but not very accurate.
• Good ensembles need a careful interplay of accuracy and diversity. Having somewhat inaccurate
base learners can be compensated by making them diverse and independent from each other.
• Although one can use any kind of models to construct ensembles, practically it is best to use
base learners that are simple and unstable.

Mechanisms to generate diversity


• Different models: each model (linear, neural net, decision tree. . . ) makes different assumptions
about the data and lead to different classifiers.
• Different hyperparameters: within the same model (polynomials, neural nets, RBF networks. . . ),
using different hyperparameters leads to different trained models. Ex: number of hidden units
in multilayer perceptrons or of basis functions in RBF networks, k in k-nearest-neighbor clas-
sifiers, error threshold in decision trees, etc.
• Different optimization algorithm or initialization: for nonconvex problems (e.g. neural nets),
each local optimum of the objective function corresponds to a different trained model. The
local optimum found depends on the optimization algorithm used (gradient descent, alternating
optimization, etc.) and on the initialization given to it.
• Different features: each learner can use a different (possibly random) subset of features from
the whole feature vector. This also makes each learner faster, since it uses fewer features.
70
• Different training sets: each learner is trained on a different subset of the data. We can do this:
– In parallel, by drawing independent random subsets from the training set, as in bagging.
– Sequentially, by giving more emphasis to instances on which the preceding learners are
not accurate, as in boosting or cascading.
– By making the subsets local, e.g. obtained by clustering.
– By defining the main task in terms of several subtasks to be implemented by the learners,
as in error-correcting output codes.

17.2 Model combination schemes


Given L trained models (learners), their outputs y1 , . . . , yL for an input x can be combined:
• In parallel or multiexpert:
– Global approach (learner fusion): all learners generate an output and all these outputs are
combined. Ex: a fixed combination (voting, averaging) or a learned one (stacking).
– Local approach (learner selection): a “gating” model selects one learner as responsible to
generate the final output. Ex: mixture of experts.
• In sequence or multistage: the learners are sorted (usually in increasing complexity) and we
apply them in sequence to the input until one of them is confident. Ex: boosting, cascading.
In the case of K-class classification, each learner may have K outputs (for the discriminant of each
class, or from a softmax). We can have each learner output a single class (the one with largest
discriminant or softmax), or have the combination use all K outputs of all L learners.
Voting or averaging Stacking Mixture of experts Boosting or cascading
y=d L
y y
y y=d 2

f() f() yes


+ +
w1 f() wL no
wL y=d 1 w 2>θ 2 ... dL
gating
w2 yes
w1
no
w 1>θ 1 d2
d1 dL
d1 d2 dL d1
d2 d2 dL

d1

x x x x

17.3 Voting and averaging


• For discrete outputs (classification):
– Majority vote: the class with most votes wins.
– Weighted vote with the posterior prob. pl (C1 |x), . . . , pl (CK |x) from each learner l = 1, . . . , L.
• For continuous outputs (regression):
P
– Average y = L1 Ll=1 yl . Possibly weighted: y = L1 Ll=1 wl yl with Ll=1 wl = 1 and w1 , . . . , wL ∈ (0, 1).
P P

– Median: more robust to outlying outputs.


• Bayesian model
P combination: in classification, the posterior prob. marginalized over all models
is p(Ck |x) = all models Mi p(Ck |x, Mi ) p(Mi ), which can be seen as weighted averaging using
as weights the model prior probabilities. Simple voting corresponds to a uniform prior (all
models equally likely).
71
Bias and variance Why (and when) does diversity help?
• Consider L independent binary classifiers with success probability > 21 (i.e., better than random
guessing) combined by taking a majority vote. One can prove the accuracy increases with L.
Not necessarily true if they are not independent (e.g. if the L classifiers are equal the accuracy will not change).

• Consider L iid random variables y1 , . .P . , yL with expected value E {yl } = µ and variance
var {yl } = σ 2 . Then, the average y = L1 Ll=1 yl has the following moments:
( L
) L
1X 1X
E {y} = E yl = E {yl } = µ
L l=1 L l=1
( L
) ( L ) L
1X 1 X 1 X 1
var {y} = var yl = 2 var yl = 2 var {yl } = σ 2 .
L l=1 L l=1
L l=1 L
So the expected value (hence the bias) doesn’t change, but the variance (hence the mean squared
error) decreases as L increases. If y1 , . . . , yL are identically but not independently distributed:
( L ) L L
! L
1 X 1 X X 1 2 2 X
var {y} = 2 var yl = 2 var {yl } + 2 cov {yi , yj } = σ + 2 σij .
L l=1
L l=1 i,j=1, i6=j
L L i,j=1, i6=j

So, if the learners are positively correlated (σij > 0), the variance (and error) is larger than
if they are independent. Hence, a well-designed ensemble combination will aim at reducing, if
not completely eliminating, positive correlation between learners.
Further decrease in variance is possible with negatively correlated learners. However, it is
impossible to have many learners that are both accurate and negatively correlated.
• Ensemble averaging and noise: if we view each learner as a random noise function added to the
true discriminant/regression function, and if these noise functions are independent with zero
mean, then the ensemble average is smoother than each ensemble member. fig. 4.5
Voting or averaging over models with low bias and high variance produces an ensemble with low
bias but lower variance.

17.5 Bagging
• Bootstrap: given a training set X = {x1 , . . . , xN } containing N samples, the bootstrap generates
L subsets X1 , . . . , XL of X , each of size N, as follows: subset Xl is obtained by sampling at
random with replacement N points from X .
– This means that some points xn will appear repeated multiple times in Xl and some other
points xn will not appear in Xl .
– The L subsets are partly different from each other. On average, each subset contains ≈ 63%
of the training set. Proof: the probability we don’t pick instance xn after N draws is 1 − N1 N ≈ e−1 ≈ 0.37.


• Stable vs unstable learning algorithms:


– A learning algorithm is unstable if small changes in the training set cause a large difference
in the trained model, i.e., the training algorithm has large variance.
– Running a stable algorithm on resampled versions of the training set lead to learners
with high positive correlation, so little diversity. Using an unstable algorithm reduces this
correlation and thus the ensemble has lower variance.
– Ex. of unstable algorithms: decision trees (the bigger the more unstable), neural nets.
– Ex. of stable algorithms: linear classifiers or regressors, nearest-neighbor classifier.
72
• Bagging (bootstrap aggregating): applicable to classification and regression.
– We generate L (partly different) subsets of the training set with the bootstrap.
– We train L learners, each on a different subset, using an unstable learning algorithm.
Since the training sets are partly different, the resulting learners are diverse.

– The ensemble output is defined as the vote or average (or median) of the learners’ outputs.
• Random forest: a variation of bagging using decorrelated decision trees as base learners.
– As in bagging, each tree is trained on a bootstrap sample of the training set, and the
ensemble output is defined as the vote or average (or median) of the learners’ outputs.
– In addition, the lth tree is constructed with a randomized algorithm, independently of the
other trees. Typically, we do random feature selection at each node of the tree: we select
the feature to split not among all possible features, but among a random subset of them.
It is recommended to grow a full tree (no pruning), because it has lower bias and higher variance; and to use as size of the
random subset 1 + ⌊log2 D⌋ (where D = dim x is the dimensionality of the instances).

• Random forests are among the best classifiers in practice, and simpler to train than boosting.

17.6 Boosting
• Bagging generates complementary learners through random sampling and unstable learning
algorithms. Boosting actively generates complementary learners by training the next learner on
the mistakes of the previous learners.
• Weak learner : a learner that has probability of error < 21 (i.e., better than random guessing on
binary classification). Ex: decision trees, decision stumps (tree grown to only 1 or 2 levels).
Strong learner : a learner that can have arbitrarily small probability of error. Ex: neural net.
• There are many versions of boosting, we focus on AdaBoost.M1, for classification (it can be
applied to regression with some modifications):
– It combines L weak learners. They have high bias, but the decrease in variance in the
ensemble compensates for that.
– Each learner l = 1, . . . , L is trained on the entire training set X = {x1 , . . . , xN }, but each
(l)
point xn has an associated probability pn indicating how “important” it is for that learner.
The training algorithm must be able to use these probabilities p(l) together with the training set X . Otherwise, this can be
simulated by sampling a training set of size N from X according to p(l) .

– The first learner uses p1n = N1 (all points equally important).


– Assume that after training learner l on (X , p(l) ) its error rate is ǫl . If ǫl ≥ 12 we stop, since
the learner is not weak. Otherwise, we update the probabilities as follows: for each point
xn , n = 1, . . . , N: fig. 17.2
(
(l)
βl pn if learner l correctly classifies xn ǫl
pn(l+1) = (l) where βl = ∈ [0, 12 ).
pn otherwise 1 − ǫl
(l+1) (l+1) P (l+1)
Then we renormalize the probabilities so they sum 1: pn = pn / N n=1 pn .
– This decreases the probability of a point if it is correctly classified, so the next learner
focuses on the misclassified points. That is why learners are chosen to be weak. If they are strong (= very
accurate), the next learner’s training set will emphasize a few points, many of which could be very noisy or outliers.

• After training, the ensemble output for a given instance is a weighted vote, where the weight
of learner l is wl ∝ log (1/βl ), so the weaker learners have a lower weight.
Other variations of boosting make the overall decision by applying learners in sequence, as in cascading.

73
• Boosting usually achieves very good classification performance, although it does depend on the
dataset and the type of learner used (it should be weak but not too weak). It is sensitive to
noise and outliers.
A very successful application in computer vision: the Viola-Jones face detector. This is a cascade of AdaBoost ensembles of decision
stumps, each trained on a large set of Haar features. It is robust and (with clever image-based operations) very fast for test images.

Cascading is similar to boosting, but the learners are trained sequentially. For example, in boosting
the next learner could be trained on the residual error of the previous learner. Another way:
• When applied to an instance x, each learner gives an output (e.g. class label) and a confidence
(which can be defined as the largest posterior probability p(Ck |x)).
• Learner l is trained on instances for which the previous learner is not confident enough.
• When applying the ensemble to a text instance, we apply the learners in sequence until one is
confident enough. We use learner l only if the previous learners 1, . . . , l − 1 are not confident
enough on their outputs.
• The goal is to order the learners in increasing complexity, so the early learners are simple and
classify the easy instances quickly.

17.4 Error-correcting output codes (ECOC)


• An ensemble learning method for K-class classification.
• Idea: instead of solving the main classification task directly with one classifier, which may be
difficult, create simpler classification subtasks which can be combined to get the main classifier.
• Base learners: L binary classifiers with outputs in {−1, +1}.
• Code matrix W of K × L with elements in {−1, +1}: if wkl = −1 (+1) then class k should be
on the negative (positive) side of learner l. Hence, each learner tries to classify one subset of
classes vs the rest (it partitions the K classes into two groups). Ex.:
 +1 −1 −1 −1   +1 +1 +1 0 0 0   −1 −1 −1 −1 −1 −1 −1 
−1 +1 −1 −1 −1 0 0 +1 +1 0 all possible −1 −1 −1 +1 +1 +1 +1
one-vs-all: −1 −1 +1 −1 one-vs-one: 0 −1 0 −1 0 +1 −1 +1 +1 −1 −1 +1 +1
−1 −1 −1 +1 0 0 −1 0 −1 −1 learners: +1 −1 +1 −1 +1 −1 +1

• Particular cases of the code matrix:


– One-vs-all: L = K and W has +1 along the diagonal and −1 in elsewhere.
– One-vs-one: L = K(K − 1)/2 and each column of W has one −1, one +1 and 0 elsewhere
(“0” means don’t care).
An ECOC code matrix has L between K (one-vs-all, fewest learners) and 2K−1 − 1 (all possible
learners). Because negating a column gives the same learner; an a column of all −1s (or all +1s) is useless.
• The code matrix allows us to define a K-class classification problem in terms of several binary
classification problems. We can use any binary classifier for the latter (decision trees, etc.).
• To classify a test input x, we apply the L classifiers to it and obtain a row vector y with
L entries in {−1, +1}. Ideally, if the classifiers were perfect, this would equal the row of W
corresponding to x’s class, but in practice some classifiers will classify x incorrectly. Then, we
find the row 1 ≤ k ≤ K in W that is closest to thisy in Hamming distance, P and output k
as label (as in error-correcting codes). Equivalently , we can compute a weighted vote Ll=1 wkl yl for

class k (where the weights wkl are the elements of W) and then pick the class with most votes.

74
• The point of ECOC is to introduce robustness against errors of the learners by making their
codewords be farther from each other in Hamming distance (number of mismatching bits). This
requires sufficiently many learners, and introduces redundancy.

• Given a value of L (the number of classifiers, with L > K) selected by the user, we generate
W so its rows are as different as possible in Hamming distance (redundant codes so protection
against errors), and its columns are as different as possible (so the learners are diverse).

• An ECOC can also be seen as an ensemble of classifiers where, to obtain each classifier, we
manipulate the output targets (rather than the input features or the training set, which in
ECOC are equal to the original training set for each classifier).

• Practically. the main benefit of ECOC seems to be in variance reduction.

17.9 Stacked generalization (stacking)


• The combination of the learners’ outputs is itself learned, but on a validation set.

1. We train L learners using the training set (in whatever way that introduces diversity).
2. We apply these learners to a validation set. For each validation point xn , we obtain the
outputs of the L learners (y1 , . . . , yL ). For classification, the output of learner l could be the class label or
the posterior probabilities (softmax). For regression, the output of learner l is its real-valued output vector.

3. We train a model f to predict the ground-truth output for a given instance xn from the
learners’ outputs for that instance.

• If we choose f to be a linear function, this produces a weighted average or vote, where the
weights are learned on the validation set. But we can choose f to be nonlinear.

• By training f on the validation set (rather than the training set where the learners were trained),
it learns to correct for mistakes that the learners make.

• Whether it works better than a fixed, simple combination rule (such as majority vote or average)
depends on the problem.

17.10 Fine-tuning an ensemble


• Constructing an ensemble that does decrease the error requires some art in selecting the right
number of learners and in making them be diverse and of the right accuracy.

• Typically, the resulting ensemble contains learners that are somewhat correlated with each
other, or that are useless.

• It often helps to postprocess the ensemble by removing some learners (ensemble pruning), so
we obtain a smaller ensemble (hence faster at test time) having about the same error.

• This is similar to feature selection and we can indeed apply feature selection algorithms. The
most usual is forward selection: starting with an empty ensemble, we sequentially add one
learner at a time, the one that gives highest accuracy when added to the previous ones.

75
18 Reinforcement learning
• How an autonomous agent that senses and acts in its environment can learn to choose optimal
actions to achieve a goal. Ex.: board games (e.g. chess, backgammon), robot navigation (e.g.
looking for the exit of a maze).
– Each time the agent performs an action, it may receive a reward (or penalty) to indicate
how good the resulting state of the environment is.
– Often the reward is delayed.
At the end of the game (positive reward if we win, negative if we lose); or when (if) the robot reaches the exit.

– The task of the agent is to learn from this reward to choose sequences of actions that
produce the greatest cumulative reward.
• As in other machine learning settings, the task is to learn a function, here a control policy
π: S → A that maps states s ∈ S to actions a ∈ A, but with the following differences:
– Delayed reward : we are not given pairs (state,action) to train the function, as in supervised
learning. Instead, we (may) have a reward when the agent executes an action. There is
no such thing as the best move at any given time, what matters is the sequence of moves.
Supervised learning is “learning with a teacher”. Reinforcement learning is “learning with a critic”. The critic doesn’t tell
us what to do but only how well we have done in the past. Its feedback is scarce and when it comes, it comes late.

– Credit assignment problem: after taking several actions and getting the reward, which
actions helped? So that we can record and recall them later on.
A reinforcement learning program learns to generate an internal value for the intermediate
states and actions in terms of how good they are in leading us to the goal. Having learned
this internal reward mechanism, the agent can just take local actions to maximize it.
– Exploration vs exploitation: the agent influences the distribution of training examples by
the actions it chooses. It should trade off exploration of unknown states and actions (to
gain information) vs exploitation of states and actions that it has already learned will
yield high reward (to maximize its cumulative reward).
– Partially observed states: practically, sensors provide only partial information about the
environment state (e.g. a camera doesn’t provide the robot location).
• Basic elements of a reinforcement learning problem:
– Agent: the decision maker (e.g. game player, robot). It has
sensors to observe the environment (e.g. robot camera).

– Environment (e.g. board, maze). At any time t, the environ-


ment is in a certain state st that is one of a set of possible
states S (e.g. board state, robot position). Often, there is an
initial state and a goal state.

– A set A of possible actions at (e.g. legal chess movements,


possible robot steps). The state changes after an action:
st+1 = δ(st , at ). The solution requires a sequence of actions.

– Reward rt = r(st , at ) ∈ R: the feedback we receive, usually a a a


s0 −−0−−−→ s1 −−1−−−→ s2 −−2−−−→ . . .
at the end of the game. It helps to learn the policy. r0 r1 r2

– Policy π: S → A: a control strategy for choosing actions


that achieve a goal.

76
• The task of the agent: perform sequences of actions, observe their consequences and learn a
control policy that, from any initial state, chooses actions that maximize the reward accumulated
over time.

The learning task


• There are several settings of the reinforcement learning problem. We may assume:
– that the agent’s actions are deterministic, or that they are nondeterministic;
– that the agent can predict the next state that will result from each action, or that it
cannot.
– that the agent is trained by giving it examples of optimal action sequences, or that it must
train itself by performing actions of its own choice.
We consider a simple setting: deterministic actions that satisfy the Markov property. At each
discrete time step t:
– The agent senses the current state st and performs an action at .
– The environment gives a reward rt = r(st , at ) and a next state st+1 = δ(st , at ).
The functions r, δ are part of the environment and are not necessarily known to the agent; they
depend only on the current state and action and not on the previous ones (Markov assumption).
• The task of the agent: learn a policy π: S → A for selecting its next action given the current
observed state st : π(st ) = at . We seek a policy that produces the greatest cumulative reward
over time: ∞
X
π 2
V (st ) = rt + γrt+1 + γ rt+2 + · · · = γ i rt+i
i=0
i.e., the weighted sum of rewards obtained by starting at st and following the policy π: ai = π(si )
for i ≥ t. The constant 0 ≤ γ < 1 (discount rate) determines the relative value of delayed vs
immediate rewards (γ = 0 considers only the most immediate reward). We take γ < 1 because,
practically, we prefer rewards sooner rather than later (e.g. robot running on a battery).
• Hence: learn a policy that maximizes V π (s) for all states s ∈ S: π ∗ = arg maxπ V π (s) ∀s ∈ S,
and V ∗ (s) is the value function of such an optimal policy π ∗ .

Ex.: grid world with GOAL state, squares = states, arrows = actions (γ = 0.9):

0 100 0 90 100 0
0 GOAL 81 GOAL 90 100 GOAL GOAL
100
0 0 72 81

0 0 100 81 90 100
0 0 81 90
0 0 72 81 81 90 100

r(s, a) Q(s, a) V ∗ (s) one optimal policy

Q-learning
• Learning π ∗ : S → A directly is difficult because we don’t have examples (s, a). Instead, we
learn a numerical evaluation function Q(s, a) of states and actions, and derive π from it.
• Define the Q function: Q(s, a) = r(s, a) + γV ∗ (δ(s, a)).
77
• We want to find π ∗ (s) = arg maxa Q(s, a). This can be done with the following iterative
algorithm. Noting that V ∗ (s) = maxa′ Q(s, a′ ), we can write the following recursive expression
for Q: Q(s, a) = r(s, a) + γ maxa′ Q(δ(s, a), a′ ). The algorithm starts with a table Q̂(s, a) = 0
for every (s, a), and repeatedly does the following:

– observe its current state s


– execute some action a
– observe the new state s′ = δ(s, a) and reward r = r(s, a)
– update the table entry (s, a) like this: Q̂(s, a) ← r + γ maxa′ Q̂(s′ , a′ ).

• Each update of Q̂(s, a) affects the old state s (not the new one s′ ).

• Assuming r(s, a) is bounded and every possible (state,action) pair is visited infinitely often,
one can prove: 1) the Q̂ values never decrease, and are between 0 and their optimal values Q;
2) Q-learning converges to the correct Q function.

• Note the agent need not know the functions r and δ ahead of time (which is often not practical,
e.g. for a mobile robot). It just needs to know their particular values as it takes actions, moves
to new states and receives the corresponding rewards (i.e., it samples those functions).
In choosing new actions to apply, the agent has an exploration-exploitation tradeoff (repeat actions that seem currently good vs
trying new ones). Commonly, one moves from pure exploration at the beginning towards exploitation as training proceeds.

If we have perfect knowledge of the functions r and δ (i.e., we can evaluate them for any (s, a)),
we can find the optimal policy more efficiently with a dynamic programming algorithm to solve
Bellman’s equation: V ∗ (s) = E {r(s, π(s)) + γ V ∗ (δ(s, π(s)))} ∀s ∈ S. This is convenient in
some applications, e.g. factory automation and job scheduling problems.

Ex.: grid world from a particular current


Q̂ table, at state s1 , taking action aR and from73 100 90 to 100
s1 66 66 s2
moving to state s2 . The update is: 81
“move
right” 81
Q̂(s1 , aR ) ← r + γ maxa′ Q̂(s2 , a′ ) −−−−−→
aR
= 0 + 0.9 max (66, 81, 100)
= 90

Training consists of a series of episodes, each beginning at a random state and executing actions
until reaching the goal state. The first update occurs when reaching the goal state and receiving a
nonzero reward; in subsequent episodes, updates propagate backward, eventually filling the entire
Q̂(·, ·) table.

Nondeterministic rewards and actions


• Instead of deterministic functions r(s, a) and δ(s, a), we have probability distributions p(r|s, a)
and P (s′|s, a) (Markov assumption).
Ex.: roll of dice in backgammon, noisy sensors and effectors in robot navigation.

• Q-learning generalizes by using expectations wrt these distributions instead of deterministic


functions:
P
– Value of a policy: V π (st ) = E { ∞ i
i=0 γ rt+i }.

78
– Optimal policy (as before): π ∗ = arg maxπ V π (s) ∀s ∈ S.
∗ ∗
P a) =′ E {r(s,∗ a)′ + γ V (δ(s, a))} = E {r(s, a)} + γ E {V (δ(s, a))} =
– Q function: Q(s,
E {r(s, a)} + γ s′ P (s |s, a) V (s ).
P
– Recursive expression for Q: Q(s, a) = E {r(s, a)} + γ s′ P (s′|s, a) maxa′ Q(s′ , a′ ).

– Update for Q̂ at iteration k: Q̂k (s, a) ← (1 − αk )Q̂k−1 (s, a) + αk r + γ maxa′ Q̂k−1 (s′ , a′ ) ,
where the step size αk decreases towards zero as k → ∞ for convergence to occur.
This is similar to SGD compared to GD. Convergence is slow and may require many thousands of iterations.

Generalizing from examples


• Using a table of states × actions to represent Q has two limitations:

– In practice, there may be a large or infinite number of states and actions (e.g. turning the
steering wheel by a continuous angle).
– Even if we could store the table, the previous Q-learning algorithm performs a kind of rote
learning: it simply stores seen (state,action) pairs and doesn’t generalize to unseen ones.
Indeed, the convergence proof assumes every possible (state,action) pair is visited (infinitely often).

• Instead, we can represent Q(s, a) with a function approximator, such as a neural net Q(s, a; W)
with inputs (s, a), weights W, and a single output corresponding to the value of Q(s, a). Each
Q̂(s, a) update in the Q-learning algorithm provides one training example, which is used to
update the weights via SGD.
A classic example: the 1995 TD-gammon program for playing backgammon (which has ≈ 1020 states and randomness because of
the dice roll) used a neural net trained on 1.5 million self-generated games (playing against itself) and achieved master-level play.

Partially observable states


• With a Markov decision process, we assume the agent can observe the state directly. In some
applications, this is not possible (e.g. a mobile robot with a camera has a certain view of its
environment but does not observe its location directly).

• This can be modeled using a Partially Observable Markov Decision Process (POMDP), which
uses P (o|s, a) to model an observation o given the current state s (unobserved) and action a,
rather than P (s′ |s, a) (e.g. o = camera image, s = 3D location of the robot).
This is similar to the difference between a discrete (observable) Markov process and a hidden Markov model (HMM).

79

You might also like