Lecturenotes Cse176
Lecturenotes Cse176
Lecturenotes Cse176
These are notes for a one-semester undergraduate course on machine learning given by Prof.
Miguel Á. Carreira-Perpiñán at the University of California, Merced. The notes are largely based on
the book “Introduction to machine learning” by Ethem Alpaydın (MIT Press, 3rd ed., 2014), with
some additions.
These notes may be used for educational, non-commercial purposes.
c
2015–2016 Miguel Á. Carreira-Perpiñán
1 Introduction
1.1 What is machine learning (ML)?
• Data is being produced and stored continuously (“big data”):
• Data is not random; it contains structure that can be used to predict outcomes, or gain knowl-
edge in some way.
Ex: patterns of Amazon purchases can be used to recommend items.
• It is more difficult to design algorithms for such tasks (compared to, say, sorting an array or
calculating a payroll). Such algorithms need data.
Ex: construct a spam filter, using a collection of email messages labelled as spam/not spam.
• ML relies on:
• A model:
1
∗ Speech recognition, machine translation, biometrics. . .
∗ Credit scoring: classify customers into high- and low-risk, based on their income and
savings, using data about past loans (whether they were paid or not).
– Regression: the labels to be predicted are continuous:
∗ Predict the price of a car from its mileage.
∗ Navigating a car: angle of the steering.
∗ Kinematics of a robot arm: predict workspace location from angles.
Low-Risk
y: price
θ2
High-Risk
Income
θ1 x: mileage
– Learning associations:
∗ Basket analysis: let p(Y |X) = “probability that a customer who buys product X
also buys product Y ”, estimated from past purchases. If p(Y |X) is large (say 0.7),
associate “X → Y ”. When someone buys X, recommend them Y .
– Clustering: group similar data points.
– Density estimation: where are data points likely to lie?
– Dimensionality reduction: data lies in a low-dimensional manifold.
– Feature selection: keep only useful features.
– Outlier/novelty detection.
• Reinforcement learning: find a sequence of actions (policy) that reaches a goal. No supervised
output but delayed reward.
Ex: playing chess or a computer game, robot in a maze.
2
2 Supervised learning
2.1 Learning a class from examples: two-class problems
• We are given a training set of labeled examples (positive and negative) and want to learn a
classifier that we can use to predict unseen examples, or to understand the data.
• Input representation: we need to decide what attributes (features) to use to describe the input
patterns (examples, instances). This implies ignoring other attributes as irrelevant.
e1
x2t
x1t p1 p2
x1: Price x1: Price
the hypothesis with the largest margin noise and a more complex hypothesis
x2
h2
h1
x1
3
2.4 Noise
• Noise is any unwanted anomaly in the data. It can be due to:
– Imprecision in recording the input attributes: xn .
– Errors in labeling the input vectors: yn .
– Attributes not considered that affect the label (hidden or latent attributes, may be unob-
servable).
• Noise makes learning harder.
• Should we keep the hypothesis class simple rather than complex?
– Easier to use and to train (fewer parameters, faster).
– Easier to explain or interpret.
– Less variance in the learned model than for a complex model (less affected by single
instances), but also higher bias.
Given comparable empirical error, a simple model will generalize better than a complex one.
(Occam’s razor : simpler explanations are more plausible; eliminate unnecessary complexity.)
where yn is coded as one-of-K and hk is the two-class classifier for problem k, i.e., hk (x) ∈ {0, 1}.
• Ideally, for a given pattern x only one hk (x) is one. When no, or more than one, hk (x) is one
then the classifier is in doubt and may reject the pattern.
regression: polynomials of order 1, 2, 6
3-class example
(✐ solve for order 1: optimal w0 , w1 )
Sports car
Engine power
?
y: price
Luxury sedan
Family car
Price x: milage
4
2.6 Regression
• Training set X = {(xn , yn )}N D
n=1 where the label for a pattern xn ∈ R is a real value yn ∈ R.
In multivariate regression, yn ∈ Rd is a real vector.
• We assume the labels were produced by applying an unknown function f to the instances, and
we want to learn (or estimate) that function using functions h from a hypothesis class H.
Ex: H= class of linear functions: h(x) = w0 + w1 x1 + · · · + wD xD = wT x + w0 .
• Interpolation: we learn a function h(x) that passes through each training pair (xn , yn ) (no
noise): yn = h(xn ), n = 1, . . . , N.
Ex: polynomial interpolation (requires a polynomial of degree N − 1 with N points in general position).
• We can always enlarge the class of functions that can be learned by using a larger hypothesis
class H (higher capacity, or complexity of H).
Ex: a union of rectangles; a polynomial of order N .
• How to choose the right inductive bias, in particular the right hypothesis class H? This is the
model selection problem.
• The goal of ML is not to replicate the training data, but to predict unseen data well, i.e., to
generalize well.
• For best generalization, we should match the complexity of the hypothesis class H with the
complexity of the function underlying the data:
– If H is less complex: underfitting. Ex: fitting a line to data generated from a cubic polynomial.
– If H is more complex: overfitting. Ex: fitting a cubic polynomial to data generated from a line.
• Validation set:
– Used to minimize the generalization error.
– Optimize hyperparameters or model structure.
– Usually done with a “grid search”. Ex: try all values of H ∈ {10, 50, 100} and λ ∈ {10−5 , 10−3 , 10−1 }.
Ex: select the number of hidden units H, the architecture, the regularization parameter λ, or how long to train for a neural net;
select k for a k-nearest-neighbor classifier.
• Test set:
– Used to report the generalization error.
– We optimize nothing on it, we just evaluate the final model on it.
Then:
1. For each class Hi , fit its optimal hypothesis hi using the training set.
2. Of all the optimal hypotheses, pick the one that is most accurate in the validation set.
We can then train the selected one on the training and validation set together (useful if we have little data altogether).
Then, it is better to introduce a binary random variable zdn that models whether xdn is missing.
6
8.7 Outlier (novelty, anomaly) detection
• An outlier is an instance that is very different from the other instances in the sample. Reasons:
– Abnormal behaviour. Fraud in credit card transactions, intruder in network traffic, etc.
• Not usually cast as a two-class classification problem because there are typically few outliers
and they don’t fit a consistent pattern that can be easily learned.
• We can also identify outliers as points that are far away from other samples. p. 200
• We want to learn a useful approximation to the underlying function that generated the data.
• We must choose:
– A loss function L(·, ·) to compute the difference between the desired output (label) yn and
our prediction to it h(xn ; Θ). Approximation error (loss):
N
X
E(Θ; X ) = L(yn , h(xn ; Θ)) = sum of errors over instances
n=1
• The model, loss and learning algorithm are chosen by the ML system designer so that:
– The model class is large enough to contain a good approximation to the underlying function
that generated the data in X in a noisy form.
– The learning algorithm is efficient and accurate.
– We must have sufficient training data to pinpoint the right model.
7
Joint distribution: p(X = x, Y = y).
p(X = x, Y = y)
Conditioning (product rule): p(Y = y | X = x) = .
p(X = x)
3 Bayesian decision theory Marginalizing (sum rule): p(X = x) =
X
p(X = x, Y = y).
y
Bayes’ theorem: p(Y = y | X = x) p(X = x)
Probability review: appendix A. (inverse probability)
p(X = x | Y = y) =
p(Y = y)
.
Probability theory (and Bayes’ rule) is the framework for making decisions under uncertainty. It can
also be used to make rational decisions among multiple actions to minimize expected risk.
3.2 Classification
• Binary classification with random variables x ∈ RD (example) and C ∈ {0, 1} (label).
– Joint probability p(x, C): how likely it is to observe an example x and a class label C.
– Prior probability p(C): how likely it is to observe a class label C, regardless of x.
p(C) ≥ 0 and p(C = 1) + p(C = 0) = 1.
– Class likelihood p(x|C): how likely it is that, having observed an example with class label
C, the example is at x.
This represents how the examples are distributed for each class. We need a model of x for each class.
– Posterior probability p(C|x): how likely it is that, having observed an example x, its class
label is C.
p(C = 1|x) + p(C = 0|x) = 1.
This is what we need to classify x. We infer it from p(C) and p(x|C) using Bayes’ rule.
• Making a decision on a new example: given x, classify it as class 1 iff p(C = 1|x) > p(C = 0|x).
• Examples: 0.5
x−µk
2 0.4
p(x|C1 ) p(x|C2 )
− 12
– Gaussian classes in 1D: p(x|Ck ) = √ 1 e σk
.
2πσk 0.3
PK
– Prior probability: p(Ck ) ≥ 0, k = 1, . . . , K, and k=1 p(Ck ) = 1.
– Class likelihood: p(x|Ck ).
p(x|Ck )p(Ck ) p(x|Ck )p(Ck )
– Bayes’ rule: p(Ck |x) = p(x)
= PK .
i=1 p(x|Ci )p(Ci )
• We learn the distributions p(C) and p(x|C) for each class from data using an algorithm.
• Naive Bayes classifier : a very simple classifier were we assume p(x|Ck ) = p(x1 |Ck ) · · · p(xD |Ck )
for each class k = 1, . . . , K, i.e., within each class the features are independent from each other.
8
3.3 Losses and risks
• Sometimes the decision of what class to choose for x based on p(Ck |x) has a cost that depends
on the class. In that case, we should choose the class with lowest risk.
Ex: medical diagnosis, earthquake prediction.
• Define:
– λik : loss (cost) incurred for choosing class i when the input actually belongs to class k.
XK
– Expected risk for choosing class i: Ri (x) = λik p(Ck |x).
k=1
• Decision: we choose the class with minimum risk: arg mink=1,...,K Rk (x).
0 1000
Ex: C1 = no cancer, C2 = cancer; (λik ) = 1 0 ; p(C1 |x) = 0.2.
C2
C3
• Examples:
– Risk based on Bayes’ classifier: gk = −Rk .
x
– With the 0/1 loss: 1
gk (x) = p(Ck |x), or? gk (x) = p(x|Ck )p(Ck ), or? gk (x) = log p(x|Ck ) + log p(Ck ).
– Many others: SVMs, neural nets, etc.
• They divide the feature space into K decision regions Rk = {x: k = arg maxi gi (x)}, k =
1, . . . , K. The regions are separated by decision boundaries, where ties occur.
• With two classes we can define a single discriminant? g(x) = g1 (x) − g2 (x) and choose class 1
iff g(x) > 0.
9
3.5 Association rules
• Association rule: implication of the form X → Y (X: antecedent, Y : consequent).
• Application: basket analysis (recommend product Y when a customer buys product X).
• Define the following measures of the association rule X → Y :
# purchases of X and Y
– Support = p(X, Y ) = .
# purchases
For the rule to be significant, the support should be large. High support means items X
and Y are frequently bought together.
p(X, Y ) # purchases of X and Y
– Confidence = p(Y |X) = = .
p(X) # purchases of X
For the rule to hold with enough confidence, should be ≫ p(Y ) and close to 1.
• Generalization to more than 2 items: X, Z → Y has confidence p(Y |X, Z), etc.
• Exe 3.7: example database of purchases.
• Given a database of purchases, we want to find all possible rules having enough support and
confidence.
• Algorithm Apriori :
1. Find itemsets with enough support (by finding frequent itemsets).
We don’t need to enumerate all possible subsets of items: for (X, Y, Z) to have enough
support, all its subsets (X, Y ), (Y, Z), (X, Z) must have enough support? . Hence: start
by finding frequent one-item sets (by doing a pass over the database). Then, inductively,
from frequent k-item sets generate k + 1-item sets and check whether they have enough
support (by doing a pass over the database).
2. Convert the found itemsets into rules with enough confidence (by splitting the itemset
into antecedent and consequent).
Again, we don’t need to enumerate all possible subsets: for X → Y, Z to have enough
confidence, X, Y → Z must have enough confidence? . Hence: start by considering a single
consequent and test the confidence for all possible single consequents (adding as a valid
rule if it has enough confidence). Then, inductively, consider two consequents for the
added rules; etc.
• A more general way to establish rules (which may include hidden variables) are graphical
models.
10
19.7 Measuring classifier performance
Binary classification problems
# misclassified patterns
• The most basic measure is the classification error (or accuracy) in %: .
# patterns
• It is often of interest to distinguish
Predicted class
the different types of errors: confus-
True class Positive Negative Total
ing the positive class with the nega-
Positive tp: true positive fn: false negative p
tive one, or vice versa.
Ex: voice authentication to log into a user account. Negative fp: false positive tn: true negative n
A false positive (allowing an impostor) is much
worse than a false negative (refusing a valid user).
Total p′ n′ N
• Having trained a classifier, we can control fn vs fp through a threshold θ ∈ [0, 1]. Let C1 be
the positive class and C2 the negative class. If the classifier returns p(C1 |x), let us choose the
positive class if p(C1 |x) > θ (so θ = 12 gives the usual classifier):
– θ ≈ 1: almost always choose C2 , nearly no false positives but nearly no true positives.
– Decreasing θ increases the number of true positives but risks introducing false positives.
• ROC curve (“receiver operating curve”): the pair values Measure Formula
(fp-rate,tp-rate) as a function of θ ∈ [0, 1]. Properties: error (fp + fn)/N
– It is always increasing. accuracy = 1− error (tp + tn)/N
– An ideal classifier is at (0, 1) (top left corner). tp-rate (hit rate) tp/p
fp-rate (false alarm rate) fp/n
– Diagonal (fp-rate = tp-rate): random classifier? . precision tp/p′
This is the worst we can do. Any classifier that is below recall = tp-rate tp/p
the diagonal can be improved by flipping its decision? . precision × recall
F-score (precision + recall)/2
• Area under the curve (AUC): reduces the ROC curve sensitivity = tp-rate
tp/p
to a number. Ideal classifier: AUC = 1. specificity = 1− fp-rate
tn/n
• ROC and AUC allow us to compare classifiers over different loss conditions, and choose a value
of θ accordingly. Often, there is not a dominant classifier.
• Information retrieval : we have a database of records (e.g. documents, images) and a query, for which
some of the records are relevant (“positive”) and the rest are not (“negative”). A retrieval sys-
tem returns K records for the query. Its performance is measured using a precision/recall curve:
– Precision: #(relevant retrieved records) / #(retrieved records).
– Recall : #(relevant retrieved records) / #(relevant records). We can always achieve perfect
recall by returning the entire database (which will contain many irrelevant records).
(a) ROC curve; b) ROC curves for 3 classifiers Precision & recall using Venn diagrams
a
a + b
a
a + c
11
K > 2 classes Ex: MNIST
handwritten digits
• Again, the most basic measure is the classification error.
0
• Confusion matrix : K × K matrix where entry (i, j) contains the 1
True class
2
number of instances of class Ci that are classified as Cj . 3
4
12
4 (Univariate) parametric methods
• How to learn probability distributions from data Joint distribution: p(X = x, Y = y).
p(X = x, Y = y)
(in order to use them to make decisions). Conditioning (product rule): p(Y = y | X = x) = .
X p(X = x)
Marginalizing (sum rule): p(X = x) = p(X = x, Y = y).
• We assume such distributions follow a particular y
Bayes’ theorem: p(Y = y | X = x) p(X = x)
parametric form (e.g. Gaussian), so we need to (inverse probability)
p(X = x | Y = y) = .
p(Y = y)
estimate its parameters (µ, σ).
• Several ways to learn them:
– by optimizing an objective function (e.g. maximum likelihood)
– by Bayesian estimation.
• This chapter: univariate case; next chapter: multivariate case.
For more complicated distributions, we usually need an algorithm to find the MLE.
• Ex: assume x|Ck ∼ N (x; µk , σk2 ). Estimate the parameters from a data sample {(xn , yn )}N
n=1 :
(a) Likelihoods
0.4
0.3
C1
p(x|C )
C2
i
0.2
0.1
0
−10 −8 −6 −4 −2 0 2 4 6 8 10
x
(b) Posteriors with equal priors
1
C C
2 2
p(C |x)
C
0.5 1
i
0
−10 −8 −6 −4 −2 0 2 4 6 8 10
x
(c) Expected risks
1
C C2
R(α |x)
C 1
2
i
0.5
rej rej
0
−10 −8 −6 −4 −2 0 2 4 6 8 10
x
14
4.6 Maximum likelihood estimation: parametric regression
• Assume there exists an unknown function f that maps inputs x to outputs y = f (x), but that
what we observe as output is a noisy version y = f (x) + ǫ, where ǫ is an random error. We
want to estimate f by a parametric function h(x; Θ). In ch. 2 we saw the least-squares error
was a good loss function to use for that purpose. We will now show that maximum likelihood
estimation under Gaussian noise is equivalent to that.
• Log-likelihood of Θ given a sample {(xn , yn )}N
n=1 drawn iid from p(x, y):
QN ? PN
L(Θ; X ) = log n=1 p(xn , yn ) = n=1 log p(yn |xn ; Θ) + constant.
• Assume an error ǫ ∼ N (0, σ 2 ), so p(y|x) ∼
PN (h(x; Θ), σ 2 ). Then maximizing the log-likelihood
N
?
is equivalent to minimizing E(Θ; X ) = n=1 (yn − h(xn ; Θ))2 , i.e., the least-squares error. p. 78
• Examples:
– Linear regression: h(x; w0 , w1) = w1 x + w0 . LSQ estimate? :
1
PN 1 PN
−1 1 N P n=1 xn w 0 n=1 y n
w = A y, A = 1 PN 1 N 2 , w= , y = 1NPN .
N n=1 xn N n=1 xn
w1 N n=1 yn xn
x: milage
• Variance of the estimator: var {d} = EX {(d(X ) − EX {d(X )})2 }. How much the estimator
varies around its expected value from one sample to another. variance
If var {d} → 0 as N → ∞: consistent estimator.
di
Ex: the sample average is a consistent estimator of the true mean? .
E[d] θ
15 bias
• Bias-variance decomposition: p. 70
error(d, θ) = EX (d − θ)2 = EX (d − EX {d})2 + (EX {d} − θ)2 = var {d} + b2θ (d).
| {z } | {z }
variance bias2
– Noise term: variance of y given x. Equal to the variance of the noise ǫ added, independent
of h or X . Can never be removed no matter which estimator we use.
– Squared error term: how much the estimate h(x) deviates (at each x) from the regression
function E {y|x}. Depends on h and X . It is zero if h(x) = E {y|x} for each x.
The ideal? , optimal regression function is h(x) = f (x) = E {y|x} (conditional mean of p(y|x)).
• Consider h(x) as an estimate (for each x) of the true f (x). Bias-variance decomposition at x:
EX (E {y|x} − h(x))2 |x = (E {y|x} − EX {h(x)})2 + EX (h(x) − EX {h(x)})2 .
| {z } | {z }
bias2 variance
The expectation EX {·} is over samples X of size N drawn from p(x, y). The other expectations are over p(y|x).
16
(a) Function and data (b) Order 1 Bias and variance estimated
5 5
from samples Xm using the
true function f in panel (a).
4
3.5
3
0 0
2.5
Error
2
Error
1.5
Bias
1
−5 −5 0.5 Variance
0 1 2 3 4 5 0 1 2 3 4 5
0
1 2 3 4 5
Order
0 0 −5
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
1.5
1
−5 −5
0 1 2 3 4 5 0 1 2 3 4 5 0.5
1 2 3 4 5 6 7 8
17
5 Multivariate parametric methods
• Sample X = {xn }N
n=1 where each example xn contains D features (continuous or discrete).
• We often write the sample (or dataset) as a matrix X = (x1 , . . . , xN ) (examples = columns).
Sometimes as its transpose (examples = rows).
σij
• Correlation matrix : corr(Xi , Xj ) = σi σj ∈ [−1, 1].
If corr(Xi , Xj ) = 0 (equivalently σij = 0) then Xi and Xj are uncorrelated (but not necessarily independent).
If corr(Xi , Xj ) = ±1 then Xi and Xj are linearly related.
• Properties:
– Σ diagonal ⇒ the D features are uncorrelated and independent: p(x) = p(x1 ) · · · p(xD ).
– x is Gaussian ⇒ each marginal p(xi ) and conditional distribution p(xi |xj ) is also Gaussian.
– x ∼ ND (µ, Σ), WD×K ⇒ WT x ∼ NK (WT µ, WT ΣW). A linear projection of a Gaussian is Gaussian.
0.4
0.35
2
x
0.3
0.25
0.2
x
0.15 1
x
2 x
1
18
5.5 Multivariate classification with Gaussian classes
Assume the class-conditional densities are Gaussian: x|Ck ∼ N (µk , Σk ). The maths carry over from
the 1D case in a straightforward way: p. 100
• MLE? : class proportions for p(Ck ); sample mean and sample covariance for µk and Σk , resp.
?
• Discriminant function gk (x) = log p(x|Ck ) + log p(Ck ) = quadratic form on x.
D(D+1) ?
• Number of parameters per class k = 1, . . . , K: p(Ck ): 1; µk : D; Σk : 2
.
A lot of parameters for the covariances! Estimating them may need a large dataset.
✐ 2
– ⇒ MLE for σkd = dth diagonal element of Σk = variance of class k for feature d.
Σ1 6= Σ2 , full Σ1 = Σ2 , full
0.1
p( x|C1)
0.05
x2 x1
Σ1 = Σ2 , diagonal Σ1 = Σ2 = σ 2 I
1
p(C | x)
0.5
1
x2 x1
19
5.6 Tuning complexity
• Again a bias-variance dilemma:
– Two few parameters (e.g. Σ1 = Σ2 = σ 2 I): high bias.
– Too many parameters (e.g. Σ1 6= Σ2 , full): high variance.
• We can use cross-validation, regularization, Bayesian priors on the covariances, etc.
• We can fit a nonlinear function h(x) similarly to a linear regression by defining additional,
nonlinear features such as x2 = x2 , x3 = ex , etc. (just as happens with polynomial regression).
Then, using a linear model in the augmented space x = (x, x2 , ex )T will correspond to a
nonlinear model in the original space x. Radial basis function networks, kernel SVMs. . .
20
6 Dimensionality reduction
• If we want to train a classifier (or regressor) on a sample {(xn , yn )}N
n=1 where the number of
features D in x (the dimension of x) is large:
21
6.2 Feature selection: forward selection
• Problem: given a sample {(xn , yn )}N D
n=1 with xn ∈ R , determine the best subset of the D
features such that the number of selected features L is as small as possible and the classification
accuracy (using a given classifier, e.g. a linear SVM) is as high as possible.
Using all D features will always give the highest accuracy on the training set but not necessarily on the validation set.
• Useful when some features are unnecessary (e.g. irrelevant for classification or pure noise) or
redundant (so we don’t need them all).
Useful with e.g. microarray data. Not useful with e.g. image pixels.
• Best-subset selection: for each subset of features, train a classifier and evaluate it on a validation
set. Pick the subset having highest accuracy and up to L features.
Combinatorial optimization: 2D possible subsets of D features? . Ex: D = 3: {∅, {x1 }, {x2 }, {x3 }, {x1 , x2 }, {x1 , x3 }, {x2 , x3 }, {x1 , x2 , x3 }}.
Brute-force search only possible for small D ⇒ approximate search.
• Forward selection: starting with an empty subset F , sequentially add one new feature at a
time. We add the feature d ∈ {1, . . . , D} such that the classifier trained on F ∪ {d} has highest
classification accuracy in the validation set. Stop when the accuracy doesn’t improve much, or
when we reach L features.
• Backward selection: same thing but start with F = {1, . . . , D} and remove one feature at a time.
• Both are greedy algorithms and are not guaranteed to find an optimal subset.
• Forward selection is preferable when we expect the optimal subset to contain few features (faster
than backward selection). It trains O(D 2 ) classifiers if we try up to D features.
• These feature selection algorithms are supervised : they use the labels yn when training the
classifier. The features selected depend on the classifier we use. There are also unsupervised algorithms.
Ex: forward selection on the Iris dataset (D = 4 features). p. 119
F1
1
4
8
0
F1
0
1
F4
−1
4 4.5 5 5.5 6 6.5 7 7.5 8
2
1
3
F2
F2
0
2.5
3.5
4.5
2
4
0
−1
2 2.5 3 3.5 4 4.5
1
F4
1
2
F3
0
3
F3
−1
0
1 2 3 4 5 6 7
0
1
1
F4
F4
0
2
−1
0 0.5 1 1.5 2 2.5
3
22
6.3 Dimensionality reduction: principal component analysis (PCA)
10 (a) Scree graph for Optdigits Optdigits after PCA
200 30
2 2
20
Eigenvalues
9 2
8 3 2 2
2 2
100 3
3 3 3 2 8
3
2 32 5 3 2 00
10 3
3 8 1
3 8 8 0
8 5 5 0 6
6 0
00 1
Second eigenvector
888 0
7 9 6 6
0 7 6
x2
0 6
0 10 20 30 40 50 60 70 7 7
7 9 6
Eigenvectors 7 5 8 5 0
6
7 9
(b) Proportion of variance explained 77 9 9 6
−10 7 7 1 1
4 1 19 1 1
41
7 9 14
0.8 94 1 9
1
4
Prop. of var.
−20
PC1 0.6 49 4
4 4
2 4 4
0.4 4
−30 4
PC2
0.2
0 0 −40
0 2 4 6 8 10 0 10 20 30 40 50 60 70 −40 −30 −20 −10 0 10 20 30 40
7
3
x2
7 7
77 4 4
7 77 7
2 7
7 7 4
4 4 4
m2 91 1
1
4
4
1 4
1 9
9 4
9 9 1
91 1 1 44
3 5 1
3 1 9 9 9
0
m1 3
1
3 6
3 3 33 3 9
8
−1 3 85 66
m2 2
2
2
2
5
9
8
8 8
88
8 666
6
2 3 55
s2 2 −2 2 2 22 8
2 0
m1 −3
2 0
w 0
0
s1 2 −4
0
00 0
0
0
x1 −5
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
1 1
4
0.5 0.5
2
0 0
pca
0 −0.5 −0.5
lda
−2 −1 −1
−2 0 2 4 −4 −2 0 2 −4 −3 −2 −1
PCA projection LDA projection
• Aims at preserving most of the signal information that is useful to discriminate among the classes.
Find a low-dimensional space such that when x is projected there, classes are well separated.
• Define:
P
– Number of points in class k: Nk . Mean of class k: µk = N1k yn =k xn .
P
– Within-class scatter matrix for class k: Sk = yn =k (xn − µk )(xn − µk )T .
P
– Total within-class scatter matrix SW = K k=1 Sk .
PK 1
PK
– Between-class scatter matrix SB = k=1 Nk (µk − µ)(µk − µ)T where µ = K k=1 µk .
• In the latent space, the between-class and within-class scatter matrices are WT SB W and
WT SW W (of L × L). T
W SB W between-class scatter
• Fisher discriminant: max J(W) = = .
W |WT SW W| within-class scatter
• This is an eigenproblem whose solution is W = (u1 , . . . , uL ) = eigenvectors associated with the
largest L eigenvalues of S−1
W SB .
?
rank (SB ) ≤K − 1 (so we can only use values of L that satisfy 1 ≤ L ≤ K − 1).
SW must be invertible (if it is not, apply PCA to the data and eliminate directions with zero variance).
24
6.7 Dimensionality reduction: multidimensional scaling (MDS)
True distances along earth surface Estimated positions on a 2D map
2000
Helsinki
1500 Moscow
1000
Dublin
London
500
Paris Berlin
0
Zurich
Lisbon Madrid
−500
−1000 Istanbul
Rome
−1500
Athens
−2000
−2500 −2000 −1500 −1000 −500 0 500 1000 1500 2000
• Unsupervised DR method: given the matrix of squared Euclidean distances d2nm = kxn − xm k2
between N data points, MDS finds points z1 , . . . , zN ∈ RL that approximate those distances:
N
X 2
min d2nm − kzn − zm k2 .
Z
n,m=1
• MDS does not use as training data the actual feature vectors xn ∈ RD , only the pairwise
distances dnm . Hence, it is applicable even when the “distances” are computed between objects
that are not represented by features. Ex: perceptual distance between two different colors according to a subject.
• If d2nm = kxn − xm k2 where xn ∈ RD and D ≥ L, then MDS is equivalent to PCA on {xn }N
n=1 . p. 137
• MDS does not produce a projection or reconstruction mapping, only the actual L-dimensional
projections z1 , . . . , zN ∈ RL for the N training points x1 , . . . , xN ∈ RD .
• How to learn a projection mapping F: x ∈ RD → z ∈ RL with parameters Θ?
Direct fit: find the projections z1 , . . . , zN by Parametric embedding: requires nonlinear
MDS and then solve a nonlinear regression optimization
N
X XN
2
min (zn − F(xn ; Θ))2 min d2nm − kF(xn ; Θ) − F(xm ; Θ)k2 .
Θ Θ
n=1 n,m=1
• Generalizations of MDS:
– Spectral methods: Isomap, Locally Linear Embedding, Laplacian eigenmaps. . .
Require solving an eigenproblem.
Isomap: define dnm = geodesic distances (approximated by shortest paths in a nearest-neighbor graph of the sample).
• Basic types:
– Centroid-based : find a prototype (“centroid”) for each cluster.
k-means, k-medoids, k-modes, etc.
– Probabilistic models: mixture densities, where each component models one cluster.
Gaussian mixtures. Usually trained with the EM algorithm.
– Graph-based (or distance- or similarity-based ): construct a graph with the data points as
vertices and edges weighted by distance values, and partition it.
Hierarchical clustering, connected-components clustering, spectral clustering, etc.
• Ill-defined problem: what does “similar” mean, and how similar should points be to be in the
same cluster? Hence, many different definitions of clustering and many different algorithms.
This is typical of exploratory tasks, where we want to understand the structure of the data before
committing to specific analyses. In practice, one should try different clustering algorithms.
26
k−means: Initial After 1 iteration
20 20
10 10
0 0
2
2
x
x
−10 −10
−20 −20
−30 −30
−40 −20 0 20 40 −40 −20 0 20 40
x1 x1
10 10
0 0
2
2
x
−10
x −10
−20 −20
−30 −30
−40 −20 0 20 40 −40 −20 0 20 40
x x
1 1
• Each iteration (centroid + assignment step) reduces E or leaves it unchanged. After a finite
number of iterations, no more changes occur and we stop.
• The result depends on the initialization. We usually initialize Z by a random assignment of
points to clusters, run k-means from several such random Z, and pick the best result.
• Vector quantization: compress a continuous space of dimension D, represented by a sample
X = {xn }N
n=1 , into a finite set of reference vectors (codebook ), with minimal distortion.
– Ex: color quantization. Represent 24-bit color images using only 256 colors with minimal distortion.
– It can be done with k-means clustering using large K. Better than uniform quantization, because it
adapts to density variations in the data and does not require a codebook of size exponential on D.
Encoder Decoder
Communication
x i i x' =mi
line
Find closest
mi mi
. .
. .
. .
27
7.1 Mixture densities and the EM algorithm
K
(
X p(x|k) component densities
• Mixture density with K components: p(x; Θ) = p(x|k)p(k)
k=1
p(k) = πk mixture proportions.
Similar to k-means, where the assignment and centroid steps correspond to the E and M steps.
But in EM the assignments are soft (znk ∈ [0, 1]), while in k-means they are hard (znk ∈ {0, 1}).
• If we knew which component xn came from for each n = 1, . . . , N, we’d not need the E step:
a single M step that estimates each component’s parameters on its set of points would suffice.
This was the case in classification (where x|Ck is Gaussian): we are given (xn , yn ).
• Each EM step increases L or leaves it unchanged, but it takes an infinite number of iterations
to converge. In practice, we stop when the parameters don’t change much, or when the number
of iterations reaches a limit.
• EM converges to a local optimum that depends on the initial value of Θ. Usually from k-means? .
• User parameter: number of clusters K. Output: posterior probabilities {p(k|xn )} and {πk , µk , Σk }K
k=1 .
28
Density-based clustering: mean-shift clustering
C3 C1 C2
p(x)
• Define a function p(x) that represents the density of the dataset {xn }N D
n=1 ⊂ R , then declare
each mode (maximum) of p as a cluster representative and assign each xn to a mode via the
mean-shift algorithm. Most useful with low-dimensional data, e.g. image segmentation (where xn = features of nth pixel).
• Kernel density estimate with bandwidth σ: a mixture having one component for each data point:
N N
X 1 X
x − xn
p(x) = p(x|n)p(n) = D
K
x ∈ RD .
n=1
Nσ n=1
σ
x−x
Usually the kernel K is Gaussian: K
σ
∝ exp − 12 k(x − xn )/σk2 .
n
• Mean-shift algorithm: starting from an initial value of x, it iterates the following expression:
N
X p(x|n)p(n) exp − 12 k(x − xn )/σk2
x← p(n|x)xn where p(n|x) = = PN 1 2
.
n=1
p(x) ′
n =1 exp − 2
k(x − xn ′ )/σk
PN
n=1 p(n|x)xn can be understood as the weighted average of the N data points using as
weights the posterior probabilities p(n|x). The mean-shift algorithm converges to a mode of
p(x). Which one it converges to depends on the initialization. By running mean-shift starting
at a data point xn , we effectively assign xn to a mode. We repeat for all points x1 , . . . , xN .
• User parameter: σ, which determines the number of clusters (σ ↓: N clusters, σ ↑: 1 cluster).
Output: modes and clusters.
• Nonparametric clustering: no assumption on the shape of the clusters or their number.
0.8
0.6
0.2
−0.4
−0.6
PD p 1/p
– Minkowski distance d(x, y) = d=1 |xd − yd | , if x, y ∈ RD .
−0.8
−1.5 −1 −0.5 0 0.5 1 1.5
0.8
0.6
0.4
0.2
−0.4
Number of words that two documents have in common; “edit” distance between strings (Exe. 6). −0.6
−0.8
−1.5 −1 −0.5 0 0.5 1 1.5
• Define a neighborhood graph with vertices = data points and edges xn ∼ xm if:
– d(xn , xm ) ≤ ǫ (ǫ-ball graph)
– xn and xm are among the k-nearest-neighbors of each other (k-nearest-neighbor graph).
• Clusters = connected-components of the graph (which can be found by depth-first search).
• User parameter: ǫ > 0 or k ∈ N. The number of clusters depends on that (ǫ ↓: N, ǫ ↑: 1).
• It can handle complex cluster shapes, but works only with non-overlapping clusters.
29
7.8 Graph-based clustering: hierarchical clustering
• Generates a nested sequence of clusterings.
• Agglomerative clustering: start with N clusters (= singleton points) and repeatedly merge pairs
of clusters until there is only one cluster left containing all points.
Divisive clustering: start with one cluster containing all points and repeatedly divide it until we have N clusters (= singleton points).
• We merge the two closest clusters Ci , Cj according to a distance δ(Ci , Cj ) between clusters:
– Single-link clustering: δ(Ci , Cj ) = minxn ∈Ci ,xm ∈Cj d(xn , xm ).
Tends to produce elongated clusters (“chaining” effect).
Equivalent to Kruskal’s algorithm to find a minimum spanning tree of a weighted graph G = (V, E, w) where V = {xn }N
n=1 ,
E = V × V and wnm = d(xn , xm ).
a 3
c 2
e f h
d 1
a b e c d f
30
8 Nonparametric methods
Parametric methods
• Examples:
– etc.
• Training is the process of choosing the best parameter values Θ from a given dataset.
• The size of the parameters is separate and smaller from the size of the training set N. The
model “compresses” the dataset into the parameters Θ, and the complexity of the model doesn’t
grow with N. Ex: a linear function f (x) = w1 x + w0 has 2 parameters (w0 , w1 ) regardless of the size N of the training set.
• After training, we discard the dataset, and use only the parameters to apply the model to
future data. Ex: keep the 2 parameters (w0 , w1 ) of the learned linear function and discard the training set {(xn , yn )}Nn=1 .
• The hypothesis space is restricted: only models of a certain parametric form (linear, etc.).
• Small complexity at test time as a function of N: both memory and time are O(1).
Nonparametric methods
• There is no training.
• The size of the model is given by the size of the dataset N, hence the complexity of the model
can grow with N.
• We keep the training set, which we need to apply the model to future data.
• The hypothesis space is less restricted (fewer assumptions). Often based on a smoothness
assumption: similar instances have similar outputs, and the output for an instance is a local
function of its neighboring instances. Essentially, the model behaves like a lookup table that we
use to interpolate future instances.
• Large complexity at test time as a function of N: both memory and time are O(N), because
we have to store all training instances and look through them to make predictions.
• They are particularly preferable to parametric methods with small training sets, where they
give better models and are relatively efficient computationally.
• Examples: kernel density estimate, nearest-neighbor classifier, running mean smoother, etc.
31
8.2 Nonparametric density estimation
• Given a sample X = {xn }N n=1 drawn iid from an unknown density, we want to construct an
estimator p(x) of the density.
• Histogram (consider first x ∈ R): split the real line into bins [x0 + mh, x0 + (m + 1)h] of width
h for m ∈ Z, and count points in each bin:
1
p(x) = (number of xn in the same bin as x) x ∈ R.
Nh
– We need to select the bin width h and the origin x0 .
– x0 has a small but annoying effect on the histogram (near bin boundaries).
– h controls the histogram smoothness: spiky if h ↓ and smooth if h ↑.
– p(x) is discontinuous at bin boundaries.
– We don’t have to retain the training set once we have computed the counts.
– They generalize to D dimensions, but are practically useful only for D . 2
In D dimensions, it requires an exponential number of bins, most of which are empty.
– Also possible to define a different bandwidth hn for each data point xn (adaptive KDE ).
– The KDE quality degrades as the dimension D increases (no matter how h is chosen).
Could be improved by using a full covariance Σn per point, but it is preferable to use a mixture with K < N components.
k 1
• k-nearest-neighbor density estimate: p(x) = for x ∈ RD , where dk (x) = (Euclidean)
2N dk (x)
distance of x to its kth nearest sample in X .
– Like using a KDE with an adaptive bandwidth h = 2dk (x).
Instead of fixing h and counting how many samples fall in the bin, we fix k and compute the bin size containing k samples.
0 0 0 0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
h=1 h=1 h = 0.5 k=3
0.4 0.4 0.4 1
0 0 0 0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
h = 0.5 h = 0.5 h = 0.25 k=1
0.8 0.8 0.8 1
0 0 0 0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
32
8.4 Nonparametric classification
• Given {(xn , yn )}N D
n=1 where xn ∈ R is a feature vector and yn ∈ {1, . . . , K} a class label.
• Estimate the class-conditional densities p(x|Ck ) with a KDE each. The resulting discriminant
function can be written (ignoring constant factors) as? :
N
X
x − xn
gk (x) = K
h
k = 1, . . . , K
n=1,yn =k
so each data point xn votes only for its class, with a weight given by K(·) (instances closer to
x have bigger weight).
• k-nearest-neighbor classifier : assigns x to the class k having most
instances among the k nearest neighbors of x.
x2
– Most common case: k = 1, nearest-neighbor classifier.
It defines a Voronoi tesselation in RD .
It works with as few as one data point per class!
*
– k is chosen to be an odd number to minimize ties.
*
– Simple; good classification accuracy; but slow at test time.
Condensed nearest-neighbor classifier : heuristic way to approximate the nearest-
neighbor classifier using a subset of the N data points (to reduce its space and x1
time complexity).
g(x) can be seen as the conditional mean E {y|x} of a KDE p(x, y) constructed on the sample? .
Also, as with KDEs:
– Regressogram: with an origin x0 and bin width h, and discontinuous at boundaries. eq. (8.24)
– Running mean smoother : K = uniform kernel. No origin x0 , only bin width h, but still discontinuous at boundaries. eq. (8.25)
– Running median smoother : like the running mean but using the median instead; more robust to outliers and noise.
– k-nearest-neighbor smoother : fixes k instead of h, adapting to the density around x.
• Running-line smoother : we use the data points around x (as defined by h or k) to estimate a
line rather than a constant, hence obtaining a local regression line.
Regressogram smoother: h = 6 Running mean smoother: h = 6 Kernel smooth: h = 1 Running line smooth: h = 6
4 4 4 4
2 2 2 2
0 0 0 0
−2 −2 −2 −2
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
h=3 h=3 h = 0.5 h=3
4 4 4 4
2 2 2 2
0 0 0 0
−2 −2 −2 −2
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
h=1 h=1 h = 0.25 h=1
4 4 4 6
4
2 2 2
2
0 0 0
0
−2 −2 −2 −2
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
33
8.9 How to choose the smoothing parameter
• Smoothing parameter : number of neighbors k or kernel bandwidth h.
34
9 Decision trees
• Nonparametric methods applicable to classification and regression given a sample {(xn , yn )}N
n=1 .
• Efficient at test time. May not even use all the features in the training set.
• The tree can be transformed into a set of IF-THEN rules that are interpretable for people.
• Widely used in practice, sometimes preferred over more accurate but less interpretable models.
– Internal decision nodes, each having ≥ 2 children. Decision node m selects one of its
children based on a test (a split) applied to the input x.
∗ Continuous feature xd : “xd > sm ” for some sm ∈ R.
∗ Discrete feature xd : n-way split for the n possible values of xd .
– Leaves, each containing a value (class label or output value y). Instances xn falling in the
same leaf should have identical (or similar) output values yn .
∗ Classification: class label y ∈ {1, . . . , K} (or proportion of each class p1 , . . . , pK ).
∗ Regression: numeric value y ∈ R (average of the outputs for the leaf’s instances).
The predicted output for an instance x is obtained by following a path from the root to a leaf.
In the best case (balanced tree) for binary trees, the path has length log2 L if there are L leaves.
– Each leaf defines a “bin” (box) in input space. Different bins may have different sizes and
contain different numbers of instances.
Unlike in nonparametric methods using a fixed bandwidth h or a fixed number of neighbors k).
– The bin for an instance x can be found very quickly (a few feature comparisons).
No need to scan the whole training set.
– We can discard the training set once the tree is constructed, and keep only its nodes (and
associated thresholds and output values).
Then, why do we call it nonparametric?
– The size of the tree can still be O(N) if each leaf represents one instance. We stop the
tree from growing before this happens, so each leaf represents several instances.
We can also grow the full tree and the prune it back to a desired size.
35
9.2
x2 Univariate trees
x1>w10
C2
Yes
No
x2>w20
w20
C1
Yes No
C1
C2 C1
w10 x1
• The test at node m compares one feature with a threshold: “xd > sm ” for some d ∈ {1, . . . , D}
and sm ∈ R. This defines a binary split into two regions: {x ∈ RD : xd ≤ sm } and {x ∈
RD : xd > sm }. The overall tree defines box-shaped, axis-aligned regions in input space.
T x > s ”, which define oblique regions (multivariate trees).
Simplest and most often used. More complex tests exist, e.g. “wm m
• With discrete features, the number of children equals the number nd of values the feature can
take, and the test selects the child corresponding to the value of xd (n-way split).
Ex: xd ∈ {red, green, blue} ⇒ nd = 3 children.
• Tree induction is learning the tree from a training set {(xn , yn )}N
n=1 , i.e., determining its nodes:
– For each internal node, its test (feature d ∈ {1, . . . , D} and threshold sm ).
– For each leaf, its output value y.
• For a given sample, many (sufficiently big) trees exist that code it with zero error. We want to
find the smallest such tree. This is NP-complete so an approximate, greedy algorithm is used: Fig. 9.3
pseudocode
– Starting at the root with the entire training set, select a best split according to a purity
criterion, and associate each child with the subset of instances that fall in it.
– Continue splitting each child (with its subset of the training set) recursively until each
child is pure (hence a leaf) and no more splits are necessary.
Ex. algorithms: CART, ID3, C4.5.
36
Classification trees 1
0.9
• Purity criterion: a split is pure if each child it produces con-
entropy=−p*log2(p)−(1−p)*log2(1−p)
0.8
tains instances of the same class. Consider a node and all 0.7
the training set instances that reach it, and call pk the pro- 0.6
PK 0
Other measures possible: Gini index φ(p) = i6=j pi pj , misclassification error 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
p
φ(p) = 1 − max(p1 , . . . , pK ).
• If a node is pure, i.e., all its instances are of the same class k, there is no need to split it. It
becomes a leaf with output value k.
We can also store the proportions p = (p1 , . . . , pK ) in the node (e.g. to compute risks).
• If a node m is not pure, we split it. We evaluate all possible features d = 1, . . . , D and all
possible split thresholds sm for each feature and pick the one with minimum impurity φ.
– If the number of instances that reach node m is Nm , there are are Nm −1 possible thresholds
(half way between each value of xd , assuming we have sorted them), and the best threshold
must be between values of different classes.
– For discrete features, there is no threshold but an n-way split.
Regression trees
P
• Purity criterion: the squared error E(g) = n∈node (yn − g)2 where g ∈ R, which is minimal
when g is the mean of the yn values? , and then E(g) is the variance of the yn values.
If there is much noise or outliers, it is preferable to set g to the median of the yn values.
• We consider a node to be pure if E ≤ θ for a user threshold θ > 0. In that case, we do not
split it. It becomes a leaf with output value g.
• If a node m is not pure, we split it. We evaluate all possible features d = 1, . . . , D and all
possible split thresholds sm for each feature and pick the one with minimum impurity (= sum
of the variances E of each of the children), as in the classification case.
• Rather than assigning a constant output value to a leaf, we can assign it a regression function
(e.g. linear), as in a running-mean smoother.
37
• Pruning is slower than early stopping but it usually leads to trees that generalize better.
An intuitive reason why is as follows. In the greedy algorithm used to grow the tree, once we make a decision at a given node (to
select a split) we never backtrack and try a different, maybe better, possibility.
4 x < 3.16
θr = 0.5 Yes No
Yes No
0 1.37 -1.35
−2 x < 3.16
0 1 2 3 4 5 6 7 8 Yes No
4
θ = 0.2 x < 1.36 x < 5.96
r
Yes No Yes No
Yes No
0
0.9 2.40
x < 3.16
−2
0 1 2 3 4 5 6 7 8 Yes No
Yes No Yes No
Yes No
−2 1.20 0.60
0 1 2 3 4 5 6 7 8
38
10 Linear discrimination
• Consider learning a classifier given a sample {(xn , yn )}N D
n=1 where xn ∈ R and yn ∈ {1, . . . , K}.
– Discriminative approach: we learn only the class boundaries, through discriminant func-
tions gk (x), k = 1, . . . , K. We don’t learn the class densities, and gk (x) need not be
modeled using probabilities. Hence, it requires assumptions only about the class bound-
aries but not about their densities. It solves a simpler problem.
This and future chapters, using linear and nonlinear discriminant functions.
• We define a parametric model gkP (x; Θk ) for each class discriminant. In linear discrimination,
this model is linear: gk (x; wk ) = D T
d=1 wkd xd + wk0 = wk x (assuming an extra constant feature
x0 = 1). We assume that, in the training set, instances from any one class are linearly separable
from instances of the other classes.
• Learning means finding parameter values {Θk }K
k=1 that optimize the quality of the separation.
In the parametric generative approach, learning means finding parameter values Θk for each class p(x|Ck ; Θk ) that maximize the
likelihood, separately for each class.
Radial basis function (RBF) networks and kernel support vector machines (SVMs) do this. ch. 12–13
39
10.3 Geometry of the linear discriminant
Boundary in 2D = line Geometry Linearly separable Pairwise linearly separable
x2
x2
x2
g(x)=w1x1+w2 x2+w0=0 H2
x2
g(x)=0 H1
g(x)>0 + H12
g(x)<0 g(x)<0 g(x)>0 +
C1 C1 C1
C2 |w0|/||w|| C2 C2
H3
H31
+
x
+
+ H23 C3
+ w C3
|g(x)|/||w||
x1 x1 x1 x1
Two classes
?
• One discriminant function is sufficient: g1 (x) − g2 (x) = wT x + w0 = g(x).
Testing: choose C1 if g(x) > 0 and C2 if g(x) < 0.
• This defines a hyperplane where w is the weight vector and w0 the threshold (or bias). It divides
the input space RD into two half-spaces, the decision regions R1 for C1 (positive side) and R2
for C2 (negative side). The hyperplane itself is the boundary or decision surface.
positive side if w0 > 0
• The origin x = 0 is on the boundary if w0 = 0
negative side if w0 < 0.
• w is orthogonal to the hyperplane. Pf. Pick x, y on the hyperplane.
• So w determines the orientation of the hyperplane and w0 its location wrt the origin.
K > 2 classes There are two common ways to define the decision for a point x ∈ RD :
• One-vs-all (linear separability): we use K discriminant functions gk (x) = wkT x + wk0 .
– Training: discriminant function gk is trained to classify the training points of class Ck vs
the training points of all other classes.
– Testing: choose Ck if k = arg maxi=1,...,K gi (x), i.e., pick the class having the larger dis-
criminant. Ideal case: gk (x) > 0 and gi (x) < 0 ∀i 6= k.
T
• One-vs-one (pairwise linear separability): we use K(K −1)/2 discriminants gij (x) = wij x+wij0.
– Training: discriminant function gij (for i 6= j) is trained to classify the training points of
class Ci vs the training points of class Cj (training points from other classes are not used).
P
– Testing: choose Ck if k = arg maxi=1,...,K K j6=i gij (x), i.e., pick the class having the larger
summed discriminant. Ideal case: gkj (x) > 0 ∀j and gij (x) < 0 ∀i 6= k, ∀j.
Also possible: for each class k, count the number of times gkj (x) > 0 for j 6= k, and pick
the class with most votes.
• Both divide the input space RD into K convex decision regions R1 , . . . , RK .
40
10.6 Minimizing functions by (stochastic) gradient descent
• When the minimization problem minw∈RD E(w) cannot be solved in closed form (solution
w∗ = some formula), we use iterative methods (w(0) → w(1) → · · · → w(∞) = w∗ ). Many
such methods exist (Newton’s method, conjugate gradients. . . ). One of the simplest ones,
reasonably effective in many cases (although sometimes very slow), is gradient descent.
– Stop iterating:
∗ When ∇E(w) ≈ 0 (since ∇E(w∗ ) = 0 at a minimizer w∗ of E). But this overfits.
∗ When the error on a validation set starts increasing (early stopping), which will
happen before the training error is minimized. The model w generalizes better to unseen data.
– It will find a local minimizer (not necessarily global ).
P
• Stochastic gradient descent (SGD): applicable when E(w) = N n=1 e(w; xn ), i.e., the total
error is the sum of the error at each data point. We update w as soon as we process xn :
repeatedly iterate w ← w + ∆w with an update ∆w = −η ∇e(w; xn ).
– Epoch = one pass over the whole dataset {x1 , . . . , xN }. It corresponds to one “noisy”
iteration of gradient descent: itP
need not always decrease E(w) because it doesn’t use
the correct gradient ∇E(w) = N n=1 ∇e(w; xn ).
– Much faster than (batch) gradient descent to get an approximate solution for w if N
is large (so there is redundancy in the dataset), or if data points come one at a time
and we don’t store them (online learning). However, very slow convergence thereafter.
– The step size has to decrease slowly over epochs for convergence to occur.
α
One typically takes η(t) = β+t
at epoch t sor suitable α, β > 0.
– Shuffling: in each epoch, process the N points in a random order. Better than a fixed order.
P
– Minibatches: computing the update ∆w = −η n∈B ∇e(w; xn ) based on subsets B of
1 < |B| < N points works better than with |B| = 1 (pure online learning) or |B| = N
(pure batch learning). In practice |B| = 10 to 1 000 points typically.
P
• Ex: E(w) = 12 N T 2
n=1 (yn − w xn ) (least-squares error for linear regression). The updates:
PN
GD: ∆w = η n=1 (yn − wT xn ) xn SGD: ∆w = η(yn − wT xn ) xn
have the form (ignoring η): “Error × Input” where Error = DesiredOutput − ActualOutput.
This has the following effect:
41
10.7 Logistic regression E a classification method, not a regression method!
0.8
0.6
0.5
0.3
1
probability p(C1 |x) = θ = σ(wT x + w0 ) = 1+exp (−(w
0.2
T x+w )) .
0.1
0 0
−10 −8 −6 −4 −2 0 2 4 6 8 10
• It defines a parametric estimator for p(C1 |x) directly, without having a model for p(x|C1 ), p(C1 )
and p(x|C2 ), p(C2 ). We learn its parameters (w, w0) from a training set {(xn , yn )}N
n=1 .
• Another way to derive the logistic classifier: let us model the log-ratio of the class-conditional
p(x|C1 )
densities as a linear function: log p(x|C 2)
= wT x + w00 . This implies? p(C1 |x) = σ(wT x + w0 )
(where w0 = w00 + log p(C1 ) ). Note that, unlike in generative models, we don’t have a model of the
p(C )
2
• When the data are Gaussian distributed, logistic regression has a comparable error rate to
a parametric Gaussian linear discriminant. Logistic regression can still be applied when the
class-conditional densities are non-Gaussian, as long as they are linearly separable.
• Geometry of the logistic classifier: like the geometry of the linear discriminant (w determines
the orientation of the decision boundary and w0 its location wrt the origin), and the magnitude
of w and w0 determines how steep the logistic function becomes.
θ
• Logit function or log odds of θ: logit(θ) = log 1−θ ∈ (−∞, ∞) for θ ∈ (0, 1). It is the inverse
of the logistic function.
K > 2 classes
exp (wkT x + wk0)
• We use a softmax function to model p(Ck |x) = PK T
for k = 1, . . . , K.
j=1 exp (w j x + w j0 )
PK
– It satisfies p(Ck |x) ∈ (0, 1) and k=1 p(Ck |x) = 1. For K = 2, softmax = logistic function.
– “Soft” max(·): if wkT x + wk0 ≫ wjT x + wj0 ∀j 6= k then p(Ck |x) ≈ 1, p(Cj |x) ≈ 0 ∀j 6= k.
It is differentiable (unlike the max(·) function).
In this sense, the logistic function is a soft step function.
3.5
2.5
20 1
x2
2
0.8
10
P(C |x ,x )
i 1 2
g (x ,x )
0.6
1 2
1.5 0
0.4
i
−10
0.2
1
−20 0
0.5
0 x x
1 1
0 0.5 1 1.5 2 2.5 3 3.5 4
x1 x x
2 2
42
10.8 Learning logistic regression
Given a sample {(xn , yn )}N
n=1 , we want to learn an estimator:
1
• K = 2: p(C1 |x) = σ(wT x + w0 ) = 1+exp (−(wT x+w0 ))
(logistic) with yn ∈ {0, 1}.
exp (wkT x+wk0 )
• K > 2: p(Ck |x) = PK T (softmax) for each class with yn ∈ {1, . . . , K}.
j=1 exp (wj x+wj0 )
XN N
X 2.5
1000
=− log (1 − θn ) − log θn . 2
100
10
P(C |x)
o
0.5
N N
∂E X ∂E X −0.5
dt
= σ(t)(1 − σ(t)). x
If the classes are linearly separable, kwk → ∞ and θn → yn ∈ {0, 1} as training proceeds.
Instead, we stop early, when the number of misclassifications is zero. Fig. 10.6
pseudocode
Newton’s method (suitably modified) is much more effective than gradient descent for this problem.
Multinomial distribution:
PK
• A die with K faces, each with probability θk ∈ [0, 1] for k = 1, . . . , K with k=1 θk = 1.
θ1 , y1 = 1
p. 600
yK
• p(y; θ) = θ1y1 · · · θK = ... where y ∈ {0, 1}K has exactly one yk = 1 and the rest are 0.
θ , y = 1
K K
43
10.8.2 By least-squares regression
Two classes: yn ∈ {0, 1}
N
1X
• We define a least-squares regression problem min E(Θ) = (yn − f (xn ; Θ))2 where:
Θ 2 n=1
– The labels {yn } are considered as real values (which happen to be either 0 or 1), and can
be seen as the desired posterior probabilities for the points {xn }.
1
– The parametric function to be learned is f (x; w, w0) = σ(wT x + w0 ) = 1+exp (−(wT x+w0 ))
.
– Hence, learning the classifier by regression means trying to estimate the desired posterior
probabilities with the function f (whose outputs are in [0, 1]).
where θn = σ(wT xn + w0 ).
• As before, if the classes are linearly separable, this would drive kwk → ∞ and θn → yn ∈ {0, 1}.
Instead, we stop iterating when the right output dimension is larger than the others for each
instance.
– We represent the labels yn as a one-of-K binary vector yn ∈ {0, 1}K (containing a single
1 in position k if yn corresponds to class k), and consider the ykn values as real values
(which happen to be either 0 or 1).
1
– We learn a separate function fk (x; wk , wk0) = σ(wkT x + wk0 ) = 1+exp (−(wkT x+wk0 ))
per out-
put dimension k = 1, . . . , K.
• This formulation ignores the fact that the components of f(x) should sum to 1. A correct
formulation, but more complicated to optimize, would use the softmax function for f.
44
11 Multilayer perceptrons (artificial neural nets)
• Parametric nonlinear function approximators f(x; Θ) for classification or regression.
• Originally inspired by neuronal circuits in the brain (McCullough-Pitts neurons, Rosenblatt’s
perceptron, etc.), and by the brain’s ability to solve intelligent problems (visual and speech
perception, navigation, planning, etc.).
Synapses between neurons = MLP weight parameters (which are modified during learning). Neuron firing = MLP unit nonlinearity.
• They are usually implemented in software in serial computers and more recently in parallel or
distributed computers. There also exist VLSI implementations.
Neuron Organization of neurons in the retina
45
11.3 Training a perceptron
P
• Apply stochastic gradient descent: to minimize the error E(w) = N
n=1 e(w; xn , yn ), repeatedly
update the weights w ← w + ∆w with ∆w = −η ∇e(w; xn , yn ) and n = 1, . . . , N (one epoch).
– Regression by least-squares error: e(w; xn , yn ) = 21 (yn −wT xn )2 ⇒ ∆w = η(yn −wT xn ) xn .
– Classification by maximum likelihood, or equivalently cross-entropy:
∗ For two classes: yn ∈ {0, 1} and e(w; xn , yn ) = −yn log θn − (1 − yn ) log (1 − θn ) where
θn = σ(wT xn ) ⇒ ∆w = η(yn − θn ) xn .
P
∗ For K > 2 classes: yn ∈ {0, 1}K coded as 1-of-K and e(w; xn , yn ) = − K k=1 ykn log θkn pseudocode
Fig. 11.3
T
exp (w x)
where θkn = PK expk(wT x) ⇒ ∆wkd = η(ykn −θkn ) xdn for d = 0, . . . , D and k = 1, . . . , K.
j=1 j
The original perceptron algorithm was a variation of stochastic gradient descent. For linearly separable problems, it converges in a
finite (possibly large) number of iterations. For problems that are not linearly separable problems, it never converges.
AND x2
x2
y XOR
x1 x2 y 1.5
x1 x2 y
0 0 0 -1.5 (0,1) (1,1) 0 0 0
0 1 0 +1 +1
0 1 1
1 0 0 + 1 0 1
1 1 1 1 1 0
y = σ x1 + x2 − 23 x0=+1 x1 x2
(0,0) (1,0) 1.5
x1 x1
An MLP with a single nonlinear hidden layer. . . . . . solves the XOR problem
z2
y y
-0.5 +1 +1
z1
z1 z2
+
z1
+
x0=+1 x1 x2 x1
47
11.7 Learning MLPs: the backpropagation algorithm
• The backpropagation algorithm is (stochastic) gradient descent applied to an MLP using a
least-squares or cross-entropy error function. Since the MLP consists of a nested sequence of
W1 W2 WK
functions (layers) x −−→ z1 −−→ . . . −−→ y = f(x; Θ) with Θ = {W1 , . . . , WK }, each being a
nonlinear perceptron, the gradient is computed with the chain rule, and can be interpreted as
backpropagating the error at the output layer through the hidden layers back to the input.
′
• Ex: an MLP with one hidden layer having H units, with inputs x ∈ RD and outputs y ∈ RD ,
where the hidden units are sigmoidal and the output units are linear. The gradient is as follows:
∂E ∂E ∂fi ∂fi (as with a perceptron
– wrt second-layer weight vih : = = δi
∂vih given fixed inputs zh ) with “error”
∂vih ∂fi ∂vih
D′ D′ ∂E
∂E X ∂E ∂fi ∂zh X ∂fi ∂zh δi = .
– wrt first-layer weight whd : = = δi
∂fi
∂whd i=1
∂fi ∂zh ∂whd i=1
∂zh ∂whd
• With more hidden layers, the gradient wrt each layer’s weights is computed recursively.
• The updates using stochastic
gradient descent with a minibatch B to minimize an error func-
P N
E Θ; {(xn , yn )}N
tionP n=1 = n=1 e(Θ; xn , yn ) are of the form Θ ← Θ + ∆Θ with ∆Θ =
−η n∈B ∇Θ e(Θ; xn , yn ), where the gradients are given below. B = {1, . . . , N } gives batch gradient descent.
Regression We assume outputs y of dimension D ′ . Θ = {W, V}. Least-squares error : Fig. 11.11
pseudocode
P ′
e(W, V; xn , yn ) = en (W, V) = 21 kyn − f(xn ; W, V)k2 = 12 D
i=1 (yin − fi (xn ; W, V))
2
fi fi
D ′ H
′ 1X 2
vih X
e= e ({fi }D
i=1 ) = (yi − fi ) , zh fi = fi {vih , zh }H
h=0 = vih zh ,
2 | {z }
i=1 δ = error of
|{z}
h=0
i output of
output unit i output unit i
D
!
X
zh
zh = zh {whd }D
d=0 = σ whd xd
whd
|{z}
output of d=0
hidden unit h
σ ′
D′
X z }| {
vih
∂en ∂en
= − (yin − fi (xn )) zhn = − (yin − fi (xn )) vih z (1 − zhn ) xdn .
∂vih | {z } |{z} whd ∂whd | {z } |{z} | hn {z }
∂fi /∂vih i=1 ∂fi /∂zh
∂en /∂fi ∂en /∂fi ∂zh /∂whd
Classification, two classes We need a single output unit f (xn ). Θ = {W, v}. Cross-entropy error :
e(W, v; xn, yn ) = en (W, v) = −yn log f (xn ; W, v) − (1 − yn ) log 1 − f (xn ; W, v)
∂en ∂en
= −(yn − f (xn )) zhn = −(yn − f (xn )) vh zhn (1 − zhn ) xdn .
∂vh ∂whd
Classification, K > 2 classes We need one output unit fi (xn ) per class i = 1, . . . , K, given by
the softmax function. Θ = {W, V}. Cross-entropy error :
K H
X exp (oin ) X
e(W, V; xn, yn ) = en (W, V) = − yin log fi (xn ; W, V) , fi (xn ) = PK , oin = vih zhn
i=1 k=1 exp (okn ) h=0
P ′
∂en ∂en D
∂vih
= −(yin − fi (xn )) zhn ∂whd
= i=1 −(y in − fi (xn )) vih zhn (1 − zhn ) xdn .
48
2 1.4 4 4 4
Training
Validation
1.5 1.2 3 3 3
1 2 2 2
1
0.5 1 1 1
−0.5 −1 −1 −1
0.4
−1 −2 −2 −2
0.2
−1.5 −3 −3 −3
0
−2 0 50 100 150 200 250 300 −4 −4 −4
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 Training Epochs −0.5 0 0.5 −0.5 0 0.5 −0.5 0 0.5
MLP with H = 2 hidden units Error on training & validation sets a) hyperplanes of hidden units in L1;
after 100, 200 and 300 epochs b) hidden unit outputs;
c) hidden unit outputs × weights in L2
• Gradient descent converges very slowly in optimization problems with many parameters.
• Vanishing gradients problem: there is one product σ ′ (z) = σ(z) (1 − σ(z)) per layer. If the
value of z is large, which will happen if the weights or inputs at that layer are large, then the
sigmoid saturates and σ ′ (z) ≈ 0, so the gradient becomes tiny, and it takes many iterations to
make progress. This is particularly problematic with deep networks (having several layers of
hidden units). Other nonlinearities have less of a problem, e.g. ReLU.
• Initial weights: small random values (e.g. uniform in [−0.01, 0.01]) so as not to saturate the
sigmoids.
• Normalizing the inputs so they have zero mean and unit variance speeds up the optimization
(since we use a single step size η for all parameters).
• Rescaling the learning rate ηw for each weight w by using a diagonal approximation to the
Hessian of the objective function E(w). This helps to correct for the fact that weights in
different layers, or subject to weight sharing, may have different effects on the output.
• It is also possible to use other optimization methods (modified Newton’s method, conjugate
gradients, L-BFGS, etc.) but, for problems with large N and redundant samples, they don’t
seem to improve significantly over SGD (with properly tuned learning rate, minibatch size and
momentum term).
49
Overtraining
• Model selection for the number of hidden units (hence weights): the bias-variance tradeoff
applies as usual:
– Neural nets with many hidden units can achieve a very low training error but memorize
the noise as well as the signal, hence overfit.
– Neural nets with few hidden units can have a large bias, hence underfit.
The number of hidden units can be estimated by cross-validation.
• More generally, one needs to select the overall architecture of the neural net (number of layers
and of hidden units in each layer, connectivity between them, choice of nonlinearity, etc.). This
is done by trial and error and can be very time-consuming.
• Early stopping: during the (stochastic) gradient descent optimization (with a given number of
hidden units), we monitor the error on a validation set and stop training when the validation
error doesn’t keep decreasing. This helps to prevent overfitting as well.
• Weight decay: we discourage weights from taking large values by adding to the objective
function, or equivalently to the update, a penalty term:
′ λX 2 ∂E
E (w) = E(w) + w ⇐⇒ ∆wi = −η + λwi .
2 i i ∂wi
This has the effect of choosing networks with small weights as long as they have a low training
error (depending on λ). Such networks are smoother and generalize better to unseen data. The
value of λ is set by cross-validation.
0.12 3.5
Training Training
Validation Validation
3
0.1
2.5
0.08
Mean Square Error
0.06
1.5
0.04
1
0.02
0.5
0 0
0 5 10 15 20 25 30 0 100 200 300 400 500 600 700 800 900 1000
Number of Hidden Units Training Epochs
• Weight sharing: the values of certain weights are constrained to be equal. For example, with
convolutional neural nets, the weights in the window are the same for every hidden unit. This
corresponds to using a filter that is homogenous in space or time, and helps to detect features
regardless of their location in the input. Ex: edges at different locations and of different orientations.
50
• Convolutional neural nets with weight sharing have a much smaller number of weights, can be
trained faster, and usually give better results than fully-connected networks.
• This process may be repeated in successive layers and make it possible for the network to learn
a hierarchical representation of the data with each layer learning progressively more complex
and abstract concepts.
Ex: pixels → edges in different orientations → corners, T-junctions, contour segments → parts → objects → classes of objects.
• With temporal data, as in speech recognition or machine translation, one may use time-delay
neural nets or recurrent neural nets.
• Deep learning refers to neural networks having multiple layers that can learn hierarchical rep-
resentations of complex data such as images. Properly trained on large, labeled datasets (using
GPUs), they have achieved impressive results in recent years in difficult tasks such as object
recognition, speech recognition, machine translation or game learning. This success is due to
the ability of the network to learn nonlinear mappings in high-dimensional spaces, and to the
fact that the network is learned end-to-end, i.e., all its weights are optimized with little human
contribution (rather than e.g. having a preprocessing layer that is fixed by hand by an expert).
Ex: instead of extracting preselected features from an image (e.g. SIFT features) or from the speech waveform (e.g. MFCCs), the
first layer(s) of the net learn those features optimally.
A
A A A
Hints
• Hints are properties of the target function that are known to us independent of the training
data. They should be built into the neural net if possible, because they help to learn a good
network, particularly when the training data is limited.
• Ex: invariance. The output of a classifier is often invariant to certain known transformations
of the input:
51
11.11 Dimensionality reduction: autoencoders
• Autoencoder : an MLP that, given a training set {xn }N D
n=1 ∈ R without labels, tries to recon-
struct its input (“labels” yn = xn ), but has a bottleneck layer, whose number of hidden units
H is smaller than the dimension D of the input x. Its output layer is linear and it is trained
by least-squares error:
N
1X
min E(W1 , W2 ) = kyn − f(F(xn ; W1 ); W2 )k2
W1 ,W2 2 n=1
• This forces the overall network f(F(x)) to learn an optimal low-dimensional representation
zn = F(xn ) ∈ RH of each instance xn ∈ RD in the bottleneck (or “code”) layer. The codes (=
outputs of the bottleneck hidden layer) are optimal for reconstructing the input instances.
• If the hidden layers are all linear then the network becomes equivalent to PCA? .
The hidden unit weights need not be the H leading eigenvectors of the data covariance matrix, but they span the same subspace.
If they are nonlinear (e.g. sigmoidal) then it learns a nonlinear dimensionality reduction.
y1 yd y1 yd 1
Hidden Representation
7 7
7 4 4
777 77
7 7 7 1 4 4 44
0.9 9 1 1
4
7 1 4
1 1
9 99 9 9 4
0.8 1 44
9
4
3 9 9 1
Decoder 0.7 3 1
9
0.6
33
zH zH 33
3 9
Hidden 2
3
0.5 5 5
Encoder 0.4
8 88 8
3 8
8 8
0.3 3
5 8
5 8
1 6
5 1
0.2
66
6
6
0.1 2 2 0 0 66
x0=+ x1 xd x0=+ x1 xd
2
2 2 2 22
2
0
0 00 6
2 2 0 0 0 0
1 1 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Linear Nonlinear Hidden 1
• If a neural network is trained for classification and has a bottleneck (a hidden layer with
dimension smaller than the input), then the network will learn a low-dimensional representation
in that hidden layer that is optimal for classification (similar to LDA if using linear layers).
52
12 Local models: radial basis function networks
12.3 Radial basis function networks
′
• Assume a dataset {xn , yn }N
n=1 with xn ∈ R
D
and yn ∈ RD for regression or yn ∈ {1, . . . , K}
for classification.
• They can be seen as a feedforward neural net with a single hidden layer of Gaussian units and
an output layer of linear units. They are not usually constructed with more hidden layers, unlike MLPs.
• Two types of representation of information in neural nets: assume for simplicity that units
output either 0 (not activated) or 1 (activated):
Each unit is locally tuned to a small area of the input space called its receptive field, and
the input space is paved with such units.
Neurons in the primary visual cortex respond only to highly localized stimuli in retinal position and angle of visual orientation.
yi w2 w1
+
x2 x2
m 1 xa m3
xa
xb xb
wih
xc xc
+
ph m2
x1 x1
p0=+1
53
• Training for regression: we minimize the least-squares error
N H
1X 2
X
E {wh , µh }H
h=1 , σ = kyn − f(xn )k + λ kwh k2
2 n=1 h=1
where λ ≥ 0 is a regularization user parameter which controls the smoothness of f. We can use
gradient descent, where the gradient is computed using the chain rule and is similar to that of p. 330
an MLP. However, RBF nets can be trained in an approximate but much simpler and faster
way as follows:
1. Set the centroids {µh }H N
h=1 in an unsupervised way using only the input points {xn }n=1 ,
usually by running K-means with H centroids, or by selecting H points at random.
2. Set the width σ by cross-validation (see below).
It is also possible to set σ to a rough value as follows. Having run k-means, we compute the distance from each centroid µh
to the farthest point xn in its cluster. We then set σ to the average such distance over the H centroids.
3. Given the centroids and width, the values φh (xn ) are fixed, and the weights {wh }H h=1 are
determined by optimizing E, which reduces to a simple linear regression. We solve the
linear system✐ (recall the normal equations of chapter 5):
ΦΦT + λI W = ΦY T ΦH×N = (φh (xn ))hn , WH×D′ = (w1 , . . . , wH )T , YD′ ×N = (y1 , . . . , yN ).
• Training for classification: we minimize the cross-entropy error
N X
X K H
X
E {wh , µh }H
h=1 , σ =− yin log θin + λ kwh k2
n=1 i=1 h=1
H
exp (fi (xn )) X
θin = PK fi (x) = wih φh (x).
k=1 exp (fk (xn )) h=1
That is, we use an RBF net with H BFs and K outputs, which we pass through a softmax.
This can either be optimized by gradient descent, or trained approximately as above (but the
optimization over the weights cannot be solved by a linear system and requires gradient descent
itself).
• Model selection: the complexity of a RBF net is controlled by:
– The number H of basis functions: H↓ high bias, low variance; H↑ low bias, high variance.
– The regularization parameter λ ≥ 0: the larger, the smoother the RBF net.
Numerically, we typically need λ & 10−7 to make the linear system sufficiently well conditioned.
– The width σ (if not optimized over or set by hand): the larger, the smoother the RBF net.
They are all set by cross-validation. We usually search over a range of values of H, log λ and log σ.
• Because they are localized, RBF nets may need many basis functions to achieve a desired
accuracy.
• Universal approximation: any continuous function (satisfying mild assumptions) from RD to
′
RD can be approximated by a RBF net with an error as small as desired (by using sufficiently
many basis functions).
Intuitive proof: we can enclose every input instance (or region) with a basis function having as centroid that instance and a small
enough width so it has little overlap with other BFs. This way we “grid” the input space RD with sufficiently many BFs, so that
only one BF is active for a given instance xn . Finally, we set the weight of each BF to the desired function value. This gives
an approximately piecewise constant approximation to the function (essentially equal to a nonparametric kernel smoother). Its
accuracy may be increased by using more hidden units and placing a finer grid in the input.
54
12.5 Normalized basis functions
• With Gaussian BFs, it is possible that φh (x) = 0 for all h = 1, . . . , H for some x ∈ RD . This
happens if x is far enough (wrt σ) from each centroid.
Now, even if x is far (wrt σ) from each centroid, at least one BF will have a nonzero value.
• φh (x) can be seen as the posterior probability p(h|x) of BF h given input x assuming a Gaussian
mixture model p(x), where component h has mean µh , and all components have the same
proportion H1 and the same, isotropic covariance matrix Σh = σ 2 I.
• The nonparametric kernel smoother of chapter 8 using a Gaussian kernel of bandwidth h > 0
N
X K k(x − xn )/hk
g(x) = PN yn
n=1 n′ =1 K k(x − xn′ )/hk
55
13 Kernel machines (support vector machines, SVMs)
• Discriminant-based method: it models the classification boundary, not the classes themselves.
• It can produce linear classifiers (linear SVMs) or nonlinear classifiers (kernel SVMs).
• Very effective in practice with certain problems:
– It gives good generalization to test cases.
– The optimization problem is convex and has a unique solution (no local optima).
– It can be trained on reasonably large datasets.
– Special kernels can be defined for many applications.
• It can be extended beyond classification to solve regression, dimensionality reduction, outlier
detection and other problems.
The basic approach is the same in all cases: maximize the margin and penalize deviations, resulting in a convex quadratic program.
• We have seen several linear classifiers: logistic regression, Gaussian classes with common co-
variance, now linear SVMs. . . They give different results because they have different inductive
bias, i.e., make different assumptions (the objective function, etc.).
• The signed distance of a point xn to the hyperplane is (wT xn + w0 )/kwk ∈ R, hence its
absolute value is yn (wT xn + w0 )/kwk ≥ 0 (since yn ∈ {−1, +1}). Let xn∗ be the closest
instance to the hyperplane and ρ = yn∗ (wT xn∗ + w0 )/kwk > 0 its distance (the margin), i.e.,
yn (wT xn + w0 )/kwk ≥ ρ ∀n = 1, . . . , N. To make this distance as large as possible, we can
maximize ρ subject to yn (wT xn + w0 )/kwk ≥ ρ ∀n = 1, . . . , N. But yn (wT xn + w0 )/kwk
is invariant to rescaling both w and w0 by a constant factor, so we arbitrarily fix this factor
by requiring kwk = 1/ρ (or, equivalently, by requiring yn (wT xn + w0 ) = 1 for the instance xn
closest to the hyperplane). Then, since maximizing ρ is the same as minimizing 1/2ρ2 = 21 kwk2 ,
we finally reach the following maximum margin optimization problem:
1
Primal QP: min kwk2 s.t. yn (wT xn + w0 ) ≥ 1, n = 1, . . . , N.
w∈R , w0 ∈R 2
D
56
• This is a constrained optimization problem, specifically a convex quadratic program (QP), de-
fined on D +1 variables (w, w0) with N linear constraints and a quadratic objective function. It
has a unique minimizer (w, w0), which can be found with different QP algorithms. The closest
instances to each side of the hyperplane are at a distance? 1/kwk from the hyperplane.
• Solving the primal problem, a QP on D + 1 variables, is equivalent to solving the following
dual problem, another convex QP but on N variables α = (α1 , . . . , αN )T and N + 1 constraints
(α1 , . . . , αN are the Lagrange multipliers of the N constraints in the primal QP):
N N N
1X T
X X
Dual QP: min yn ym (xn xm ) αn αm − αn s.t. yn αn = 0, αn ≥ 0, n = 1, . . . , N.
α∈RN 2
n,m n=1 n=1
3 2 2
This can be solved in O(N + N D) times and O(N ) space (because we need the dot products
xTn xm for n, m = 1, . . . , N). The optimal α ∈ RN has the property that αn > 0 only for those
instances that lie on the margin (the closest ones to the hyperplane), i.e., yn (wT xn + w0 ) = 1,
called the support vectors (SVs). For all other instances, which lie beyond the margin, αn = 0.
• From the optimal α ∈ RN , we recover the solution (w, w0) as follows:
N
X N
X
w= αn yn xn = αn yn xn w0 = yn − wT xn for any support vector xn ?
n=1 n∈SVs
so the optimal weight vector can be written as a l.c. of the SVs. This allows kernelization later.
(a)
4
1.5 1.5
(b)
3 hinge loss
1 1 (c)
2 cross entropy
1 (d)
0.5 0.5 1
−1 1
0/1 loss
−1 0
0
0 0.5 1 1.5 2
0
0 0.5 1 1.5 2 −2 −1 0 1 2
t
In terms of ξn : (a) ξn = 0, (b) ξn = 0, (c) 0 < ξn < 1, (d) ξn > 1;
In terms of αn : (a) αn = 0, (b) 0 < αn < C, (c)–(d) αn = C;
(b)–(d) are SVs (circled instances ⊕ ⊙, on or beyond the margin)
57
13.3 Binary classification, not linearly separable case:
soft margin hyperplane
• If the dataset is not linearly separable, we look for the hyperplane that incurs least classification
error while having the largest margin. For each data point xn , we allow a deviation ξn ≥ 0 (a
slack variable) from the margin, but penalize such deviations proportionally to C > 0:
N
1 2
X yn (wT xn + w0 ) ≥ 1 − ξn ,
Primal QP: min kwk + C ξn s.t.
w∈RD , w0 ∈R, ξ∈RN 2 ξn ≥ 0, n = 1, . . . , N.
n=1
2
C > 0 is a user parameter that controls the tradeoff
PN between maximizing the margin (2/kwk )
and minimizing the deviations or “soft error” ( n=1 ξn ):
– C ↑: tries to avoid every single error but reduces the margin so it can overfit;
– C ↓: enlarges the margin but ignores most errors so it can underfit.
C is set by cross-validation. Typically, one chooses C on a log scale, e.g. C ∈ {10−6 , 10−5 , . . . , 10+5 , 10+6 }.
• This is again a convex QP on D + 1 + N variables and its unique minimizer (w, w0, ξ) can be
found with different QP algorithms. It is equivalent to solving the dual QP on N variables:
N N N
1X T
X X C ≥ αn ≥ 0,
Dual QP: min yn ym (xn xm ) αn αm − αn s.t. yn αn = 0,
α∈RN 2 n = 1, . . . , N
n,m n=1 n=1
which differs from the dual QP of the linearly separable case in the extra “αn ≤ C” constraints.
The optimal α ∈ RN has the property that αn > 0 only for those instances that lie on or within
the margin or are misclassified, i.e., yn (wT xn + w0 ) ≤ 1, called the support vectors (SVs). For
all other instances, which lie beyond the margin, αn = 0. As a result, for each training point
n = 1, . . . , N we can have:
– ξn = 0: xn is correctly classified and beyond the margin (here, αn = 0).
– 0 < ξn < 1: xn is correctly classified but within the margin (here, 0 < αn ≤ C).
– ξn ≥ 1: xn is misclassified (here, αn = C).
• From the optimal α ∈ RN , we recover the solution (w, w0) as before:
N
X N
X
w= αn yn xn = αn yn xn w0 = yn − wT xn for any xn with 0 < αn < C.
n=1 n∈SVs
• The primal problem can be written equivalently but without constraints by eliminating ξn as:
N
1 2
X
min kwk + C [1 − yn g(xn )]+ where g(x) = wT x + w0 .
w∈RD ,w0 ∈R 2
n=1
This defines the hinge loss, which penalizes errors linearly:
(
0 if yn g(xn ) ≥ 1
[1 − yn g(xn )]+ = max 0, 1 − yn g(xn ) =
1 − yn g(xn ) otherwise
and is more robust against outliers compared to the quadratic loss (1 − yn g(xn ))2 .
58
13.5 Kernelization: kernel SVMs for nonlinear classification
• We map the data points x to a higher-dimensional space. Learning a linear model in the new
space corresponds to learning a nonlinear model in the original space.
• Assume we map x ∈ RD → z = φ(x) = (φ1 , . . . , φK )T ∈ RK where K > D (usually, K is much
larger or even infinite), and z1 = φ1 (x) ≡ 1 P
(so we don’t have to write a bias term explicitly).
The discriminant is now g(x) = w φ(x) = K
T
i=1 wi φi (x).
• The primal and dual soft-margin QPs are as before but replacing xn with φ(xn ):
N
1 X
Primal QP: min kwk2 + C ξn s.t. yn (wT φ(xn )) ≥ 1 − ξn , ξn ≥ 0, n = 1, . . . , N
w∈RD , ξ∈RN 2 n=1
N N N
1X X X C ≥ αn ≥ 0,
Dual QP: min yn ym (φ(xn )T φ(xm )) αn αm − αn s.t. yn αn = 0,
α∈RN 2 n,m | {z } n = 1, . . . , N
n=1 n=1
K(xn ,xm )
N
X N
X
Optimal solution: w = αn yn φ(xn ) = αn yn φ(xn )
n=1 n∈SVs
N
X
Discriminant: g(x) = wT φ(x) = αn yn (φ(xn )T φ(x)).
n=1
| {z }
K(xn ,x)
T
• Kernelization: we replace the inner product φ(xn ) φ(xm ) between basis functions with a kernel
function K(xn , xm ) that operates on a pair of instances in the original input space. This saves
the work of having to construct φ(·), whose dimension can be very high or infinite, and taking
the dot product. All we need in practice is the kernel K(·, ·), not the basis functions φ(·).
• The resulting, nonlinear discriminant (the kernel SVM ) is
N
X N
X N
X
T T
g(x) = w φ(x) = αn yn φ(xn ) φ(x) = αn yn K(xn , x) = αn yn K(xn , x).
n=1 n=1 n∈SVs
Again, we can use the kernel K(·, ·) directly and don’t need the basis functions.
• Many other algorithms that depend on dot products of input instances have been kernelized
(hence made nonlinear) in the same way: PCA, LDA, etc.
• The fundamental object in a kernel SVM, which determines the resulting discriminant, is the
kernel, and the N × N Gram matrix of dot products it defines on a training set of N points:
K = (K(xn , xm ))nm = (φ(xn )T φ(xm ))nm .
For this matrix and the dual QP to be well defined, the kernel K(·, ·) must be a symmetric positive definite function, i.e., it must
satisfy uT Ku ≥ 0 ∀u ∈ RN , u 6= 0.
• A kernel SVM can be seen as a basis function expansion (as in RBF networks), but the number
of BFs is not set by the user (possibly by cross-validation); it is determined automatically
during training, and each BF corresponds to one input instance that is a support vector.
• At test time, a kernel SVM works like a template classifier, by “comparing” the test pattern x
with the SVs (templates) by means of the kernel, and combining these comparisons into g(x)
to compute the final discriminant. Ex: for MNIST handwritten digits with SVs , ...
N
X
g(x) = αn yn K(xn , x) = α1 y1 K( , x) + α2 y2 K( , x) + . . .
n∈SVs
| {z }
how similar x is to
59
Polynomial kernel of degree 2 Gaussian kernel of different widths
2 2
(a) s =2
2
(b) s =0.5
2 2
−1
1
1.5 −1 1 1 1
−1
1 0 0
0 1 2 0 1 2
1
2 2
(c) s =0.25 (d) s =0.1
2 2
1
0.5 1
−1
1 1
−1
−1 −1
0 0 0
0 0.5 1 1.5 2 0 1 2 0 1 2
. . . . . . . . . . . . . . . . . . . circled instances ⊕, ⊙ are SVs; +, · are non-SVs . . . . . . . . . . . . . . . . . . .
60
13.9 Multiclass kernel machines
With K > 2 classes, there are two common ways to define the decision for a point x ∈ RD :
• One-vs-all : we use K SVMs gk (x), k = 1, . . . , K.
– Training: SVM gk is trained to classify the training points of class Ck (label +1) vs the
training points of all other classes (label −1).
– Testing: choose Ck if k = arg maxi=1,...,K gi (x).
• One-vs-one: we use K(K − 1)/2 SVMs gij (x), i, j = 1, . . . , K, i 6= j.
– Training: SVM gij is trained to classify the training points of class Ci (label +1) vs the
training points of class Cj (label −1); training points from other classes are not used.
P
– Testing: choose Ck if k = arg maxi=1,...,K K j6=i gij (x). Also possible: for each class k, count
the number of times gkj (x) > 0 for j 6= k, and pick the class with most votes.
• This is a convex QP on D + 1 + 2N variables and its unique minimizer (w, w0, ξ + , ξ− ) can be
found with different QP algorithms. It is equivalent to solving the dual QP on 2N variables:
N N N
1X − −
X
−
X
min + + T
(αn − αn )(αm − αm )(xn xm ) + ǫ +
(αn + αn ) − yn (αn+ − αn− )
α+ ,α− ∈RN 2 n,m n=1 n=1
N
X
s.t. (αn+ − αn− ) = 0, C ≥ αn+ , αn− ≥ 0, n = 1, . . . , N.
n=1
The optimal α ∈ RN has the property that αn+ = αn− = 0 only for those instances that lie
within the ǫ-tube; these are the instances that are fitted with enough precision. The support
vectors satisfy either αn+ > 0 or αn− > 0. As a result, for each training point n = 1, . . . , N we
can have:
61
– αn+ = αn− = 0: xn is fitted within the ǫ-tube.
– Either αn+ or αn− is in (0, C): xn is fitted on the boundary of the ǫ-tube.
We use these instances to calculate w0 , since they satisfy yn = wT xn + w0 ± ǫ if αn∓ > 0.
– αn+ = C or αn− = C: xn is fitted outside the ǫ-tube.
• Hence, the fitted line can be written as a weighted sum of the support vectors:
N
X
f (x) = wT x + w0 = (αn+ − αn− )(xTn x) + w0 .
n=1
• Kernelization: we replace the dot products xTn xm (in the dual QP) or xTn x (in the regression
line) with K(xn , xm ) or K(xn , x), respectively, where K(·, ·) is a kernel (polynomial, Gaussian,
etc.). This results in a nonlinear regression function.
−4 −2 0 2 4
2
1.5
1.4 1.5
1.2 1 1
0 2 4 6 8 0 2 4 6 8 0 2 4 6 8
62
Joint distribution: p(X = x, Y = y).
p(X = x, Y = y)
Conditioning (product rule): p(Y = y | X = x) = .
p(X = x)
14 Graphical models
X
Marginalizing (sum rule): p(X = x) = p(X = x, Y = y).
y
Bayes’ theorem: p(Y = y | X = x) p(X = x)
p(X = x | Y = y) = .
Directed graphical models (inverse probability) p(Y = y)
– X and Y are conditionally independent given Z iff p(X, Y |Z) = p(X|Z) p(Y |Z).
Equivalently? p(X|Y, Z) = p(X|Z) and p(Y |X, Z) = p(Y |Z).
Example Notation: p(R) means p(R = 1), p(W |R) means p(W = 0|R = 1), etc.
A graphical model with completely specified conditional distributions at “Modeling wet grass and rain”
each node (in the example, tables giving p(R) and p(W |R) for all com-
binations of values of R and W ), specifies the joint distribution over all
variables (in the example, p(R, W ) = p(R) p(W |R)), from which we can
compute any specific distribution over groups of variables:
Summary For d variables X1 , X2 , . . . , Xd (binary, but this generalizes to discrete or continuous variables):
• The general expression for a joint distribution without assumptions (requires 2d−1 parameters):
p(X1 , X2 , . . . , Xd ) = p(X1 ) p(X2 |X1 ) p(X3 |X1 , X2 ) · · · p(Xd |X1 , X2 , . . . , Xd−1 )
simplifies under the conditional independence assumptions of the graphical model to:
d
Y
p(X1 , X2 , . . . , Xd ) = p(Xi |parents(Xi ))
i=1
• Training, i.e., estimating the parameters of the graphical model given a dataset, is typically
done by maximum likelihood.
• The computational complexity of training and inference depends on properties of the graphical
model:
– If the graph is a tree: exact inference takes polynomial time (junction-tree algorithm).
– Otherwise, exact inference is NP-complete and becomes intractable for large models. In
that case one uses approximate algorithms (variational methods, Markov chain Monte
Carlo, etc.).
65
14.3 Generative models
• Generative model : a graphical model that represents the process we believe created the data.
• Convenient to design specific graphical models for supervised and unsupervised problems in
machine learning (classification, regression, clustering, etc.).
• Ex: classification with inputs x = (x1 , . . . , xD ) and class labels C:
– Generative model: C x means p(C, x) = p(C) p(x|C), or “first pick a class C by
sampling from p(C), then (given C) pick an instance x by sampling from p(x|C)”. This
allows us to sample pairs (C, x) if we know the model for p(x|C) (the class-conditional
probability), and the model parameters (p(C) and p(x|C)).
Ex: p(C) = πC ∈ (0, 1) and p(x|C) = N (µC , ΣC ), for continuous variables.
p(x|C) p(C)
– Bayes’ rule inverts the generative process to classify an instance x: p(C|x) = .
p(x)
– If, in addition, we assume that x = (x1 , . . . , xD ) are conditionally independent given C
(i.e., we ignore possible correlations between features for instances within the same class),
then we achieve a simpler model, the naive Bayes classifier : p(x|C) = p(x1 |C) · · · p(xD |C).
• The joint distribution is not defined using parents (which imply a directed graph), but using:
– Cliques (clique = set of nodes such that there is a link between every pair of nodes in the clique).
– Potential functions ψC (XC ), where XC is the set of variables in clique C.
1 Y X Y
X = (X1 , . . . , Xd ): p(X) = ψC (XC ) Z= ψC (XC ).
Z all cliques C X all cliques C
P
The partition function Z is a normalization constant that ensures that X p(X) = 1.
• Ex: consider an image X with d black or white pixels (Xi ∈ {0, 1}, so X = (X1 , . . . , Xd ) can
take 2d different values). Connecting each pixel to its 4 neighboring pixels defines two types of
cliques: single pixels Xi , and pairs of neighboring pixels (Xi , Xj ). Then:
d
! d !
1 Y Y
p(X) = ψ1 (Xi ) ψ2 (Xi , Xj ) ψ1 (Xi ) = e−|Yi −Xi | ψ2 (Xi , Xj ) = e−α|Xi −Xj |
Z i=1 i∼j
where Y1 , . . . , Yd are the pixel values for an observed, noisy image Y and α > 0. Then, p(X)
is high if X is close to Y (single-pixel cliques), and if X is smooth (neighboring pixels tend to
have the same value), as controlled by α. We can compute a denoised image as arg maxX p(X).
In statistical mechanics, the Ising model for magnetic materials is similar to an MRF: pixel = atom with spin ∈ {↑, ↓}.
66
15 Discrete Markov models and hidden Markov models
• Up to now, we have regarded data points x1 , . . . , xN as identically independently distributed
(iid): each xn is an independent sample from a (possibly unknown) distribution p(x; Θ) with
parameters Θ. Hence, the log-likelihood is simply a sum of the individual log-likelihood for each
P
data point: log P (x1 , . . . , xN ) = N
n=1 log p(xn ; Θ). As a graphical model: x1 x2 · · · xN .
• First-order Markov model : assumes the state at time t + 1 depends only on the state at time t
(it is conditionally independent of all other times given time t, or “the future is independent of the past given the present”):
• Homogeneous Markov model : assumes the transition probability from Si to Sj (for any pair of
states Si , Sj ) is independent of the time (so going from Si to Sj has the same probability no matter when it happens):
N
X
aij ≡ p(qt+1 = Sj | qt = Si ) aij = 1, i = 1, . . . , N, aij ∈ [0, 1], i, j = 1, . . . , N.
j=1
distribution π = (πi ).
a13
• The sequence of states (each a random variable) can be seen as π3
3
a graphical model : q1 q2 ··· qT .
• Sampling: given A and π, we can generate a sequence of states of length T as follows: sample
q1 from π; sample q2 from Aq1 • ; . . . ; sample qT from AqT −1 • .
• Training: given K observed sequences of length T , we want to estimate the parameters (A, π).
This can be done by maximum likelihood:
K T
!
X Y
max log P (sequence k) where P (sequence k) = πqk1 aqk,t−1 qkt .
π,A
k=1 t=2
68
What can we do with an HMM?
• Sampling: given (A, B, π) we can generate a sequence of observations as follows: sample q1
from π and O1 from Bq1 • ; sample q2 from Aq1 • and O2 from Bq2 • ; . . . ; sample qT from AqT −1 •
and OT from BqT • .
• Evaluating the probability p(O; A, B, π) of a given observation sequence O = (O1 , . . . , OT ). The
joint probability of a given sequence of both observations and states (if they were observed) is:
T
! T !
Y Y
p(O, Q; A, B, π) = p(q1 ) p(qt |qt−1 ) p(Ot |qt ) = πq1 bq1 O1 aq1 q2 · · · aqT −1 qT bqT OT .
t=2 t=1
• Training: given a training set of observation sequences, learn the HMM parameters (A, B, π)
that maximize the probability of generating those sequences. This can be solved with a form of
the EM algorithm (the Baum-Welch algorithm). E step: compute “p(Q|O; A, B, π)” for the current parameters
(A, B, π). M step: reestimate the parameters (A, B, π), as in a discrete Markov process.✐
In an HMM, the state variables follow a first-order Markov process: p(qt |q1 , . . . , qt−1 ) = p(qt |qt−1 ).
But the observed variables do not: p(Ot |Ot−1 , . . . , O1) 6= p(Ot |Ot−1 ). In fact, Ot depends (implicitly)
on all previous observations. An HMM achieves this long-range dependency with fewer parameters
than using a Markov process of order k. The latter needs N k+1 transition probabilities, and a huge training set where
every possible sequence of k + 1 symbols (bigrams, trigrams,. . . ,k + 1-grams) should be observed.
d1
x x x x
• Consider L iid random variables y1 , . .P . , yL with expected value E {yl } = µ and variance
var {yl } = σ 2 . Then, the average y = L1 Ll=1 yl has the following moments:
( L
) L
1X 1X
E {y} = E yl = E {yl } = µ
L l=1 L l=1
( L
) ( L ) L
1X 1 X 1 X 1
var {y} = var yl = 2 var yl = 2 var {yl } = σ 2 .
L l=1 L l=1
L l=1 L
So the expected value (hence the bias) doesn’t change, but the variance (hence the mean squared
error) decreases as L increases. If y1 , . . . , yL are identically but not independently distributed:
( L ) L L
! L
1 X 1 X X 1 2 2 X
var {y} = 2 var yl = 2 var {yl } + 2 cov {yi , yj } = σ + 2 σij .
L l=1
L l=1 i,j=1, i6=j
L L i,j=1, i6=j
So, if the learners are positively correlated (σij > 0), the variance (and error) is larger than
if they are independent. Hence, a well-designed ensemble combination will aim at reducing, if
not completely eliminating, positive correlation between learners.
Further decrease in variance is possible with negatively correlated learners. However, it is
impossible to have many learners that are both accurate and negatively correlated.
• Ensemble averaging and noise: if we view each learner as a random noise function added to the
true discriminant/regression function, and if these noise functions are independent with zero
mean, then the ensemble average is smoother than each ensemble member. fig. 4.5
Voting or averaging over models with low bias and high variance produces an ensemble with low
bias but lower variance.
17.5 Bagging
• Bootstrap: given a training set X = {x1 , . . . , xN } containing N samples, the bootstrap generates
L subsets X1 , . . . , XL of X , each of size N, as follows: subset Xl is obtained by sampling at
random with replacement N points from X .
– This means that some points xn will appear repeated multiple times in Xl and some other
points xn will not appear in Xl .
– The L subsets are partly different from each other. On average, each subset contains ≈ 63%
of the training set. Proof: the probability we don’t pick instance xn after N draws is 1 − N1 N ≈ e−1 ≈ 0.37.
– The ensemble output is defined as the vote or average (or median) of the learners’ outputs.
• Random forest: a variation of bagging using decorrelated decision trees as base learners.
– As in bagging, each tree is trained on a bootstrap sample of the training set, and the
ensemble output is defined as the vote or average (or median) of the learners’ outputs.
– In addition, the lth tree is constructed with a randomized algorithm, independently of the
other trees. Typically, we do random feature selection at each node of the tree: we select
the feature to split not among all possible features, but among a random subset of them.
It is recommended to grow a full tree (no pruning), because it has lower bias and higher variance; and to use as size of the
random subset 1 + ⌊log2 D⌋ (where D = dim x is the dimensionality of the instances).
• Random forests are among the best classifiers in practice, and simpler to train than boosting.
17.6 Boosting
• Bagging generates complementary learners through random sampling and unstable learning
algorithms. Boosting actively generates complementary learners by training the next learner on
the mistakes of the previous learners.
• Weak learner : a learner that has probability of error < 21 (i.e., better than random guessing on
binary classification). Ex: decision trees, decision stumps (tree grown to only 1 or 2 levels).
Strong learner : a learner that can have arbitrarily small probability of error. Ex: neural net.
• There are many versions of boosting, we focus on AdaBoost.M1, for classification (it can be
applied to regression with some modifications):
– It combines L weak learners. They have high bias, but the decrease in variance in the
ensemble compensates for that.
– Each learner l = 1, . . . , L is trained on the entire training set X = {x1 , . . . , xN }, but each
(l)
point xn has an associated probability pn indicating how “important” it is for that learner.
The training algorithm must be able to use these probabilities p(l) together with the training set X . Otherwise, this can be
simulated by sampling a training set of size N from X according to p(l) .
• After training, the ensemble output for a given instance is a weighted vote, where the weight
of learner l is wl ∝ log (1/βl ), so the weaker learners have a lower weight.
Other variations of boosting make the overall decision by applying learners in sequence, as in cascading.
73
• Boosting usually achieves very good classification performance, although it does depend on the
dataset and the type of learner used (it should be weak but not too weak). It is sensitive to
noise and outliers.
A very successful application in computer vision: the Viola-Jones face detector. This is a cascade of AdaBoost ensembles of decision
stumps, each trained on a large set of Haar features. It is robust and (with clever image-based operations) very fast for test images.
Cascading is similar to boosting, but the learners are trained sequentially. For example, in boosting
the next learner could be trained on the residual error of the previous learner. Another way:
• When applied to an instance x, each learner gives an output (e.g. class label) and a confidence
(which can be defined as the largest posterior probability p(Ck |x)).
• Learner l is trained on instances for which the previous learner is not confident enough.
• When applying the ensemble to a text instance, we apply the learners in sequence until one is
confident enough. We use learner l only if the previous learners 1, . . . , l − 1 are not confident
enough on their outputs.
• The goal is to order the learners in increasing complexity, so the early learners are simple and
classify the easy instances quickly.
class k (where the weights wkl are the elements of W) and then pick the class with most votes.
74
• The point of ECOC is to introduce robustness against errors of the learners by making their
codewords be farther from each other in Hamming distance (number of mismatching bits). This
requires sufficiently many learners, and introduces redundancy.
• Given a value of L (the number of classifiers, with L > K) selected by the user, we generate
W so its rows are as different as possible in Hamming distance (redundant codes so protection
against errors), and its columns are as different as possible (so the learners are diverse).
• An ECOC can also be seen as an ensemble of classifiers where, to obtain each classifier, we
manipulate the output targets (rather than the input features or the training set, which in
ECOC are equal to the original training set for each classifier).
1. We train L learners using the training set (in whatever way that introduces diversity).
2. We apply these learners to a validation set. For each validation point xn , we obtain the
outputs of the L learners (y1 , . . . , yL ). For classification, the output of learner l could be the class label or
the posterior probabilities (softmax). For regression, the output of learner l is its real-valued output vector.
3. We train a model f to predict the ground-truth output for a given instance xn from the
learners’ outputs for that instance.
• If we choose f to be a linear function, this produces a weighted average or vote, where the
weights are learned on the validation set. But we can choose f to be nonlinear.
• By training f on the validation set (rather than the training set where the learners were trained),
it learns to correct for mistakes that the learners make.
• Whether it works better than a fixed, simple combination rule (such as majority vote or average)
depends on the problem.
• Typically, the resulting ensemble contains learners that are somewhat correlated with each
other, or that are useless.
• It often helps to postprocess the ensemble by removing some learners (ensemble pruning), so
we obtain a smaller ensemble (hence faster at test time) having about the same error.
• This is similar to feature selection and we can indeed apply feature selection algorithms. The
most usual is forward selection: starting with an empty ensemble, we sequentially add one
learner at a time, the one that gives highest accuracy when added to the previous ones.
75
18 Reinforcement learning
• How an autonomous agent that senses and acts in its environment can learn to choose optimal
actions to achieve a goal. Ex.: board games (e.g. chess, backgammon), robot navigation (e.g.
looking for the exit of a maze).
– Each time the agent performs an action, it may receive a reward (or penalty) to indicate
how good the resulting state of the environment is.
– Often the reward is delayed.
At the end of the game (positive reward if we win, negative if we lose); or when (if) the robot reaches the exit.
– The task of the agent is to learn from this reward to choose sequences of actions that
produce the greatest cumulative reward.
• As in other machine learning settings, the task is to learn a function, here a control policy
π: S → A that maps states s ∈ S to actions a ∈ A, but with the following differences:
– Delayed reward : we are not given pairs (state,action) to train the function, as in supervised
learning. Instead, we (may) have a reward when the agent executes an action. There is
no such thing as the best move at any given time, what matters is the sequence of moves.
Supervised learning is “learning with a teacher”. Reinforcement learning is “learning with a critic”. The critic doesn’t tell
us what to do but only how well we have done in the past. Its feedback is scarce and when it comes, it comes late.
– Credit assignment problem: after taking several actions and getting the reward, which
actions helped? So that we can record and recall them later on.
A reinforcement learning program learns to generate an internal value for the intermediate
states and actions in terms of how good they are in leading us to the goal. Having learned
this internal reward mechanism, the agent can just take local actions to maximize it.
– Exploration vs exploitation: the agent influences the distribution of training examples by
the actions it chooses. It should trade off exploration of unknown states and actions (to
gain information) vs exploitation of states and actions that it has already learned will
yield high reward (to maximize its cumulative reward).
– Partially observed states: practically, sensors provide only partial information about the
environment state (e.g. a camera doesn’t provide the robot location).
• Basic elements of a reinforcement learning problem:
– Agent: the decision maker (e.g. game player, robot). It has
sensors to observe the environment (e.g. robot camera).
76
• The task of the agent: perform sequences of actions, observe their consequences and learn a
control policy that, from any initial state, chooses actions that maximize the reward accumulated
over time.
Ex.: grid world with GOAL state, squares = states, arrows = actions (γ = 0.9):
0 100 0 90 100 0
0 GOAL 81 GOAL 90 100 GOAL GOAL
100
0 0 72 81
0 0 100 81 90 100
0 0 81 90
0 0 72 81 81 90 100
Q-learning
• Learning π ∗ : S → A directly is difficult because we don’t have examples (s, a). Instead, we
learn a numerical evaluation function Q(s, a) of states and actions, and derive π from it.
• Define the Q function: Q(s, a) = r(s, a) + γV ∗ (δ(s, a)).
77
• We want to find π ∗ (s) = arg maxa Q(s, a). This can be done with the following iterative
algorithm. Noting that V ∗ (s) = maxa′ Q(s, a′ ), we can write the following recursive expression
for Q: Q(s, a) = r(s, a) + γ maxa′ Q(δ(s, a), a′ ). The algorithm starts with a table Q̂(s, a) = 0
for every (s, a), and repeatedly does the following:
• Each update of Q̂(s, a) affects the old state s (not the new one s′ ).
• Assuming r(s, a) is bounded and every possible (state,action) pair is visited infinitely often,
one can prove: 1) the Q̂ values never decrease, and are between 0 and their optimal values Q;
2) Q-learning converges to the correct Q function.
• Note the agent need not know the functions r and δ ahead of time (which is often not practical,
e.g. for a mobile robot). It just needs to know their particular values as it takes actions, moves
to new states and receives the corresponding rewards (i.e., it samples those functions).
In choosing new actions to apply, the agent has an exploration-exploitation tradeoff (repeat actions that seem currently good vs
trying new ones). Commonly, one moves from pure exploration at the beginning towards exploitation as training proceeds.
If we have perfect knowledge of the functions r and δ (i.e., we can evaluate them for any (s, a)),
we can find the optimal policy more efficiently with a dynamic programming algorithm to solve
Bellman’s equation: V ∗ (s) = E {r(s, π(s)) + γ V ∗ (δ(s, π(s)))} ∀s ∈ S. This is convenient in
some applications, e.g. factory automation and job scheduling problems.
Training consists of a series of episodes, each beginning at a random state and executing actions
until reaching the goal state. The first update occurs when reaching the goal state and receiving a
nonzero reward; in subsequent episodes, updates propagate backward, eventually filling the entire
Q̂(·, ·) table.
78
– Optimal policy (as before): π ∗ = arg maxπ V π (s) ∀s ∈ S.
∗ ∗
P a) =′ E {r(s,∗ a)′ + γ V (δ(s, a))} = E {r(s, a)} + γ E {V (δ(s, a))} =
– Q function: Q(s,
E {r(s, a)} + γ s′ P (s |s, a) V (s ).
P
– Recursive expression for Q: Q(s, a) = E {r(s, a)} + γ s′ P (s′|s, a) maxa′ Q(s′ , a′ ).
– Update for Q̂ at iteration k: Q̂k (s, a) ← (1 − αk )Q̂k−1 (s, a) + αk r + γ maxa′ Q̂k−1 (s′ , a′ ) ,
where the step size αk decreases towards zero as k → ∞ for convergence to occur.
This is similar to SGD compared to GD. Convergence is slow and may require many thousands of iterations.
– In practice, there may be a large or infinite number of states and actions (e.g. turning the
steering wheel by a continuous angle).
– Even if we could store the table, the previous Q-learning algorithm performs a kind of rote
learning: it simply stores seen (state,action) pairs and doesn’t generalize to unseen ones.
Indeed, the convergence proof assumes every possible (state,action) pair is visited (infinitely often).
• Instead, we can represent Q(s, a) with a function approximator, such as a neural net Q(s, a; W)
with inputs (s, a), weights W, and a single output corresponding to the value of Q(s, a). Each
Q̂(s, a) update in the Q-learning algorithm provides one training example, which is used to
update the weights via SGD.
A classic example: the 1995 TD-gammon program for playing backgammon (which has ≈ 1020 states and randomness because of
the dice roll) used a neural net trained on 1.5 million self-generated games (playing against itself) and achieved master-level play.
• This can be modeled using a Partially Observable Markov Decision Process (POMDP), which
uses P (o|s, a) to model an observation o given the current state s (unobserved) and action a,
rather than P (s′ |s, a) (e.g. o = camera image, s = 3D location of the robot).
This is similar to the difference between a discrete (observable) Markov process and a hidden Markov model (HMM).
79