ML Word To PDF
ML Word To PDF
ML Word To PDF
By PythonLife
Contents
Preface page 1
1 Introduction 3
1.1 A Taste of Machine Learning 3
1.1.1 Applications 3
1.1.2 Data 7
1.1.3 Problems 9
1.2 Probability Theory 12
1.2.1 Random Variables 12
1.2.2 Distributions 13
1.2.3 Mean and Variance 15
1.2.4 Marginalization, Independence, Conditioning, and
Bayes Rule 16
1.3 Basic Algorithms 20
1.3.1 Naive Bayes 22
1.3.2 Nearest Neighbor Estimators 24
1.3.3 A Simple Classifier 27
1.3.4 Perceptron 29
1.3.5 K-Means 32
2 Density Estimation 37
2.1 Limit Theorems 37
2.1.1 Fundamental Laws 38
2.1.2 The Characteristic Function 42
2.1.3 Tail Bounds 45
2.1.4 An Example 48
2.2 Parzen Windows 51
2.2.1 Discrete Density Estimation 51
2.2.2 Smoothing Kernel 52
2.2.3 Parameter Estimation 54
2.2.4 Silverman’s Rule 57
2.2.5 Watson-Nadaraya Estimator 59
2.3 Exponential Families 60
2.3.1 Basics 60
v
vi 0 Contents
2.3.2 Examples 62
2.4 Estimation 66
2.4.1 Maximum Likelihood Estimation 66
2.4.2 Bias, Variance and Consistency 68
2.4.3 A Bayesian Approach 71
2.4.4 An Example 75
2.5 Sampling 77
2.5.1 Inverse Transformation 78
2.5.2 Rejection Sampler 82
3 Optimization 91
3.1 Preliminaries 91
3.1.1 Convex Sets 92
3.1.2 Convex Functions 92
3.1.3 Subgradients 96
3.1.4 Strongly Convex Functions 97
3.1.5 Convex Functions with Lipschitz Continous Gradient 98
3.1.6 Fenchel Duality 98
3.1.7 Bregman Divergence 100
3.2 Unconstrained Smooth Convex Minimization 102
3.2.1 Minimizing a One-Dimensional Convex Function 102
3.2.2 Coordinate Descent 104
3.2.3 Gradient Descent 104
3.2.4 Mirror Descent 108
3.2.5 Conjugate Gradient 111
3.2.6 Higher Order Methods 115
3.2.7 Bundle Methods 121
3.3 Constrained Optimization 125
3.3.1 Projection Based Methods 125
3.3.2 Lagrange Duality 127
3.3.3 Linear and Quadratic Programs 131
3.4 Stochastic Optimization 135
3.4.1 Stochastic Gradient Descent 136
3.5 Nonconvex Optimization 137
3.5.1 Concave-Convex Procedure 137
3.6 Some Practical Advice 139
4 Online Learning and Boosting 143
4.1 Halving Algorithm 143
4.2 Weighted Majority 144
Contents vii
1
2 0 Preface
Density
Models
Conditional
Learning
Density Density
Models Models
Densities Densities
Conditional Conditional
Learning Learning
Introduction
Over the past two decades Machine Learning has become one of the main-
stays of information technology and with that, a rather central, albeit usually
hidden, part of our life. With the ever increasing amounts of data becoming
available there is good reason to believe that smart data analysis will become
even more pervasive as a necessary ingredient for technological progress.
The purpose of this chapter is to provide the reader with an overview over
the vast range of applications which have at their heart a machine learning
problem and to bring some degree of order to the zoo of problems. After that,
we will discuss some basic tools from statistics and probability theory, since
they form the language in which many machine learning problems must be
phrased to become amenable to solving. Finally, we will outline a set of fairly
basic yet effective algorithms to solve an important problem, namely that of
classification. More sophisticated tools, a discussion of more general
problems and a detailed analysis will follow in later parts of the book.
1.1.1 Applications
Most readers will be familiar with the concept of web page ranking. That is,
the process of submitting a query to a search engine, which then findswebpages
relevant to the query and which returns them in their order of relevance. See
e.g. Figure 1.1 for an example of the query results for “ma-chine learning”.
That is, the search engine returns a sorted list of webpages given a query. To
achieve this goal, a search engine needs to ‘know’ which
3
4 1 Introduction
Google
Web Scholar Results 1 - 10 of about 10,500,000 for machine learning. (0.06 seconds)
machine learning
www.aaai.org/AITopics/html/machine.html - Similar pages
Machine Learning
A list of links to papers and other resources on machine learning.
www.machinelearning.net/ - 14k - Cached - Similar pages
Fig. 1.1. The 5 top scoring webpages for the query “machine learning”
pages are relevant and which pages match the query. Such knowledge can be
gained from several sources: the link structure of webpages, their content,
the frequency with which users will follow the suggested links in a query, or
from examples of queries in combination with manually ranked webpages.
Increasingly machine learning rather than guesswork and clever engineering
is used to automate the process of designing a good search engine [RPB06].
A rather related application is collaborative filtering. Internet book-
stores such as Amazon, or video rental sites such as Netflix use this informa-
tion extensively to entice users to purchase additional goods (or rent more
movies). The problem is quite similar to the one of web page ranking. As
before, we want to obtain a sorted list (in this case of articles). The key dif-
ference is that an explicit query is missing and instead we can only use past
purchase and viewing decisions of the user to predict future viewing and
purchase habits. The key side information here are the decisions made by
similar users, hence the collaborative nature of the process. See Figure 1.2
for an example. It is clearly desirable to have an automatic system to solve
this problem, thereby avoiding guesswork and time [BK07].
An equally ill-defined problem is that of automatic translation of doc-
uments. At one extreme, we could aim at fully understanding a text before
translating it using a curated set of rules crafted by a computational linguist
well versed in the two languages we would like to translate. This is a rather
arduous task, in particular given that text is not always grammatically cor-
rect, nor is the document understanding part itself a trivial one. Instead, we
could simply use examples of translated documents, such as the proceedings
of the Canadian parliament or other multilingual entities (United Nations,
European Union, Switzerland) to learn how to translate between the two
1.1 A Taste of Machine Learning 5
Pattern Recognit ion and Artificial Intelligence: A The Elements of Statistical Pattern Classification (2nd Data Mining: Practical
Machine Learning Modern Approach (2nd Learning by T. Hastie Edition) by Richard O. Machine Learning Tools
(Information Science and Edition) (Prentice Hall (25) $72.20 Duda and Techniques, Second
Statistics) by Christopher Series in Artif icial (25) $115.00 Edition (Morgan Kauf mann
M. Bishop Intelligence) by Stuart Series in Data
(30) $60.50 Russell Management Systems) by
(76) $115.00 Ian H. Witten
(21) $39.66
Fig. 1.2. Books recommended by Amazon.com when viewing Tom Mitchell’s Ma-
chine Learning Book [Mit97]. It is desirable for the vendor to recommend relevant
books which a user might purchase.
Fig. 1.3. 11 Pictures of the same person taken from the Yale face recognition
database. The challenge is to recognize that we are dealing with the same per-
son in all 11 cases.
6 1 Introduction
HAVANA (Reuters) - The European Union’s top development aid official
left Cuba on Sunday convinced that EU diplomatic sanctions against
the communist island should be dropped after Fidel Castro’s
retirement, his main aide said.
<TYPE="ORGANIZATION">HAVANA</> (<TYPE="ORGANIZATION">Reuters</>) - The
<TYPE="ORGANIZATION">European Union</>’s top development aid official left
<TYPE="ORGANIZATION">Cuba</> on Sunday convinced that EU diplomatic sanctions
against the communist <TYPE="LOCATION">island</> should be dropped after
<TYPE="PERSON">Fidel Castro</>’s retirement, his main aide said.
Fig. 1.4. Named entity tagging of a news article (using LingPipe). The relevant
locations, organizations and persons are tagged for further information extraction.
are clearly terms from agriculture, it is equally clear that in the context of
contemporary politics they refer to members of the Republican Party.
Other applications which take advantage of learning are speech recog-
nition (annotate an audio sequence with text, such as the system shipping
with Microsoft Vista), the recognition of handwriting (annotate a sequence
of strokes with text, a feature common to many PDAs), trackpads of com-
puters (e.g. Synaptics, a major manufacturer of such pads derives its name
from the synapses of a neural network), the detection of failure in jet en-
gines, avatar behavior in computer games (e.g. Black and White), direct
marketing (companies use past purchase behavior to guesstimate whether
you might be willing to purchase even more) and floor cleaning robots (such
as iRobot’s Roomba). The overarching theme of learning problems is that
there exists a nontrivial dependence between some observations, which we
will commonly refer to as x and a desired response, which we refer to as y,
for which a simple set of deterministic rules is not known. By using learning
we can infer such a dependency between x and y in a systematic fashion.
We conclude this section by discussing the problem of classification, since
it will serve as a prototypical problem for a significant part of thisbook.
It occurs frequently in practice: for instance, when performing spam filtering,
we are interested in a yes/no answer as to whether an e-mail con- tains
relevant information or not. Note that this issue is quite user depen- dent: for
a frequent traveller e-mails from an airline informing him about recent
discounts might prove valuable information, whereas for many other
recipients this might prove more of an nuisance (e.g. when the e-mail relates
to products available only overseas). Moreover, the nature of annoying e-
mails might change over time, e.g. through the availability of new products
(Viagra, Cialis, Levitra, . . . ), different opportunities for fraud (the Nigerian
419 scam which took a new twist after the Iraq war), or different data types
(e.g. spam which consists mainly of images). To combat these problems we
1.1 A Taste of Machine Learning 7
Fig. 1.5. Binary classification; separate stars from diamonds. In this example we
are able to do so by drawing a straight line which separates both sets. We will see
later that this is an important example of what is called a linear classifier.
want to build a system which is able to learn how to classify new e-mails.
A seemingly unrelated problem, that of cancer diagnosis shares a common
structure: given histological data (e.g. from a microarray analysis of a pa-
tient’s tissue) infer whether a patient is healthy or not. Again, we are asked
to generate a yes/no answer given a set of observations. See Figure 1.5 for
an example.
1.1.2 Data
It is useful to characterize learning problems according to the type of data
they use. This is a great help when encountering new challenges, since quite
often problems on similar data types can be solved with very similar tech-
niques. For instance natural language processing and bioinformatics use very
similar tools for strings of natural language text and for DNA sequences.
Vectors constitute the most basic entity we might encounter in our work. For
instance, a life insurance company might be interesting in obtaining the
vector of variables (blood pressure, heart rate, height, weight, cholesterol
level, smoker, gender) to infer the life expectancy of a potential customer.
A farmer might be interested in determining the ripeness of fruit based on
(size, weight, spectral data). An engineer might want to find dependencies
in (voltage, current) pairs. Likewise one might want to represent documents
by a vector of counts which describe the occurrence of words. The latter is
commonly referred to as bag of words features.
One of the challenges in dealing with vectors is that the scales and units of
different coordinates may vary widely. For instance, we could measure the
height in kilograms, pounds, grams, tons, stones, all of which would amount
to multiplicative changes. Likewise, when representing temperatures, we
have a full class of affine transformations, depending on whether we rep-
resent them in terms of Celsius, Kelvin or Farenheit. One way of dealing
8 1 Introduction
1.1.3 Problems
The range of learning problems is clearly large, as we saw when discussing
applications. That said, researchers have identified an ever growing number
of templates which can be used to address a large set of situations. It is those
templates which make deployment of machine learning in practice easy and
our discussion will largely focus on a choice set of such problems. We now
give a by no means complete list of templates.
Binary Classification is probably the most frequently studied problem
in machine learning and it has led to a large number of important algorithmic
and theoretic developments over the past century. In its simplest form it
reduces to the question: given a pattern x drawn from a domain X, estimate
which value an associated binary random variable y ∈ {±1} will assume.
For instance, given pictures of apples and oranges, we might want to state
whether the object in question is an apple or an orange. Equally well, we
might want to predict whether a home owner might default on his loan,
given income data, his credit history, or whether a given e-mail is spam or
ham. The ability to solve this basic problem already allows us to address a
large variety of practical settings.
There are many variants exist with regard to the protocol in which we are
required to make our estimation:
10 1 Introduction
Fig. 1.6. Left: binary classification. Right: 3-class classification. Note that in thelatter
case we have much more degree for ambiguity. For instance, being able to distinguish
stars from diamonds may not suffice to identify either of them correctly, since we
also need to distinguish both of them from triangles.
• We might see a sequence of (xi, yi) pairs for which yi needs to be estimated
in an instantaneous online fashion. This is commonly referred to as online
learning.
• We might observe a collection X := {x1, . . . x m} and Y := {y1, . . . ym} of
pairs (xi, yi) which are then used to estimate y for a (set of) so-far unseen
J J J
}
X = x 1, . . . , x m′ . This is commonly referred to as batch learning.
• We might be allowed to know XJ already at the time of constructing the
model. This is commonly referred to as transduction.
• We might be allowed to choose X for the purpose of model building. This
is known as active learning.
• We might not have full information about X, e.g. some of the coordinates
of the xi might be missing, leading to the problem of estimation with
missing variables.
• The sets X and XJ might come from different data sources, leading to the
problem of covariate shift correction.
• We might be given observations stemming from two problems at the same
time with the side information that both problems are somehow related.
This is known as co-training.
• Mistakes of estimation might be penalized differently depending on the
type of error, e.g. when trying to distinguish diamonds from rocks a very
asymmetric loss applies.
Multiclass Classification is the logical extension of binary classifica- tion.
The main difference is that now y ∈ {1, . . . , n} may assume a range of
different values. For instance, we might want to classify a document ac-
cording to the language it was written in (English, French, German, Spanish,
Hindi, Japanese, Chinese, . . . ). See Figure 1.6 for an example. The main dif-
ference to before is that the cost of error may heavily depend on the type of
1.1 A Taste of Machine Learning 11
error we make. For instance, in the problem of assessing the risk of cancer, it
makes a significant difference whether we mis-classify an early stage of can-
cer as healthy (in which case the patient is likely to die) or as an advanced
stage of cancer (in which case the patient is likely to be inconvenienced from
overly aggressive treatment).
Structured Estimation goes beyond simple multiclass estimation by
assuming that the labels y have some additional structure which can be used
in the estimation process. For instance, y might be a path in an ontology,
when attempting to classify webpages, y might be a permutation, when
attempting to match objects, to perform collaborative filtering, or to rank
documents in a retrieval setting. Equally well, y might be an annotation of
a text, when performing named entity recognition. Each of those problems
has its own properties in terms of the set of y which we might consider
admissible, or how to search this space. We will discuss a number of those
problems in Chapter ??.
Regression is another prototypical application. Here the goal is to esti-
mate a real-valued variable y ∈ R given a pattern x (see e.g. Figure 1.7). For
instance, we might want to estimate the value of a stock the next day, the
yield of a semiconductor fab given the current process, the iron content of
ore given mass spectroscopy measurements, or the heart rate of an athlete,
given accelerometer data. One of the key issues in which regression problems
differ from each other is the choice of a loss. For instance, when estimating
stock values our loss for a put option will be decidedly one-sided. On the
other hand, a hobby athlete might only care that our estimate of the heart
rate matches the actual on average.
Novelty Detection is a rather ill-defined problem. It describes the issue
of determining “unusual” observations given a set of past measurements.
Clearly, the choice of what is to be considered unusual is very subjective.
A commonly accepted notion is that unusual events occur rarely. Hence a
possible goal is to design a system which assigns to each observation a rating
12 1 Introduction
Fig. 1.8. Left: typical digits contained in the database of the US Postal Service.
Right: unusual digits found by a novelty detection algorithm [SPST+ 01] (for a
description of the algorithm see Section 7.4). The score below the digits indicates the
degree of novelty. The numbers on the lower right indicate the class associated with
the digit.
as to how novel it is. Readers familiar with density estimation might contend
that the latter would be a reasonable solution. However, we neither need a
score which sums up to 1 on the entire domain, nor do we care particularly
much about novelty scores for typical observations. We will later see how this
somewhat easier goal can be achieved directly. Figure 1.8 has an example of
novelty detection when applied to an optical character recognition database.
height
ξ(x)
Fig. 1.9. The random variable ξ maps from the set of outcomes of an experiment
(denoted here by X) to real numbers. As an illustration here X consists of the
patients a physician might encounter, and they are mapped via ξ to their weight
and height.
1.2.2 Distributions
Perhaps the most important way to characterize a random variable is to
associate probabilities with the values it can take. If the random variable is
discrete, i.e., it takes on a finite number of values, then this assignment of
probabilities is called a probability mass function or PMF for short. A PMF
must be, by definition, non-negative and must sum to one. For instance,
if the coin is fair, i.e., heads and tails are equally likely, then the random
variable X described above takes on values of +1 and −1 with probability
0.5. This can be written as
Pr(X = +1) = 0.5 and Pr(X = −1) = 0.5. (1.1)
When there is no danger of confusion we will use the slightly informal no-
tation p(x) := Pr(X = x).
In case of a continuous random variable the assignment of probabilities
results in a probability density function or PDF for short. With some abuse of
terminology, but keeping in line with convention, we will often use densityor
distribution instead of probability density function. As in the case of the PMF,
a PDF must also be non-negative and integrate to one. Figure 1.10 shows two
distributions: the uniform distribution
( 1
if x ∈ [a, b]
p(x) = b− a (1.2)
0 otherwise,
14 1 Introduction
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
-4 -2 0 2 4 -4 -2 0 2 4
Fig. 1.10. Two common densities. Left: uniform distribution over the interval
[−1, 1]. Right: Normal distribution with zero mean and unit variance.
J
The CDF F (x ) allows us to perform range queries on p efficiently. For
instance, by integral calculus we obtain
∫ b
The values of xJ for which F (xJ) assumes a specific value, such as 0.1 or 0.5
have a special name. They are called the quantiles of the distribution p.
Definition 1.2 (Quantiles) Let q ∈ (0, 1). Then the value of xJ for which
Pr(X < xJ) ≤ q and Pr(X > xJ) ≤ 1 − q is the q-quantile of the distribution
p. Moreover, the value xJ associated with q = 0.5 is called the median.
1.2 Probability Theory 15
Fig. 1.11. Quantiles of a distribution correspond to the area under the integral of the
density p(x) for which the integral takes on a pre-specified value. Illustratedare
the 0.1, 0.5 and 0.9 quantiles respectively.
For instance, in the case of a dice we have equal probabilities of 1/6 for all
6 possible outcomes. It is easy to see that this translates into a mean of (1
+ 2 + 3 + 4 + 5 + 6)/6 = 3.5.
The mean of a random variable is useful in assessing expected losses and
benefits. For instance, as a stock broker we might be interested in the ex-
pected value of our investment in a year’s time. In addition to that, however,
we also might want to investigate the risk of our investment. That is, how
likely it is that the value of the investment might deviate from its expecta-
tion since this might be more relevant for our decisions. This means that we
16 1 Introduction
need a variable to quantify the risk inherent in a random variable. One such
measure is the variance of a random variable.
The variance measures by how much on average f (X) deviates from its ex-
pected value. As we shall see in Section 2.1, an upper bound on the variance
can be used to give guarantees on the probability that f (X) will be within
ϵ of its expected value. This is one of the reasons why the variance is often
associated with the risk of a random variable. Note that often one discusses
properties of a random variable in terms of its standard deviation, which is
defined as the square root of the variance.
We say that X and Y are independent, i.e., the values that X takes does
not depend on the values that Y takes whenever
Independence is useful when it comes to dealing with large numbers of ran- dom
variables whose behavior we want to estimate jointly. For instance, whenever
we perform repeated measurements of a quantity, such as when
1.2 Probability Theory 17
2.0 2.0
1.5 1.5
1.0 1.0
0.5 0.5
0.0 0.0
-0.5 -0.5
-0.5 0.0 0.5 1.0 1.5 2.0 -0.5 0.0 0.5 1.0 1.5 2.0
Fig. 1.12. Left: a sample from two dependent random variables. Knowing about
first coordinate allows us to improve our guess about the second coordinate. Right:
a sample drawn from two independent random variables, obtained by randomly
permuting the dependent sample.
measuring the voltage of a device, we will typically assume that the individ-
ual measurements are drawn from the same distribution and that they are
independent of each other. That is, having measured the voltage a number
of times will not affect the value of the next measurement. We will call such
random variables to be independently and identically distributed, or in short,
iid random variables. See Figure 1.12 for an example of a pair of random
variables drawn from dependent and independent distributions respectively.
Conversely, dependence can be vital in classification and regression prob-
lems. For instance, the traffic lights at an intersection are dependent of each
other. This allows a driver to perform the inference that when the lights are
green in his direction there will be no traffic crossing his path, i.e. the other
lights will indeed be red. Likewise, whenever we are given a picture x of a
digit, we hope that there will be dependence between x and its label y.
Especially in the case of dependent random variables, we are interested
in conditional probabilities, i.e., probability that X takes on a particular
value given the value of Y . Clearly Pr(X = rain|Y = cloudy) is higher than
Pr(X = rain|Y = sunny). In other words, knowledge about the value of Y
significantly influences the distribution of X. This is captured via conditional
probabilities:
p(x, y)
p(x|y) := . (1.14)
p(y)
1.2.4.1 An Example
We illustrate our reasoning by means of a simple example — inference using
an AIDS test. Assume that a patient would like to have such a test carried
out on him. The physician recommends a test which is guaranteed to detect
HIV-positive whenever a patient is infected. On the other hand, for healthy
patients it has a 1% error rate. That is, with probability 0.01 it diagnoses
a patient as HIV-positive even when he is, in fact, HIV-negative. Moreover,
assume that 0.15% of the population is infected.
Now assume that the patient has the test carried out and the test re-
turns ’HIV-negative’. In this case, logic implies that he is healthy, since the
test has 100% detection rate. In the converse case things are not quite as
straightforward. Denote by X and T the random variables associated with
the health status of the patient and the outcome of the test respectively. We
are interested in p(X = HIV+|T = HIV+). By Bayes rule we may write
p(T = HIV+|X = HIV+)p(X = HIV+)
p(X = HIV+ |T = HIV+) =
p(T = HIV+)
While we know all terms in the numerator, p(T = HIV+) itself is unknown.
That said, it can be computed via
Σ
p(T = HIV+) = p(T = HIV+, x)
x∈ {HIV+,HIV-}
Σ
= p(T = HIV+|x)p(x)
x∈ {HIV+,HIV-}
Fig. 1.13. A graphical description of our HIV testing scenario. Knowing the age of
the patient influences our prior on whether the patient is HIV positive (the random
variable X). The outcomes of the tests 1 and 2 are independent of each other given
the status X. We observe the shaded random variables (age, test 1, test 2) and
would like to infer the un-shaded random variable X. This is a special case of a
graphical model which we will discuss in Chapter ??.
Let us now think how we could improve the diagnosis. One way is to ob-
tain further information about the patient and to use this in the diagnosis.
For instance, information about his age is quite useful. Suppose the patient is
35 years old. In this case we would want to compute p(X = HIV+|T = HIV+,
A = 35) where the random variable A denotes the age. The corre- sponding
expression yields:
p(T = HIV+|X = HIV+, A)p(X = HIV+|A)
p(T = HIV+|A)
Here we simply conditioned all random variables on A in order to take addi-
tional information into account. We may assume that the test is independent
of the age of the patient, i.e.
p(t|x, a) = p(t|x).
What remains therefore is p(X = HIV+|A). Recent US census data pegs this
number at approximately 0.9%. Plugging all data back into the conditional
1·0.009
expression yields = 0.48. What has happened here is that
1· 0 .009+0.01·0.991
Asset Capital Group, Inc., (ACGU) announced that it is expanding the marketing of bio-remediation fluids and cleaning equipment. After
its recent acquisition of interest in American Bio-Clean Corporation and an 80
News is expected to be released next week on this growing company and could drive the price even higher. Buy (ACGU) Monday at open. I
believe those involved at this stage could enjoy a nice ride up.
x1: The quick brown fox jumped over the lazy dog.
x2: The dog hunts a fox.
the quick brown fox jumped over lazy dog hunts a
x1 2 1 1 1 1 1 1 1 0 0
x2 1 0 0 1 0 0 0 1 1 1
and associated labels yi, denoted by Y := {y1, . . . , ym}. Here the labels sat- isfy
yi ∈ {spam, ham}. The key assumption we make here is that the pairs (xi,
yi) are drawn jointly from some distribution p(x, y) which represents the e-
mail generating process for a user. Moreover, we assume that there is
sufficiently strong dependence between x and y that we will be able to
estimate y given x and a set of labeled instances X, Y.
Before we do so we need to address the fact that e-mails such as Figure 1.14
are text, whereas the three algorithms we present will require data to be
represented in a vectorial fashion. One way of converting text into a vector
is by using the so-called bag of words representation [Mar61, Lew98]. In its
simplest version it works as follows: Assume we have a list of all possible
words occurring in X, that is a dictionary, then we are able to assign a unique
number with each of those words (e.g. the position in the dictionary). Now
we may simply count for each document xi the number of times a given word
j is occurring. This is then used as the value of the j-th coordinateof xi.
Figure 1.15 gives an example of such a representation. Once we have the
latter it is easy to compute distances, similarities, and other statistics directly
from the vectorial representation.
22 1 Introduction
The key problem, however, is that we do not know p(x|y) or p(x). We may
dispose of the requirement of knowing p(x) by settling for a likelihood ratio
p(spam|x) p(x|spam)p(spam)
L(x) := = . (1.17)
p(ham|x) p(x|ham)p(ham)
Whenever L(x) exceeds a given threshold c we decide that x is spam and
consequently reject the e-mail. If c is large then our algorithm is conservative
and classifies an email as spam only if p(spam|x) p(ham|x). On the other
hand, if c is small then the algorithm aggressively classifies emails as spam.
The key obstacle is that we have no access to p(x|y). This is where we make
our key approximation. Recall Figure 1.13. In order to model the distribution
of the test outcomes T1 and T2 we made the assumption that they are
conditionally independent of each other given the diagnosis. Analogously,
we may now treat the occurrence of each word in a document as a separate
test and combine the outcomes in a naive fashion by assuming that
Y
# of words in x
p(x|y) = p(w j |y), (1.18)
j=1
where w j denotes the j-th word in document x. This amounts to the as-
sumption that the probability of occurrence of a word in a document is
independent of all other words given the category of the document. Even
though this assumption does not hold in general – for instance, the word “York”
is much more likely to after the word “New” – it suffices for our purposes (see
Figure 1.16).
This assumption reduces the difficulty of knowing p(x|y) to that of esti-
mating the probabilities of occurrence of individual words w. Estimates for
1.3 Basic Algorithms 23
Fig. 1.16. Naive Bayes model. The occurrence of individual words is independent
of each other, given the category of the text. For instance, the word Viagra is fairly
frequent if y = spam but it is considerably less frequent if y = ham, except when
considering the mailbox of a Pfizer sales representative.
p(w|y) can be obtained, for instance, by simply counting the frequency oc-
currence of the word within documents of a given class. That is, we estimate
Σm Σ# of words in xi , ,
yi = spam and wj = w
p(w|spam) ≈ i=1 j=1
Σm Σ # of words in xi i
i=1
j=1 {yi = spam}
, ,
Here yi = spam and w ji = w equals 1 if and only if xi is labeled as spam
and w occurs as the j-th word in xi. The denominator is simply the total
number of words in spam documents. Similarly one can compute p(w|ham).
In principle we could perform the above summation whenever we see a new
document x. This would be terribly inefficient, since each such computation
requires a full pass through X and Y. Instead, we can perform a single pass
through X and Y and store the resulting statistics as a good estimate of the
conditional probabilities. Algorithm 1.1 has details of an implementation.
Note that we performed a number of optimizations: Firstly, the normaliza-
tion by m−spam
1
and m−ham
1
respectively is independent of x, hence we incor-
Fig. 1.17. 1 nearest neighbor classifier. Depending on whether the query point x is
closest to the star, diamond or triangles, it uses one of the three labels for it.
Fig. 1.18. k-Nearest neighbor classifiers using Euclidean distances. Left: decision
boundaries obtained from a 1-nearest neighbor classifier. Middle: color-coded sets
of where the number of red / blue points ranges between 7 and 0. Right: decision
boundary determining where the blue or red dots are in the majority.
flexible. For instance, we could use string edit distances to compare two
documents or information theory based measures.
However, the problem with nearest neighbor classification is that the esti-
mates can be very noisy whenever the data itself is very noisy. For instance,
if a spam email is erroneously labeled as nonspam then all emails which
are similar to this email will share the same fate. See Figure 1.18 for an
example. In this case it is beneficial to pool together a number of neighbors,
say the k-nearest neighbors of x and use a majority vote to decide the class
membership of x. Algorithm 1.2 has a description of the algorithm. Note
that nearest neighbor algorithms can yield excellent performance when used
with a good distance measure. For instance, the technology underlying the
Netflix progress prize [BK07] was essentially nearest neighbours based.
Note that it is trivial to extend the algorithm to regression. All we need
to change in Algorithm 1.2 is to return the average of the values yi instead of
their majority vote. Figure 1.19 has an example.
Note that the distance computation d(xi, x) for all observations can be-
26 1 Introduction
Algorithm 1.2 k-Nearest Neighbor Classification
Classify(X, Y, x) {reads documents X, labels Y and query x}
for i = 1 to m do
Compute distance d(xi, x)
end for
Compute set I containing indices for the k smallest distances d(xi, x).
return majority label of {yi where i ∈ I}.
Fig. 1.19. k-Nearest neighbor regression estimator using Euclidean distances. Left:
some points (x, y) drawn from a joint distribution. Middle: 1-nearest neighbour
classifier. Right: 7-nearest neighbour classifier. Note that the regression estimate is
much more smooth.
Here ⟨ ·, ·⟩ denotes the standard dot product between vectors. Taking differ-
ences between the two distances yields
f (x) := ǁµ+ − xǁ2 − ǁµ− − xǁ2 = 2 ⟨ µ− − µ+, x⟩ + ǁµ− ǁ2 − ǁµ+ǁ2 .
(1.21)
This is a linear function in x and its sign corresponds to the labels we esti-
mate for x. Our algorithm sports an important property: The classification rule
can be expressed via dot products. This follows from
Σ Σ
2
ǁµ+ ǁ = ⟨ µ+ , µ+ ⟩ = ⟨ xi , xj ⟩ and ⟨ µ+ , x⟩ = m−+1 ⟨ xi , x⟩ .
yi =yj =1
m−+2 yi=1
28 1 Introduction
Fig. 1.21. The feature map φ maps observations x from X into a feature space H.
The map φ is a convenient way of encoding pre-processing steps systematically.
a consequence we have now moved from a fairly simple and pedestrian lin-
ear classifier to one which yields a nonlinear function f (x) with a rather
nontrivial decision boundary.
1.3.4 Perceptron
In the previous sections we assumed that our classifier had access to a train-
ing set of spam and non-spam emails. In real life, such a set might be difficult
to obtain all at once. Instead, a user might want to have instant results when-
ever a new e-mail arrives and he would like the system to learn immediately
from any corrections to mistakes the system makes.
To overcome both these difficulties one could envisage working with the
following protocol: As emails arrive our algorithm classifies them as spam or
non-spam, and the user provides feedback as to whether the classification is
correct or incorrect. This feedback is then used to improve the performance
of the classifier over a period of time.
This intuition can be formalized as follows: Our classifier maintains a
parameter vector. At the t-th time instance it receives a data point xt, to which
it assigns a label ŷt using its current parameter vector. The true label yt is
then revealed, and used to update the parameter vector of the classifier.Such
algorithms are said to be online. We will now describe perhaps the simplest
classifier of this kind namely the Perceptron [Heb49, Ros58].
Let us assume that the data points x t ∈ Rd, and labels yt ∈ {±1}. As
before we represent an email as a bag-of-words vector and we assign +1 to spam
emails and −1 to non-spam emails. The Perceptron maintains a weight
30 1 Introduction
Fig. 1.22. The Perceptron without bias. Left: at time t we have a weight vector wt
denoted by the dashed arrow with corresponding separating plane (also dashed). For
reference we include the linear separator w∗ and its separating plane (both denoted
by a solid line). As a new observation xt arrives which happens to be mis-classified by
the current weight vector wt we perform an update. Also note the margin between
the point xt and the separating hyperplane defined by w∗. Right: This leads to the
weight vector wt+1 which is more aligned with w∗.
where ⟨ w, xt⟩ denotes the usual Euclidean dot product and b is an offset. Note
the similarity of (1.25) to (1.21) of the simple classifier. Just as the latter,
the Perceptron is a linear classifier which separates its domain Rd into two
halfspaces, namely {x| ⟨ w, x⟩ + b > 0} and its complement. If ŷt = yt then
no updates are made. On the other hand, if ŷt /= yt the weight vector si
updated as
Figure 1.22 shows an update step of the Perceptron algorithm. For simplicity
we illustrate the case without bias, that is, where b = 0 and where it remains
unchanged. A detailed description of the algorithm is given in Algorithm 1.3.
An important property of the algorithm is that it performs updates on w
by multiples of the observations xi on which it makes a mistake. Hence we
Σ
may express w as w = i∈ Error yi xi . Just as before, we can replace xi and x
by φ(xi) and φ(x) to obtain a kernelized version of the Perceptron algorithm
[FS99] (Algorithm 1.4).
If the dataset (X, Y) is linearly separable, then the Perceptron algorithm
1.3 Basic Algorithms 31
eventually converges and correctly classifies all the points in X. The rate of
convergence however depends on the margin. Roughly speaking, the margin
quantifies how linearly separable a dataset is, and hence how easy it is to
solve a given classification problem.
Geometrically speaking (see Figure 1.22) the margin measures the distance
of x from the hyperplane defined by {x| ⟨ w, x⟩ + b = 0}. Larger the margin,
the more well separated the data and hence easier it is to find a hyperplane
with correctly classifies the dataset. The following theorem asserts that if
there exists a linear classifier which can classify a dataset with a large mar-
gin, then the Perceptron will also correctly classify the same dataset after
making a small number of mistakes.
Theorem 1.7 (Novikoff’s theorem) Let (X, Y) be a dataset with at least one
example labeled +1 and one example labeled −1. Let R := maxt ǁxtǁ, and
assume that there exists (w∗, b∗) such that ǁw∗ǁ = 1 and γt := yt(⟨ w∗, xt⟩ + b ∗) ≥
(1+R2)(1+(b∗) 2)
γ for all t. Then, the Perceptron will make at most γ2
mistakes.
This result is remarkable since it does not depend on the dimensionality
of the problem. Instead, it only depends on the geometry of the setting,
as quantified via the margin γ and the radius R of a ball enclosing the
observations. Interestingly, a similar bound can be shown for Support Vector
Machines [Vap95] which we will be discussing in Chapter 7.
Proof We can safely ignore the iterations where no mistakes were made and
hence no updates were carried out. Therefore, without loss of generality
assume that the t-th update was made after seeing the t-th observation and
let wt denote the weight vector after the update. Furthermore, for simplicity
assume that the algorithm started with w0 = 0 and b0 = 0. By the update
equation (1.26) we have
⟨ wt, w∗⟩ + btb∗ = ⟨ wt− 1, w∗⟩ + bt− 1b∗ + yt(⟨ xt, w∗⟩ + b∗)
≥ ⟨ wt− 1, w ⟩ + bt− 1b + γ.
∗ ∗
32 1 Introduction
1.3.5 K-Means
All the algorithms we discussed so far are supervised, that is, they assume
that labeled training data is available. In many applications this is too much
to hope for; labeling may be expensive, error prone, or sometimes impossi-
ble. For instance, it is very easy to crawl and collect every page within the
www.purdue.edu domain, but rather time consuming to assign a topic to
each page based on its contents. In such cases, one has to resort to unsuper-
vised learning. A prototypical unsupervised learning algorithm is K-means,
which is clustering algorithm. Given X = {x1, . . . , xm} the goal of K-means
is to partition it into k clusters such that each point in a cluster is similar
to points from its own cluster than with points from some other cluster.
1.3 Basic Algorithms 33
and 0 otherwise.
Stage 2 Keep the r fixed and determine µ. Since the r’s are fixed, J is an
quadratic function of µ. It can be minimized by setting the derivative
with respect to µj to be 0:
Σ
m
rij(xi − µj) = 0 for all j. (1.31)
i=1
Rearranging obtains
Σ i rijxi
µ = . (1.32)
j Σ
r
i ij
Σ
Since i rij counts the number of points assigned to cluster j, we are
essentially setting µj to be the sample mean of the points assigned to
cluster j.
The algorithm stops when the cluster assignments do not change signifi-
cantly. Detailed pseudo-code can be found in Algorithm 1.5.
Two issues with K-Means are worth noting. First, it is sensitive to the choice
of the initial cluster centers µ. A number of practical heuristics have been
developed. For instance, one could randomly choose k points from the given
dataset as cluster centers. Other methods try to pick k points from X which
are farthest away from each other. Second, it makes a hard assignment of
every point to a cluster center. Variants which we will encounter later in
34 1 Introduction
the book will relax this. Instead of letting rij ∈ {0, 1} these soft variants
will replace it with the probability that a given xi belongs to cluster j.
The K-Means algorithm concludes our discussion of a set of basic machine
learning methods for classification and regression. They provide a useful
starting point for an aspiring machine learning researcher. In this book we will
see many more such algorithms as well as connections between these basic
algorithms and their more advanced counterparts.
Problems
Problem 1.1 (Eyewitness) Assume that an eyewitness is 90% certain that
a given person committed a crime in a bar. Moreover, assume thatthere
were 50 people in the restaurant at the time of the crime. What is the
posterior probability of the person actually having committed the crime.
Problem 1.2 (DNA Test) Assume the police have a DNA library of 10
million records. Moreover, assume that the false recognition probability is
below 0.00001% per record. Suppose a match is found after a database search
for an individual. What are the chances that the identification is correct? You
can assume that the total population is 100 million people. Hint: compute the
probability of no match occurring first.
Problem 1.3 (Bomb Threat) Suppose that the probability that one of a
thousand passengers on a plane has a bomb is 1 : 1, 000, 000. Assuming that the
probability to have a bomb is evenly distributed among the passengers,
1.3 Basic Algorithms 35
− 12
the probability that two passengers have a bomb is roughly equal to 10 .
Therefore, one might decide to take a bomb on a plane to decrease chances
that somebody else has a bomb. What is wrong with this argument?
Problem 1.6 (Two Dices) Assume you have a game which uses the max- imum
of two dices. Compute the probability of seeing any of the events
{1, . . . , 6}. Hint: prove first that the cumulative distribution function of the
maximum of a pair of random variables is the square of the original cumu-
lative distribution function.
Problem 1.7 (Matching Coins) Consider the following game: two play- ers
bring a coin each. the first player bets that when tossing the coins both will
match and the second one bets that they will not match. Show that even if one
of the players were to bring a tainted coin, the game still would befair.
Show that it is in the interest of each player to bring a fair coin to the game.
Hint: assume that the second player knows that the first coin favors heads
over tails.
Problem 1.9 Prove that the Normal distribution (1.3) has mean µ and
variance σ2. Hint: exploit the fact that p is symmetric around µ.
Problem 1.11 (Quantiles) Find a distribution for which the mean ex- ceeds
the median. Hint: the mean depends on the value of the high-quantile terms,
whereas the median does not.
Problem 1.12 (Multicategory Naive Bayes) Prove that for multicate- gory
Naive Bayes the optimal decision is given by
Yn
p([x]i |y)
∗
y (x) := argmax p(y) (1.34)
y i=1
∗
Moreover, show that y (x) is the label incurring the smallest misclassification
error.
Problem 1.14 (Nearest Neighbor Loss) Show that the expected loss in- curred
by the nearest neighbor classifier does not exceed twice the loss of theBayes
optimal decision.
2
Density Estimation
After looking at this figure you decide that things are probably reasonable.
And, in fact, they are consistent with the convergence behavior of a sim-
ulated dice in Figure 2.1. In computing (2.1) we have learned something
useful: the expansion is a special case of a binomial series. The first term
Fig. 2.1. Convergence of empirical means to expectations. From left to right: em-
pirical frequencies of occurrence obtained by casting a dice 10, 20, 50, 100, 200, and
500 times respectively. Note that after 20 throws we still have not observed a single
20
’6’, an event which occurs with only 65 ≈ 2.6% probability.
37
38 2 Density Estimation
be the empirical average over the random variables Xi. Then for any ϵ > 0
the following holds
lim Pr .X̄m − µ. ≤ ϵ = 1. (2.3)
m→∞
2.1 Limit Theorems 39
6
5
4
3
2
1
Fig. 2.2. The mean of a number of casts of a dice. The horizontal straight line
denotes the mean 3.5. The uneven solid line denotes the actual mean X̄n as a
function of the number of draws, given as a semilogarithmic plot. The crosses denote
the outcomes of the dice. Note how X̄n ever more closely approaches the mean 3.5
are we obtain an increasing number of observations.
This establishes that, indeed, for large enough sample sizes, the average will
converge to the expectation. The strong law strengthens this as follows:
The strong law implies that almost surely (in a measure theoretic sense) X̄m
converges to µ, whereas the weak law only states that for every ϵ the random
variable X̄m will be within the interval [µ − ϵ, µ+ϵ]. Clearly the strong implies
the weak law since the measure of the events X̄m = µ converges to 1, hence
any ϵ-ball around µ would capture this.
Both laws justify that we may take sample averages, e.g. over a number
of events such as the outcomes of a dice and use the latter to estimate their
means, their probabilities (here we treat the indicator variable of the event
as a {0; 1}-valued random variable), their variances or related quantities. We
postpone a proof until Section 2.1.2, since an effective way of proving Theo-
rem 2.1 relies on the theory of characteristic functions which we will discuss
in the next section. For the moment, we only give a pictorial illustration in
Figure 2.2.
− Σ
Once we established that the random variable X̄m = m 1 mi=1Xi con-
verges to its mean µ, a natural second question is to establish how quickly it
converges and what the properties of the limiting distribution of X̄m − µ are.
Note in Figure 2.2 that the initial deviation from the mean is large whereas
as we observe more data the empirical mean approaches the true one.
40 2 Density Estimation
6
5
4
3
2
1
Fig. 2.3. Five instantiations of a running average over outcomes of a toss of a dice.
Note that all of them converge to the mean 3.5. Moreover note that √ they all are
well contained within the upper and lower envelopes given by µ ± VarX [x]/m.
Note that just like the law of large numbers the central limit theorem (CLT)
is an asymptotic result. That is, only in the limit of an infinite number of
observations will it become exact. That said, it often provides an excellent
approximation even for finite numbers of observations, as illustrated in Fig-
ure 2.4. In fact, the central limit theorem and related limit theorems build the
foundation of what is known as asymptotic statistics.
which have all mean µ = 3.5 and variance (see Problem 2.1)
VarX [x] = EX [x2 ] − EX [x]2 = (1 + 4 + 9 + 16 + 25 + 36)/6 − 3.52 ≈ 2.92.
− Σ
We now study the random variable Wm := m 1 mi=1 [Xi − 3.5]. Since each
of the terms in the sum has zero mean, also W m’s mean vanishes. Moreover,
Wm is a multiple of Zm of (2.4). Hence we have that Wm converges to a
1
normal distribution with zero mean and standard deviation 2.92m− 2 .
Consequently the average of m tosses of the dice yields a random vari- able
with mean 3.5 and it will approach a normal distribution with variance m− 2
1
2.92. In other words, the empirical mean converges to its average at rate O(m−
1
2 ). Figure 2.3 gives an illustration of the quality of the bounds implied by the
CLT.
One remarkable property of functions of random variables is that in many
conditions convergence properties of the random variables are bestowed upon
the functions, too. This is manifest in the following two results: a variant
of Slutsky’s theorem and the so-called delta method. The former deals with
limit behavior whereas the latter deals with an extension of the central limit
theorem.
For a proof see e.g. [Bil68]. Theorem 2.4 is often referred to as the continuous
mapping theorem (Slutsky only proved the result for affine functions). It
means that for functions of random variables it is possible to pull the limiting
procedure into the function. Such a device is useful when trying to prove
asymptotic normality and in order to obtain characterizations of the limiting
distribution.
In other words, φX (ω) is the inverse Fourier transform applied to the prob-
ability measure p(x). Consequently φX (ω) uniquely characterizes p(x) and
moreover, p(x) can be recovered from φX (ω) via the forward Fourier trans-
form. One of the key utilities of characteristic functions is that they allow
us to deal in easy ways with sums of random variables.
The result for characteristic functions follows form the property of the
Fourier transform.
For sums of several random variables the characteristic function is the prod-
uct of the individual characteristic functions. This allows us to prove both
the weak law of large numbers and the central limit theorem (see Figure 2.4
for an illustration) by proving convergence in the Fourier domain.
Proof [Weak Law of Large Numbers] At the heart of our analysis lies
a Taylor expansion of the exponential into
exp(iwx) = 1 + i ⟨ w, x⟩ + o(|w|)
and hence φX (ω) = 1 + iwEX [x] + o(|w|).
1
In Chapter ?? we will discuss more general descriptions of distributions of which φX is a special
case. In particular, we will replace the exponential exp(i ⟨ω, x⟩) by a kernel function k(x, x′ ).
44 2 Density Estimation
Fig. 2.4. A working example of the central limit theorem. The top row contains
distributions of sums of uniformly distributed random variables on the interval [0.5,
0.5]. From left to right we have sums of 1, 2, 4, 8 and 16 random var√
iables. The
bottom row contains the same distribution with the means rescaled by m, where
m is the number of observations. Note how the distribution converges increasingly
to the normal distribution.
Given m random Σvariables Xi with mean EX [x] = µ this means that their
average X̄m := 1 m Xi has the characteristic function
m i=1
m
i
φX̄m (ω) = 1 + wµ + o(m − 1 |w|) (2.11)
m
Proof [Central Limit Theorem] We use the same idea as above to prove
the CLT. The main difference, though, is that we need to assume that the
second moments of the random variables Xi exist. To avoid clutter we only
prove the case of constant mean EXi [xi] = µ and variance VarXi [xi] = σ2.
2.1 Limit Theorems 45
Σm
Let Zm := √ 1 (Xi − µ). Our proof relies on showing convergence
mσ2 i=1
of the characteristic function of Zm, i.e. φZm to that of a normally dis- tributed
random variable W with zero mean and unit variance. Expanding the
exponential to second order yields:
1 2 2
exp(iwx) = 1 + iwx − w x + o(|w|2)
2
1 2
and hence φX (ω) = 1 + iwEX [x] − w VarX [x] + o(|w|2)
2
Since the mean of Zm vanishes by centering (Xi − µ) and the variance per
−
variable is m 1 we may write the characteristic function of Zm via
1 2 −
m
φZm (ω) = 1 − w + o(m 1 |w|2 )
2m
Its proof is left as an exercise. See Problem 2.2 for details. This connection
also implies (subject to regularity conditions) that if we know the moments
of a distribution we are able to reconstruct it directly since it allows us
to reconstruct its characteristic function. This idea has been exploited in
density estimation [Cra46] in the form of Edgeworth and Gram-Charlier
expansions [Hal92].
know their mean. This leads to the Gauss-Markov inequality. If we know their
mean and their variance we are able to state a stronger bound, the
Chebyshev inequality. For an even stronger setting, when we know that
each variable has bounded range, we will be able to state a Chernoff bound.
Those bounds are progressively more tight and also more difficult to prove.
We state them in order of technical sophistication.
iX
2.1 Limit Theorems 47
This bound is quite reasonable for large δ but it means that for high levels
of confidence we need a huge number of observations.
Much stronger results can be obtained if we are able to bound the range
of the random variables. Using the latter, we reap an exponential improve-
ment in the quality of the bounds in the form of the McDiarmid [McD89]
inequality. We state the latter without proof:
Example 2.3 (Hoeffding bound) As in example 2.2 assume that Xi are iid
random variables and let X̄m be their average. Moreover, assume that
48 2 Density Estimation
While this bound is tight (see Problem 2.5 for details), it is possible to ob-
tain better bounds if we know additional information. In particular knowing
a bound on the variance of a random variable in addition to knowing that it
has bounded range would allow us to strengthen the statement considerably.
The Bernstein inequality captures this connection. For details see [BBL05] or
works on empirical process theory [vdVW96, SW86, Vap82].
2.1.4 An Example
It is probably easiest to illustrate the various bounds using a concrete exam-
ple. In a semiconductor fab processors are produced on a wafer. A typical300mm
wafer holds about 400 chips. A large number of processing steps are required
to produce a finished microprocessor and often it is impossibleto assess the
effect of a design decision until the finished product has been produced.
Assume that the production manager wants to change some step from
process ’A’ to some other process ’B’. The goal is to increase the yield of
the process, that is, the number of chips of the 400 potential chips on the
wafer which can be sold. Unfortunately this number is a random variable,
i.e. the number of working chips per wafer can vary widely between different
wafers. Since process ’A’ has been running in the factory for a very long
time we may assume that the yield is well known, say it is µA = 350 out of
400 processors on average. It is our goal to determine whether process ’B’
is better and what its yield may be. Obviously, since production runs are
expensive we want to be able to determine this number as quickly as
possible, i.e. using as few wafers as possible. The production manager is risk
averse and wants to ensure that the new process is really better. Hence he
requires a confidence level of 95% before he will change the production.
2.1 Limit Theorems 49
A first step is to formalize the problem. Since we know process ’A’ exactly
we only need to concern ourselves with ’B’. We associate the random variable Xi
with wafer i. A reasonable (and somewhat simplifying) assumption is to posit
that all Xi are independent and identically distributed where all Xi have the
mean µB. Obviously we do not know µB — otherwise there would be no
reason for testing! We denote by X̄m the average of the yields of m wafers
using process ’B’. What we are interested in is the accuracy ϵ for which the
probability
δ = Pr(|X̄m − µB | > ϵ) satisfies δ ≤ 0.05.
Let us now discuss how the various bounds behave. For the sake of the
argument assume that µB − µA = 20, i.e. the new process produces on
average 20 additional usable chips.
Hoeffding Since the yields are in the interval {0, . . . , 400} we have an ex-
plicit bound on the range of observations. Recall the inequality (2.16) which
bounds the failure probably δ = 0.05 by an exponential term. Solving this
for m yields
m ≥ 0.5|b − a|2ϵ− 2 log(2/δ) ≈ 737.8 (2.19)
In other words, we need at lest 738 wafers to determine whether process ’B’
is better. While this is a significant improvement of almost two orders of
magnitude, it still seems wasteful and we would like to do better.
50 2 Density Estimation
Rates and Constants The astute reader will have noticed that all three
confidence bounds had scaling behavior m = O(ϵ− 2). That is, in all cases the
number of observations was a fairly ill behaved function of the amount of
confidence required. If we were just interested in convergence per se, a
statement like that of the Chebyshev inequality would have been entirely
sufficient. The various laws and bounds can often be used to obtain con-
siderably better constants for statistical confidence guarantees. For more
complex estimators, such as methods to classify, rank, or annotate data,
a reasoning such as the one above can become highly nontrivial. See e.g.
[MYA94, Vap98] for further details.
2.2 Parzen Windows 51
Let us discuss a concrete case. We assume that we have 12 documents and would
like to estimate the probability of occurrence of the word ’dog’ fromit. As
raw data we have:
Document ID 1 2 3 4 5 6 7 8 9 10 11 12
Occurrences of ‘dog’ 1 0 2 0 4 6 3 0 6 2 0 1
This means that the word ‘dog’ occurs the following number of times:
Occurrences of ‘dog’ 0 1 2 3 4 5 6
Number of documents 4 2 2 1 1 0 2
Something unusual is happening here: for some reason we never observed
5 instances of the word dog in our documents, only 4 and less, or alter-
natively 6 times. So what about 5 times? It is reasonable to assume that
the corresponding value should not be 0 either. Maybe we did not sample
enough. One possible strategy is to add pseudo-counts to the observations.
This amounts to the following estimate:
h Σ
m i
p̂X (x) := (m + |X|)− 1 1+ {xi = x} = p(x) (2.24)
i=1
we may choose to smooth it out by a smoothing kernel h(x) such that the
probability mass becomes somewhat more spread out. For a density estimate on
X ⊆ R d this is achieved by
1 Σ
m
d x xi −r h
p̂(x) = r
m i=1
. (2.26)
This expansion is commonly known as the Parzen windows estimate. Note
that obviously∫h must be chosen such that h(x) ≥ 0 for all x ∈ X and
moreover that h(x)dx = 1 in order to ensure that (2.26) is a proper prob-
ability distribution. We now formally justify this smoothing. Let R be a
small region such that
∫
q= p(x) dx.
R
Out of the m samples drawn from p(x), the probability that k of them fall
in region R is given by the binomial distribution
m k
q (1 − q)m− k.
k
The expected fraction of points falling inside the region can easily be com-
puted from the expected value of the Binomial distribution: E[k/m] = q.
Similarly, the variance can be computed as Var[k/m] = q(1 − q)/m. As m
→ ∞ the variance goes to 0 and hence the estimate peaks around the
expectation. We can therefore set
k ≈ mq.
q ≈ p(x) · V,
around x. If we let
Σ
m
x − xi
k= h ,
i=1
r
then one can use (2.27) to estimate p via
1 Σ −d
m
x − xi
p̂(x) = r h ,
m
i=1
r
0.10 0.05
0.04
0.03
0.05
0.02
0.01
0.00 0.00
40 50 60 70 80 90 100 110 40 50 60 70 80 90 100 110
Fig. 2.5. Left: a naive density estimate given a sample of the weight of 18 persons.
Right: the underlying weight distribution.
0.050 0.050 0.050 0.050
Fig. 2.6. Parzen windows density estimate associated with the 18 observations of the
Figure above. From left to right: Gaussian kernel density estimate with kernelof
width 0.3, 1, 3, and 10 respectively.
1.0 1.0 1.0 1.0
Fig. 2.7. Some kernels for Parzen windows density estimation. From left to right:
Gaussian kernel, Laplace kernel, Epanechikov kernel, and uniform density.
Moreover, there is the issue of choosing a suitable kernel function. The fact
that a large variety of them exists might suggest that this is a crucial issue. In
practice, this turns out not to be the case and instead, the choice of a
suitable kernel width is much more vital for good estimates. In other words,
size matters, shape is secondary.
The problem is that we do not know which kernel width is best for the
data. If the problem is one-dimensional, we might hope to be able to eyeball
the size of r. Obviously, in higher dimensions this approach fails. A second
56 2 Density Estimation
Fig. 2.8. Nonuniform density. Left: original density with samples drawn from the
distribution. Middle: density estimate with a uniform kernel. Right: density estimate
using Silverman’s adjustment.
Fig. 2.9. Watson Nadaraya estimate. Left: a binary classifier. The optimal solution
would be a straight line since both classes were drawn from a normal distribution
with the same variance. Right: a regression estimator. The data was generated from
a sinusoid with additive noise. The regression tracks the sinusoid reasonably well.
2.3.1 Basics
Densities from the exponential family are defined by
Example 2.5 (Binary Model) A ssume t hat X = {0; 1} and that φ(x) =
x. In this case we have g(θ) = log e0 + eθ = log 1 + eθ . It follows that
θ
p(x = 0; θ) = 1+e1
θ
and p(x = 1; θ) = e θ . In other words, by choosing
1+e
different values of θ one can recover different Bernoulli distributions.
Proof Note that ∇ 2g(θ)θ = Varx [φ(x)] implies that g is convex, since the
covariance matrix is positive semidefinite. To show (2.42) we expand
∫ ∫
X φ(x) exp ⟨ φ(x), θ⟩ dx
∇ θg(θ) = ∫ = φ(x)p(x; θ)dx = Ex [φ(x)] . (2.43)
X exp ⟨ φ(x), θ⟩
Next we take the second derivative to obtain
∫
2
∇θ g(θ) = φ(x) [φ(x) − ∇ θ g(θ)]T p(x; θ)dx (2.44)
X
h i
= Ex φ(x)φ(x)T − Ex [φ(x)] Ex [φ(x)]T (2.45)
which proves the claim. For the first equality we used (2.43). For the second
line we used the definition of the variance.
One may show that higher derivatives ∇ nθg(θ) generate higher order cu-
mulants of φ(x) under p(x; θ). This is why g is often also referred as the cumulant-
generating function. Note that in general, computation of g(θ)
62 2 Density Estimation
have obtained from direct computation of the mean p(x = 1; θ) and variance
p(x = 1; θ) − p(x = 1; θ)2 subject to the distribution p(x; θ).
2.3.2 Examples
A large number of densities are members of the exponential family. Note,
however, that in statistics it is not common to express them in the dot
product formulation for historic reasons and for reasons of notational com-
pactness. We discuss a number of common densities below and show why
they can be written in terms of an exponential family. A detailed description
of the most commonly occurring types are given in a table.
Gaussian Let x, µ ∈ Rd and let Σ ∈ Rd× d where Σ > 0, that is, Σ is a positive
definite matrix. In this case the normal distribution can be expressed
via
d 1 1
p(x) = (2π)− 2|Σ|− 2exp − (x − µ)T Σ− 1 (x − µ) (2.46)
2
1
= exp xT Σ− 1µ + tr − xxT Σ− 1 − c(µ, Σ)
2
where c(µ, Σ) = 12 µT Σ− 1 µ + d2log 2π + 12log |Σ|. By combining the
terms in x into φ(x) := (x, − 12 xx ) we obtain the sufficient statistics
T
e− λλx 1
p(x) = = exp (x log λ − λ) where x ∈ N0 . (2.48)
x! x!
n! λk λ n− k
p(zn = k) = 1− (2.49)
(n − k)!k! nk n
"
λk λ n
n! #
k
λ
= 1− nk(n − k)! 1−
k! n n
0.40 3.5
0.35
3.0
0.30
2.5
0.25
2.0
0.20
1.5
0.15
1.0
0.10
0.05 0.5
0.00 0.0
0 5 10 15 20 25 30 0.0 0.2 0.4 0.6 0.8 1.0
butions. It is given by
Γ(a + b)
p(x) = xa− 1(1 − x)b− 1 . (2.50)
Γ(a)Γ(b)
Figure 2.10 has a graphical description of the Poisson distribution and the Beta
distribution. For a more comprehensive list of exponential family dis- tributions
see the table below and [Fel71, FT94, MN83]. In principle any map φ(x),
domain X with underlying measure µ are suitable, as long as the log-partition
function g(θ) can be computed efficiently.
Rn x, − 12xx n
log 2π − 12log |θ2 | + 1 2θ1T θ2− 1 θ1 Rn ×Cn
T
Lebesgue 2 2 2 θ2
3
2 2 2 √2 1 2 1
2
Inverse Normal [0, ∞)
−
x
1
−x, − 1 x 1
2 log π − 2 θ θ − 2 log θ (0, ∞)2
Beta [0, 1] 1)Γ(θ2 )
(log x, log (1 − x)) R2
Gamma [0, ∞) 1
(log x, −x) log Γ(θ1) − θ1 log θ2 (0, ∞)2
x n+1
Wishart Cn |X| − 2 log |x|, − 12x −θ 1 log |θ2 | + θ1 n log 2 R ×Cn
Σ n −
+ i=1 log Γ θ1 + 1 2i
Q Σn Σ
Dirichlet Sn ( ni=1 xi )− 1 (log x1 , . . . , log xn ) log Γ(θi ) − log Γ ( n θi ) (R+ )n
2x i=1 i=1
− 1
Inverse χ2 R+ e — log x (θ − 1) log 2 + log(θ − 1) (0, ∞)
Logarithmic N 1
x
x log(− log(1 − eθ)) (−∞, 0)
Sn denotes the probability simplex in n dimensions. Cn is the cone of positive semidefinite matrices in Rn× n.
65
66 2 Density Estimation
2.4 Estimation
In many statistical problems the challenge is to estimate parameters of in-
terest. For instance, in the context of exponential families, we may want
to estimate a parameter θˆ such that it is close to the “true” parameter θ∗
in the distribution. While the problem is fully general, we will describe the
relevant steps in obtaining estimates for the special case of the exponential
family. This is done for two reasons — firstly, exponential families are an
important special case and we will encounter slightly more complex variants
on the reasoning in later chapters of the book. Secondly, they are of a suffi-
ciently simple form that we are able to show a range of different techniques.
In more advanced applications only a small subset of those methods may be
practically feasible. Hence exponential families provide us with a working
example based on which we can compare the consequences of a number of
different techniques.
Here µ[X] is the empirical average of the map φ(x). Maximization of p(X; θ)
is equivalent to minimizing the negative log-likelihood − log p(X; θ). Thelatter
is a common practical choice since for independently drawn data,the
product of probabilities decomposes into the sum of the logarithms of
individual likelihoods. This leads to the following objective function to be
minimized
Put another way, the above conditions state that we aim to find the distribu-
tion p(x; θ) which has the same expected value of φ(x) as what we observed
empirically via µ[X]. Under very mild technical conditions a solution to (2.56)
exists.
In general, (2.56) cannot be solved analytically. In certain special cases,
though, this is easily possible. We discuss two such choices in the following:
Multinomial and Poisson distributions.
1 Often the Poisson distribution is specified using λ := log θ as its rate parameter. In this case we have
p(x; λ) = λx e −λ /x! as its parametrization. The advantage of the natural parametrization using
θ is that we can directly take advantage of the properties of the log-partition function as
generating the cumulants of x.
68 2 Density Estimation
Σ
It is easy to check that (2.58) is satisfied for eθi = mj=1 {xj = i}. In other
words, the MLE for a discrete distribution simply given by the empirical
frequencies of occurrence.
The multinomial setting also exhibits two rather important aspects of ex-
Σ
ponential families: firstly, choosing θi = c + log m i=1
{xj = i} for any c ∈ R
will lead to an equivalent distribution. This is the case since the sufficient
statistic φ(x) is not minimal. In our context this means that the coordinates
Σ
of φ(x) are linearly dependent — for any x we have that j [φ(x)] j = 1,
hence we could eliminate one dimension. This is precisely the additional
degree of freedom which is reflected in the scaling freedom in θ.
Secondly, for data where some events do not occur at all, the expression
hΣ i
m
log j=1
{x j = i} = log 0 is ill defined. This is due to the fact that this
particular set of counts occurs on the boundary of the convex set within which
the natural parameters θ are well defined. We will see how different types of
priors can alleviate the issue.
Using the MLE is not without problems. As we saw in Figure 2.1, conver-
gence can be slow, since we are not using any side information. The latter can
provide us with problems which are both numerically better conditionedand
which show better convergence, provided that our assumptions are ac-
curate. Before discussing a Bayesian approach to estimation, let us discuss
basic statistical properties of the estimator.
Fig. 2.11. Left: unbiased estimator; the estimates, denoted by circles have as mean
the true parameter, as denoted by a star. Middle: consistent estimator. While the
true model is not within the class we consider (as denoted by the ellipsoid), the
estimates converge to the white star which is the best model within the class that
approximates the true model, denoted by the solid star. Right: different estimators
have different regions of uncertainty, as made explicit by the ellipses around the true
parameter (solid star).
In other words, in expectation the parameter estimate matches the true pa-
rameter. Note that this only makes sense if a true parameter actually exists.
For instance, if the data is Poisson distributed and we attempt modeling it
by a Gaussian we will obviously not obtain unbiased estimates.
For finite sample sizes MLE is often biased. For instance, for the normal
distribution the variance estimates carry bias O(m− 1). See problem 2.19for
details. In general, under fairly mild conditions, MLE is asymptotically
unbiased [DGL96]. We prove this for exponential families. For more general
settings the proof depends on the dimensionality and smoothness of the
family of densities that we have at our disposition.
70 2 Density Estimation
Theorem 2.19 (Cramér and Rao [Rao73]) Assume that X is drawn from
p(X; θ) and let θˆ[X] be an asymptotically unbiased estimator. Denote by I
the Fisher information matrix and by B the variance of θˆ[X] where
h i
I := Cov [∇ θ log p(X; θ)] and B := Var θˆ[X] . (2.60)
Proof We prove the claim for the scalar case. The extension to matrices is
straightforward. Using the Cauchy-Schwarz inequality we have
h i h i
Cov2 ∇ θ log p(X; θ), θˆ[X] ≤ Var [∇ θ log p(X; θ)] Var θˆ[X] = IB. (2.61)
Note that at the true parameter the expected log-likelihood score vanishes
∫
EX[∇ θ log p(X; θ)] = ∇ θ p(X; θ)dX = ∇ θ1 = 0. (2.62)
2.4 Estimation 71
Hence we may simplify the covariance formula by dropping the means via
h i h i
Cov ∇ θ log p(X; θ), θ̂[X] = EX ∇ θ log p(X; θ)θ̂[X]
∫
= p(X; θ)θ̂(X)∇ θ log p(X; θ)dθ
∫
= ∇ θ p(X; θ)θ̂(X)dX = ∇ θ θ = 1.
and with respect to which norm. Under fairly general conditions this turns
out to be true for finite-dimensional parameters and smoothly parametrized
densities. See [DGL96, vdG00] for proofs and further details.
theorems, it ignores the fact that in practice we almost always have some
idea about what to expect of our solution. It would be foolish to ignore such
additional information. For instance, when trying to determine the voltage of
a battery, it is reasonable to expect a measurement in the order of 1.5Vor
less. Consequently such prior knowledge should be incorporated into the
estimation process. In fact, the use of side information to guide estimation
turns out to be the tool to building estimators which work well in high
dimensions.
Recall Bayes’ rule (1.15) which states that p(θ|x) = p(x| θ)p(θ) . In our con- text
p(x)
this means that if we are interested in the posterior probability of θ
assuming a particular value, we may obtain this using the likelihood (often
referred to as evidence) of x having been generated by θ via p(x|θ) and our
prior belief p(θ) that θ might be chosen in the distribution generating x.
Observe the subtle but important difference to MLE: instead of treating θ
as a parameter of a density model, we treat θ as an unobserved random
variable which we may attempt to infer given the observations X.
This can be done for a number of different purposes: we might want to
infer the most likely value of the parameter given the posterior distribution
p(θ|X). This is achieved by
θ̂MAP (X) := argmax p(θ|X) = argmin − log p(X|θ) − log p(θ). (2.64)
θ θ
The second equality follows since p(X) does not depend on θ. This estimator
is also referred to as the Maximum a Posteriori, or MAP estimator. It differs
from the maximum likelihood estimator by adding the negative log-prior
to the optimization problem. For this reason it is sometimes also referred
to as Penalized MLE. Effectively we are penalizing unlikely choices θ via
− log p(θ).
Note that using θ̂MAP (X) as the parameter of choice is not quite accurate.
After all, we can only infer a distribution over θ and in general there is no
guarantee that the posterior is indeed concentrated around its mode. A more
accurate treatment is to use the distribution p(θ|X) directly via
∫
p(x|X) = p(x|θ)p(θ|X)dθ. (2.65)
In other words, we integrate out the unknown parameter θ and obtain the
density estimate directly. As we will see, it is generally impossible to solve
(2.65) exactly, an important exception being conjugate priors. In the other
cases one may resort to sampling from the posterior distribution to approx-
imate the integral.
While it is possible to design a wide variety of prior distributions, this book
2.4 Estimation 73
That is, they restrict the deviation of the parameter value θ from some guess
θ0. The intuition is that extreme values of θ are much less likely than more
moderate choices of θ which will lead to more smooth and even distributions
p(x|θ).
A popular choice is the Gaussian prior which we obtain for p = d = 1
and λ = 1/2σ2. Typically one sets θ0 = 0 in this case. Note that in (2.66)
we did not spell out the normalization of p(θ) — in the context of MAP
estimation this is not needed since it simply becomes a constant offset in
the optimization problem (2.64). We have
Note that p(θ|n, ν) itself is a member of the exponential family with the feature
map φ(θ) = (θ, −g(θ)). Hence h(ν, n) is convex in (nν, n). Moreover, the
posterior distribution has the form
Fig. 2.12. From left to right: regions of equal prior probability in R2 for priors using
the l1, l2 and l ∞ norm. Note that only the l2 norm is invariant with regard to the
coordinate system. As we shall see later, the l1 norm prior leads to solutions where
only a small number of coordinates is nonzero.
That is, the posterior distribution has the same form as a conjugate prior with
parameters mµ[X]+nν m+n
and m + n. In other words, n acts like a phantom sample size
and ν is the corresponding mean parameter. Such an interpreta- tion is
reasonable given our desire to design a prior which, when combinedwith the
likelihood remains in the same model class: we treat prior knowl-edge as
having observed virtual data beforehand which is then added to the actual set
of observations. In this sense data and prior become completelyequivalent
— we obtain our knowledge either from actual observations or from virtual
observations which describe our belief into how the data gen- eration process is
supposed to behave.
Eq. (2.70) has the added benefit of allowing us to provide an exact nor-
malized version of the posterior. Using (2.68) we obtain that
mµ[X]+nν
p(θ|X) = exp ⟨ mµ[X] + nν, θ⟩ − (m + n)g(θ) − h m+n
,m +n .
Combining terms one may check that the integrand amounts to the normal-
2.4 Estimation 75
2.4.4 An Example
Assume we would like to build a language model based on available doc-
uments. For instance, a linguist might be interested in estimating the fre-
quency of words in Shakespeare’s collected works, or one might want to
compare the change with respect to a collection of webpages. While mod- els
describing documents by treating them as bags of words which all have been
obtained independently of each other are exceedingly simple, they are
valuable for quick-and-dirty content filtering and categorization, e.g. a spam
filter on a mail server or a content filter for webpages.
Hence we model a document d as a multinomial distribution: denote by
wi for i ∈ {1, . . . , md} the words in d. Moreover, denote by p(w|θ) the probability
of occurrence of word w, then under the assumption that the words are
independently drawn, we have
Y
md
p(d|θ) = p(w i |θ). (2.71)
i=1
It is our goal to find parameters θ such that p(d|θ) is accurate. For a given
collection D of documents denote by mw the number of counts for word w in
the entire collection. Moreover, denote by m the total number of wordsin
the entire collection. In this case we have
Y Y
p(D|θ) = p(di |θ) = p(w|θ)mw . (2.72)
i w
where G and H are the first and second derivatives of − log p(D, θ) with respect
to θ. The quadratic expression can be minimized with respect to δby choosing
δ = −H 1G and we can fashion an update algorithm from thisby letting θ ←
−
convergent. Note that the prior on θ ensures that H is well conditioned even in
the case where the variance of φ(x) is not. In practice this means that the prior
ensures fast convergence of the optimization algorithm.
H = Varx∼p(x|θ) [φ(x)] + 1 2 1 mσ
Update θ ← θ − H 1G
−
end while
return θ
2.5 Sampling
So far we considered the problem of estimating the underlying probability
density, given a set of samples drawn from that density. Now let us turn to
the converse problem, that is, how to generate random variables given the
underlying probability density. In other words, we want to design a random
variable generator. This is useful for a number of reasons:
We may encounter probability distributions where optimization over suit-
able model parameters is essentially impossible and where it is equally im-
possible to obtain a closed form expression of the distribution. In these cases
it may still be possible to perform sampling to draw examples of the kind
of data we expect to see from the model. Chapter ?? discusses a number of
graphical models where this problem arises.
Secondly, assume that we are interested in testing the performance of a
network router under different load conditions. Instead of introducing the under-
development router in a live network and wreaking havoc, one could
78 2 Density Estimation
estimate the probability density of the network traffic under various load
conditions and build a model. The behavior of the network can then be
simulated by using a probabilistic model. This involves drawing random
variables from an estimated probability distribution.
Carrying on, suppose that we generate data packets by sampling and see
an anomalous behavior in your router. In order to reproduce and debug
this problem one needs access to the same set of random packets which
caused the problem in the first place. In other words, it is often convenient
if our random variable generator is reproducible; At first blush this seems like
a contradiction. After all, our random number generator is supposedto
generate random variables. This is less of a contradiction if we consider how
random numbers are generated in a computer — given a particular
initialization (which typically depends on the state of the system, e.g. time,
disk size, bios checksum, etc.) the random number algorithm produces a
sequence of numbers which, for all practical purposes, can be treated as iid.
A simple method is the linear congruential generator [PTVF94]
xi+1 = (axi + b) mod c.
The performance of these iterations depends significantly on the choice of the
constants a, b, c. For instance, the GNU C compiler uses a = 1103515245, b =
12345 and c = 232. In general b and c need to be relatively prime and a − 1
needs to be divisible by all prime factors of c and by 4. It is very much
advisable not to attempt implementing such generators on one’s own unless
it is absolutely necessary.
Useful desiderata for a pseudo random number generator (PRNG) are that
for practical purposes it is statistically indistinguishable from a sequence of
iid data. That is, when applying a number of statistical tests, we will accept
the null-hypothesis that the random variables are iid. See Chapter ?? for
a detailed discussion of statistical testing procedures for random variables.
In the following we assume that we have access to a uniform RNG U [0, 1]
which draws random numbers uniformly from the range [0, 1].
1
0.3
0.8
0.2 0.6
0.4
0.1
0.2
0 0
1 2 3 4 5 1 2 3 4 5 6
Fig. 2.13. Left: discrete probability distribution over 5 possible outcomes. Right:
associated cumulative distribution function. When sampling, we draw x uniformly
at random from U [0, 1] and compute the inverse of F .
variable x := φ(z) is drawn from. ∇ xφ− 1(x) . · p(φ− 1(x)). Here .∇ xφ− 1(x) .
−
denotes the determinant of the Jacobian of φ 1.
− ∞∼
F − 1(z) converts samples z U [0, 1] to samples drawn from p(x).
We now apply this strategy to a number of univariate distributions. One of
the most common cases is sampling from a discrete distribution.
0.6
0.6
0.4
0.4
0.2 0.2
0 0
0 2 4 6 8 10 0 2 4 6 8 10
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
4 3 2 1 0 1 2 3 4 5
Fig. 2.15. Red: true density of the standard normal distribution (red line) is con-
trasted with the histogram of 20,000 random variables generated by the Box-Müller
transform.
The key observation is that the joint distribution p(x, y) is radially symmet- ric,
i.e. it only depends on the radius r2 = x2 + y2. Hence we may performa
variable substitution in polar coordinates via the map φ where
x = r cos θ and y = r sin θ hence (x, y) = φ− 1(r, θ). (2.81)
This allows us to express the density in terms of (r, θ) via
. = r e− r2 .
1
p(r, θ) = p(φ 1(r, θ)) . ∇
−
1 − 1 r2 . cos θ sin θ
r,θ
φ (r, θ). =
−1
e
2 2
2π . −r sin θ r cos θ . 2π
The fact that p(r, θ) is constant in θ means that we can easily sample θ ∈
[0, 2π] by drawing a random variable, say zθ from U [0, 1] and rescaling it with
2π. To obtain a sampler for r we need to compute the cumulative distribution
1 2
function for p(r) = re− r2 :
∫ r′
1 2 1 ′2 √
J
F (r ) = re− 2r dr = 1 − e — 2r and hence r = F − 1(z) = −2 log(1 − z).
0
(2.82)
Observing that z ∼ U [0, 1] implies that 1 − z ∼ U [0, 1] yields the following
sampler: draw zθ, zr ∼ U [0, 1] and compute x and y by
√ √
x = −2 log zr cos 2πzθ and y = −2 log zr sin 2πzθ.
Note that the Box-Müller transform yields two independent Gaussian ran-
dom variables. See Figure 2.15 for an example of the sampler.
82 2 Density Estimation
0 otherwise
Using the variable transform (2.81) yields
(r
if r ≤ 1
(r, θ)) .∇ r,θφ (r, θ). =
−1 −1 π
p(r, θ) = p(φ (2.84)
0 otherwise
Fig. 2.16. Rejection sampler. Left: samples drawn from the uniform distribution on [0,
1]2. Middle: the samples drawn from the uniform distribution on the unit disc are all
the points in the grey shaded area. Right: the same procedure allows us to sample
uniformly from arbitrary sets.
3. 0
2. 5
2. 5
2. 0
2. 0
1. 5
1. 5
1. 0
1. 0
0. 5
0. 5
0. 0 0. 0
0. 0 0. 2 0. 4 0. 6 0. 8 1.0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0
Fig. 2.17. Accept reject sampling for the Beta(2, 5) distribution. Left: Samples are
generated uniformly from the blue rectangle (shaded area). Only those samples
which fall under the red curve of the Beta(2, 5) distribution (darkly shaded area)
are accepted. Right: The true density of the Beta(2, 5) distribution (red line) is
contrasted with the histogram of 10,000 samples drawn by the rejection sampler.
repeat
draw x ∼ q(x) and t ∼ U [0, 1]
until ct ≤ p(x)
q(x)
return x
Proof Denote by Z the event that the sample drawn from q is accepted.
Then by Bayes rule the probability Pr(x|Z) can be written as follows
cq(x)
p(x)
· q(x)
Pr(Z|x) Pr(x) = p(x) (2.85)
Pr(x| Z) = = c− 1
Pr(Z)
∫ ∫
Here we used that Pr(Z) = Pr(Z|x)q(x)dx = c− 1p(x)dx = c− 1.
Note that the algorithm of Example 2.12 is a special case of such a rejection
1
sampler — we majorize pX by the uniform distribution rescaled by p(X) .
mean and unit variance it turns out that choosing λ = 1 yields the most
efficient sampling scheme (see Problem 2.27) with
r
2e
p(x) ≤ q(x|λ = 1)
π
As illustrated in Figure 2.18, we first generate x ∼ q(x|λ = 1) using the
inverse transfo√ rm method (see Example 2.9 and Problem 2.21) and t ∼
U [0, 1]. If t ≤ 2e/πp(x)
√π we accept x, otherwise we reject it. The efficiency
of this scheme is 2e .
0.6
0.4
0.2
0
−4 −2 0 2 4
Fig. 2.18. Rejection sampling for the Normal distribution (red curve).
√ Samples are
generated uniformly from the Laplace distribution rescaled by 2e/π. Only those
samples which fall under the red curve of the standard normal distribution (darkly
shaded area) are accepted.
Problems
Problem 2.1 (Bias Variance Decomposition {1}) Prove that the vari-
ance VarX [x] of a random variable can be written as EX [x2 ] − EX [x]2 .
Problem 2.2 (Moment Generating Function {2}) Prove that the char-
acteristic function can be used to generate moments as given in (2.12). Hint:
use the Taylor expansion of the exponential and apply the differential oper-
ator before the expectation.
Problem 2.4 (Weak Law of Large Numbers {2}) In analogy to the proof
of the central limit theorem prove the weak law of large numbers. Hint: use
a first order Taylor expansion of eiωt = 1 + iωt + o(t) to compute an approx-
imation of the characteristic function. Next compute the limit m → ∞ for
φX̄m . Finally, apply the inverse Fourier transform to associate the constant
distribution at the mean µ with it.
Problem 2.5 (Rates and confidence bounds {3}) Show that the rate
of hoeffding is tight — get bound from central limit theorem and compare to
the hoeffding rate.
Problem 2.6 Why can’t we just use each chip on the wafer as a random
variable? Give a counterexample. Give bounds if we actually were allowed to
do this.
Problem 2.7 (Union Bound) Work on many bounds at the same time.
We only have logarithmic penalty.
Problem 2.9 (Randomized Projections {3}) Prove that the random- ized
projections converge.
Problem 2.10 (The Count-Min Sketch {5}) Prove the projection trick
Problem 2.11 (Parzen windows with triangle kernels {1}) Suppose you
are given the following data: X = {2, 3, 3, 5, 5}. Plot the estimated den-sity
using a kernel density estimator with the following kernel:
(
0.5 − 0.25 ∗ |u| if |u| ≤ 2
k(u) =
0 otherwise.
Problem 2.12 Gaussian process link with Gaussian prior on natural pa-
rameters
Problem 2.15 (Multivariate Gaussian {1}) Prove that Σ > 0 is a nec- essary
and sufficient condition for the normal distribution to be well defined.
using the exponential families parametrization. Next show that while the
mean estimate µ̂ is unbiased, the variance estimate has a slight bias of O( 1m).
To see this, take the expectation with respect to σ̂ 2 .
88 2 Density Estimation
Problem 2.20 (cdf of Logistic random variable {1}) Show that the cdfof
the Logistic random variable (??) is given by (??).
Problem 2.25 (Argmax of the Beta(a, b) distribution {1}) Show that the
mode of the Beta(a, b) distribution is given by (2.87).
Problem 2.26 (Accept reject sampling for the unit disk {2}) Give at
least TWO different accept-reject based sampling schemes to generate sam-
ples uniformly at random from the unit disk. Compute their efficiency.
Problem 2.30 (Uniform samples from a disk {2}) Show how the ideas
described in Section ?? can be generalized to draw samples uniformly at ran-
x2 x2
dom from an axis parallel ellipse: {(x, y) : a21 + b22 ≤ 1}.
3
Optimization
Here xi are the training instances and yi are the corresponding labels. l the
loss function measures the discrepancy between y and the predictions f (xi).
Finding the optimal f involves solving an optimization problem.
This chapter provides a self-contained overview of some basic concepts and
tools from optimization, especially geared towards solving machine learning
problems. In terms of concepts, we will cover topics related to convexity,
duality, and Lagrange multipliers. In terms of tools, we will cover a variety
of optimization algorithms including gradient descent, stochastic gradient
descent, Newton’s method, and Quasi-Newton methods. We will also look
at some specialized algorithms tailored towards solving Linear Programming
and Quadratic Programming problems which often arise in machine learning
problems.
3.1 Preliminaries
Minimizing an arbitrary function is, in general, very difficult, but if the ob-
jective function to be minimized is convex then things become considerably
simpler. As we will see shortly, the key advantage of dealing with convex
functions is that a local optima is also the global optima. Therefore, well
developed tools exist to find the global minima of a convex function. Conse-
quently, many machine learning algorithms are now formulated in terms of
convex optimization problems. We briefly review the concept of convex sets
and functions in this section.
91
92 3 Optimization
Intuitively, what this means is that the line joining any two points x and y from
the set C lies inside C (see Figure 3.1). It is easy to see (Exercise 3.1) that
intersections of convex sets are also convex.
Fig. 3.1. The convex set (left) contains the line joining any two points that belong
to the set. A non-convex set (right) does not satisfy this property.
Σ Σ
A vector sum i λix i is called a convex combination if λi ≥ 0 and i λi =
1. Convex combinations are helpful in defining a convex hull:
Definition 3.2 (Convex Hull) The convex hull, conv(X), of a finite sub-
set X = {x1, . . . , xn} of Rn consists of all convex combinations of x1, . . . , x n.
1.5
1.0
0.5
f(x)
f(x)
0.0
0.5
1.0
1.5
x x
Fig. 3.2. A convex function (left) satisfies (3.4); the shaded region denotes its epi-
graph. A nonconvex function (right) does not satisfy (3.4).
If f is twice differentiable, then f is convex if, and only if, its Hessian is
positive semi-definite, that is,
∇ 2f (x) ≥ 0. (3.9)
For twice differentiable strictly convex functions, the Hessian matrix is pos-
itive definite, that is, ∇ 2f (x) > 0. We briefly summarize some operations
which preserve convexity:
94 3 Optimization
Addition If f1 and f2 are convex, then f1 + f2 is also convex.
Scaling If f is convex, then αf is convex for α > 0.
Affine Transform If f is convex, then g(x) = f (Ax + b) for some matrix
A and vector b is also convex.
Adding a Linear Function ⟨ x for some vector
If f is convex, then g(x) = f (x)+ a,
a is also convex.
Subtracting a Linear Function If f is convex, then g(x) = f (x)−⟨⟩ a, x for some vector
a is also convex.
Pointwise Maximum If fi are convex, then g(x) = max ⟩ i fi(x) is also convex.
Scalar Composition If f (x) = h(g(x)), then f is convex if a) g is convex,
and h is convex, non-decreasing or b) g is concave, and
h is convex, non-increasing.
18 2
16
14
12 1
10
8 0
6
4
2 -1
0
3 -2
-3
-3 3 2 1 0 -1 -2 -3
3 -3
Fig. 3.3. Left: Convex Function in two variables. Right: the corresponding convex
below-sets {x|f (x) ≤ c}, for different values of c. This is also called a contour plot.
is convex.
Hence, for all 0 < λ < 1, we have (λx + (1 − λ)xJ) ∈ Xc, which proves the
claim. Figure 3.3 depicts this situation graphically.
3.1 Preliminaries 95
which shows that for small values of λ we have f (z(λ)) < f (x), thus showing
that x is not optimal.
The reverse implication follows from (3.7) by noting that f (xJ) ≥ f (x),
whenever (3.12) holds.
96 3 Optimization
One way to ensure that (3.12) holds is to set ∇f (x) = 0. In other words,
minimizing a convex function is equivalent to finding a x such that ∇f (x) =
0. Therefore, the first order conditions are both necessary and sufficient
when minimizing a convex function.
3.1.3 Subgradients
So far, we worked with differentiable convex functions. The subgradient is a
generalization of gradients appropriate for convex functions, including those
which are not necessarily smooth.
The set of all subgradients at a point is called the subdifferential, and is de-
noted by ∂xf (x). If this set is not empty then f is said to be subdifferentiable
at x. On the other hand, if this set is a singleton then, the function is said
to be differentiable at x. In this case we use ∇f (x) to denote the gradient
of f . Convex functions are subdifferentiable everywhere in their domain. We now
state some simple rules of subgradient calculus:
Σ
Example 3.2 (Negative Entropy) Let ∆n = {x s.t. i xi = 1 and xi ≥ 0}
be the n dimensional simplex, and f : ∆n → R be the negative entropy:
Σ
f (x) = xi log xi. (3.15)
i
Then f is 1-strongly convex with respect to the ǁ·ǁ1 norm on the simplex
(see Problem 3.7).
If f is a σ-strongly convex function then one can show the following prop-
erties (Exercise 3.8). Here x, xJ are arbitrary and µ ∈ ∂f (x) and µJ ∈ ∂f (xJ).
σ¨ J
x − x¨
2
f (xJ) ≥ f (x) + xJ − x, µ + (3.16)
2
1 ¨ J
µ − µ¨
2
f (x ) ≤ f (x) + x − x, µ +
J J
(3.17)
2σ
x − xJ, µ − µJ ≥ σ ¨x − xJ¨
2
(3.18)
1
x − x , µ − µ ≤ ¨µ − µ ¨ .
J J J 2
(3.19)
σ
98 3 Optimization
LI ≥ ∇ 2f (x). (3.21)
f (x ) = sup {⟨ x, x ⟩ − f (x)} .
∗ ∗ ∗
(3.26)
x
Fig. 3.4. A convex function is always lower bounded by its first order Taylor ap-
proximation. This is true even if the function is not differentiable (see Figure 3.5)
1
4 3 2 1 0 1 2 3 4
f(x)
∆ f(x,x0 )
f( x 0 ) + x − x 0 , ∇f( x 0 )
0
f(x )
Fig. 3.6. f (x) is the value of the function at x, while f (x′)+⟨x x−′, f (x
∇ ′) denotesthe
first order Taylor expansion of f around x , evaluated at x. ⟩The difference
′
One can calculate ∇f (x) = log x, where log x is the component wise loga- rithm
of the entries of x, and write the Bregman divergence
Σ Σ Σ J Σ J
∆f (x, xJ) = xi log xi − xi − x i log xJi + x i − x − xJ, log xJ
i i i i
Σ xi
= x ilog J
+ xJ i− x i .
xi
i
2 ǁxǁp − 12 x − ǁxJiǁp−p 2i
i (xi − xi)
p
.
Proof
∆f (x, y) + ∆f (y, z) = f (x) − f (y) − ⟨ x − y, ∇f (y)⟩ + f (y) − f (z) − ⟨ y − z, ∇f (z)⟩
= f (x) − f (z) − ⟨ x − y, ∇f (y)⟩ − ⟨ y − z, ∇f (z)⟩
= ∆f (x, z) + ⟨ ∇f (z) − ∇f (y), x − y⟩ .
4: if J J ( at+b
2 ) > 0 then
t
6: else
7: at+1 = at+b 2
t
and bt+1 = bt
8: end if
9: t= t+1
10: end while
11: Return: at+b
2
t
simple problem has many applications. As we will see later, many optimiza-
tion methods find a direction of descent and minimize the objective function
along this direction1; this subroutine is called a line search. Algorithm 3.1
depicts a simple line search routine based on interval bisection.
Before we show that Algorithm 3.1 converges, let us first derive an im-
portant property of convex functions of one variable. For a differentiable one-
dimensional convex function J (3.7) reduces to
Instead of the simple bisecting line search more sophisticated line searches
such as the More-Thuente line search or the golden bisection rule can also be
used to speed up convergence (see [NW99] Chapter 3 for an extensive
discussion).
Inexact Line Search: Instead of minimizing J(wt − η∇J(w t)) we could simply
look for a stepsize which results in sufficient decrease in the objectivefunction
value. One popular set of sufficient decrease conditions is the Wolfe
conditions
J(wt+1 ) ≤ J(wt) + c1ηt ⟨ ∇J(wt), wt+1 − wt⟩ (sufficient decrease) (3.42)
⟨ ∇J(wt+1), wt+1 − wt⟩ ≥ c2 ⟨ ∇J(wt), wt+1 − wt⟩ (curvature) (3.43)
with 0 < c1 < c2 < 1 (see Figure 3.7). The Wolfe conditions are also called
the Armijio-Goldstein conditions. If only sufficient decrease (3.42) alone is
enforced, then it is called the Armijio rule.
Fig. 3.7. The sufficient decrease condition (left) places an upper bound on the
acceptable stepsizes while the curvature condition (right) places a lower bound on
the acceptable stepsizes.
106 3 Optimization
1
Proof Plugging in ηt = L and rearranging (3.45) obtains
1 2
Solving for
r
2L(J (w 0 ) − J (w ∗))
=ϵ
T +1
shows that T is O(1/ϵ2) as claimed.
If in addition to having a Lipschitz continuous gradient, if J is σ-strongly
convex, then more can be said. First, one can translate convergence in
ǁ∇J(w t)ǁ to convergence in function values. Towards this end, use (3.17) to
write
1
J(wt ) ≤ J(w ∗) + ǁ∇J (w t)ǁ2 .
2σ
Therefore, it follows that whenever ǁ∇J (w t )ǁ < ϵ we have J(wt ) − J(w ∗) <
ϵ2/2σ. Furthermore, we can strengthen the rates of convergence.
(3.47)
log(1/c)
iterations.
Proof Combining (3.46) with ǁ∇ J(wt )ǁ2 ≥ 2σ(J(wt ) − J(w ∗)), and using
the definition of c one can write
c(J (w t ) − J(w ∗)) ≥ J(wt+1 ) − J(w ∗).
Applying the above equation recursively
cT (J(w 0 ) − J(w ∗)) ≥ J(wT ) − J (w ∗).
Solving for
ϵ = cT (J (w 0 ) − J (w ∗))
and rearranging yields (3.47).
When applied to practical problems which are not strongly convex gra- dient
descent yields a low accuracy solution within a few iterations. How- ever, as
the iterations progress the method “stalls” and no further increasein accuracy
is obtained because of the O(1/ϵ2) rates of convergence. On the other hand,
if the function is strongly convex, then gradient descent converges linearly, that
is, in O(log(1/ϵ)) iterations. However, the number
108 3 Optimization
Fig. 3.8. Convergence of gradient descent with exact line search on two quadratic
problems (3.48). The problem on the left is ill-conditioned, whereas the problem
on the right is well-conditioned. We plot the contours of the objective function,
and the steps taken by gradient descent. As can be seen gradient descent converges
fast on the well conditioned problem, while it zigzags and takes many iterations to
converge on the ill-conditioned problem.
It is easy to verify that choosing f (·) = 12 ǁ·ǁ2 recovers the usual gradient
descent updates. On the other hand if we choose f to be the un-normalized
entropy (3.30) then ∇f (·) = log and therefore (3.54) specializes to
wt+1 = exp(log(w t ) − ηt ∇J (w t )) = wt exp(−ηt ∇ J(wt )), (3.55)
which is sometimes called the Exponentiated Gradient (EG) update.
110 3 Optimization
∗
Theorem 3.14 Let J be a convex function and J(w ) denote its minimum
value. The mirror descent updates (3.54) with a σ-strongly convex function
f satisfy
Σ
∆f (w , w1) + 12σ t ηt 2 ǁ∇J(wt)ǁ2
∗
Proof Using the convexity of J (see (3.7)) and (3.54) we can write
J(w ∗) ≥ J (w t ) + ⟨ w ∗ − w t , ∇J (w t )⟩
1
≥ J(w t) — w∗ − wt, f (w t+1 ) − f (w t)⟩ .
ηt
⟨
Summing over t = 1, . . . , T
Σ Σ
∆f (w , w1 ) − ∆f (w , w T +1 ) + ∆f (w t , w t+1 ) ≥ ηt (J(wt ) − J(w )).
∗ ∗ ∗
t t
it follows that
Σ
∆f (w∗, w1) + t ∆f (wt, wt+1) ≥ min J (w t ) − J (w ∗ ). (3.56)
Σ
ηtt t
1 1
∆ f(w t, w t+1 ) ≤ ǁ∇f (w t) − ∇f (w )ǁ2 = η2 ǁ∇J(w )ǁ2 . (3.57)
2σ
t+1 t
2σ t
The proof is completed by plugging in (3.57) into (3.56).
Aw = b. (3.59)
Definition 3.16 (Conjugate Directions) Non zero vectors pt and p t′ are said
to be conjugate with respect to a symmetric positive definite matrix Aif pTt′
Apt = 0 if t /= t .
J
Conjugate directions {p0, . . . , pn− 1} are linearly independent and form abasis.
To see this, suppose the pt’s Σ are not linearly independent. Then there exists
non-zero coefficients σ sucht that σ p = 0. The t t pt ’s are conjugate
Σ Σ
directions, therefore pT
t′ A( t σ t p t ) = t σ t p T
t′ Ap t = σ t′ pT J
t′ Apt′ = 0 for all t .
J
Since A is positive definite this implies that σ t′ = 0 for all t , a contradiction.
As it turns out, the conjugate directions can be generated iteratively as
follows: Starting with any w0 ∈ Rn define p0 = −g0 = b − Aw0, and set
gtTpt
αt = − (3.60a)
pT
t Apt
t t
pt+1 = −gt+1 + βt+1pt (3.60e)
112 3 Optimization
The following theorem asserts that the pt generated by the above procedure
are indeed conjugate directions.
Theorem 3.17 Suppose the t-th iterate generated by the conjugate gradient
method (3.60) is not the solution of (3.59), then the following properties hold:
Step 1: We first prove that (3.63) holds. Using (3.60c), (3.60b) and (3.60a)
pT T
j g t+1 = pj (Aw t+1 − b)
j (Awt + αt pt − b)
= pT
gtTpt
= pT
j Awt — Apt — b
pT
t Apt
T
pj Apt
=p g −
T T
g pt .
j t T t
pt Apt
For j = t, both terms cancel out, while for j < t both terms vanish due to
the induction hypothesis.
Step 2: Next we prove that (3.61) holds. Using (3.60c) and (3.60b)
pT T T
t+1 Apj = −g t+1 Apj + βt+1 pt Apj .
By the definition of βt+1 (3.60d) the above expression vanishes for j = t. For
j < t, the first term is zero because Apj ∈ span{p0 , p1 , . . . , pj +1 }, a subspace
orthogonal to gt+1 as already shown in Step 1. The induction hypothesis
guarantees that the second term is zero.
Step 4 Clearly, (3.61) and (3.60e) imply (3.62). This concludes the proof.
t t
Proof Let w denote the minimizer of (3.48). Since the pt’s form a basis
w − w0 = σ0p0 + . . . + σn− 1pn− 1,
for some scalars σt. Our proof strategy will be to show that the coefficients
114 3 Optimization
Algorithm 3.3 Conjugate Gradient
1: Input: Initial point w0, residual norm tolerance ϵ
2: Set t = 0, g0 = Aw0 − b, and p0 = −g0
3: while ǁAwt − bǁ ≥ ϵ do
T
4: αt = pgT t gt
t Apt
5: wt+1 = wt + αtp t
6: gt+1 = gtT+ αtApt
g g
7: βt+1 = t+1T t+1
gt g t
8: pt+1 = −g t+1 + βt+1p t
9: t= t+1
10: end while
11: Return: wt
On the other hand, following the iterative process (3.60b) from w0 until wt
yields
wt − w0 = α0p0 + . . . + αt− 1pt− 1.
T
Again premultiplying with pt A and using conjugacy
t A(wt − w 0 ) = 0.
pT (3.68)
Substituting (3.68) into (3.67) produces
t A(w − w t )
pT gtT pt , (3.69)
σ = =−
t
pT
t Apt pT
t Apt
thus showing that σt = αt.
Observe that the g t+1 computed via (3.60c) is nothing but the gradient of
J(wt+1). Furthermore, consider the following one dimensional optimization
problem:
min φt(α) := J(wt + αpt).
α∈ R
gtT+1 gt+1
Fletcher-Reeves Set αt = argminα J(wt + αpt) and βt gtT gt
= .
t t t t + (w t t t
2
(3.70)
of J.
Proof First notice that
∫ 1
∗ 2 ∗ ∗
∇J(wt) − ∇J(w ) = ∇ J(wt + t(w — wt))(wt − w )dt. (3.74)
0
Next using the fact that ∇ 2J(w t) is invertible and the gradient vanishes at
the optimum (∇ J(w ∗) = 0), write
wt+1 − w∗ = wt − w∗ − ∇ 2J(wt)− 1∇J(wt)
= ∇ 2 J(w t )− 1 [∇ 2 J(w t )(wt − w ∗) − (∇J (w t ) − ∇ J(w ∗))]. (3.75)
Using (3.75), (3.74), and the Lipschitz continuity of ∇ 2J
¨∇J (w t ) − ∇ J(w ∗) − ∇ 2 J(w t )(w t − w ∗)¨
∫ 1
¨ 2
= ¨ [∇ J(wt + t(wt − w ∗ )) − ∇ J(wt)](wt − w )dt¨
2 ∗
∫ 1 ǁ(wt − w )ǁ dt
≤ ¨[∇ 2 J(w + t(w − w ∗ 2 ¨ ∗
0
t t )) − ∇ J(wt)]
∫
Mt dt = M 2
ǁwt − w ǁ .
2 1 ∗
≤ ǁwt − w ǁ ∗
2
(3.76)
0
Finally use (3.75) and (3.76) to conclude that
M ¨ 2 NM
ǁwt+1 − w∗ǁ ≤ ∇ J(w )t− 1¨ ǁw t — w∗ǁ2 ≤ ǁw t — w∗ǁ2.
2 2
1200
1000
800
600
400
200
200
400
6 4 2 0 2 4 6
Fig. 3.9. The blue solid line depicts the one dimensional convex function J(w) = w4
+ 20w2 + w. The green dotted-dashed line represents the first order Taylor
approximation to J(w), while the red dashed line represents the second order Taylor
approximation, both evaluated at w = 2.
Unlike Newton’s method which uses the Hessian to build its quadratic model
(3.70), BFGS uses the matrix Ht > 0, which is a positive-definite estimate
of the Hessian. A quasi-Newton direction of descent is found by minimizing
Qt(w):
The stepsize ηt > 0 is found by a line search obeying the Wolfe conditions
3.2 Unconstrained Smooth Convex Minimization 119
(3.80)
When updating our model it is reasonable to expect that the gradient of
Qt+1 should match the gradient of J at wt and wt+1. Clearly,
∇Qt+1(w) = ∇J(wt+1) + Ht+1(w − wt+1), (3.81)
which implies that ∇Qt+1(wt+1 ) = ∇J(wt+1), and hence our second con-
dition is automatically satisfied. In order to satisfy our first condition, we
require
∇Qt+1(wt) = ∇J(wt+1 ) + Ht+1(wt − wt+1) = ∇J(wt). (3.82)
By rearranging, we obtain the so-called secant equation:
Ht+1st = yt, (3.83)
where st := wt+1 − wt and yt := ∇J(wt+1) − ∇J(wt) denote the most recent
step along the optimization trajectory in parameter and gradient space,
respectively. Since Ht+1 is a positive definite matrix, pre-multiplying the
secant equation by st yields the curvature condition
sT
t yt > 0. (3.84)
If the curvature condition is satisfied, then there are an infinite number
of matrices Ht+1 which satisfy the secant equation (the secant equation
represents n linear equations, but the symmetric matrix Ht+1 has n(n+1)/2
degrees of freedom). To resolve this issue we choose the closest matrix to
Ht which satisfies the secant equation. The key insight of the BFGS comes
from the observation that the descent direction computation (3.78) involves
the inverse matrix Bt := Ht− 1. Therefore, we choose a matrix Bt+1 := H−t+11
where αt+1 is a scalar and I denotes the identity matrix. Then the secant
equation (3.83) becomes
αt+1st = yt. (3.90)
In general, the above equation cannot be solved. Therefore we use the αt+1
which minimizes ǁαt+1st − ytǁ2 which yields the Barzilai-Borwein (BB) step-
size
sT
t yt
αt+1 = . (3.91)
sTs
t t
As it turns out, αt+1 lies between the minimum and maximum eigenvalue of
the average Hessian in the direction st, hence the name Spectral Gradient
method. The parameter update (3.79) is now given by
1
wt+1 = wt − ∇J(wt). (3.92)
αt
A practical implementation uses safeguards to ensure that the stepsize αt+1
is neither too small nor too large. Given 0 < αmin < αmax < ∞ we compute
sT
t yt
αt+1 = min α max , max α min , T . (3.93)
s s
t t
its linearization (i.e., first order Taylor approximation). See Figures 3.4 and
3.5 for geometric intuition, and recall (3.7) and (3.13):
This iteratively refines the piecewise linear lower bound JCP and allows us
to get close to the minimum of J (see Figure 3.10 for an illustration).
If w ∗ denotes the minimizer of J , then clearly each J (w i ) ≥ J (w ∗ ) and
hence min0≤i≤t J (w i ) ≥ J (w ∗). On the other hand, since J ≥ J CPt it fol-
lows that J(w ) ≥ J tCP (wt ). In other words, J(w ) is sandwiched between
∗ ∗
min0≤ i≤ t J(wi) and JCPt(wt) (see Figure 3.11 for an illustration). The cutting
plane method monitors the monotonically decreasing quantity
ϵt := min J(wi) − JCP
t
(wt), (3.97)
0≤i≤t
Fig. 3.10. A convex function (blue solid curve) is bounded from below by its lin-
earizations (dashed lines). The gray area indicates the piecewise linear lower bound
obtained by using the linearizations. We depict a few iterations of the cutting plane
method. At each iteration the piecewise linear lower bound is minimized and a new
linearization is added at the minimizer (red rectangle). As can be seen, adding more
linearizations improves the lower bound.
Fig. 3.11. A convex function (blue solid curve) with four linearizations evaluated at
four different locations (magenta circles). The approximation gap ϵ3 at the end of
fourth iteration is indicated by the height of the cyan horizontal band i.e., difference
between lowest value of J(w) evaluated so far and the minimum of J4CP(w) (red
diamond).
well known (see e.g., [LNN95, Bel05]) that it can be very slow when new
iterates move too far away from the previous ones (i.e., causing unstable “zig-
zag” behavior in the iterates). In fact, in the worst case the cutting plane
method might require exponentially many steps to converge to an ϵ optimum
solution.
Bundle methods stabilize CPM by augmenting the piecewise linear lower
(e.g., JCP
t (w) in (3.95)) with a prox-function (i.e., proximity control func- tion)
which prevents overly large steps in the iterates [Kiw90]. Roughlyspeaking,
there are 3 popular types of bundle methods, namely, proximal [Kiw90], trust
region [SZ92], and level set [LNN95]. All three versions use
2
2 ǁ·ǁ as their prox-function, but differ in the way they compute the new
1
iterate:
proximal: w := argmin ζt w − ŵ ǁ2 + J CP (w)}, (3.98)
t ǁ { t− 1 t
2
w
CP
trust region: w := argmin{J (w) 1 w − ŵ ǁ2 ≤ κ }, (3.99)
t t | ǁ t− 1 t
w 2
where ŵt− 1 is the current prox-center, and ζt , κt , and τt are positive trade-
off parameters of the stabilization. Although (3.98) can be shown to be
equivalent to (3.99) for appropriately chosen ζt and κt, tuning ζt is rather
difficult while a trust region approach can be used for automatically tuning
3.3 Constrained Optimization 125
where both ci and ei are convex functions. We say that w is feasible if and
only if it satisfies the constraints, that is, ci(w) ≤ 0 for i ∈ I and ei(w) = 0
for i ∈ E.
Recall that w is the minimizer of an unconstrained problem if and only if
ǁ∇J(w)ǁ = 0 (see Lemma 3.6). Unfortunately, when constraints are present
one cannot use this simple characterization of the solution. For instance, the
w at which ǁ∇J(w)ǁ = 0 may not be a feasible point. To illustrate, consider
the following simple minimization problem (see Figure 3.12):
1
min w2 (3.102a)
w 2
s. t. 1 ≤ w ≤ 2. (3.102b)
14
12
10
8
J(w)
0
6 4 2 0 2 4 6
w
for some pre-defined tolerance ϵ > 0. Of course, J ∗ is unknown and hence the
gap J (w) − J ∗ cannot be computed in practice. Furthermore, as we showed in
Section 3.3, for constrained optimization problems ǁ∇J(w)ǁ does not
vanish at the optimal solution. Therefore, we will use the following stopping
3.3 Constrained Optimization 127
Proof First assume that w is feasible, that is, ci(w) ≤ 0 for i ∈ I and
ei(w) = 0 for i ∈ E. Since αi ≥ 0 we have
Σ Σ
αici(w) + βiei(w) ≤ 0, (3.108)
i∈ I i∈ E
see that maxα≥0,β L(w, α, β) → ∞. Similarly, when ei′ (w) = / 0 let βi′ → ∞
if ei′ (w) > 0 or βi′ → −∞ if ei′ (w) < 0 to arrive at the same conclusion.
If define the Lagrange dual function
D(α, β) = min L(w, α, β), (3.109)
w
for α ≥ 0 and β, then one can prove the following property, which is often
called as weak duality.
Theorem 3.21 (Weak Duality) The Lagrange dual function (3.109) sat-
isfies
D(α, β) ≤ J(w)
Therefore
Σ Σ
D(α, β) = min L(w, α, β) = min J(w) + αici(w) + βiei(w) ≤ J(w)
w w
i∈ I i∈ E
The proof of the above theorem is quite technical and can be found in
any standard reference (e.g., [BV04]). Therefore we will omit the proof and
proceed to discuss various implications of strong duality. First note that
min max L(w, α, β) = max min L(w, α, β). (3.112)
w α≥ 0,β α≥ 0,β w
In other words, one can switch the order of minimization over w with max-
imization over α and β. This is called the saddle point property of convex
functions.
Suppose strong duality holds. Given any α ≥ 0 and β such that D(α, β) >
−∞ and a feasible w we can immediately write the duality gap
J(w) − J = J (w) − D ≤ J (w) − D(α, β),
∗ ∗
Suppose the primal and dual optimal values are attained at w∗ and(α∗,
β∗) respectively, and consider the following line of argument:
∗ ∗ ∗
J(w ) = D(α , β ) (3.113a)
Σ Σ
= min J (w) + α∗ici (w) + βi∗ej (w) (3.113b)
w i∈ I i∈ E
Σ Σ
≤ J (w ) + α∗ici (w ) + βi∗ei (w )
∗ ∗ ∗
(3.113c)
i∈ I i∈ E
≤ J (w ∗). (3.113d)
i∈ I i∈ E
ci (w ∗) ≤ 0 ∀i ∈ I (3.114a)
ej (w ∗) = 0 ∀i ∈ E (3.114b)
α∗i ≥ 0 (3.114c)
α∗ici (w ∗) =0 (3.114d)
Σ Σ
∇J (w ∗) + α∗i∇ci (w ∗) + βi∗∇ei (w ∗) = 0. (3.114e)
i∈ I i∈ E
The above conditions are called the KKT conditions. If the primal problem is
convex, then the KKT conditions are both necessary and sufficient. In other
words, if ŵ and (α̂, β̂) satisfy (3.114) then ŵ and (α̂, β̂) are primal and dual
optimal with zero duality gap. To see this note that the first two conditions
show that ŵ is feasible. Since αi ≥ 0, L(w, α, β) is convex in w. Finally the
last condition states that ŵ minimizes L(w, α̂, β̂). Since α̂i ci (ŵ) = 0 and
3.3 Constrained Optimization 131
ej (ŵ) = 0, we have
= J (ŵ).
min cT w (3.115a)
w
s. t. Aw = b, w ≥ 0. (3.115b)
min cT w (3.116a)
w
s. t. Aw ≥ b, (3.116b)
min cT w (3.117a)
w,ξ
s. t. Aw − ξ = b, ξ ≥ 0. (3.117b)
Next, we split w into its positive and negative parts w+ and w− respec-
i = max(0, w i ) and w i = max(0, −w i ). Using these new
−
tively by setting w +
132 3 Optimization
T
c w+
min −c w− (3.118a)
w+,w−, ξ
0 ξ
w+ w+
s. t. A −A −I w− = b, w− ≥ 0, (3.118b)
ξ ξ
−
thus yielding a canonical LP (3.115) in the variables w+, w and ξ.
By introducing non-negative Lagrange multipliers α and β one can write
the Lagrangian of (3.115) as
Taking gradients with respect to the primal and dual variables and setting
them to zero obtains
AT β − α = c (3.120a)
Aw = b (3.120b)
T
α w=0 (3.120c)
w≥0 (3.120d)
α ≥ 0. (3.120e)
Condition (3.120c) can be simplified by noting that both w and α are con-
strained to be non-negative, therefore αT w = 0 if, and only if, αi w i = 0 for
i = 1, . . . , n.
Using (3.120a), (3.120c), and (3.120b) we can write
cT w = (AT β − α)T w = β T Aw = β T b.
Substituting this into (3.115) and eliminating the primal variable w yields
the following dual LP
max bT β (3.121a)
α,β
s.t. A β − α = c, α ≥ 0.
T
(3.121b)
β+ β+
s.t. AT
−A T
−I β− = c, β
−
≥ 0. (3.122b)
α α
It can be easily verified that the primal-dual problem is symmetric; by taking
the dual of the dual we recover the primal (Problem 3.17). One important
thing to note however is that the primal (3.115) involves n variables and
n + m constraints, while the dual (3.122) involves 2m + n variables and 4m
+ 2n constraints.
s.t. ai w = bi for i ∈ E
T
(3.123b)
ai w ≤ bi for i ∈ I
T
(3.123c)
s.t. Aw = b (3.124b)
To find the saddle point of the Lagrangian we take gradients with respect
134 3 Optimization
∗ ∗
Let w denote the minimizer of (3.123). If we define the active set A(w ) as
, ,
A(w ∗) = i s.t. i ∈ I and aT ∗
i w = bi ,
then the KKT conditions (3.114) for this problem can be written as
ai w − bi < 0 ∀i ∈ I \ A(w )
T ∗
(3.128a)
ai w − bi = 0 ∀i ∈ E ∪ A(w ) (3.128b)
T ∗
for solving (3.123) can be viewed as different ways to identify the active set.
See [NW99] for a detailed discussion.
Classical optimization techniques must compute this sum in its entirety for
each evaluation of the objective, respectively its gradient. As available data
sets grow ever larger, such “batch” optimizers therefore become increasingly
inefficient. They are also ill-suited for the incremental setting, where partial
data must be modeled as it arrives.
136 3 Optimization
1Σ
k
J t (w) = λΩ(w) + l(w, xti, yti). (3.129)
k i=1
Setting k = 1 obtains an algorithm which processes data points as they
arrive.
where f and g are convex functions. Clearly, J is not convex, but there
exists a reasonably simple algorithm namely the Concave-Convex Procedure
(CCP) for finding a local minima of J. The basic idea is simple: In the
tth iteration replace g by its first order Taylor expansion at wt, that is, g(wt)
+ ⟨ w − wt, ∇g(wt)⟩ and minimize
10 200
20
150
30
100
40
50
50
60
70
80 50
1.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Fig. 3.13. Given the function on the left we decompose it into the difference of two
convex functions depicted on the right panel. The CCP algorithm generates iterates
by matching points on the two convex curves which have the same tangent vectors.
As can be seen, the iterates approach the solution x = 2.0.
Adding the two inequalities, rearranging, and using (3.135) shows that J(wt) =
f (wt) − g(wt) ≥ f (wt+1) − g(wt+1) = J(wt+1), as claimed.
∗ ∗ ∗
Let w be a stationary point of the iterates. Then ∇f (w ) = ∇g(w ),which
in turn implies that w ∗ is a local minima of J because ∇ J(w ∗) = 0.
Duality: Sometimes problems which are hard to optimize in the primal may
become simpler in the dual. For instance, if the objective function is strongly
convex but non-smooth, its Fenchel conjugate is smooth with a Lipschitz
continuous gradient.
Problems
Problem 3.1 (Intersection of Convex Sets {1}) If C1 and C2 are con-
vex sets, then show that C1 ∩ C2 is also convex. Extend your result to show
Tn
that i=1 Ci are convex if Ci are convex.
Problem 3.2 (Linear Transform of Convex Sets {1}) Given a set C ⊂
Rn and a linear transform A ∈ Rm× n, define AC := {y = Ax : x ∈ C}. If C
is convex then show that AC is also convex.
Problem 3.4 (Convex Hull {2}) Show that the convex hull, conv(X) is
the smallest convex set which contains X.
Problem 3.7 (Strong convexity of the negative entropy {3}) Show that
the negative entropy (3.15) is 1-strongly convex with respect to the ǁ·ǁ1 norm
− 2
on the simplex. Hint: First show that φ(t) := (t − 1) log t − 2 (t 1)t+1≥ 0 for
all t ≥ 0. Next substitute t = xi/yi to show that
Σ x
(xi − yi ) log i ≥ ǁx − yǁ21.
i
yi
3.6 Some Practical Advice 141
Problem 3.8 (Strongly Convex Functions {2}) Prove 3.16, 3.17, 3.18
and 3.19.
Problem 3.12 (Gradient of the p-norm {1}) Show that the gradient of
the p-norm (3.31) is given by (3.32).
1 T
f (x) = x Ax where A is a positive definite matrix
2
f (x) = − log(x)
f (x) = exp(x)
f (x) = x log(x)
Problem 3.17 (Dual of a LP {1}) Show that the dual of the LP (3.122)
is (3.115). In other words, we recover the primal by computing the dual of
the dual.
4
So far the learning algorithms we considered assumed that all the training
data is available before building a model for predicting labels on unseen data
points. In many modern applications data is available only in a streaming
fashion, and one needs to predict labels on the fly. To describe a concrete
example, consider the task of spam filtering. As emails arrive the learning
algorithm needs to classify them as spam or ham. Tasks such as these are
tackled via online learning. Online learning proceeds in rounds. At each
round a training example is revealed to the learning algorithm, which uses
its current model to predict the label. The true label is then revealed to
the learner which incurs a loss and updates its model based on the feedback
provided. This protocol is summarized in Algorithm 4.1. The goal of online
learning is to minimize the total loss incurred. By an appropriate choice
of labels and loss functions, this setting encompasses a large number of tasks
such as classification, regression, and density estimation. In our spam
detection example, if an email is misclassified the user can provide feedback
which is used to update the spam filter, and the goal is to minimize the
number of misclassified emails.
Lemma 4.1 The Halving algorithm makes at most log2 (n) mistakes.
143
144 4 Online Learning and Boosting
Proof Let M denote the total number of mistakes. The halving algorithm
makes a mistake at iteration t if at least half the consistent experts Ct predict
the wrong label. This in turn implies that
|Ct| |C0 | n
|Ct+1 | ≤≤ M = M.
2 2 2
On the other hand, since one of the experts is consistent it follows that
1 ≤ |Ct+1|. Therefore, 2M ≤ n. Solving for M completes the proof.
For the rest of this chapter we will make the following standard assumptions:
• The gradient ∇ψ, and its inverse (∇ψ)− 1 = ∇ψ∗ can be computed.
The method we employ to solve (4.2) is given in Algorithm 4.2. Before
analyzing the performance of the algorithm we would like to discuss three
special cases. First, Euclidean distance squared which recovers projected
stochastic gradient descent, second Entropy which recovers Exponentiated
gradient descent, and third the p-norms for p > 2 which recovers the p-norm
Perceptron. BUGBUG TODO.
Our key result is Lemma 4.3 given below. It can be found in various guises
in different places most notably Lemma 2.1 and 2.2 in [?], Theorem 4.1 and
Eq. (4.21) and (4.15) in [?], in the proof of Theorem 1 of [?], as well as Lemma
3 of [?]. We prove a slightly general variant; we allow for projections with
an arbitrary Bregman divergence and also take into account a generalized
version of strong convexity of ft. Both these modifications will allow us to deal
with general settings within a unified framework.
146 4 Online Learning and Boosting
Proof We prove the result in three steps. First we upper bound ∆ ψ(w, wt+1) by
∆ψ (w, ŵt+1 ). This is a consequence of (4.4) and the non-negativity of the
Bregman divergence which allows us to write
∆ψ (w, w t+1 ) ≤ ∆ψ (w, ŵ t+1 ). (4.9)
In the next step we use Lemma 3.11 to write
∆ψ (w, w t ) + ∆ψ (w t , ŵ t+1 ) − ∆ψ (w, ŵt+1 ) = ⟨ ∇ψ(ŵ t+1 ) − ∇ ψ(w t ), w − w t ⟩ .
Since ∇ψ∗ = (∇ψ)− 1, the update in step 3 of Algorithm 4.2 can equivalently be
written as ∇ ψ(ŵ t+1 ) − ∇ ψ(w t ) = −ηt gt . Plugging this in the above
equation and rearranging
∆ψ (w, ŵ t+1 ) = ∆ψ (w, wt ) − ηt ⟨ gt , wt − w⟩ + ∆ψ (w t , ŵ t+1 ). (4.10)
Finally we upper bound ∆ψ (w t , ŵt+1 ). For this we need two observations:
First, ⟨ x, y⟩ ≤ 1 ǁxǁ2 + σ ǁyǁ2 for all x, y ∈ Rn and σ > 0. Second, the
σ 2σ 2 σ 2
strong convexity of ψ allows us to bound ∆ψ (ŵ t+1 , w t ) ≥ 2 ǁwt − ŵ t+1 ǁ .
Using these two observations
∆ψ (wt , ŵ t+1 ) = ψ(wt ) − ψ(ŵ t+1 ) − ⟨ ∇ψ(ŵ t+1 ), wt − ŵ t+1 ⟩
Inequality (4.7) follows by putting together (4.9), (4.10), and (4.11), while
(4.8) follows by using (4.6) with f = ft and wJ = wt and substituting into
4.2 Weighted Majority 147
(4.7).
Now we are ready to prove regret bounds.
Lemma 4.4 Let w ∈ Ω denote the best parameter chosen in hindsight, and
∗ ∗
let ǁgtǁ ≤ L for all t. Then the regret of Algorithm 4.2 can be bounded via
Σ
T 1 Σ2 T
ft (wt ) − ft (w ) ≤ F ∗
− Tλ +L ηt. (4.12)
ηT 2σ
t=1 t=1
Summing over t
T T T
Σ Σ 1 Σ ηt
t ) t− f (w
t ) ≤ ((1 − η λ)∆t (w ψ, w ) − ∆
∗ ∗ ∗
f (w t (w ,ψw t+1 )) + ǁgt ǁ2 .
t=1
η
t=1 t t=1
2σ
` ˛¸ x
` ˛¸ x
T1 T2
Since the diameter of Ω is bounded by F and ∆ψ is non-negative
1 1 Σ
T
1 1
T1 = −λ ∆ (w , w ) −
∗ ∗
∆ψ(w , wT +1) + ∆ψ(w , wt)
∗
− − λ
η1 ηt ηt− 1
ψ 1 t=2
ηTT
1 Σ 1 1
≤
∗
− λ ∆ψ(w , w1) + ∆ψ(w∗, wt) − − λ
η1 ηt ηt− 1
t=2
T
1 Σ 1 1 1
≤ − λ F+ F — − λ =F − Tλ .
η1 t=2
ηt ηt− 1 ηT
On the other hand, since the subgradients are Lipschitz continuous with
constant L it follows that
L2 Σ
T
T2 ≤ η t.
2σ t=1
Putting together the bounds for T1 and T2 yields (4.12).
1
Corollary 4.5 If λ > 0 and we set ηt = λt
then
T 2
Σ L
ft (xt ) − f t(x ) ≤
∗
(1 + log(T )),
2σλ
t=1
148 4 Online Learning and Boosting
Σ
T
L2 √
ft (xt ) − f t (x∗) ≤ F+ T.
t=1
σ
Σ √ L2 Σ 1 √ L2 √
T T
∗
ft(wt) − ft(w ) ≤ F T+ √ ≤F σ
σ
t=1 t=1 2 t T+ T.
Problems
Problem
1 2 4.1
σ (Generalized
2 Cauchy-Schwartz
n {1}) Show that ⟨ x, y⟩ ≤
2σ ǁxǁ + 2 ǁyǁ for all x, y ∈ R and σ > 0.
Σb
Problem 4.2 (Bounding sum of a series {1}) Show that √1 ≤
√
t=a 2 t
b − a + 1. Hint: Upper bound the sum by an integral.
5
Conditional Densities
Because we do not have any prior knowledge about the data, we choose a
zero mean unit variance isotropic normal distribution for p(θ). This yields
1 Σ m
— log p(θ|X, Y ) = ǁθǁ2 − log p(y |xi , iθ) + const. (5.3)
2
i=1
then
Σm
— log p(θ|X, Y ) = 1 ǁθǁ2 + g(θ|xi ) − ⟨ φ(xi , yi ), θ⟩ + const. (5.5)
2 i=1
where
Σ
g(θ|x) = log exp (⟨ φ(x, y), θ⟩ ) , (5.6)
y∈ Y
149
150 5 Conditional Densities
can be used to obtain the maximum aposteriori (MAP) estimate for θ. Given
the optimal θ, the class label at any given x can be predicted using
y∗ = argmax p(y x,
| θ). (5.7)
y
Define φ̂(x) := φ(x, +1) − φ(x, −1). Plugging (5.8) into (5.4), using the
definition of φ̂ and rearranging
1
p(y = +1|x, θ) = D E and
ˆ
−φ(x), θ
1 + exp
D E
1
p(y = −1|x, θ) = 1 + exp φ̂(x), θ
,
or more compactly
1
p(y|x, θ) = D E . (5.9)
1 + exp ˆ θ
−yφ(x),
Since p(y|x, θ) is a logistic function, hence the name logistic regression. The
classification rule (5.7) in this case specializes as follows: predict +1 when-
ever p(y = +1|x, θ) ≥ p(y = −1|x, θ) otherwise predict −1. However
p(y = +1|x, θ) D E
log = φ̂(x), θ ,
p(y = −1|x, θ)
D E
therefore one can equivalently use sign φ̂(x), θ as our prediction func-
tion. Using (5.9) we can write the objective function of logistic regression
as
1 Σ m D E
2
ǁθǁ + log 1 + exp −yi φ̂(xi ), θ
2 i=1
Notice that the second term of the gradient vanishes whenever p(yi |xi , θ) =
1. Therefore, one way to interpret logistic regression is to view it as a method
to maximize p(yi |xi , θ) for each point (xi , yi ) in the training set. Since the
objective function of logistic regression is twice differentiable one can also
compute its Hessian
Σ
m
∇ 2 J(θ) = I − p(yi |xi , θ)(1 − p(yi |xi , θ))φ̂(x i )φ̂(xi )T ,
i=1
where we used yi2 = 1. The Hessian can be used in the Newton method
5.2 Regression
5.2.1 Conditionally Normal Models
fixed variance
5.7.1 Optimization
issues in optimization (blows up with number of classes). structure is not
there. can we do better?
Problems
Problem 5.1 Poisson models
J
k(x, x ) = ⟨ Φ(x), Φ(x)⟩ (6.1)
where Φ is a feature map which maps X into some dot product space H. In
other words, kernels correspond to dot products in some dot product space.
The main advantage of using a kernel as a similarity measure are threefold:
First, if the feature space is rich enough, then simple estimators such as
hyperplanes and half-spaces may be sufficient. For instance, to classify the
points in Figure BUGBUG, we need a nonlinear decision boundary, but once
we map the points to a 3 dimensional space a hyperplane suffices. Second,
kernels allow us to construct machine learning algorithms in the dot
product space H without explicitly computing Φ(x). Third, we need not make
any assumptions about the input space X other than for it to be aset. As
we will see later in this chapter, this allows us to compute similarity between
discrete objects such as strings, trees, and graphs. In the first half of this
chapter we will present some examples of kernels, and discuss some
theoretical properties of kernels in the second half.
155
156 6 Kernels and Function Spaces
6.1.1 Examples
6.1.1.1 Linear Kernel
Linear kernels are perhaps the simplest of all kernels. We assume that x ∈ Rn
and define
J J
Σ J
k(x, x ) = x, x = xix i.
i
J
If x and x are dense then computing the kernel takes O(n) time. On the other
hand, for sparse vectors this can be reduced to O(|nnz(x) ∩ nnz(xJ)|), where nnz(·)
denotes the set of non-zero indices of a vector and | · | de- notes the size of a
set. Linear kernels are a natural representation to use for vectorial data. They are
also widely used in text mining where documents are represented by a vector
containing the frequency of occurrence of words (Recall that we encountered this
so-called bag of words representation in Chapter 1). Instead of a simple bag of
words, one can also map a text to theset of pairs of words that co-occur in a
sentence for a richer representation.
Proposition 6.1 Let Φ(x) (resp. Φ(xJ)) denote the vector whose entries
are all possible d-th degree ordered products of the entries of x (resp. x J).
Then
d
k(x, xJ) = Φ(x), Φ(xJ) = x, xJ . (6.2)
d
= x, xJ
6.1 The Basics 157
The kernel (6.2) is called the polynomial kernel. An useful extension is the
inhomogeneous polynomial kernel
J J d
k(x, x ) = x, x +c , (6.3)
J
Σ Y
P
J
[k1 * . . . * kP ] (x, x ) = kp (x̄p , x̄p ). (6.4)
x̄∈ R(x),x̄′ ∈ R(x′ ) p=1
Here, the sum is over all possible ways in which we can decompose x and
xJ into x̄1 , . . . , x̄P and x̄J1 , . . . , x̄JP respectively. If the cardinality of R(x) is
finite, then it can be shown that (6.4) results in a valid kernel. Although
convolution kernels provide the abstract framework, specific instantiations of
this idea lead to a rich set of kernels on discrete objects. We will now discuss
some of them in detail.
that is, we count all occurrences of s in x but now the weight associated with
a subsequence depends on its length. To illustrate, consider the subsequence abc
which occurs in the string abcebc twice, namely, abcebc and abcebc. Thefirst
occurrence is counted with weight λ3 while the second occurrence is counted
with the weight λ6. As it turns out, this kernel can be computedby a
dynamic programming algorithm (problem BUGBUG) in O(|x| · |xJ|) time.
6.1 The Basics 159
i.e., (Ãt )ij is the probability of a transition from vertex vj to vertex vi via
a walk of length t (problem BUGBUG). If p0 is an initial probability dis-
tribution over vertices, then the probability distribution pt describing the
location of our random walker at time t is pt = Ãt p0 . The j th component of
pt denotes the probability of finishing a t-length walk at vertex vj. A random
walk need not continue indefinitely; to model this, we associate every node
vik in the graph with a stopping probability qik . The overall probability of
T
stopping after t steps is given by q pt .
J J J
Given two graphs G(V, E) and G (V , E ), their direct product G× is a
graph with vertex set
34’ 12’
G1
24’ 22’
1’ 2’
14’ 32’
Fig. 6.1. Two graphs (G1 & G2) and their direct product (G×). Each node of the
direct product graph is labeled with a pair of nodes (6.7); an edge exists in the
direct product if and only if the corresponding nodes are adjacent in both original
graphs (6.8). For instance, nodes 11′ and 32′ are adjacent because there is an edge
between nodes 1 and 3 in the first, and 1′ and 2′ in the second graph.
This idea can be extended to graphs whose nodes are associated with labels
by replacing the matrix Ã × with a matrix of label similarities. For appro-
priate choices of µ(t) the above sum converges and efficient algorithms for
computing the kernel can be devised. See [?] for details.
As it turns out, the simple idea of performing a random walk on the prod-
6.2 Kernels 161
6.2 Kernels
6.2.1 Feature Maps
give examples, linear classifier, nonlinear ones with r2-r3 map
6.3 Algorithms
6.3.1 Kernel Perceptron
6.3.2 Trivial Classifier
6.3.3 Kernel Principal Component Analysis
6.4 Reproducing Kernel Hilbert Spaces
As it turns out, this class of functions coincides with the class of positive semi-
definite functions. Intuitively, the notion of a positive semi-definite function is an
extension of the familiar notion of a positive semi-definite matrix (also see
Appendix BUGBUG):
is called the Gram matrix or the kernel matrix of k with respect to x1, . . . , xn.
We now establish the converse, that is, we show that every positive semi-
definite kernel function can be written as (6.1). Towards this end, define a
map Φ from X into the space of functions mapping X to R (denoted RX) via
Φ(x) = k(·, x). In other words, Φ(x) : X → R is a function which assigns the
value k(x , x) to x ∈ X. Next construct a vector space by taking all possible
J J
To see that the above dot product is well defined even though it contains
the expansion coefficients (which need not be unique), note that ⟨ f, g⟩ =
Σ n′ J
Σn
j independent of αi. Similarly, for g, note that ⟨ f, g⟩ = i=1 αif (xi),
j=1 βjf (x ),
this time independent of βj. This also shows that ⟨ f, g⟩ is bilinear. Symme-
try follows because ⟨ f, g⟩ = ⟨ g, f ⟩ , while the positive semi-definiteness of k
implies that
Σ
⟨ f, f ⟩ = αiαjk(xi, xj) ≥ 0. (6.14)
i,j
6.4.3 Regularization
Representer theorem, regularization
164 6 Kernels and Function Spaces
Problems
Problem 6.1 Show that (6.17) holds for an arbitrary positive semi-definite
function k.
Problem 6.3 (k-spectrum kernel {2}) Given two strings x and x show
J
how one can compute the k-spectrum kernel (section 6.1.1.5) in O((|x| +
|xJ|)k) time. Hint: You need to use a trie.
7
Linear Models
165
166 7 Linear Models
Fig. 7.1. A linearly separable toy binary classification problem of separating the
diamonds from the circles. We normalize (w, b) to ensure that mini=1,...m | ⟨ w, xi⟩
| = 1. In this case, the margin is given by ǁwǁ
+b 1
as the calculation in the inset shows.
or equivalently
1 2
min ǁwǁ (7.4a)
w,b 2
Given any w and b the constraints can now be satisfied by making ξi large
enough. This renders the whole optimization problem useless. Therefore, one
has to penalize large ξi. This is done via the following modified optimization
problem:
1 C Σ
m
min ǁwǁ2 + ξi (7.5a)
w,b,ξ 2 m
i=1
CΣ Σ mΣ
m m
1
L(w, b, ξ, α, β) = ǁwǁ2 + ξi + α (1 − ξ — y (⟨ w, x ⟩ + b)) − β ξ .
2 m i=1 i=1 i=1
i i i i i i
Next take gradients with respect to w, b and ξ and set them to zero.
Σ
m
∇ wL = w − αiyixi = 0 (7.6a)
i=1
Σ
m
∇ bL = − αiyi = 0 (7.6b)
i=1
C
∇ ξi L = − αi − βi = 0. (7.6c)
m
Substituting (7.6) into the Lagrangian and simplifying yields the dual ob-
jective function:
1Σ Σ m
− i jα iα j ⟨ xi , xj ⟩ +
yy αi , (7.7)
2
i,j i=1
boils down to
1Σ Σ
min y y α α ⟨ x ,x ⟩ − (7.8a)
α
m
α 2 i j i j i j i
i,j i=1
Σ
m
s.t. αT y = 0 (7.9b)
C
0 ≤ αi ≤ . (7.9c)
m
Before turning our attention to algorithms for solving (7.9), a number of
observations are in order. First, note that computing H only requires com-
puting dot products between training examples. If we map the input data to
a Reproducing Kernel Hilbert Space (RKHS) via a feature map φ, then we
can still compute the entries of H and solve for the optimal α. In this case,
Hij = yi yj ⟨ φ(xi ), φ(xj )⟩ = yi yj k(xi , xj ), where k is the kernel associated
with the RKHS. Given the optimal α, one can easily recover the decision
boundary. This is a direct consequence of (7.6a), which allows us to write w
as a linear combination of the training data:
Σ
m
w= αiyiφ(xi),
i=1
Fig. 7.2. The picture depicts the well classified points (yi⟨( w, xi + b) > 1 in black,
the support vectors yi(⟨w, xi ⟩+ b) = 1 in blue, and margin errors yi( w,
⟨ xi + b) < 1 in
red. ⟩ ⟩
yi(⟨ w, xi⟩ + b) < 1: In this case, ξi > 0, and hence the KKT conditions
imply that βi = 0. Consequently, αi = Cm(see (7.6c)). Such points
are said to be margin errors.
yi(⟨ w, xi⟩ + b) > 1: In this case, ξi = 0, (1− ξi − yi(⟨ w, xi⟩ +b)) < 0, and
by the KKT conditions αi = 0. Such points are said to be well
classified. It is easy to see that the decision boundary (7.10) does not
change even if these points are removed from the training set.
yi(⟨ w, xi⟩ + b) = 1: In this case ξi = 0 and βi ≥ 0. Since αi is non-negative
and satisfies (7.6c) it follows that 0 ≤ αi ≤ C .mSuch points are said to
be on the margin. They are also sometimes called support vectors.
Since the support vectors satisfy yi(⟨ w, xi⟩ + b) = 1 and yi ∈ {±1} it follows
that b = yi − ⟨ w, xi⟩ for any support vector xi. However, in practice to recover
b we average
Σ
b = yi − ⟨ w, xi⟩ . (7.11)
i
over all support vectors, that is, points xi for which 0 < αi < C m
. Because it
uses support vectors, the overall algorithm is called C-Support Vector
classifier or C-SV classifier for short.
170 7 Linear Models
loss
y(⟨ w, x⟩ + b)
Fig. 7.3. The binary hinge loss. Note that the loss is convex but non-differentiable
at the kink point. Furthermore, it increases linearly as the distance from the decision
hyperplane y(⟨ w, x⟩ + b) decreases.
In the absence of any prior knowledge about the data, we choose a zero mean
unit variance isotropic normal distribution for p(θ). This yields
1 Σ m
— log p(θ|X, Y ) = ǁθǁ2 − log p(y |xi , iθ) + const. (7.16)
2 i=1
Of course, our aim is not just to maximize p(yi |xi , θ) but also to ensure
that p(y|xi , θ) is small for all y yi. This, for instance, can be achieved by
requiring
p(yi|xi, θ)
≥ η, for all y yi and some η ≥ 1. (7.18)
p(y|xi , θ)
Here φ(x, y) is a joint feature map which depends on both the input data xand
the label y, while g(θ|x) is the log-partition function. Now (7.18) boils down
to
p(yi|xi, θ)
= exp φ(x , y ) − max φ(x , y), θ ≥ η. (7.20)
i i i
maxy yi p(y|xi , θ) y yi
If we choose η such that log η = 1, set φ(x, y) = y2φ(x), and observe that
y ∈ {±1} we can rewrite (7.20) as
Dy y E
i
φ(x i) − − i φ(x ),i θ = y i ⟨ φ(xi ), θ⟩ ≥ 1. (7.21)
2 2
By replacing − log p(yi |xi , θ) in (7.16) with the condition (7.21) we obtain
the following objective function:
1 2
min ǁθǁ (7.22a)
θ 2
which recovers (7.4), but without the bias b. The prediction function is
recovered by noting that (7.17) specializes to
∗ y
y = argmax ⟨ φ(x, y), θ⟩ = argmax ⟨ φ(x), θ⟩ = sign(⟨ φ(x), θ⟩ ). (7.23)
y∈ {± 1} y∈ {± 1} 2
1 HBB HBB̄ αB
min αT αT
B̄ −
T
α B αB̄
T
e (7.24a)
αB αB̄
B
2 HB̄B HB̄B̄
s.t. αT
B αT
¯
B y=0 (7.24b)
C
constant terms and rearranging, one can simplify the above problem to
1
αT H α + α (H ¯ α ¯ − e)
T
min (7.25a)
αB
BB B BB B
2 B B
s.t. αT T
B yB = −αB
¯ yB̄ (7.25b)
C
m(α) ≤ M (α)
where
m(α) = max −yi ∇J (α)i and M (α) = min −yi ∇ J(α)i . (7.36)
i∈ Iup i∈ I down
Therefore, a natural stopping criterion is to stop when the KKT gap falls
below a desired tolerance ϵ, that is,
Finally, we turn our attention to the issue of working set selection. The
first order approximation to the objective function J(α) can be written as
approximation as
B dB ≈ J(α + d) − J(α).
∇J (α)T
From among all possible directions dB we wish to choose one which decreases
the objective function the most while maintaining feasibility. This is best
expressed as the following optimization problem:
min ∇J (α)T
B dB (7.38a)
dB
T
s.t. yB dB = 0 (7.38b)
di ≥ 0 if αi = 0 and i ∈ B (7.38c)
C
di ≤ 0 if αi = and i ∈ B (7.38d)
m
— 1 ≤ d i ≤ 1. (7.38e)
s.t. yi di + yj dj = 0 (7.39b)
dk ≥ 0 if αk = 0 and k ∈ {i, j} (7.39c)
C
dk ≤ 0 if αk = and k ∈ {i, j} (7.39d)
m
— 1 ≤ dk ≤ 1 for k ∈ {i, j}. (7.39e)
At first glance, it seems that choosing the optimal i and j from the set
{1, . . . , m}×{1, . . . m} requires O(m2) effort. We now show that O(m) effort
suffices.
Define new variables dˆk = yk dk for k ∈ {i, j}, and use the observation
yk ∈ {±1} to rewrite the objective function as
(−yi ∇J (α)i + yj ∇ J(α)j ) dˆj .
which clearly can be solved in O(m) time. Comparison with (7.36) shows
that at every iteration of SMO we choose to update coefficients α i and αj
which maximally violate the KKT conditions.
7.2 Extensions
7.2.1 The ν trick
In the soft margin formulation the parameter C is a trade-off between two
conflicting requirements namely maximizing the margin and minimizing the
training error. Unfortunately, this parameter is rather unintuitive and hence
difficult to tune. The ν-SVM was proposed to address this issue. As Theorem
7.3 below shows, ν controls the number of support vectors and margin errors.
The primal problem for the ν-SVM can be written as
1 1 Σm
min ǁwǁ 2 − ρ + ξi (7.40a)
w,b,ξ,ρ 2 νm
i=1
min 1 Σ y y α α ⟨ x , x ⟩ (7.41a)
α 2 i j i j i j
i,j
Σ
m
Lemma 7.2 Let ν ∈ [0, 1] and (7.41) be feasible. Then there is at least one
Σ
solution α which satisfies α = ii1. Furthermore, if the final objective valueof
Σ
(7.41) is non-zero then all solutions satisfy i αi = 1.
T
This implies that either 1 α Hα = 0, in which case ᾱ is an optimal solution
2
with the desired property or 1 αT H α /= 0, in which case all optimal solutions
Σ 2
satisfy i αi = 1.
In view of the above theorem one can equivalently replace (7.41) by the
following simplified optimization problem with two equality constraints
min 1 Σ y y α α ⟨ x , x ⟩ (7.42a)
α 2 i j i j i j
i,j
Σ
m
Theorem 7.3 Suppose we run ν-SVM with kernel k on some data and
obtain ρ > 0. Then
(i) ν is an upper bound on the fraction of margin errors, that is points
for which yi (⟨ w, xi ⟩ + bi ) < ρ.
7.2 Extensions 179
Theorem 7.4 If (7.40) leads to a decision function with ρ > 0, then (7.5)
with C = 1ρ leads to the same decision function.
This loss is difficult to work with because it is non-convex (see Figure 7.4). In
loss
y(⟨ w, x⟩ + b)
Fig. 7.4. The 0-1 loss which is non-convex and intractable is depicted in red. The hinge
loss is a convex upper bound to the 0-1 loss and shown in blue. The square hinge loss
is a differentiable convex upper bound to the 0-1 loss and is depicted in green.
fact, it has been shown that finding the optimal (w, b) pair which minimizes
the 0-1 loss on a training dataset of m labeled points is NP hard [BDEL03].
Therefore various proxy functions such as the binary hinge loss (7.13) which
we discussed in Section 7.1.1 are used. Another popular proxy is the square
180 7 Linear Models
hinge loss:
Besides being a proxy for the 0-1 loss, the squared hinge loss, unlike the hinge
loss, is also differentiable everywhere. This sometimes makes the opti-
mization in the primal easier. Just like in the case of the hinge loss one can
derive the dual of the regularized risk minimization problem and show that
it is a quadratic programming problem (problem 7.5).
parameterized by s ≤ 0 is another proxy for the 0-1 loss (see Figure 7.5).
Although not convex, it can be expressed as the difference of two convex
functions
Fig. 7.5. The ramp loss depicted here with s = 0.3 − can be viewed as the sumof
a convex function namely the binary hinge loss (left) and a concave function
min(0, 1 — y(⟨ w, x + b)) (right). Viewed alternatively, the ramp loss can be written
⟩
as the difference of two convex functions.
3.5.1 can be used to solve the resulting regularized risk minimization problem
with the ramp loss. Towards this end write
1 C Σ
m
C Σm
J(w) = ǁwǁ2 + l conc(w, x i, y i) − l cave (w, x i, y i) . (7.46)
2 m m
i=1 i=1
` ˛¸ x ` ˛¸ x
Jconc (w) Jcave (w)
7.3 Support Vector Regression 181
Recall that at every iteration of the CCP we replace Jcave(w) by its first
order Taylor approximation, computing which requires
CΣ
m
∂w J(w) = ∂ wl cave (w, x , y ). (7.47)
m i=1
i i
Ignoring constant terms, each iteration of the CCP algorithm involves solv-
ing the following minimization problem (also see (3.134))
!
m
Σ C Σ δi yi xi
m
1 C
J(w) = ǁwǁ2 + lconc(w, xi, yi) − m w. (7.49)
2 m
i=1 i=1
T
s.t. α y = 0 (7.50b)
C C
— δ ≤ αi ≤ (e − δ). (7.50c)
m m
In fact, this problem can be solved by a SMO solver with minor modifica-
tions. Putting everything together yields Algorithm 7.1.
loss
ϵ y − (⟨ w, x⟩ + b)
Fig. 7.6. The ϵ insensitive loss. All points which lie within the ϵ tube shaded in
gray incur zero loss while points outside incur a linear loss.
In other words, we want to find a hyperplane such that all the training data
lies within an ϵ tube around the hyperplane. We may not always be able to
find such a hyperplane, hence we relax the above condition by introducing
slack variables ξ+ and ξ− and write the corresponding primal problem as
i i
1 2 CΣ
m + −
min
+ −
ǁwǁ + (ξ i + ξ i ) (7.52a)
w,b,ξ ,ξ 2 m i=1
s.t. yi − (⟨ w, xi⟩ + b) ≤ ϵ + ξi + for all i (7.52b)
(⟨ w, xi⟩ + b) − yi ≤ ϵ + ξi− for all i (7.52c)
ξ+i ≥ 0, and ξ−i ≥ 0. (7.52d)
1 C Σm + Σm
L(w, b, ξ + , ξ , α+ , α , β + , β ) = ǁwǁ2 + (ξ i+ ξ )i −
− − − − − −
(β + ξ +i +i β ξ i) i
2 m
i=1 i=1
m
Σ
+ α+i (yi − (⟨ w, xi⟩ + b) − ϵ − ξ+)
i=1
Σ
m
+ α−i ((⟨ w, xi ⟩ + b) − yi − ϵ − ξ − ).
i=1
7.3 Support Vector Regression 183
Taking gradients with respect to the primal variables and setting them to
0, we obtain the following conditions:
Σ m
− α i)xi
−
w= (α+
i
(7.53)
m i=1 m
Σ Σ −
αi + = αi (7.54)
i=1 i=1
C
α+ + β+ = (7.55)
i m i
− C −
α +β = . (7.56)
i i m
Noting that αi +, , β +,i ≥ 0 and substituting the above conditions into
{ −} { −}
i i j
α+,α− 2 i j j
i,j
Σ
m Σm
(α+ + α ) − yi (α+ − α )
− −
+ϵ
i i i i
m i=1 m i=1
Σ i Σ i
s.t. α+ = α− (7.57b)
i=1 i=1
C
0 ≤ αi+ ≤ (7.57c)
m
C
0 ≤ α−i ≤
. (7.57d)
m
This is a quadratic programming problem with one equality constraint, and
hence a SMO like decomposition method can be derived for finding the
optimal coefficients α+ and α− (Problem 7.7).
As a consequence of (7.53), analogous to the classification case, one can
map the data via a feature map φ into an RKHS with kernel k and recover
the decision boundary f (x) = ⟨ w, φ(x)⟩ + b via
m m
Σ Σ
f (x) = (α+
i
− α i) ⟨ φ(x)i , φ(x)⟩ + b =
−
(α+
i
− α− i)k(xi , x) + b. (7.58)
i=1 i=1
Finally, the KKT conditions
C C
− α+i ξ+i = 0 − α− iξ− = i0 and
m m
α−i ((⟨ w, xi ⟩ + b) − yi − ϵ − ξ − ) = 0 i (yi − (⟨ w, x i ⟩ + b) − ϵ − ξ ) = 0,
α+ +
184 7 Linear Models
αi− = 0. In other words, points which lie inside the ϵ tube around the
It turns out that the support vector regression framework can be easily
extended to handle other, more general, convex loss functions such as the
ones found in Table 7.1. Different losses have different properties and hence
lead to different estimators. For instance, the square loss leads to penalized
least squares (LS) regression, while the Laplace loss leads to the penalized
least absolute deviations (LAD) estimator. Huber’s loss on the other hand is
a combination of the penalized LS and LAD estimators, and the pinball loss
with parameter τ ∈ [0, 1] is used to estimate τ -quantiles. Setting τ = 0.5
in the pinball loss leads to a scaled version of the Laplace loss. If we define
ξ = y −⟨ w, x⟩ , then it is easily verified that all these losses can all be written
as
l+ (ξ − ϵ) if ξ > ϵ
l (−ξ − ϵ)
−
l(w, x, y) = if ξ < ϵ (7.60)
0 if ξ ∈ [−ϵ, ϵ].
For all these different loss functions, the support vector regression formu-
7.3 Support Vector Regression 185
i i j
α+,α− 2 i j j
i,j
CΣ
m Σm Σm
— T + (ξ + ) + T − (ξ − ) + ϵ (α+ + α− ) − y (α+ − α− )
i i i i i
m
m i=1 m i=1 i=1
Σ i Σ i
s.t. α+ = α− (7.62b)
i=1 i=1
C
0≤α ≤
{ +,− } { +,− } { +,− }
∂ l (ξ ) (7.62c)
i
m ξ i
{+,− }
0 ≤ ξi (7.62d)
C
| ≥α
{ −} { +,− } { −} { +,− }
ξ i+, = inf ξ ∂ lξ +, i . (7.62e)
m
Here T +(ξ) = l+(ξ) − ξ∂ξl+(ξ) and T − (ξ) = l− (ξ) − ξ∂ ξl− (ξ). We now show how
+ −
(7.62) can be specialized to the pinball loss. Clearly, l (ξ) = τξ while l (−ξ) =
(τ −1)ξ, and hence l (ξ) = (1−τ )ξ. Therefore, T +(ξ) = (τ −1)ξ − ξ(τ − 1) =
−
Table 7.1. Various loss functions which can be used in support vector
regression. For brevity we denote y − ⟨ w, x⟩ as ξ and write the loss
l(w, x, y) in terms of ξ.
ϵ-insensitive loss max(0, |ξ| − ϵ)
Laplace loss |ξ|
Square loss 1
|ξ|2
2
1 2
2σ
ξ if |ξ| ≤ σ
Huber’s robust loss σ
|ξ| − 2 otherwise
τξ if ξ ≥ 0
Pinball loss
(τ − 1)ξ otherwise.
186 7 Linear Models
observe that ϵ = 0 for the pinball loss then (7.62) specializes as follows:
m
min 1 Σ α α ⟨ x , x ⟩ −Σ y α (7.63a)
i j i j i i
α 2 i,j i=1
Σ
m
s.t. αi = 0 (7.63b)
i=1
C C
(τ − 1) ≤ αi ≤
τ. (7.63c)
m m
Similar specializations of (7.62) for other loss functions in Table 7.1 can be
derived.
νm i i
w,b,ξ+,ξ−,є 2 i=1
Σ
s.t. (α−i − α+i) = 0 (7.65b)
i=1
m
Σ +
−
(α +i α ) = 1 (7.65c)i
i=1
1
0 ≤ αi+ ≤ (7.65d)
νm
1
0 ≤ α−i ≤ . (7.65e)
νm
Given the input data X one can compute the empirical density
(
1
if x ∈ X
p̂(x) = m
0 otherwise,
Fig. 7.7. The novelty detection problem can be viewed as finding a large margin
hyperplane which separates ν fraction of the data points away from the origin.
1 Σ Σ Σ
m
1
L(w, ξ, ρ, α, β) = ǁwǁ2 + ξ — ρ + m α (ρ − ξ — ⟨ w, x ⟩ ) −m β ξ .
2 νm i=1 i=1 i=1
i i i i i i
By taking gradients with respect to the primal variables and setting them
to 0 we obtain
Σ
m
w= αixi (7.68)
i=1
1 1
αi = − βi ≤ (7.69)
νm νm
Σ
m
αi = 1. (7.70)
i=1
Noting that αi, βi ≥ 0 and substituting the above conditions into the La-
grangian yields the dual
min 1 Σ α α ⟨ x , x ⟩ (7.71a)
α 2 i j i j
i,j
1
s.t. 0 ≤ αi ≤ (7.71b)
νm
Σ
m
αi = 1. (7.71c)
i=1
7.5 Margins and Probability 189
Theorem 7.5 Assume that the solution of (7.71) satisfies ρ /= 0, then the
following statements hold:
(i) ν is an upper bound on the fraction of support vectors, that is points
for which ⟨ w, xi⟩ ≤ ρ.
(ii) Suppose the data X were generated independently from a distribution
p(x) which does not contain discrete components. Moreover, assume
that the kernel k is analytic and non-constant. With probability 1,
asympotically, ν equals the fraction of support vectors.
Recall that the joint feature map φ(x, y) was introduced in section 7.1.2. One
way to interpret the above equation is to view f (x, y) as a compatibility score
between instance x and label y; we assign the label with the highest
compatibility score to x. There are a number of extensions of the binary
hinge loss (7.13) which can be used to estimate this score function. In all
these cases the objective function is written as
m
λ 2 1 Σ
min J(w) := ǁwǁ + l(w, x i, y i). (7.73)
w 2 m i=1
190 7 Linear Models
Here λ is a scalar which trades off the regularizer 1 ǁwǁ2 with the empirical
1
Σm 2
risk m i=1 l(w, xi, yi). Plugging in different loss functions yields classifiers
for different settings. Two strategies exist for finding the optimal w. Just like
in the binary SVM case, one can compute and maximize the dual of (7.73).
However, the number of dual variables becomes m|Y|, where m is the number of
training points and |Y| denotes the size of the label set. The second strategy is to
optimize (7.73) directly. However, the loss functions we discuss below are non-
smooth, therefore non-smooth optimization algorithms such as bundle
methods (section 3.2.7) need to be used.
− φ(x, y) − φ(x, y ), w )
J
l(w, x, y) := max 0, max(1
′
. (7.75)
/ y
y=
as the log odds ratio in the exponential family. Towards this end choose η
such that log η = 1 and rewrite (7.20):
p(y|x, w)
log = φ(x, y) − max φ(x, yJ), w ≥ 1.
maxy′ y p(yJ|x, w) y′/=y
Σ
l(w, x, y) = max 0, 1 − ( φ(x, y) − φ(x, yJ), w ) . (7.77)
y∈ Y x and y′ ∈/Y x
One can immediately verify that specializing the above losses to the mul-
ticlass case recovers (7.74) and (7.75) respectively, while the binary case
recovers (7.13). The above losses are zero only when
min f (x, y) = min ⟨ φ(x, y), w⟩ ≥ 1 + max φ(x, yJ ), w = 1 + max f (x, yJ).
y∈ Y x y∈ Y x y′ ∈/ Y x y′ ∈/ Y x
This can be interpreted as follows: The losses ensure that all the labels
assigned to x have larger scores compared to labels not assigned to x with the
margin of separation of at least 1.
Although the above loss functions are compatible with multiple labels,
the prediction function argmaxy f (x, y) only takes into account the labelwith
the highest score. This is a significant drawback of such models, which can be
overcome by using a multiclass approach instead. Let |Y| be thesize of
the label set and z ∈ R|Y| denote a vector with ±1 entries. We set
192 7 Linear Models
where A(y) denotes the set of all possible subgraphs of y. The maximum
margin version, on the other hand, is given by
l(w, x, y) = max max (0, 1 − (f (x, i) − f (x, j))) . (7.81)
G∈ A(y) (i,j)∈ G
7.7 Large Margin Classifiers with Structure 193
7.8 Applications
7.8.1 Sequence Annotation
7.8.2 Matching
7.8.3 Ranking
7.8.4 Shortest Path Planning
7.8.5 Image Annotation
7.8.6 Contingency Table Loss
7.9 Optimization
7.9.1 Column Generation
subdifferentials
Derive the dual of (7.83) and contrast it with (7.9). What changes to the SMO
algorithm would you make to solve this dual?
Problem 7.3 (Deriving the simplified ν-SVM dual {2}) In Lemma 7.2we
Σ
i i ≥
used (7.41) to show that the constraint α 1 can be replaced by
Σ
i αi = 1. Show that an equivalent way to arrive at the same conclusion is
by arguing that the constraint ρ ≥ 0 is redundant in the primal (7.40). Hint:
Observe that whenever ρ < 0 the objective function is always non-negative.
On the other hand, setting w = ξ = b = ρ = 0 yields an objective function
value of 0.
Problem 7.4 (Fenchel and Lagrange Duals {2}) We derived the La- grange
dual of (7.12) in Section 7.1 and showed that it is (7.9). Derive the Fenchel
dual of (7.12) and relate it to (7.9). Hint: See theorem 3.3.5 of [BL00].
7.10 CRFs vs Structured Large Margin Models 195
Problem 7.5 (Dual of the square hinge loss {1}) The analog of (7.5)
when working with the square hinge loss is the following
m
1 C Σ
2
min ǁwǁ + ξ2 (7.84a)
w,b,ξ 2 m i=1 i
Problem 7.6 (Dual of the ramp loss {1}) Derive the Lagrange dual of
(7.49) and show that it the Quadratic Programming problem (7.50).
Problem 7.7 (SMO for various SVM formulations {2}) Derive an SMO
like decomposition algorithm for solving the dual of the following problems:
• ν-SVM (7.41).
• SV regression (7.57).
• SV novelty detection (7.71).
Problem 7.8 (Novelty detection with Balls {2}) In Section 7.4 we as-
sumed that we wanted to estimate a halfspace which contains a major frac-
tion of the input data. An alternative approach is to use balls, that is, we
estimate a ball of small radius in feature space which encloses a majority of
the input data. Write the corresponding optimization problem and its dual.
J
Show that if the kernel is translation invariant, that is, k(x, x ) depends only
on ǁx − x ǁ then the optimization problem with balls is equivalent to (7.71).
J
Problem 7.9 (Multiclass and Multilabel loss from Ranking Loss {1})
Show how the multiclass (resp. multilabel) losses (7.74) and (7.75) (resp.
(7.77) and (7.79)) can be derived as special cases of (7.80) and (7.81) re-
spectively.
Our proof presentation by and large follows [?]. We first show that
Lemma 1.2 For any arbitrary vector α ∈ Rd let qi denote the i-th compo-
nent of f (α). Then qi ∼ N(0, ǁαǁ 2 /k) and hence
h i Σ
E ǁf (α)ǁ k qi2 = ǁαǁ2 . (1.4)
2
= i=1 E
In other words, the expected length of vectors are preserved even after em-
bedding them in a k dimensional space. Next we show that the lengths of the
embedded vectors are tightly concentrated around their mean.
Lemma 1.3 For any ϵ > 0 and any unit vector α ∈ Rd we have
k
Pr ǁf (α)ǁ2 > 1 + ϵ < exp − ϵ2/2 − ϵ3/3 (1.5)
2
k
Pr ǁf (α)ǁ2 < 1 − ϵ < exp − ϵ2/2 − ϵ3/3 . (1.6)
2
197
198 1 Linear Algebra and Functional Analysis
2
Pr (1 − ϵ) ǁαǁ2 ≤ ǁf (α)ǁ2 ≤ (1 + ϵ) ǁαǁ2 ≥ 1 − . (1.7)
n 2+β
k 2
2 exp − ϵ2/2 − ϵ3/3 ≤ ,
2 n 2+β
1 Σ
E [qi ] = √ αj E [Rij ] = 0.
k j
Since Rij are independent zero mean unit variance random variables, E [Rij Ril ] =
1 if j = l and 0 otherwise. Using this
2
1 Σ
d
1Σ
d Σ
d 1Σ
d 2 1 2
2
E qi Rijαj =
= E k αj αl E [Rij Ril ] = αj = ǁαǁ .
k j=1 j=1 l=1 k j=1 k
h i h i
Pr ǁf (α)ǁ2 > 1 + ϵ = Pr exp λ ǁf (α)ǁ2 > exp(λ(1 + ϵ)) .
A1.1 Johnson Lindenstrauss Lemma 199
h Σ i
E exp λ ki=1 q 2
i
=
exp(λ(1 + ϵ))
hQ i
k exp λq2
E i=1
=
i
exp(λ(1 + ϵ))
!k
2 Eexp
i
= λ . (1.8)
λq k exp
(1 + ϵ)
The last equality is because the qi ’s are i.i.d. Since α is a unit vector, from
the previous lemma qi ∼ N(0, 1/k). Therefore, kq 2i is a χ2 random variable
with moment generating function
λ 2 1
E exp λq2 = E exp kq =q .
i
k
i
k 1 − 2λ
Plugging this into (1.8)
k
h 2 i exp − kλ(1 + ϵ)
.
> exp (λ(1 + ϵ)) ≤ q k
Pr exp λ ǁf (α)ǁ 1− 2λ
kє
Setting λ = 2(1+є) in the above inequality and simplifying
h i
Pr exp λ ǁf (α)ǁ2 > exp(λ(1 + ϵ)) ≤ (exp(−ϵ)(1 + ϵ))k/2 .
A1.3.4 Operators
spectrum, norm, bounded, unbounded operators
Conjugate Distributions
201
202 2 Conjugate Distributions
Binomial — Beta
φ(x) = x
Γ(nν + 1)Γ(n(1 − ν) + 1)
eh(nν,n) = = B(nν + 1, n(1 − ν) + 1)
Γ(n + 2)
In traditional notation one represents the conjugate as
Γ(α + β)
p(z; α, β) = zα− 1(1 − z)β− 1
Γ(α)Γ(β)
where α = nν + 1 and β = n(1 − bν) + 1.
Multinomial — Dirichlet
φ(x) = eQx
d
Γ(nνi + 1)
h(nν,n) i=1
e =
Γ(n + d)
In traditional notation one represents the conjugate as
Σ
Γ( di=1 αi) Y αi − 1
d
p(z; α) = Q d zi
i=1 Γ(αi) i=1
where αi = nνi + 1
Poisson — Gamma
φ(x) = x
eh(nν,n) = n− nν Γ(nν)
Loss Functions
203
Logistic [CSS00] log(1 + exp(−yf )) −y/(1 + exp(−yf ))
+
Novelty [SPST 01] max(0, ρ − f ) 0 if f ≥ ρ and −1 otherwise
Least mean squares [Wil98] 1
2
(f − y) 2 f −y
Least absolute deviation |f − y| sign(f − y)
Quantile regression [Koe05] max(τ (f − y), (1 − τ )(y − f )) τ if f > y and τ − 1 otherwise
ϵ-insensitive [VGS97] max(0, |f − y| − ϵ) 0 if |f − y| ≤ ϵ, else sign(f − y)
3 Loss Functions Huber’s robust loss [MSR+ 97] 1
(f − y)2 if |f − y| ≤ 1, else |f − y| − 1
f − y if |f − y| ≤ 1, else sign(f − y)
2 2
Poisson regression [Cre93] exp(f ) − yf exp(f ) − y
Vectorial loss functions and their derivatives, depending on the vector f := Wx and on y.
Loss Derivative
Soft-Margin Multiclass [TGK04] maxy′ (fy′ − f y + ∆(y, yJ)) ey ∗ − e y
∗
[CS03] where y is the argmax of the loss
Scaled Soft-Margin Multiclass maxy′ Γ(y, yJ)(fy ′ − f y + ∆(y, yJ)) Γ(y, yJ)(ey∗ − ey)
[TJHA05] where y∗ is the argmax of the loss
Σ hΣ i Σ
Softmax Multiclass [CDLS99] log y′ exp(fy ′ ) − fy y ′ e y ′ exp(f J
y ) / y′ exp(fyJ ) − ey
(f − y) M (f − y) where M ≥ 0 M (f − y)
1 T
Multivariate Regression 2
204
A3.1 Loss Functions 205
x.
For instance, for squared loss we have
¯l(⟨ w, x⟩ , y) = 1 (⟨ w, x⟩ − y)2 and ¯lJ (⟨ w, x⟩ , y) = ⟨ w, x⟩ − y.
2
Here φ(x, y) is a joint feature map, ∆(y, yJ) ≥ 0 describes the cost of mis-
classifying y by yJ, and Γ(y, yJ) ≥ 0 is a scaling term which indicates by how
much the large margin property should be enforced. For instance, [TGK04]
J J J
choose Γ(y, y ) = 1. On the other hand [TJHA05] suggest Γ(y, y ) = ∆(y, y ),
which reportedly yields better performance. Finally, [McA07] recently sug-
gested generic functions Γ(y, yJ).
The logistic loss can also be interpreted as the negative log-likelihood of
a conditional exponential family model:
where the normalizing constant g(w|x), often called the log-partition func-
tion, reads
Σ
g(w|x) := log exp w, φ(x, yJ) . (3.8)
y′∈ Y
A3.1 Loss Functions 207
where p(y|x) is the exponential family model (3.7). In the case of (3.6) we
denote by ȳ(x) the argmax of the RHS, that is
In the case where the loss is maximized for more than one distinct value ȳ(x)
we may average over the individual values, since any convex combination of
such terms lies in the subdifferential.
Note that (3.6) majorizes ∆(y, y∗), where y∗ := argmaxy′ ⟨ w, φ(x, yJ)⟩
[TJHA05]. This can be seen via the following series of inequalities:
∆(y, y∗) ≤ Γ(y, y∗) ⟨ w, φ(x, y∗) − φ(x, y)⟩ + ∆(y, y∗) ≤ l(x, y, w).
The first inequality follows because Γ(y, y∗) ≥ 0 and y∗ maximizes ⟨ w, φ(x, yJ)⟩
thus implying that Γ(y, y∗) ⟨ w, φ(x, y∗) − φ(x, y)⟩ ≥ 0. The second inequal-
ity follows by definition of the loss.
We conclude this section with a simple lemma which is at the heart of
several derivations of [Joa05]. While the proof in the original paper is far from
trivial, it is straightforward in our setting:
208 3 Loss Functions
J
Lemma 3.1 Denote by δ(y, y ) a loss and let φ(xi, yi) be a feature map for
observations (xi, yi) with 1 ≤ i ≤ m. Moreover, denote by X, Y the set of all
m patterns and labels respectively. Finally let
m m
Σ Σ
Φ(X, Y ) := φ(xi , yi ) and ∆(Y, Y J ) := δ(yi , yiJ ). (3.13)
i=1 i=1
Σ
m
max
′
w, φ(xi , yJ ) − φ(xi , yi ) + δ(yi , yJ ) and max′ w, Φ(X, Y J ) − Φ(X, Y ) + ∆(Y, Y J ).
y Y
i=1
This is immediately obvious, since both feature map and loss decompose,
which allows us to perform maximization over Y J by maximizing each of its m
components. In doing so, we showed that aggregating all data and labels into
a single feature map and loss yields results identical to minimizingthe sum
over all individual losses. This holds, in particular, for the sample error loss of
[Joa05]. Also note that this equivalence does not hold whenever Γ(y, yJ) is not
constant.
Likewise note that [TGK04] use relaxations when solving structured esti-
mation problems of the form
Γ(y, y ) w, φ(x, y ) − φ(x, y) + ∆(y, y ),
J J J
l(x, y, w) = max
′
(3.16)
y
J
by enlarging the domain of maximization with respect to y . For instance,
instead of an integer programming problem we might relax the setting to
a linear program which is much cheaper to solve. This, again, provides an
upper bound on the original loss function.
In summary, we have demonstrated that convex relaxation strategies are
well applicable for bundle methods. In fact, the results of the corresponding
optimization procedures can be used directly for further optimization steps.
" #
1 Σ Σ
n
1
C(yi , yj )I(⟨ w, xi ⟩ > ⟨ w, xj ⟩ ) where M m2 − m2 . (3.19)
M y <y =
i j i
2 i
Using the same convex majorization as above when we were maximizing the
ROC score, we obtain an empirical risk of the form
1 Σ
Remp (w) = C(y , iy )j max(0, 1 + ⟨ w, i — xj ⟩ ) (3.20)
M
x
yi <y j
Now the goal is to find an efficient algorithm for obtaining the number of
times when the individual losses are nonzero such as to compute both the
value and the gradient of Remp(w). The complication arises from the fact
that observations xi with label yi may appear in either side of the inequality
depending on whether yj < yi or yj > y i. This problem can be solved as
follows: sort f = Xw in ascending order and traverse it while keeping track
of how many items with a lower value yj are no more than 1 apart in terms
of their value of fi. This way we may compute the count statistics efficiently.
Algorithm 3.3 describes the details, generalizing the results of [Joa06]. Again,
its runtime is O(m log m), thus allowing for efficient computation.
1 Σ
C(i, j)I(⟨ w, x i ⟩ > ⟨ w,jx ⟩ ) (3.21)
|P |
(i,j)∈ P
212 3 Loss Functions
5: Rescale C ← C/M
6: π ← {1, . . . , 2m} sorted in ascending order of c
7: for i = 1 to 2m do
8: j = πi mod m
9: if πi ≤ m then
10: for k = 1 to yj − 1 do
11: r ← r − C(k, yj)ukcj
12: gj ← gj − C(k, yj )u k
13: end for
14: lyj ← lyj + 1
15: else
16: for k = yj + 1 to n do
17: r ← r + C(yj, k)lkcj+m
18: gj ← gj + C(yj , k)lk
19: end for
20: uyj ← uyj − 1
21: end if
22: end for
23: g ← g X
T
1 Σ
R emp (w) = |P | C(i, j) max (0, 1 + ⟨ w, x ⟩ − ⟨ w, x ⟩ ) (3.22)
(i,j)∈ P i j
(
1 Σ 0 if ⟨ w, xj − xi ⟩ ≥ 1
where ∂w Remp (w) = C(i, j)
|P | xi − xj otherwise
(i,j)∈ P
(3.23)
A3.1.3.4 Ranking
In webpage and document ranking we are often in a situation similar to that
described in Section A3.1.3.2, however with the difference that we do not
only care about objects xi being ranked according to scores yi but moreover
that different degrees of importance are placed on different documents.
The information retrieval literature is full with a large number of differ- ent
scoring functions. Examples are criteria such as Normalized Discounted
Cumulative Gain (NDCG), Mean Reciprocal Rank (MRR), Precision@n, or
Expected Rank Utility (ERU). They are used to address the issue of evaluat- ing
rankers, search engines or recommender sytems [Voo01, JK02, BHK98,
BH04]. For instance, in webpage ranking only the first k retrieved docu- ments
that matter, since users are unlikely to look beyond the first k, say10,
retrieved webpages in an internet search. [LS07] show that these scores can
be optimized directly by minimizing the following loss:
Σ
l(X, y, w) = max ci w, xπ(i) − xi + ⟨ a − a(π), b(y)⟩ . (3.24)
π
i
7: g = c(π ) − c and g ← g X
−1 T
which can be easily solved by the Hungarian Marriage algorithm, that is,
the Kuhn-Munkres algorithm.
The original papers by [Kuh55] and [Mun57] implied an algorithm with
O(m3) cost in the number of terms. Later, [Kar80] suggested an algorithm with
expected quadratic time in the size of the assignment problem (ignor- ing log-
factors). Finally, [OL93] propose a linear time algorithm for large problems.
Since in our case the number of pages is fairly small (in the order of 50 to 200
per query ) the scaling behavior per query is not too important.We used an
existing implementation due to [JV87].
Note also that training sets consist of a collection of ranking problems, that
is, we have several ranking problems of size 50 to 200. By means of
parallelization we are able to distribute the work onto a cluster of worksta-
tions, which is able to overcome the issue of the rather costly computation
per collection of queries. Algorithm 3.5 spells out the steps in detail.
y>0 y<0
J
y >0 T+ F+
J
y <0 F− T−
l(x, y, W ) = max
′
Γ(y, yJ) Wy′ − Wy, x + ∆(y, yJ) (3.29)
y
Here Γ and ∆ are defined as in Section A3.1.2 and y∗ denotes the value of yJ
A3.1 Loss Functions 217
for which the RHS of (3.29) is maximized. This means that for unstructured
multiclass settings we may simply compute xW . Since this needs to be per-
formed for all observations xi we may take advantage of fast linear algebra
routines and compute f = XW for efficiency. Likewise note that comput-
ing the gradient over m observations is now a matrix-matrix multiplication,
too: denote by G the matrix of rows of gradients Γ(yi , yi∗ )(eyi∗ − eyi ). Then
∂W Remp (X, y, W ) = GT X. Note that G is very sparse with at most two
T
nonzero entries per row, which makes the computation of G X essentially as
expensive as two matrix vector multiplications. Whenever we have many
classes, this may yield significant computational gains.
Log-likelihood scores of exponential families share similar expansions. We
have
Σ Σ
l(x, y, W ) = log exp w, φ(x, yJ) − ⟨ w, φ(x, y)⟩ = log exp Wy′ , x − ⟨ Wy, x⟩
y′ y′
(3.31)
Σ
y′ (ey′ ⊗ x) exp Wy′ , x
∂W l(x, y, W ) = Σ − ey ⊗ x. (3.32)
y′ exp Wy ′ , x
The main difference to the soft-margin setting is that the gradients are
not sparse in the number of classes. This means that the computation of
gradients is slightly more costly.
A3.1.4.2 Ontologies
Fig. A3.1. Two ontologies. Left: a binary hierarchy with internal nodes 1,{ . . . , 7 }
and labels {8, . . . 15 }. Right: a generic directed acyclic graph with internal nodes
{ . . . , 6, 12 and
1, } labels 7, . {. . , 11, 13, . . . , 15 . Note
} that node 5 has two parents,
namely nodes 2 and 3. Moreover, the labels need not be found at the same level of
the tree: nodes 14 and 15 are one level lower than the rest of the nodes.