Understanding Machine Learning P2
Understanding Machine Learning P2
Understanding Machine Learning P2
Prediction Problems
sifiers, each of which discriminates between one class and the rest of the classes.
That is, given a training set S = (x1 , y1 ), . . . , (xm , ym ), where every yi is in Y, we
construct k binary training sets, S1 , . . . , Sk , where Si = (x1 , ( 1)1[y1 6=i] ), . . . , (xm , ( 1)1[ym 6=i] ).
In words, Si is the set of instances labeled 1 if their label in S was i, and 1
otherwise. For every i 2 [k] we train a binary predictor hi : X ! {±1} based on
Si , hoping that hi (x) should equal 1 if and only if x belongs to class i. Then,
given h1 , . . . , hk , we construct a multiclass predictor using the rule
When more than one binary hypothesis predicts “1” we should somehow decide
which class to predict (e.g., we can arbitrarily decide to break ties by taking the
minimal index in argmaxi hi (x)). A better approach can be applied whenever
each hi hides additional information, which can be interpreted as the confidence
in the prediction y = i. For example, this is the case in halfspaces, where the
actual prediction is sign(hw, xi), but we can interpret hw, xi as the confidence
in the prediction. In such cases, we can apply the multiclass rule given in Equa-
tion (17.1) on the real valued predictions. A pseudocode of the One-versus-All
approach is given in the following.
One-versus-All
input:
training set S = (x1 , y1 ), . . . , (xm , ym )
algorithm for binary classification A
foreach i 2 Y
let Si = (x1 , ( 1)1[y1 6=i] ), . . . , (xm , ( 1)1[ym 6=i] )
let hi = A(Si )
output:
the multiclass hypothesis defined by h(x) 2 argmaxi2Y hi (x)
All-Pairs
input:
training set S = (x1 , y1 ), . . . , (xm , ym )
algorithm for binary classification A
foreach i, j 2 Y s.t. i < j
initialize Si,j to be the empty sequence
for t = 1, . . . , m
If yt = i add (xt , 1) to Si,j
If yt = j add (xt , 1) to Si,j
let hi,j = A(Si,j )
output:
the multiclass hypothesis
⇣P defined by ⌘
h(x) 2 argmaxi2Y j2Y sign(j i) h i,j (x)
1 2 3
Suppose that the probability masses of classes 1, 2, 3 are 40%, 20%, and 40%,
respectively. Consider the application of One-versus-All to this problem, and as-
sume that the binary classification algorithm used by One-versus-All is ERM
with respect to the hypothesis class of halfspaces. Observe that for the prob-
lem of discriminating between class 2 and the rest of the classes, the optimal
halfspace would be the all negative classifier. Therefore, the multiclass predic-
tor constructed by One-versus-All might err on all the examples from class 2
(this will be the case if the tie in the definition of h(x) is broken by the nu-
⌘ label). In contrast, if we⇣ choose⌘ hi (x) = hwi , xi,
merical value ⇣of the class
where w1 = p1 , p1 , w2 = (0, 1), and w3 = p1 , p1 , then the classi-
2 2 2 2
fier defined by h(x) = argmaxi hi (x) perfectly predicts all the examples. We see
230 Multiclass, Ranking, and Complex Prediction Problems
that even though the approximation error of the class of predictors of the form
h(x) = argmaxi hwi , xi is zero, the One-versus-All approach might fail to find a
good predictor from this class.
That is, the prediction of h for the input x is the label that achieves the highest
weighted score, where weighting is according to the vector w.
Let W be some set of vectors in Rd , for example, W = {w 2 Rd : kwk B},
for some scalar B > 0. Each pair ( , W ) defines a hypothesis class of multiclass
predictors:
H ,W = {x 7! argmax hw, (x, y)i : w 2 W }.
y2Y
Chapter 16 and as we will discuss in more detail in Chapter 25). Two examples
of useful constructions are given in the following.
(x, y) = [ 0, . . . , 0 , x1 , . . . , xn , 0, . . . , 0 ]. (17.2)
| {z } | {z } | {z }
2R(y 1)n 2Rn 2R(k y)n
w3 w4
TF-IDF:
The previous definition of (x, y) does not incorporate any prior knowledge
about the problem. We next describe an example of a feature function that
does incorporate prior knowledge. Let X be a set of text documents and Y be a
set of possible topics. Let d be a size of a dictionary of words. For each word in the
dictionary, whose corresponding index is j, let T F (j, x) be the number of times
the word corresponding to j appears in the document x. This quantity is called
Term-Frequency. Additionally, let DF (j, y) be the number of times the word
corresponding to j appears in documents in our training set that are not about
topic y. This quantity is called Document-Frequency and measures whether word
j is frequent in other topics. Now, define : X ⇥ Y ! Rd to be such that
⇣ ⌘
m
j (x, y) = T F (j, x) log DF (j,y) ,
where m is the total number of documents in our training set. The preced-
ing quantity is called term-frequency-inverse-document-frequency or TF-IDF for
232 Multiclass, Ranking, and Complex Prediction Problems
17.2.3 ERM
We have defined the hypothesis class H ,W and specified a loss function . To
learn the class with respect to the loss function, we can apply the ERM rule with
respect to this class. That is, we search for a multiclass hypothesis h 2 H ,W ,
parameterized by a vector w, that minimizes the empirical risk with respect to
,
m
1 X
LS (h) = (h(xi ), yi ).
m i=1
We now show that when W = Rd and we are in the realizable case, then it is
possible to solve the ERM problem efficiently using linear programming. Indeed,
in the realizable case, we need to find a vector w 2 Rd that satisfies
8i 2 [m], yi = argmaxhw, (xi , y)i.
y2Y
Equivalently, we need that w will satisfy the following set of linear inequalities
8i 2 [m], 8y 2 Y \ {yi }, hw, (xi , yi )i > hw, (xi , y)i.
Finding w that satisfies the preceding set of linear equations amounts to solving
a linear program.
As in the case of binary classification, it is also possible to use a generalization
of the Perceptron algorithm for solving the ERM problem. See Exercise 2.
In the nonrealizable case, solving the ERM problem is in general computa-
tionally hard. We tackle this difficulty using the method of convex surrogate
17.2 Linear Multiclass Predictors 233
loss functions (see Section 12.3). In particular, we generalize the hinge loss to
multiclass problems.
Recall that a surrogate convex loss should upper bound the original nonconvex
loss, which in our case is (hw (x), y). To derive an upper bound on (hw (x), y)
we first note that the definition of hw (x) implies that
Therefore,
Since hw (x) 2 Y we can upper bound the right-hand side of the preceding by
def
max
0
( (y 0 , y) + hw, (x, y 0 ) (x, y)i) = `(w, (x, y)). (17.3)
y 2Y
We use the term “generalized hinge loss” to denote the preceding expression. As
we have shown, `(w, (x, y)) (hw (x), y). Furthermore, equality holds when-
ever the score of the correct label is larger than the score of any other label, y 0 ,
by at least (y 0 , y), namely,
It is also immediate to see that `(w, (x, y)) is a convex function with respect to w
since it is a maximum over linear functions of w (see Claim 12.5 in Chapter 12),
and that `(w, (x, y)) is ⇢-Lipschitz with ⇢ = maxy0 2Y k (x, y 0 ) (x, y)k.
Remark 17.2 We use the name “generalized hinge loss” since in the binary
case, when Y = {±1}, if we set (x, y) = yx 2 , then the generalized hinge loss
becomes the vanilla hinge loss for binary classification,
Geometric Intuition:
The feature function : X ⇥ Y ! Rd maps each x into |Y| vectors in Rd .
The value of `(w, (x, y)) will be zero if there exists a direction w such that
when projecting the |Y| vectors onto this direction we obtain that each vector is
represented by the scalar hw, (x, y)i, and we can rank the di↵erent points on
the basis of these scalars so that
• For each y 0 6= y, the di↵erence between hw, (x, y)i and hw, (x, y 0 )i is larger
than the loss of predicting y 0 instead of y. The di↵erence hw, (x, y)i
hw, (x, y 0 )i is also referred to as the “margin” (see Section 15.1).
This is illustrated in the following figure:
w
(x, y)
)
00
,y
(x, y 00 )
(y
(y
,y 0
)
(x, y 0 )
We can solve the optimization problem associated with multiclass SVM us-
ing generic convex optimization algorithms (or using the method described in
Section 15.5). Let us analyze the risk of the resulting hypothesis. The analysis
seamlessly follows from our general analysis for convex-Lipschitz problems given
in Chapter 13. In particular, applying Corollary 13.8 and using the fact that the
generalized hinge loss upper bounds the loss, we immediately obtain an analog
of Corollary 15.7:
corollary 17.1 Let D be a distribution over X ⇥ Y, let : X ⇥ Y ! Rd ,
and assume that for all x 2 X and y 2 Y we have k (x, y)k ⇢/2. Let B > 0.
17.2 Linear Multiclass Predictors 235
q
2
Consider running Multiclass SVM with = B2⇢2 m on a training set S ⇠ Dm
and let hw be the output of Multiclass SVM. Then,
r
g hinge g hinge 8⇢2 B 2
E m [LD (hw )] E m [LD (w)] min LD (u) + ,
S⇠D S⇠D u:kukB m
where LD (h) = E(x,y)⇠D [ (h(x), y)] and LgD hinge (w) = E(x,y)⇠D [`(w, (x, y))]
with ` being the generalized hinge-loss as defined in Equation (17.3).
We can also apply the SGD learning framework for minimizing LgD hinge (w) as
described in Chapter 14. Recall Claim 14.6, which dealt with subgradients of max
functions. In light of this claim, in order to find a subgradient of the generalized
hinge loss all we need to do is to find y 2 Y that achieves the maximum in the
definition of the generalized hinge loss. This yields the following algorithm:
SGD for Multiclass Learning
parameters:
Scalar ⌘ > 0, integer T > 0
loss function : Y ⇥ Y ! R+
class-sensitive feature mapping : X ⇥ Y ! Rd
initialize: w(1) = 0 2 Rd
for t = 1, 2, . . . , T
sample (x, y) ⇠ D
find ŷ 2 argmaxy0 2Y (y 0 , y) + hw(t) , (x, y 0 ) (x, y)i
set vt = (x, ŷ) (x, y)
update w(t+1) = w(t) ⌘vt
PT
output w̄ = T1 t=1 w(t)
Remark 17.3 It is interesting to note that the risk bounds given in Corol-
lary 17.1 and Corollary 17.2 do not depend explicitly on the size of the label
set Y, a fact we will rely on in the next section. However, the bounds may de-
pend implicitly on the size of Y via the norm of (x, y) and the fact that the
bounds are meaningful only when there exists some vector u, kuk B, for which
LgD hinge (u) is not excessively large.
236 Multiclass, Ranking, and Complex Prediction Problems
In the previous section we have already shown that the sample complexity of
learning a linear multiclass predictor does not depend explicitly on the number
of classes. We just need to make sure that the norm of the range of is not too
large. This will take care of the overfitting problem. To tackle the computational
challenges we rely on the structure of the problem, and define the functions and
so that calculating the maximization problems in the definition of hw and in
the SGD algorithm can be performed efficiently. In the following we demonstrate
one way to achieve these goals for the OCR task mentioned previously.
To simplify the presentation, let us assume that all the words in Y are of length
r and that the number of di↵erent letters in our alphabet is q. Let y and y0 be two
17.3 Structured Output Prediction 237
That is, we sum the value of the i’th pixel only over the images for which y
assigns the letter j. The triple index (i, j, 1) indicates that we are dealing with
feature (i, j) of type 1. Intuitively, such features can capture pixels in the image
whose gray level values are indicative of a certain letter. The second type of
features take the form
r
1X
i,j,2 (x, y) = 1[y =i] 1[yt 1 =j]
.
r t=2 t
That is, we sum the number of times the letter i follows the letter j. Intuitively,
these features can capture rules like “It is likely to see the pair ‘qu’ in a word”
or “It is unlikely to see the pair ‘rz’ in a word.” Of course, some of these features
will not be very useful, so the goal of the learning process is to assign weights to
features by learning the vector w, so that the weighted score will give us a good
prediction via
hw (x) = argmax hw, (x, y)i.
y2Y
Clearly, the maximum of hw, (x, y)i equals maxs Ms,r . Furthermore, we can
calculate M in a recursive manner:
Ms,⌧ = max
0
(Ms0 ,⌧ 1 + hw, (x, s, s0 )i) . (17.5)
s
17.4 Ranking
We next turn to describe loss functions for ranking. There are many possible ways
to define such loss functions, and here we list a few examples. In all the examples
S1
we define `(h, (x̄, y)) = (h(x̄), y), for some function : r=1 (Rr ⇥ Rr ) ! R+ .
• 0–1 Ranking loss: (y0 , y) is zero if y and y0 induce exactly the same
ranking and (y0 , y) = 1 otherwise. That is, (y0 , y) = 1[⇡(y0 )6=⇡(y)] . Such
a loss function is almost never used in practice as it does not distinguish
between the case in which ⇡(y0 ) is almost equal to ⇡(y) and the case in
which ⇡(y0 ) is completely di↵erent from ⇡(y).
• Kendall-Tau Loss: We count the number of pairs (i, j) that are in di↵erent
order in the two permutations. This can be written as
r 1 X
X r
2
(y0 , y) = 1[sign(yi0 yj0 )6=sign(yi yj )] .
r(r 1) i=1 j=i+1
This loss function is more useful than the 0–1 loss as it reflects the level of
similarity between the two rankings.
• Normalized Discounted Cumulative Gain (NDCG): This measure em-
phasizes the correctness at the top of the list by using a monotonically
nondecreasing discount function D : N ! R+ . We first define a discounted
cumulative gain measure:
r
X
G(y0 , y) = D(⇡(y0 )i ) yi .
i=1
We can easily see that (y0 , y) 2 [0, 1] and that (y0 , y) = 0 whenever
⇡(y0 ) = ⇡(y).
A typical way to define the discount function is by
(
1
if i 2 {r k + 1, . . . , r}
D(i) = log2 (r i+2)
0 otherwise
where k < r is a parameter. This means that we care more about elements
that are ranked higher, and we completely ignore elements that are not at
the top-k ranked elements. The NDCG measure is often used to evaluate
the performance of search engines since in such applications it makes sense
completely to ignore elements that are not at the top of the ranking.
Once we have a hypothesis class and a ranking loss function, we can learn a
ranking function using the ERM rule. However, from the computational point of
view, the resulting optimization problem might be hard to solve. We next discuss
how to learn linear predictors for ranking.
As we discussed in Chapter 16, we can also apply a feature mapping that maps
instances into some feature space and then takes the inner products with w in the
feature space. For simplicity, we focus on the simpler form as in Equation (17.6).
Given some W ⇢ Rd , we can now define the hypothesis class HW = {hw :
w 2 W }. Once we have defined this hypothesis class, and have chosen a ranking
loss function, we can apply the ERM rule as follows: Given a training set, S =
(x̄1 , y1 ), . . . , (x̄m , ym ), where each (x̄i , yi ) is in (X ⇥ R)ri , for some ri 2 N, we
Pm
should search w 2 W that minimizes the empirical loss, i=1 (hw (x̄i ), yi ).
As in the case of binary classification, for many loss functions this problem is
computationally hard, and we therefore turn to describe convex surrogate loss
functions. We describe the surrogates for the Kendall tau loss and for the NDCG
loss.
In our case, yi0 yj0 = hw, xi xj i. It follows that we can use the hinge loss upper
bound as follows:
1[sign(yi yj )(yi0 yj0 )0] max {0, 1 sign (yi yj ) hw, xi xj i} .
Taking the average over the pairs we obtain the following surrogate convex loss
for the Kendall tau loss function:
r 1 X
X r
2
(hw (x̄), y) max {0, 1 sign(yi yj ) hw, xi xj i} .
r(r 1) i=1 j=i+1
The right-hand side is convex with respect to w and upper bounds the Kendall
tau loss. It is also a ⇢-Lipschitz function with parameter ⇢ maxi,j kxi xj k.
On the basis of this observation, we can use the generalized hinge loss for cost-
sensitive multiclass classification as a surrogate loss function for the NDCG loss
as follows:
(hw (x̄), y) (hw (x̄), y) + hw, (x̄, ⇡(hw (x̄)))i hw, (x̄, ⇡(y))i
max [ (v, y) + hw, (x̄, v)i hw, (x̄, ⇡(y))i]
v2V
" r
#
X
= max (v, y) + (vi ⇡(y)i ) hw, xi i . (17.8)
v2V
i=1
where ↵i = hw, xi i and i = yi /G(y, y). We can think of this problem a little
bit di↵erently by defining a matrix A 2 Rr,r where
Now, let us think about each j as a “worker,” each i as a “task,” and Ai,j as
the cost of assigning task i to worker j. With this view, the problem of finding
v becomes the problem of finding an assignment of the tasks to workers of
minimal cost. This problem is called “the assignment problem” and can be solved
efficiently. One particular algorithm is the “Hungarian method” (Kuhn 1955).
Another way to solve the assignment problem is using linear programming. To
do so, let us first write the assignment problem as
r
X
argmin Ai,j Bi,j (17.9)
B2Rr,r
+ i,j=1
r
X
s.t. 8i 2 [r], Bi,j = 1
j=1
Xr
8j 2 [r], Bi,j = 1
i=1
The following claim states that every doubly stochastic matrix is a convex
combination of permutation matrices.
claim 17.3 ((Birkho↵ 1946, Von Neumann 1953)) The set of doubly stochastic
matrices in Rr,r is the convex hull of the set of permutation matrices in Rr,r .
On the basis of the claim, we easily obtain the following:
lemma 17.4 There exists an optimal solution of Equation (17.10) that is also
an optimal solution of Equation (17.9).
Proof Let B be a solution of Equation (17.10). Then, by Claim 17.3, we can
P
write B = i i Ci , where each Ci is a permutation matrix, each i > 0, and
P
i i = 1. Since all the Ci are also doubly stochastic, we clearly have that
hA, Bi hA, Ci i for every i. We claim that there is some i for which hA, Bi =
hA, Ci i. This must be true since otherwise, if for every i hA, Bi < hA, Ci i, we
would have that
* +
X X X
hA, Bi = A, i Ci = i hA, Ci i > i hA, Bi = hA, Bi,
i i i
which cannot hold. We have thus shown that some permutation matrix, Ci ,
satisfies hA, Bi = hA, Ci i. But, since for every other permutation matrix C we
have hA, Bi hA, Ci we conclude that Ci is an optimal solution of both Equa-
tion (17.9) and Equation (17.10).
problem stems from the inadequacy of the zero-one loss for what we are really
interested in. A more adequate performance measure should take into account
the predictions over the entire set of instances. For example, in the previous
section we have defined the NDCG loss, which emphasizes the correctness of the
top-ranked items. In this section we describe additional loss functions that are
specifically adequate for bipartite ranking problems.
As in the previous section, we are given a sequence of instances, x̄ = (x1 , . . . , xr ),
and we predict a ranking vector y0 2 Rr . The feedback vector is y 2 {±1}r . We
define a loss that depends on y0 and y and depends on a threshold ✓ 2 R. This
threshold transforms the vector y0 2 Rr into the vector (sign(yi0 ✓), . . . , sign(yr0
✓)) 2 {±1}r . Usually, the value of ✓ is set to be 0. However, as we will see, we
sometimes set ✓ while taking into account additional constraints on the problem.
The loss functions we define in the following depend on the following 4 num-
bers:
True positives: a = |{i : yi = +1 ^ sign(yi0 ✓) = +1}|
False positives: b = |{i : yi = 1^ sign(yi0 ✓) = +1}|
(17.11)
False negatives: c = |{i : yi = +1 ^ sign(yi0 ✓) = 1}|
True negatives: d = |{i : yi = 1^ sign(yi0 ✓) = 1}|
2
F = (1+ (1+ )a
2 )a+b+ 2 c . Again, we set ✓ = 0, and the loss function becomes
(y0 , y) = 1 F .
• Recall at k: We measure the recall while the prediction must contain at most
k positive labels. That is, we should set ✓ so that a + b k. This is conve-
nient, for example, in the application of a fraud detection system, where a
bank employee can only handle a small number of suspicious transactions.
• Precision at k: We measure the precision while the prediction must contain
at least k positive labels. That is, we should set ✓ so that a + b k.
This is clearly true for the case ✓ = 0 if we choose V = {±1}r . The two measures
for which ✓ is not taken to be 0 are precision at k and recall at k. For precision
at k we can take V to be the set V k , containing all vectors in {±1}r whose
number of ones is at least k. For recall at k, we can take V to be Vk , which is
defined analogously. See Exercise 5.
246 Multiclass, Ranking, and Complex Prediction Problems
Any vector v 2 V falls into Ȳa,b for some a, b 2 [r]. Furthermore, if Ȳa,b \ V
is not empty for some a, b 2 [r] then Ȳa,b \ V = Ȳa,b . Therefore, we can search
within each Ȳa,b that has a nonempty intersection with V separately, and then
take the optimal value. The key observation is that once we are searching only
within Ȳa,b , the value of is fixed so we only need to maximize the expression
r
X
max vi hw, xi i.
v2Ȳa,b
i=1
Suppose the examples are sorted so that hw, x1 i ··· hw, xr i. Then, it is
easy to verify that we would like to set vi to be positive for the smallest indices
i. Doing this, with the constraint on a, b, amounts to setting vi = 1 for the a
top ranked positive examples and for the b top-ranked negative examples. This
yields the following procedure.
17.6 Summary 247
17.6 Summary
Many real world supervised learning problems can be cast as learning a multiclass
predictor. We started the chapter by introducing reductions of multiclass learning
to binary learning. We then described and analyzed the family of linear predictors
for multiclass learning. We have shown how this family can be used even if the
number of classes is extremely large, as long as we have an adequate structure
on the problem. Finally, we have described ranking problems. In Chapter 29 we
study the sample complexity of multiclass learning in more detail.
The One-versus-All and All-Pairs approach reductions have been unified un-
der the framework of Error Correction Output Codes (ECOC) (Dietterich &
Bakiri 1995, Allwein, Schapire & Singer 2000). There are also other types of re-
ductions such as tree-based classifiers (see, for example, Beygelzimer, Langford
& Ravikumar (2007)). The limitations of reduction techniques have been studied
248 Multiclass, Ranking, and Complex Prediction Problems
in (Daniely et al. 2011, Daniely, Sabato & Shwartz 2012). See also Chapter 29,
in which we analyze the sample complexity of multiclass learning.
Direct approaches to multiclass learning with linear predictors have been stud-
ied in (Vapnik 1998, Weston & Watkins 1999, Crammer & Singer 2001). In par-
ticular, the multivector construction is due to Crammer & Singer (2001).
Collins (2000) has shown how to apply the Perceptron algorithm for structured
output problems. See also Collins (2002). A related approach is discriminative
learning of conditional random fields; see La↵erty, McCallum & Pereira (2001).
Structured output SVM has been studied in (Weston, Chapelle, Vapnik, Elissee↵
& Schölkopf 2002, Taskar, Guestrin & Koller 2003, Tsochantaridis, Hofmann,
Joachims & Altun 2004).
The dynamic procedure we have presented for calculating the prediction hw (x)
in the structured output section is similar to the forward-backward variables
calculated by the Viterbi procedure in HMMs (see, for instance, (Rabiner &
Juang 1986)). More generally, solving the maximization problem in structured
output is closely related to the problem of inference in graphical models (see, for
example, Koller & Friedman (2009)).
Chapelle, Le & Smola (2007) proposed to learn a ranking function with respect
to the NDCG loss using ideas from structured output learning. They also ob-
served that the maximization problem in the definition of the generalized hinge
loss is equivalent to the assignment problem.
Agarwal & Roth (2005) analyzed the sample complexity of bipartite ranking.
Joachims (2005) studied the applicability of structured output SVM to bipartite
ranking with multivariate performance measures.
17.8 Exercises
(x, y) = [ 0, . . . , 0 , x1 , . . . , xn , 1 , 0, . . . , 0 ].
| {z } | {z } | {z }
2R(y 1)(n+1) 2Rn+1 2R(k y)(n+1)
Show that there exists a vector w 2 Rk(n+1) such that `(w, (x, y)) = 0 for
every (x, y) 2 S.
Hint: Observe that for every example (x, y) 2 S we can write x = µy + v for
some kvk r. Now, take w = [w1 , . . . , wk ], where wi = [µi , kµi k2 /2].
2. Multiclass Perceptron: Consider the following algorithm:
17.8 Exercises 249
Color?
pale green to pale yellow
other
not-tasty Softness?
not-tasty tasty
To check if a given papaya is tasty or not, the decision tree first examines
the color of the Papaya. If this color is not in the range pale green to pale
yellow, then the tree immediately predicts that the papaya is not tasty without
additional tests. Otherwise, the tree turns to examine the softness of the papaya.
If the softness level of the papaya is such that it gives slightly to palm pressure,
the decision tree predicts that the papaya is tasty. Otherwise, the prediction is
“not-tasty.” The preceding example underscores one of the main advantages of
decision trees – the resulting classifier is very simple to understand and interpret.
A popular splitting rule at internal nodes of the tree is based on thresholding the
value of a single feature. That is, we move to the right or left child of the node on
the basis of 1[xi <✓] , where i 2 [d] is the index of the relevant feature and ✓ 2 R
is the threshold. In such cases, we can think of a decision tree as a splitting of
the instance space, X = Rd , into cells, where each leaf of the tree corresponds
to one cell. It follows that a tree with k leaves can shatter a set of k instances.
Hence, if we allow decision trees of arbitrary size, we obtain a hypothesis class
of infinite VC dimension. Such an approach can easily lead to overfitting.
To avoid overfitting, we can rely on the minimum description length (MDL)
principle described in Chapter 7, and aim at learning a decision tree that on one
hand fits the data well while on the other hand is not too large.
For simplicity, we will assume that X = {0, 1}d . In other words, each instance
is a vector of d bits. In that case, thresholding the value of a single feature
corresponds to a splitting rule of the form 1[xi =1] for some i = [d]. For instance,
we can model the “papaya decision tree” earlier by assuming that a papaya is
parameterized by a two-dimensional bit vector x 2 {0, 1}2 , where the bit x1
represents whether the color is pale green to pale yellow or not, and the bit x2
represents whether the softness is gives slightly to palm pressure or not. With
this representation, the node Color? can be replaced with 1[x1 =1] , and the node
Softness? can be replaced with 1[x2 =1] . While this is a big simplification, the
algorithms and analysis we provide in the following can be extended to more
general cases.
With the aforementioned simplifying assumption, the hypothesis class becomes
finite, but is still very large. In particular, any classifier from {0, 1}d to {0, 1}
can be represented by a decision tree with 2d leaves and depth of d + 1 (see
Exercise 1). Therefore, the VC dimension of the class is 2d , which means that
the number of examples we need to PAC learn the hypothesis class grows with
2d . Unless d is very small, this is a huge number of examples.
To overcome this obstacle, we rely on the MDL scheme described in Chapter 7.
The underlying prior knowledge is that we should prefer smaller trees over larger
trees. To formalize this intuition, we first need to define a description language
for decision trees, which is prefix free and requires fewer bits for smaller decision
trees. Here is one possible way: A tree with n nodes will be described in n + 1
blocks, each of size log2 (d + 3) bits. The first n blocks encode the nodes of the
tree, in a depth-first order (preorder), and the last block marks the end of the
code. Each block indicates whether the current node is:
Overall, there are d + 3 options, hence we need log2 (d + 3) bits to describe each
block.
Assuming each internal node has two children,1 it is not hard to show that
this is a prefix-free encoding of the tree, and that the description length of a tree
with n nodes is (n + 1) log2 (d + 3).
By Theorem 7.7 we have that with probability of at least 1 over a sample
of size m, for every n and every decision tree h 2 H with n nodes it holds that
r
(n + 1) log2 (d + 3) + log(2/ )
LD (h) LS (h) + . (18.1)
2m
This bound performs a tradeo↵: on the one hand, we expect larger, more complex
decision trees to have a smaller training risk, LS (h), but the respective value of
n will be larger. On the other hand, smaller decision trees will have a smaller
value of n, but LS (h) might be larger. Our hope (or prior knowledge) is that we
can find a decision tree with both low empirical risk, LS (h), and a number of
nodes n not too high. Our bound indicates that such a tree will have low true
risk, LD (h).
The bound on LD (h) given in Equation (18.1) suggests a learning rule for decision
trees – search for a tree that minimizes the right-hand side of Equation (18.1).
Unfortunately, it turns out that solving this problem is computationally hard.2
Consequently, practical decision tree learning algorithms are based on heuristics
such as a greedy approach, where the tree is constructed gradually, and locally
optimal decisions are made at the construction of each node. Such algorithms
cannot guarantee to return the globally optimal decision tree but tend to work
reasonably well in practice.
A general framework for growing a decision tree is as follows. We start with
a tree with a single leaf (the root) and assign this leaf a label according to a
majority vote among all labels over the training set. We now perform a series of
iterations. On each iteration, we examine the e↵ect of splitting a single leaf. We
define some “gain” measure that quantifies the improvement due to this split.
Then, among all possible splits, we either choose the one that maximizes the
gain and perform it, or choose not to split the leaf at all.
In the following we provide a possible implementation. It is based on a popular
decision tree algorithm known as “ID3” (short for “Iterative Dichotomizer 3”).
We describe the algorithm for the case of binary features, namely, X = {0, 1}d ,
1 We may assume this without loss of generality, because if a decision node has only one
child, we can replace the node by its child without a↵ecting the predictions of the decision
tree.
2 More precisely, if NP6=P then no algorithm can solve Equation (18.1) in time polynomial
in n, d, and m.
18.2 Decision Tree Algorithms 253
and therefore all splitting rules are of the form 1[xi =1] for some feature i 2 [d].
We discuss the case of real valued features in Section 18.2.3.
The algorithm works by recursive calls, with the initial call being ID3(S, [d]),
and returns a decision tree. In the pseudocode that follows, we use a call to a
procedure Gain(S, i), which receives a training set S and an index i and evaluates
the gain of a split of the tree according to the ith feature. We describe several
gain measures in Section 18.2.1.
ID3(S, A)
Input: training set S, feature subset A ✓ [d]
if all examples in S are labeled by 1, return a leaf 1
if all examples in S are labeled by 0, return a leaf 0
if A = ;, return a leaf whose value = majority of labels in S
else :
Let j = argmaxi2A Gain(S, i)
if all examples in S have the same label
Return a leaf whose value = majority of labels in S
else
Let T1 be the tree returned by ID3({(x, y) 2 S : xj = 1}, A \ {j}).
Let T2 be the tree returned by ID3({(x, y) 2 S : xj = 0}, A \ {j}).
Return the tree:
xj = 1?
T2 T1
Therefore, we can define Gain to be the di↵erence between the two, namely,
Gain(S, i) := C(P[y = 1])
S
⇣ ⌘
P[xi = 1] C(P[y = 1|xi = 1]) + P[xi = 0]C(P[y = 1|xi = 0]) .
S S S S
254 Decision Trees
Information Gain: Another popular gain measure that is used in the ID3
and C4.5 algorithms of Quinlan (1993) is the information gain. The information
gain is the di↵erence between the entropy of the label before and after the split,
and is achieved by replacing the function C in the previous expression by the
entropy function,
C(a) = a log(a) (1 a) log(1 a).
Gini Index: Yet another definition of a gain, which is used by the CART
algorithm of Breiman, Friedman, Olshen & Stone (1984), is the Gini index,
C(a) = 2a(1 a).
Both the information gain and the Gini index are smooth and concave upper
bounds of the train error. These properties can be advantageous in some situa-
tions (see, for example, Kearns & Mansour (1996)).
18.2.2 Pruning
The ID3 algorithm described previously still su↵ers from a big problem: The
returned tree will usually be very large. Such trees may have low empirical risk,
but their true risk will tend to be high – both according to our theoretical
analysis, and in practice. One solution is to limit the number of iterations of ID3,
leading to a tree with a bounded number of nodes. Another common solution is
to prune the tree after it is built, hoping to reduce it to a much smaller tree,
but still with a similar empirical error. Theoretically, according to the bound in
Equation (18.1), if we can make n much smaller without increasing LS (h) by
much, we are likely to get a decision tree with a smaller true risk.
Usually, the pruning is performed by a bottom-up walk on the tree. Each node
might be replaced with one of its subtrees or with a leaf, based on some bound
or estimate of LD (h) (for example, the bound in Equation (18.1)). A pseudocode
of a common template is given in the following.
Generic Tree Pruning Procedure
input:
function f (T, m) (bound/estimate for the generalization error
of a decision tree T , based on a sample of size m),
tree T .
foreach node j in a bottom-up walk on T (from leaves to root):
find T 0 which minimizes f (T 0 , m), where T 0 is any of the following:
the current tree after replacing node j with a leaf 1.
the current tree after replacing node j with a leaf 0.
the current tree after replacing node j with its left subtree.
the current tree after replacing node j with its right subtree.
the current tree.
let T := T 0 .
18.3 Random Forests 255
As mentioned before, the class of decision trees of arbitrary size has infinite VC
dimension. We therefore restricted the size of the decision tree. Another way
to reduce the danger of overfitting is by constructing an ensemble of trees. In
particular, in the following we describe the method of random forests, introduced
by Breiman (2001).
A random forest is a classifier consisting of a collection of decision trees, where
each tree is constructed by applying an algorithm A on the training set S and
an additional random vector, ✓, where ✓ is sampled i.i.d. from some distribution.
The prediction of the random forest is obtained by a majority vote over the
predictions of the individual trees.
To specify a particular random forest, we need to define the algorithm A and
the distribution over ✓. There are many ways to do this and here we describe one
particular option. We generate ✓ as follows. First, we take a random subsample
from S with replacements; namely, we sample a new training set S 0 of size m0
using the uniform distribution over S. Second, we construct a sequence I1 , I2 , . . .,
where each It is a subset of [d] of size k, which is generated by sampling uniformly
at random elements from [d]. All these random variables form the vector ✓. Then,
256 Decision Trees
the algorithm A grows a decision tree (e.g., using the ID3 algorithm) based on
the sample S 0 , where at each splitting stage of the algorithm, the algorithm is
restricted to choosing a feature that maximizes Gain from the set It . Intuitively,
if k is small, this restriction may prevent overfitting.
18.4 Summary
Many algorithms for learning decision trees (such as ID3 and C4.5) have been
derived by Quinlan (1986). The CART algorithm is due to Breiman et al. (1984).
Random forests were introduced by Breiman (2001). For additional reading we
refer the reader to (Hastie, Tibshirani & Friedman 2001, Rokach 2007).
The proof of the hardness of training decision trees is given in Hyafil & Rivest
(1976).
18.6 Exercises
1. 1. Show that any binary classifier h : {0, 1}d 7! {0, 1} can be implemented
as a decision tree of height at most d + 1, with internal nodes of the form
(xi = 0?) for some i 2 {1, . . . , d}.
2. Conclude that the VC dimension of the class of decision trees over the
domain {0, 1}d is 2d .
2. (Suboptimality of ID3)
Consider the following training set, where X = {0, 1}3 and Y = {0, 1}:
((1, 1, 1), 1)
((1, 0, 0), 1)
((1, 1, 0), 0)
((0, 0, 1), 0)
Suppose we wish to use this training set in order to build a decision tree of
depth 2 (i.e., for each input we are allowed to ask two questions of the form
(xi = 0?) before deciding on the label).
18.6 Exercises 257
1. Suppose we run the ID3 algorithm up to depth 2 (namely, we pick the root
node and its children according to the algorithm, but instead of keeping
on with the recursion, we stop and pick leaves according to the majority
label in each subtree). Assume that the subroutine used to measure the
quality of each feature is based on the entropy function (so we measure the
information gain), and that if two features get the same score, one of them
is picked arbitrarily. Show that the training error of the resulting decision
tree is at least 1/4.
2. Find a decision tree of depth 2 that attains zero training error.
19 Nearest Neighbor
Nearest Neighbor algorithms are among the simplest of all machine learning
algorithms. The idea is to memorize the training set and then to predict the
label of any new instance on the basis of the labels of its closest neighbors in
the training set. The rationale behind such a method is based on the assumption
that the features that are used to describe the domain points are relevant to
their labelings in a way that makes close-by points likely to have the same label.
Furthermore, in some situations, even when the training set is immense, finding
a nearest neighbor can be done extremely fast (for example, when the training
set is the entire Web and distances are based on links).
Note that, in contrast with the algorithmic paradigms that we have discussed
so far, like ERM, SRM, MDL, or RLM, that are determined by some hypothesis
class, H, the Nearest Neighbor method figures out a label on any test point
without searching for a predictor within some predefined class of functions.
In this chapter we describe Nearest Neighbor methods for classification and
regression problems. We analyze their performance for the simple case of binary
classification and discuss the efficiency of implementing these methods.
Throughout the entire chapter we assume that our instance domain, X , is en-
dowed with a metric function ⇢. That is, ⇢ : X ⇥X ! R is a function that returns
the distance between any two elements of X . For example,
qP if X = Rd then ⇢ can
d
be the Euclidean distance, ⇢(x, x0 ) = kx x0 k = i=1 (xi x0i )2 .
Let S = (x1 , y1 ), . . . , (xm , ym ) be a sequence of training examples. For each
x 2 X , let ⇡1 (x), . . . , ⇡m (x) be a reordering of {1, . . . , m} according to their
distance to x, ⇢(x, xi ). That is, for all i < m,
For a number k, the k-NN rule for binary classification is defined as follows:
Figure 19.1 An illustration of the decision boundaries of the 1-NN rule. The points
depicted are the sample points, and the predicted label of any new point will be the
label of the sample point in the center of the cell it belongs to. These cells are called a
Voronoi Tessellation of the space.
k-NN
input: a training sample S = (x1 , y1 ), . . . , (xm , ym )
output: for every point x 2 X ,
return the majority label among {y⇡i (x) : i k}
It is easy to verify that we can cast the prediction by majority of labels (for
classification) or by the averaged target (for regression) as in Equation (19.1) by
an appropriate choice of . The generality can lead to other rules; for example, if
Y = R, we can take a weighted average of the targets according to the distance
from x:
Xk
⇢(x, x⇡i (x) )
hS (x) = Pk y⇡i (x) .
i=1 j=1 ⇢(x, x⇡j (x) )
19.2 Analysis
Since the NN rules are such natural learning methods, their generalization prop-
erties have been extensively studied. Most previous results are asymptotic con-
sistency results, analyzing the performance of NN rules when the sample size, m,
260 Nearest Neighbor
goes to infinity, and the rate of convergence depends on the underlying distribu-
tion. As we have argued in Section 7.4, this type of analysis is not satisfactory.
One would like to learn from finite training samples and to understand the gen-
eralization performance as a function of the size of such finite training sets and
clear prior assumptions on the data distribution. We therefore provide a finite-
sample analysis of the 1-NN rule, showing how the error decreases as a function
of m and how it depends on properties of the distribution. We will also explain
how the analysis can be generalized to k-NN rules for arbitrary values of k. In
particular, the analysis specifies the number of examples required to achieve a
true error of 2LD (h? ) + ✏, where h? is the Bayes optimal hypothesis, assuming
that the labeling rule is “well behaved” (in a sense we will define later).
Recall that the Bayes optimal rule (that is, the hypothesis that minimizes LD (h)
over all functions) is
h? (x) = 1[⌘(x)>1/2] .
lemma 19.1 Let X = [0, 1]d , Y = {0, 1}, and D be a distribution over X ⇥ Y
for which the conditional probability function, ⌘, is a c-Lipschitz function. Let
S = (x1 , y1 ), . . . , (xm , ym ) be an i.i.d. sample and let hS be its corresponding
1-NN hypothesis. Let h? be the Bayes optimal rule for ⌘. Then,
Proof Since LD (hS ) = E(x,y)⇠D [1[hS (x)6=y] ], we obtain that ES [LD (hS )] is the
probability to sample a training set S and an additional example (x, y), such
that the label of ⇡1 (x) is di↵erent from y. In other words, we can first sample
m unlabeled examples, Sx = (x1 , . . . , xm ), according to DX , and an additional
unlabeled example, x ⇠ DX , then find ⇡1 (x) to be the nearest neighbor of x in
Sx , and finally sample y ⇠ ⌘(x) and y⇡1 (x) ⇠ ⌘(⇡1 (x)). It follows that
E[LD (hS )] = E
m ,x⇠D ,y⇠⌘(x),y 0 ⇠⌘(⇡ (x))
[1[y6=y0 ] ]
S Sx ⇠DX X 1
= E
m ,x⇠D
P [y 6= y 0 ] . (19.2)
Sx ⇠DX X y⇠⌘(x),y 0 ⇠⌘(⇡1 (x))
We next upper bound Py⇠⌘(x),y0 ⇠⌘(x0 ) [y 6= y 0 ] for any two domain points x, x0 :
P [y 6= y 0 ] 2⌘(x)(1 ⌘(x)) + c kx x0 k.
y⇠⌘(x),y 0 ⇠⌘(x0 )
The next step is to bound the expected distance between a random x and its
closest element in S. We first need the following general probability lemma. The
lemma bounds the probability weight of subsets that are not hit by a random
sample, as a function of the size of that sample.
ma 1
Finally, by a standard calculus, maxa ae me and this concludes the proof.
Equipped with the preceding lemmas we are now ready to state and prove the
main result of this section – an upper bound on the expected error of the 1-NN
learning rule.
theorem 19.3 Let X = [0, 1]d , Y = {0, 1}, and D be a distribution over X ⇥ Y
for which the conditional probability function, ⌘, is a c-Lipschitz function. Let
hS denote the result of applying the 1-NN rule to a sample S ⇠ Dm . Then,
p 1
E m [LD (hS )] 2 LD (h? ) + 4 c d m d+1 .
S⇠D
Proof Fix some ✏ = 1/T , for some integer T , let r = T d and let C1 , . . . , Cr be the
cover of the set X using boxes of length ✏: Namely, for every (↵1 , . . . , ↵d ) 2 [T ]d ,
there exists a set Ci of the form {x : 8j, xj 2 [(↵j 1)/T, ↵j /T ]}. An illustration
for d = 2, T = 5 and the set corresponding to ↵ = (2, 4) is given in the following.
p p
For each x, x0 in the same box we have kx x0 k d ✏. Otherwise, kx x0 k d.
Therefore,
2 2 3 2 3 3
[ p [ p
E [kx x⇡1 (x) k] E 4P 4 Ci 5 d + P 4 Ci 5 ✏ d5 ,
x,S S
i:Ci \S=; i:Ci \S6=;
S
and by combining Lemma 19.2 with the trivial bound P[ i:Ci \S6=; Ci ] 1 we
get that
p
E [kx x⇡1 (x) k] d me r
+✏ .
x,S
19.2 Analysis 263
19.4 Summary
The k-NN rule is a very simple learning algorithm that relies on the assumption
that “things that look alike must be alike.” We formalized this intuition using
the Lipschitzness of the conditional probability. We have shown that with a suf-
ficiently large training set, the risk of the 1-NN is upper bounded by twice the
risk of the Bayes optimal rule. We have also derived a lower bound that shows
the “curse of dimensionality” – the required sample size might increase expo-
nentially with the dimension. As a result, NN is usually performed in practice
after a dimensionality reduction preprocessing step. We discuss dimensionality
reduction techniques later on in Chapter 23.
Cover & Hart (1967) gave the first analysis of 1-NN, showing that its risk con-
verges to twice the Bayes optimal error under mild conditions. Following a lemma
due to Stone (1977), Devroye & Györfi (1985) have shown that the k-NN rule
19.6 Exercises 265
19.6 Exercises
In this exercise we will prove the following theorem for the k-NN rule.
theorem 19.5 Let X = [0, 1]d , Y = {0, 1}, and D be a distribution over X ⇥ Y
for which the conditional probability function, ⌘, is a c-Lipschitz function. Let hS
denote the result of applying the k-NN rule to a sample S ⇠ Dm , where k 10.
Let h? be the Bayes optimal hypothesis. Then,
r ! ⇣ p ⌘
8
E[LD (hS )] 1 + LD (h? ) + 6 c d + k m 1/(d+1) .
S k
1. Prove the following lemma.
lemma 19.6 Let C1 , . . . , Cr be a collection of subsets of some domain set,
X . Let S be a sequence of m points sampled i.i.d. according to some probability
distribution, D over X . Then, for every k 2,
2 3
X 2rk
E 4 P[Ci ]5 .
S⇠D m m
i:|Ci \S|<k
Hints:
• Show that
2 3
X r
X
E4 P[Ci ]5 = P[Ci ] P [|Ci \ S| < k] .
S S
i:|Ci \S|<k i=1
• Fix some i and suppose that k < P[Ci ] m/2. Use Cherno↵’s bound to show
that
P[Ci ] m/8
P [|Ci \ S| < k] P [|Ci \ S| < P[Ci ]m/2] e .
S S
ma 1
• Use the inequality maxa ae me to show that for such i we have
P[Ci ] m/8 8
P[Ci ] P [|Ci \ S| < k] P[Ci ]e .
S me
• Conclude the proof by using the fact that for the case k P[Ci ] m/2 we
clearly have:
2k
P[Ci ] P [|Ci \ S| < k] P[Ci ] .
S m
266 Nearest Neighbor
W.l.o.g. assume that p 1/2. Now use Lemma 19.7 to show that
r !
8
P P [hS (x) 6= y] 1 + P [1[p>1/2] 6= y].
y1 ,...,yj y⇠p k y⇠p
• Show that
P [1[p>1/2] 6= y] = p = min{p, 1 p} min{⌘(x), 1 ⌘(x)} + |p ⌘(x)|.
y⇠p
• Combine all the preceding to obtain that the second summand in Equa-
tion (19.3) is bounded by
r !
8 p
1+ LD (h? ) + 3 c ✏ d.
k
hope it will find a reasonable solution (as happens to be the case in several
practical tasks). In Section 20.6 we describe how to implement SGD for neural
networks. In particular, the most complicated operation is the calculation of the
gradient of the loss function with respect to the parameters of the network. We
present the backpropagation algorithm that efficiently calculates the gradient.
The idea behind neural networks is that many neurons can be joined together
by communication links to carry out complex computations. It is common to
describe the structure of a neural network as a graph whose nodes are the neurons
and each (directed) edge in the graph links the output of some neuron to the
input of another neuron. We will restrict our attention to feedforward network
structures in which the underlying graph does not contain cycles.
A feedforward neural network is described by a directed acyclic graph, G =
(V, E), and a weight function over the edges, w : E ! R. Nodes of the graph
correspond to neurons. Each single neuron is modeled as a simple scalar func-
tion, : R ! R. We will focus on three possible functions for : the sign
function, (a) = sign(a), the threshold function, (a) = 1[a>0] , and the sig-
moid function, (a) = 1/(1 + exp( a)), which is a smooth approximation to the
threshold function. We call the “activation” function of the neuron. Each edge
in the graph links the output of some neuron to the input of another neuron.
The input of a neuron is obtained by taking a weighted sum of the outputs of
all the neurons connected to it, where the weighting is according to w.
To simplify the description of the calculation performed by the network, we
further assume that the network is organized in layers. That is, the set of nodes
can be decomposed into a union of (nonempty) disjoint subsets, V = [· Tt=0 Vt ,
such that every edge in E connects some node in Vt 1 to some node in Vt , for
some t 2 [T ]. The bottom layer, V0 , is called the input layer. It contains n + 1
neurons, where n is the dimensionality of the input space. For every i 2 [n], the
output of neuron i in V0 is simply xi . The last neuron in V0 is the “constant”
neuron, which always outputs 1. We denote by vt,i the ith neuron of the tth layer
and by ot,i (x) the output of vt,i when the network is fed with the input vector x.
Therefore, for i 2 [n] we have o0,i (x) = xi and for i = n + 1 we have o0,i (x) = 1.
We now proceed with the calculation in a layer by layer manner. Suppose we
have calculated the outputs of the neurons at layer t. Then, we can calculate
the outputs of the neurons at layer t + 1 as follows. Fix some vt+1,j 2 Vt+1 .
Let at+1,j (x) denote the input to vt+1,j when the network is fed with the input
vector x. Then,
X
at+1,j (x) = w((vt,r , vt+1,j )) ot,r (x),
r: (vt,r ,vt+1,j )2E
270 Neural Networks
and
ot+1,j (x) = (at+1,j (x)) .
That is, the input to vt+1,j is a weighted sum of the outputs of the neurons in Vt
that are connected to vt+1,j , where weighting is according to w, and the output
of vt+1,j is simply the application of the activation function on its input.
Layers V1 , . . . , VT 1 are often called hidden layers. The top layer, VT , is called
the output layer. In simple prediction problems the output layer contains a single
neuron whose output is the output of the network.
We refer to T as the number of layers in the network (excluding V0 ), or the
“depth” of the network. The size of the network is |V |. The “width” of the
network is maxt |Vt |. An illustration of a layered feedforward neural network of
depth 2, size 10, and width 5, is given in the following. Note that there is a
neuron in the hidden layer that has no incoming edges. This neuron will output
the constant (0).
v1,1
x1 v0,1
v1,2
x2 v0,2
v1,3 v2,1 Output
x3 v0,3
v1,4
constant v0,4
v1,5
That is, the parameters specifying a hypothesis in the hypothesis class are the
weights over the edges of the network.
We can now study the approximation error, estimation error, and optimization
error of such hypothesis classes. In Section 20.3 we study the approximation
error of HV,E, by studying what type of functions hypotheses in HV,E, can
implement, in terms of the size of the underlying graph. In Section 20.4 we
study the estimation error of HV,E, , for the case of binary classification (i.e.,
VT = 1 and is the sign function), by analyzing its VC dimension. Finally, in
Section 20.5 we show that it is computationally hard to learn the class HV,E, ,
even if the underlying graph is small, and in Section 20.6 we present the most
commonly used heuristic for training HV,E, .
In this section we study the expressive power of neural networks, namely, what
type of functions can be implemented using a neural network. More concretely,
we will fix some architecture, V, E, , and will study what functions hypotheses
in HV,E, can implement, as a function of the size of V .
We start the discussion with studying which type of Boolean functions (i.e.,
functions from {±1}n to {±1}) can be implemented by HV,E,sign . Observe that
for every computer in which real numbers are stored using b bits, whenever we
calculate a function f : Rn ! R on such a computer we in fact calculate a
function g : {±1}nb ! {±1}b . Therefore, studying which Boolean functions can
be implemented by HV,E,sign can tell us which functions can be implemented on
a computer that stores real numbers using b bits.
We begin with a simple claim, showing that without restricting the size of the
network, every Boolean function can be implemented using a neural network of
depth 2.
claim 20.1 For every n, there exists a graph (V, E) of depth 2, such that
HV,E,sign contains all functions from {±1}n to {±1}.
The preceding claim shows that neural networks can implement any Boolean
function. However, this is a very weak property, as the size of the resulting
network might be exponentially large. In the construction given at the proof of
Claim 20.1, the number of nodes in the hidden layer is exponentially large. This
is not an artifact of our proof, as stated in the following theorem.
theorem 20.2 For every n, let s(n) be the minimal integer such that there
exists a graph (V, E) with |V | = s(n) such that the hypothesis class HV,E,sign
contains all the functions from {0, 1}n to {0, 1}. Then, s(n) is exponential in n.
Similar results hold for HV,E, where is the sigmoid function.
Proof Suppose that for some (V, E) we have that HV,E,sign contains all functions
from {0, 1}n to {0, 1}. It follows that it can shatter the set of m = 2n vectors in
{0, 1}n and hence the VC dimension of HV,E,sign is 2n . On the other hand, the
VC dimension of HV,E,sign is bounded by O(|E| log(|E|)) O(|V |3 ), as we will
show in the next section. This implies that |V | ⌦(2n/3 ), which concludes our
proof for the case of networks with the sign activation function. The proof for
the sigmoid case is analogous.
Remark 20.1 It is possible to derive a similar theorem for HV,E, for any , as
long as we restrict the weights so that it is possible to express every weight using
a number of bits which is bounded by a universal constant. We can even con-
sider hypothesis classes where di↵erent neurons can employ di↵erent activation
functions, as long as the number of allowed activation functions is also finite.
Which functions can we express using a network of polynomial size? The pre-
ceding claim tells us that it is impossible to express all Boolean functions using
a network of polynomial size. On the positive side, in the following we show
that all Boolean functions that can be calculated in time O(T (n)) can also be
expressed by a network of size O(T (n)2 ).
theorem 20.3 Let T : N ! N and for every n, let Fn be the set of functions
that can be implemented using a Turing machine using runtime of at most T (n).
Then, there exist constants b, c 2 R+ such that for every n, there is a graph
(Vn , En ) of size at most c T (n)2 + b such that HVn ,En ,sign contains Fn .
The proof of this theorem relies on the relation between the time complexity
of programs and their circuit complexity (see, for example, Sipser (2006)). In a
nutshell, a Boolean circuit is a type of network in which the individual neurons
20.3 The Expressive Power of Neural Networks 273
lemma 20.4 Suppose that a neuron v, that implements the sign activation
function, has k incoming edges, connecting it to neurons whose outputs are in
{±1}. Then, by adding one more edge, linking a “constant” neuron to v, and
by adjusting the weights on the edges to v, the output of v can implement the
conjunction or the disjunction of its inputs.
theorem 20.5 Fix some ✏ 2 (0, 1). For every n, let s(n) be the minimal integer
such that there exists a graph (V, E) with |V | = s(n) such that the hypothesis class
HV,E, , with being the sigmoid function, can approximate, to within precision
of ✏, every 1-Lipschitz function f : [ 1, 1]n ! [ 1, 1]. Then s(n) is exponential
in n.
Let us start with a depth 2 network, namely, a network with a single hidden
layer. Each neuron in the hidden layer implements a halfspace predictor. Then,
the single neuron at the output layer applies a halfspace on top of the binary
outputs of the neurons in the hidden layer. As we have shown before, a halfspace
can implement the conjunction function. Therefore, such networks contain all
hypotheses which are an intersection of k 1 halfspaces, where k is the number
of neurons in the hidden layer; namely, they can express all convex polytopes
with k 1 faces. An example of an intersection of 5 halfspaces is given in the
following.
Next we discuss the sample complexity of learning the class HV,E, . Recall that
the fundamental theorem of learning tells us that the sample complexity of learn-
ing a hypothesis class of binary classifiers depends on its VC dimension. There-
fore, we focus on calculating the VC dimension of hypothesis classes of the form
HV,E, , where the output layer of the graph contains a single neuron.
We start with the sign activation function, namely, with HV,E,sign . What is
the VC dimension of this class? Intuitively, since we learn |E| parameters, the
VC dimension should be order of |E|. This is indeed the case, as formalized by
the following theorem.
Proof To simplify the notation throughout the proof, let us denote the hy-
pothesis class by H. Recall the definition of the growth function, ⌧H (m), from
Section 6.5.1. This function measures maxC⇢X :|C|=m |HC |, where HC is the re-
striction of H to functions from C to {0, 1}. We can naturally extend the defi-
nition for a set of functions from X to some finite set Y, by letting HC be the
restriction of H to functions from C to Y, and keeping the definition of ⌧H (m)
intact.
Our neural network is defined by a layered graph. Let V0 , . . . , VT be the layers
of the graph. Fix some t 2 [T ]. By assigning di↵erent weights on the edges
between Vt 1 and Vt , we obtain di↵erent functions from R|Vt 1 | ! {±1}|Vt | . Let
H(t) be the class of all possible such mappings from R|Vt 1 | ! {±1}|Vt | . Then,
H can be written as a composition, H = H(T ) . . . H(1) . In Exercise 4 we show
that the growth function of a composition of hypothesis classes is bounded by
the products of the growth functions of the individual classes. Therefore,
T
Y
⌧H (m) ⌧H(t) (m).
t=1
Let dt,i be the number of edges that are headed to the ith neuron of layer t.
Since the neuron is a homogenous halfspace hypothesis and the VC dimension
of homogenous halfspaces is the dimension of their input, we have by Sauer’s
lemma that
⇣ ⌘dt,i
⌧H(t,i) (m) dem
t,i
(em)dt,i .
In the previous sections we have shown that the class of neural networks with an
underlying graph of polynomial size can express all functions that can be imple-
mented efficiently, and that the sample complexity has a favorable dependence
on the size of the network. In this section we turn to the analysis of the time
complexity of training neural networks.
We first show that it is NP hard to implement the ERM rule with respect to
HV,E,sign even for networks with a single hidden layer that contain just 4 neurons
in the hidden layer.
theorem 20.7 Let k 3. For every n, let (V, E) be a layered graph with n
input nodes, k + 1 nodes at the (single) hidden layer, where one of them is the
constant neuron, and a single output node. Then, it is NP hard to implement the
ERM rule with respect to HV,E,sign .
The proof relies on a reduction from the k-coloring problem and is left as
Exercise 6.
One way around the preceding hardness result could be that for the purpose
of learning, it may suffice to find a predictor h 2 H with low empirical error,
not necessarily an exact ERM. However, it turns out that even the task of find-
ing weights that result in close-to-minimal empirical error is computationally
infeasible (see (Bartlett & Ben-David 2002)).
One may also wonder whether it may be possible to change the architecture
of the network so as to circumvent the hardness result. That is, maybe ERM
with respect to the original network structure is computationally hard but ERM
with respect to some other, larger, network may be implemented efficiently (see
Chapter 8 for examples of such cases). Another possibility is to use other acti-
vation functions (such as sigmoids, or any other type of efficiently computable
activation functions). There is a strong indication that all of such approaches
are doomed to fail. Indeed, under some cryptographic assumption, the problem
of learning intersections of halfspaces is known to be hard even in the repre-
sentation independent model of learning (see Klivans & Sherstov (2006)). This
implies that, under the same cryptographic assumption, any hypothesis class
which contains intersections of halfspaces cannot be learned efficiently.
A widely used heuristic for training neural networks relies on the SGD frame-
work we studied in Chapter 14. There, we have shown that SGD is a successful
learner if the loss function is convex. In neural networks, the loss function is
highly nonconvex. Nevertheless, we can still implement the SGD algorithm and
20.6 SGD and Backpropagation 277
hope it will find a reasonable solution (as happens to be the case in several
practical tasks).
The problem of finding a hypothesis in HV,E, with a low risk amounts to the
problem of tuning the weights over the edges. In this section we show how to
apply a heuristic search for good weights using the SGD algorithm. Throughout
this section we assume that is the sigmoid function, (a) = 1/(1 + e a ), but
the derivation holds for any di↵erentiable scalar function.
Since E is a finite set, we can think of the weight function as a vector w 2 R|E| .
Suppose the network has n input neurons and k output neurons, and denote by
hw : Rn ! Rk the function calculated by the network if the weight function is
defined by w. Let us denote by (hw (x), y) the loss of predicting hw (x) when
the target is y 2 Y. For concreteness, we will take to be the squared loss,
1 2
(hw (x), y) = 2 khw (x) yk ; however, similar derivation can be obtained for
every di↵erentiable function. Finally, given a distribution D over the examples
domain, Rn ⇥ Rk , let LD (w) be the risk of the network, namely,
Recall the SGD algorithm for minimizing the risk function LD (w). We repeat
the pseudocode from Chapter 14 with a few modifications, which are relevant
to the neural network application because of the nonconvexity of the objective
function. First, while in Chapter 14 we initialized w to be the zero vector, here
we initialize w to be a randomly chosen vector with values close to zero. This
is because an initialization with the zero vector will lead all hidden neurons to
have the same weights (if the network is a full layered network). In addition,
the hope is that if we repeat the SGD procedure several times, where each time
we initialize the process with a new random vector, one of the runs will lead
to a good local minimum. Second, while a fixed step size, ⌘, is guaranteed to
be good enough for convex problems, here we utilize a variable step size, ⌘t , as
defined in Section 14.4.2. Because of the nonconvexity of the loss function, the
choice of the sequence ⌘t is more significant, and it is tuned in practice by a trial
and error manner. Third, we output the best performing vector on a validation
set. In addition, it is sometimes helpful to add regularization on the weights,
with parameter . That is, we try to minimize LD (w) + 2 kwk2 . Finally, the
gradient does not have a closed form solution. Instead, it is implemented using
the backpropagation algorithm, which will be described in the sequel.
278 Neural Networks
Backpropagation
input:
example (x, y), weight vector w, layered graph (V, E),
activation function : R ! R
initialize:
denote layers of the graph V0 , . . . , VT where Vt = {vt,1 , . . . , vt,kt }
define Wt,i,j as the weight of (vt,j , vt+1,i )
(where we set Wt,i,j = 0 if (vt,j , vt+1,i ) 2/ E)
forward:
set o0 = x
for t = 1, . . . , T
for i = 1, . . . , kt
P kt 1
set at,i = j=1 Wt 1,i,j ot 1,j
set ot,i = (at,i )
backward:
set T = oT y
for t = T 1, T 2, . . . , 1
for i = 1, . . . , kt
Pkt+1 0
t,i = j=1 Wt,j,i t+1,j (at+1,j )
output:
foreach edge (vt 1,j , vt,i ) 2 E
set the partial derivative to t,i 0 (at,i ) ot 1,j
20.6 SGD and Backpropagation 279
The chain rule for taking the derivative of a composition of functions can be
written in terms of the Jacobian as follows. Given two functions f : Rn ! Rm
and g : Rk ! Rn , we have that the Jacobian of the composition function,
(f g) : Rk ! Rm , at w, is
Next, we discuss how to calculate the partial derivatives with respect to the
edges from Vt 1 to Vt , namely, with respect to the elements in Wt 1 . Since we
fix all other weights of the network, it follows that the outputs of all the neurons
in Vt 1 are fixed numbers which do not depend on the weights in Wt 1 . Denote
the corresponding vector by ot 1 . In addition, let us denote by `t : Rkt ! R the
loss function of the subnetwork defined by layers Vt , . . . , VT as a function of the
outputs of the neurons in Vt . The input to the neurons of Vt can be written as
at = Wt 1 ot 1 and the output of the neurons of Vt is ot = (at ). That is, for
every j we have ot,j = (at,j ). We obtain that the loss, as a function of Wt 1 ,
can be written as
Let us also denote t = Jot (`t ). Then, we can further rewrite the preceding as
0
Jwt 1 (gt ) = t,1 (at,1 ) o>
t 1 , ... , t,kt
0
(at,kt ) o>
t 1 . (20.3)
It is left to calculate the vector t = Jot (`t ) for every t. This is the gradient
of `t at ot . We calculate this in a recursive manner. First observe that for the
last layer we have that `T (u) = (u, y), where is the loss function. Since we
1 2
assume that (u, y) = 2 ku yk we obtain that Ju (`T ) = (u y). In particular,
T = JoT (`T ) = (oT y). Next, note that
In particular,
0
t = Jot (`t ) = J (Wt ot ) (`t+1 )diag( (Wt ot ))Wt
= Jot+1 (`t+1 )diag( 0 (at+1 ))Wt
0
= t+1 diag( (at+1 ))Wt .
In summary, we can first calculate the vectors {at , ot } from the bottom of
the network to its top. Then, we calculate the vectors { t } from the top of
the network back to its bottom. Once we have all of these vectors, the partial
derivatives are easily obtained using Equation (20.3). We have thus shown that
the pseudocode of backpropagation indeed calculates the gradient.
20.7 Summary
Neural networks were extensively studied in the 1980s and early 1990s, but with
mixed empirical success. In recent years, a combination of algorithmic advance-
ments, as well as increasing computational power and data size, has led to a
breakthrough in the e↵ectiveness of neural networks. In particular, “deep net-
works” (i.e., networks of more than 2 layers) have shown very impressive practical
performance on a variety of domains. A few examples include convolutional net-
works (Lecun & Bengio 1995), restricted Boltzmann machines (Hinton, Osindero
& Teh 2006), auto-encoders (Ranzato, Huang, Boureau & Lecun 2007, Bengio &
LeCun 2007, Collobert & Weston 2008, Lee, Grosse, Ranganath & Ng 2009, Le,
Ranzato, Monga, Devin, Corrado, Chen, Dean & Ng 2012), and sum-product
networks (Livni, Shalev-Shwartz & Shamir 2013, Poon & Domingos 2011). See
also (Bengio 2009) and the references therein.
The expressive power of neural networks and the relation to circuit complexity
have been extensively studied in (Parberry 1994). For the analysis of the sample
complexity of neural networks we refer the reader to (Anthony & Bartlet 1999).
Our proof technique of Theorem 20.6 is due to Kakade and Tewari lecture notes.
282 Neural Networks
Klivans & Sherstov (2006) have shown that for any c > 0, intersections of nc
halfspaces over {±1}n are not efficiently PAC learnable, even if we allow repre-
sentation independent learning. This hardness result relies on the cryptographic
assumption that there is no polynomial time solution to the unique-shortest-
vector problem. As we have argued, this implies that there cannot be an efficient
algorithm for training neural networks, even if we allow larger networks or other
activation functions that can be implemented efficiently.
The backpropagation algorithm has been introduced in Rumelhart, Hinton &
Williams (1986).
20.9 Exercises
if we feed the network with the real number 0.x1 x2 . . . xn , then the output
of the network will be x.
Hint: Denote ↵ = 0.x1 x2 . . . xn and observe that 10k ↵ 0.5 is at least 0.5
if xk = 1 and is at most 0.3 if xk = 1.
2. Construct a network, N2 , with O(n) weights, which implements a function
from [n] to {0, 1}n such that N2 (i) = ei for all i. That is, upon receiving
the input i, the network outputs the vector of all zeros except 1 at the i’th
neuron.
(i) (i) (i)
3. Let ↵1 , . . . , ↵n be n real numbers such that every ↵i is of the form 0.a1 a2 . . . an ,
(i)
with aj 2 {0, 1}. Construct a network, N3 , with O(n) weights, which im-
plements a function from [n] to R, and satisfies N2 (i) = ↵i for every i 2 [n].
4. Combine N1 , N3 to obtain a network that receives i 2 [n] and output a(i) .
(i)
5. Construct a network N4 that receives (i, j) 2 [n] ⇥ [n] and outputs aj .
2
Hint: Observe that the AND function over {0, 1} can be calculated using
O(1) weights.
6. Conclude that there is a graph with O(n) weights such that the VC di-
mension of the resulting hypothesis class is n2 .
6. Prove Theorem 20.7.
Hint: The proof is similar to the hardness of learning intersections of halfs-
paces – see Exercise 32 in Chapter 8.
Part III
Additional Learning Models
21 Online Learning
Our goal is to study which hypothesis classes are learnable in the online model,
and in particular to find good learning algorithms for a given hypothesis class.
Remark 21.1 Throughout this section and the next, we ignore the computa-
21.1 Online Classification in the Realizable Case 289
Proof We simply note that whenever the algorithm errs we have |Vt+1 | |Vt |/2,
(hence the name Halving). Therefore, if M is the total number of mistakes, we
have
M
1 |VT +1 | |H| 2 .
Rearranging this inequality we conclude our proof.
Of course, Halving’s mistake bound is much better than Consistent’s mistake
bound. We already see that online learning is di↵erent from PAC learning—while
in PAC, any ERM hypothesis is good, in online learning choosing an arbitrary
ERM hypothesis is far from being optimal.
v1
h1 h2 h3 h4
v2 v3 v1 0 0 1 1
v2 0 1 ⇤ ⇤
v3 ⇤ ⇤ 0 1
The definition of Ldim and the discussion above immediately imply the fol-
lowing:
lemma 21.6 No algorithm can have a mistake bound strictly smaller than
Ldim(H); namely, for every algorithm, A, we have MA (H) Ldim(H).
Proof Let T = Ldim(H) and let v1 , . . . , v2T 1 be a sequence that satisfies the
requirements in the definition of Ldim. If the environment sets xt = vit and
yt = 1 pt for all t 2 [T ], then the learner makes T mistakes while the definition
of Ldim implies that there exists a hypothesis h 2 H such that yt = h(xt ) for all
t.
1/2
1/4 3/4
This tree is shattered by H. And, because of the density of the reals, this tree
can be made arbitrarily deep.
Lemma 21.6 states that Ldim(H) lower bounds the mistake bound of any
algorithm. Interestingly, there is a standard algorithm whose mistake bound
matches this lower bound. The algorithm is similar to the Halving algorithm.
Recall that the prediction of Halving is made according to a majority vote of
the hypotheses which are consistent with previous examples. We denoted this
set by Vt . Put another way, Halving partitions Vt into two sets: Vt+ = {h 2 Vt :
h(xt ) = 1} and Vt = {h 2 Vt : h(xt ) = 0}. It then predicts according to the
larger of the two groups. The rationale behind this prediction is that whenever
Halving makes a mistake it ends up with |Vt+1 | 0.5 |Vt |.
The optimal algorithm we present in the following uses the same idea, but
instead of predicting according to the larger class, it predicts according to the
class with larger Ldim.
The following lemma formally establishes the optimality of the preceding al-
gorithm.
21.1 Online Classification in the Realizable Case 293
lemma 21.7 SOA enjoys the mistake bound MSOA (H) Ldim(H).
Proof It suffices to prove that whenever the algorithm makes a prediction mis-
take we have Ldim(Vt+1 ) Ldim(Vt ) 1. We prove this claim by assuming the
contrary, that is, Ldim(Vt+1 ) = Ldim(Vt ). If this holds true, then the definition
(r)
of pt implies that Ldim(Vt ) = Ldim(Vt ) for both r = 1 and r = 0. But, then
we can construct a shaterred tree of depth Ldim(Vt ) + 1 for the class Vt , which
leads to the desired contradiction.
corollary 21.8 Let H be any hypothesis class. Then, the standard optimal
algorithm enjoys the mistake bound MSOA (H) = Ldim(H) and no other algorithm
can have MA (H) < Ldim(H).
Comparison to VC Dimension
In the PAC learning model, learnability is characterized by the VC dimension of
the class H. Recall that the VC dimension of a class H is the maximal number
d such that there are instances x1 , . . . , xd that are shattered by H. That is, for
any sequence of labels (y1 , . . . , yd ) 2 {0, 1}d there exists a hypothesis h 2 H
that gives exactly this sequence of labels. The following theorem relates the VC
dimension to the Littlestone dimension.
theorem 21.9 For any class H, VCdim(H) Ldim(H), and there are classes
for which strict inequality holds. Furthermore, the gap can be arbitrarily larger.
x1
x2 x2
x3 x3 x3 x3
Now, the definition of a shattered set clearly implies that we got a valid shattered
tree of depth d, and we conclude that VCdim(H) Ldim(H). To show that the
gap can be arbitrarily large simply note that the class given in Example 21.4 has
VC dimension of 1 whereas its Littlestone dimension is infinite.
294 Online Learning
We restate the learner’s goal as having the lowest possible regret relative to H.
An interesting question is whether we can derive an algorithm with low regret,
meaning that RegretA (H, T ) grows sublinearly with the number of rounds, T ,
which implies that the di↵erence between the error rate of the learner and the
best hypothesis in H tends to zero as T goes to infinity.
We first show that this is an impossible mission—no algorithm can obtain a
sublinear regret bound even if |H| = 2. Indeed, consider H = {h0 , h1 }, where h0
is the function that always returns 0 and h1 is the function that always returns
1. An adversary can make the number of mistakes of any online algorithm be
equal to T , by simply waiting for the learner’s prediction and then providing
the opposite label as the true label. In contrast, for any sequence of true labels,
y1 , . . . , yT , let b be the majority of labels in y1 , . . . , yT , then the number of
mistakes of hb is at most T /2. Therefore, the regret of any online algorithm
might be at least T T /2 = T /2, which is not sublinear in T . This impossibility
result is attributed to Cover (Cover 1965).
To sidestep Cover’s impossibility result, we must further restrict the power
of the adversarial environment. We do so by allowing the learner to randomize
his predictions. Of course, this by itself does not circumvent Cover’s impossibil-
ity result, since in deriving this result we assumed nothing about the learner’s
strategy. To make the randomization meaningful, we force the adversarial envir-
onment to decide on yt without knowing the random coins flipped by the learner
on round t. The adversary can still know the learner’s forecasting strategy and
even the random coin flips of previous rounds, but it does not know the actual
value of the random coin flips used by the learner on round t. With this (mild)
change of game, we analyze the expected number of mistakes of the algorithm,
where the expectation is with respect to the learner’s own randomization. That
is, if the learner outputs ŷt where P[ŷt = 1] = pt , then the expected loss he pays
21.2 Online Classification in the Unrealizable Case 295
on round t is
P[ŷt 6= yt ] = |pt yt |.
Put another way, instead of having the predictions of the learner being in {0, 1}
we allow them to be in [0, 1], and interpret pt 2 [0, 1] as the probability to predict
the label 1 on round t.
With this assumption it is possible to derive a low regret algorithm. In partic-
ular, we will prove the following theorem.
theorem 21.10 For every hypothesis class H, there exists an algorithm for
online classification, whose predictions come from [0, 1], that enjoys the regret
bound
T
X T
X p
8h 2 H, |pt yt | |h(xt ) yt | 2 min{log(|H|) , Ldim(H) log(eT )} T .
t=1 t=1
Furthermore,
⇣p no ⌘algorithm can achieve an expected regret bound smaller than
⌦ Ldim(H) T .
We will provide a constructive proof of the upper bound part of the preceding
theorem. The proof of the lower bound part can be found in (Ben-David, Pal, &
Shalev-Shwartz 2009).
The proof of Theorem 21.10 relies on the Weighted-Majority algorithm for
learning with expert advice. This algorithm is important by itself and we dedicate
the next subsection to it.
21.2.1 Weighted-Majority
Weighted-majority is an algorithm for the problem of prediction with expert ad-
vice. In this online learning problem, on round t the learner has to choose the
advice of d given experts. We also allow the learner to randomize his choice by
defining a distribution over the d experts, that is, picking a vector w(t) 2 [0, 1]d ,
P (t) (t)
with i wi = 1, and choosing the ith expert with probability wi . After the
learner chooses an expert, it receives a vector of costs, vt 2 [0, 1]d , where vt,i
is the cost of following the advice of the ith expert. If the learner’s predic-
tions are randomized, then its loss is defined to be the averaged cost, namely,
P (t) (t)
i wi vt,i = hw , vt i. The algorithm assumes that the number of rounds T is
given. In Exercise 4 we show how to get rid of this dependence using the doubling
trick.
296 Online Learning
Weighted-Majority
input: number ofp experts, d ; number of rounds, T
parameter: ⌘ = 2 log(d)/T
initialize: w̃(1) = (1, . . . , 1)
for t = 1, 2, . . .
P (t)
set w(t) = w̃(t) /Zt where Zt = i w̃i
(t)
choose expert i at random according to P[i] = wi
receive costs of all experts vt 2 [0, 1]d
pay cost hw(t) , vt i
(t+1) (t)
update rule 8i, w̃i = w̃i e ⌘vt,i
The following theorem is key for analyzing the regret bound of Weighted-
Majority.
Proof We have:
Using the inequality e a 1 a + a2 /2, which holds for all a 2 (0, 1), and the
P (t)
fact that i wi = 1, we obtain
Zt+1 X (t)
log log wi 1 ⌘vt,i + ⌘ 2 vt,i
2
/2
Zt i
X (t)
= log 1 wi ⌘vt,i ⌘ 2 vt,i
2
/2 .
i
| {z }
def
=b
Next, note that b 2 (0, 1). Therefore, taking log of the two sides of the inequality
1 b e b we obtain the inequality log(1 b) b, which holds for all b 1,
and obtain
Zt+1 X (t)
log wi ⌘vt,i ⌘ 2 vt,i
2
/2
Zt i
X (t)
= ⌘ hw(t) , vt i + ⌘ 2 2
wi vt,i /2
i
⌘ hw(t) , vt i + ⌘ 2 /2.
21.2 Online Classification in the Unrealizable Case 297
Combining the preceding with Equation (21.3) and using the fact that log(Z1 ) =
log(d) we get that
X T
X T ⌘2
⌘ min vt,i log(d) ⌘ hw(t) , vt i + ,
i
t t=1
2
Expert(i1 , i2 , . . . , iL )
input A hypothesis class H ; Indices i1 < i2 < · · · < iL
initialize: V1 = H
for t = 1, 2, . . . , T
receive xt
(r)
for r 2 {0, 1} let Vt = {h⇣ 2 Vt⌘: h(xt ) = r}
(r)
define ỹt = argmaxr Ldim Vt
(in case of a tie set ỹt = 0)
if t 2 {i1 , i2 , . . . , iL }
predict ŷt = 1 ỹt
else
predict ŷt = ỹt
(ŷ )
update Vt+1 = Vt t
Note that each such expert can give us predictions at every round t while only
observing the instances x1 , . . . , xt . Our generic online learning algorithm is now
an application of the Weighted-Majority algorithm with these experts.
To analyze the algorithm we first note that the number of experts is
Ldim(H) ✓ ◆
X T
d= . (21.4)
L
L=0
It can be shown that when T Ldim(H) + 2, the right-hand side of the equation
Ldim(H)
is bounded by (eT /Ldim(H)) (the proof can be found in Lemma A.5).
21.2 Online Classification in the Unrealizable Case 299
lemma 21.13 Let H be any hypothesis class with Ldim(H) < 1. Let x1 , x2 , . . . , xT
be any sequence of instances. For any h 2 H, there exists L Ldim(H) and in-
dices 1 i1 < i2 < · · · < iL T such that when running Expert(i1 , i2 , . . . , iL )
on the sequence x1 , x2 , . . . , xT , the expert predicts h(xt ) on each online round
t = 1, 2, . . . , T .
The previous lemma holds in particular for the hypothesis in H that makes the
least number of mistakes on the sequence of examples, and we therefore obtain
the following:
T
X
min |h(xt ) yt |
h2H
t=1
Together with Theorem 21.11, the upper bound part of Theorem 21.10 is
proven.
300 Online Learning
theorem 21.15 The Online Gradient Descent algorithm enjoys the following
regret bound for every w? 2 H,
T
kw? k2 ⌘X
RegretA (w? , T ) + kvt k2 .
2⌘ 2 t=1
p
If we further assume that ft is ⇢-Lipschitz for all t, then setting ⌘ = 1/ T yields
1 p
RegretA (w? , T ) (kw? k2 + ⇢2 ) T .
2
B
If we further assume that H is B-bounded and we set ⌘ = p
⇢ T
then
p
RegretA (H, T ) B ⇢ T .
Proof The analysis is similar to the analysis of Stochastic Gradient Descent
1
with projections. Using the projection lemma, the definition of w(t+ 2 ) , and the
definition of subgradients, we have that for every t,
kw(t+1) w ? k2 kw(t) w? k2
1 1
= kw(t+1) w? k2 kw(t+ 2 ) w? k2 + kw(t+ 2 ) w? k2 kw(t) w? k2
1
kw(t+ 2 ) w ? k2 kw(t) w? k2
= kw(t) ⌘vt w ? k2 kw(t) w? k2
= 2⌘hw(t) w? , vt i + ⌘ 2 kvt k2
2⌘(ft (w(t) ) ft (w? )) + ⌘ 2 kvt k2 .
Summing over t and observing that the left-hand side is a telescopic sum we
obtain that
T
X T
X
kw(T +1) w ? k2 kw(1) w? k2 2⌘ (ft (w(t) ) ft (w? )) + ⌘ 2 kvt k2 .
t=1 t=1
(1)
Rearranging the inequality and using the fact that w = 0, we get that
T
X T
kw(1) w ? k2 kw(T +1) w? k2 ⌘X
(ft (w(t) ) ft (w? )) + kvt k2
t=1
2⌘ 2 t=1
T
kw? k2 ⌘X
+ kvt k2 .
2⌘ 2 t=1
This proves the first bound in the theorem. The second bound follows from the
assumption that ft is ⇢-Lipschitz, which implies that kvt k ⇢.
The Perceptron is a classic online learning algorithm for binary classification with
the hypothesis class of homogenous halfspaces, namely, H = {x 7! sign(hw, xi) :
302 Online Learning
On rounds on which the algorithm makes a prediction mistake, we shall use the
hinge-loss as a surrogate convex loss function
• ft is a convex function
• For all w, ft (w) `(w, (xt , yt )). In particular, this holds for w(t) .
for some vt 2 @ft (w(t) ). In our case, if yt hw(t) , xt i > 0 then ft is the zero
21.4 The Online Perceptron Algorithm 303
This form implies that the predictions of the Perceptron algorithm and the set
M do not depend on the actual value of ⌘ as long as ⌘ > 0. We have therefore
obtained the Perceptron algorithm:
Perceptron
initialize: w1 = 0
for t = 1, 2, . . . , T
receive xt
predict pt = sign(hw(t) , xt i)
if yt hw(t) , xt i 0
w(t+1) = w(t) + yt xt
else
w(t+1) = w(t)
p T
X
|M| Rkw? k |M| ft (w? ) 0. (21.6)
t=1
|M| R2 kw? k2 .
Proof The theorem follows from Equation (21.6) and the following claim: Given
p p
x, b, c 2 R+ , the inequality x b x c 0 implies that x c + b2 + b c. The
last claim can be easily derived by analyzing the roots of the convex parabola
Q(y) = y 2 by c.
The last assumption of Theorem 21.16 is called separability with large margin
(see Chapter 15). That is, there exists w? that not only satisfies that the point
xt lies on the correct side of the halfspace, it also guarantees that xt is not too
close to the decision boundary. More specifically, the distance from xt to the
decision boundary is at least = 1/kw? k and the bound becomes (R/ )2 .
When the separability assumption does not hold, the bound involves the term
[1 yt hw? , xt i]+ which measures how much the separability with margin require-
ment is violated.
As a last remark we note that there can be cases in which there exists some
w? that makes zero errors on the sequence but the Perceptron will make many
errors. Indeed, this is a direct consequence of the fact that Ldim(H) = 1. The
way we sidestep this impossibility result is by assuming more on the sequence of
examples – the bound in Theorem 21.16 will be meaningful only if the cumulative
P
surrogate loss, t ft (w? ) is not excessively large.
21.5 Summary
In this chapter we have studied the online learning model. Many of the results
we derived for the PAC learning model have an analog in the online model. First,
we have shown that a combinatorial dimension, the Littlestone dimension, char-
acterizes online learnability. To show this, we introduced the SOA algorithm (for
the realizable case) and the Weighted-Majority algorithm (for the unrealizable
case). We have also studied online convex optimization and have shown that
online gradient descent is a successful online learner whenever the loss function
is convex and Lipschitz. Finally, we presented the online Perceptron algorithm
as a combination of online gradient descent and the concept of surrogate convex
loss functions.
21.6 Bibliographic Remarks 305
The Standard Optimal Algorithm was derived by the seminal work of Lit-
tlestone (1988). A generalization to the nonrealizable case, as well as other
variants like margin-based Littlestone’s dimension, were derived in (Ben-David
et al. 2009). Characterizations of online learnability beyond classification have
been obtained in (Abernethy, Bartlett, Rakhlin & Tewari 2008, Rakhlin, Srid-
haran & Tewari 2010, Daniely et al. 2011). The Weighted-Majority algorithm is
due to (Littlestone & Warmuth 1994) and (Vovk 1990).
The term “online convex programming” was introduced by Zinkevich (2003)
but this setting was introduced some years earlier by Gordon (1999). The Per-
ceptron dates back to Rosenblatt (Rosenblatt 1958). An analysis for the re-
alizable case (with margin assumptions) appears in (Agmon 1954, Minsky &
Papert 1969). Freund and Schapire (Freund & Schapire 1999) presented an anal-
ysis for the unrealizable case with a squared-hinge-loss based on a reduction to
the realizable case. A direct analysis for the unrealizable case with the hinge-loss
was given by Gentile (Gentile 2003).
For additional information we refer the reader to Cesa-Bianchi & Lugosi (2006)
and Shalev-Shwartz (2011).
21.7 Exercises
p
Show that if the regret of A on each period of 2m rounds is at most ↵ 2m ,
then the total regret is at most
p
2 p
p ↵ T.
2 1
5. Online-to-batch Conversions: In this exercise we demonstrate how a suc-
cessful online learning algorithm can be used to derive a successful PAC
learner as well.
Consider a PAC learning problem for binary classification parameterized
by an instance domain, X , and a hypothesis class, H. Suppose that there exists
an online learning algorithm, A, which enjoys a mistake bound MA (H) < 1.
Consider running this algorithm on a sequence of T examples which are sam-
pled i.i.d. from a distribution D over the instance space X , and are labeled by
some h? 2 H. Suppose that for every round t, the prediction of the algorithm
is based on a hypothesis ht : X ! {0, 1}. Show that
MA (H)
E[LD (hr )] ,
T
where the expectation is over the random choice of the instances as well as a
random choice of r according to the uniform distribution over [T ].
Hint: Use similar arguments to the ones appearing in the proof of Theo-
rem 14.8.
22 Clustering
Clustering is one of the most widely used techniques for exploratory data anal-
ysis. Across all disciplines, from social sciences to biology to computer science,
people try to get a first intuition about their data by identifying meaningful
groups among the data points. For example, computational biologists cluster
genes on the basis of similarities in their expression in di↵erent experiments; re-
tailers cluster customers, on the basis of their customer profiles, for the purpose
of targeted marketing; and astronomers cluster stars on the basis of their spacial
proximity.
The first point that one should clarify is, naturally, what is clustering? In-
tuitively, clustering is the task of grouping a set of objects such that similar
objects end up in the same group and dissimilar objects are separated into dif-
ferent groups. Clearly, this description is quite imprecise and possibly ambiguous.
Quite surprisingly, it is not at all clear how to come up with a more rigorous
definition.
There are several sources for this difficulty. One basic problem is that the
two objectives mentioned in the earlier statement may in many cases contradict
each other. Mathematically speaking, similarity (or proximity) is not a transi-
tive relation, while cluster sharing is an equivalence relation and, in particular,
it is a transitive relation. More concretely, it may be the case that there is a
long sequence of objects, x1 , . . . , xm such that each xi is very similar to its two
neighbors, xi 1 and xi+1 , but x1 and xm are very dissimilar. If we wish to make
sure that whenever two elements are similar they share the same cluster, then
we must put all of the elements of the sequence in the same cluster. However,
in that case, we end up with dissimilar elements (x1 and xm ) sharing a cluster,
thus violating the second requirement.
To illustrate this point further, suppose that we would like to cluster the points
in the following picture into two clusters.
A clustering algorithm that emphasizes not separating close-by points (e.g., the
Single Linkage algorithm that will be described in Section 22.1) will cluster this
input by separating it horizontally according to the two lines:
Another basic problem is the lack of “ground truth” for clustering, which is a
common problem in unsupervised learning. So far in the book, we have mainly
dealt with supervised learning (e.g., the problem of learning a classifier from
labeled training data). The goal of supervised learning is clear – we wish to
learn a classifier which will predict the labels of future examples as accurately
as possible. Furthermore, a supervised learner can estimate the success, or the
risk, of its hypotheses using the labeled training data by computing the empirical
loss. In contrast, clustering is an unsupervised learning problem; namely, there
are no labels that we try to predict. Instead, we wish to organize the data in
some meaningful way. As a result, there is no clear success evaluation procedure
for clustering. In fact, even on the basis of full knowledge of the underlying data
distribution, it is not clear what is the “correct” clustering for that data or how
to evaluate a proposed clustering.
Consider, for example, the following set of points in R2 :
and suppose we are required to cluster them into two clusters. We have two
highly justifiable solutions:
Clustering 309
This phenomenon is not just artificial but occurs in real applications. A given
set of objects can be clustered in various di↵erent meaningful ways. This may
be due to having di↵erent implicit notions of distance (or similarity) between
objects, for example, clustering recordings of speech by the accent of the speaker
versus clustering them by content, clustering movie reviews by movie topic versus
clustering them by the review sentiment, clustering paintings by topic versus
clustering them by style, and so on.
To summarize, there may be several very di↵erent conceivable clustering so-
lutions for a given data set. As a result, there is a wide variety of clustering
algorithms that, on some input data, will output very di↵erent clusterings.
A Clustering Model:
Clustering tasks can vary in terms of both the type of input they have and the
type of outcome they are expected to compute. For concreteness, we shall focus
on the following common setup:
Input — a set of elements, X , and a distance function over it. That is, a function
d : X ⇥ X ! R+ that is symmetric, satisfies d(x, x) = 0 for all x 2 X
and often also satisfies the triangle inequality. Alternatively, the function
could be a similarity function s : X ⇥ X ! [0, 1] that is symmetric
and satisfies s(x, x) = 1 for all x 2 X . Additionally, some clustering
algorithms also require an input parameter k (determining the number
of required clusters).
Output — a partition of the domain set X into subsets. That is, C = (C1 , . . . Ck )
Sk
where i=1 Ci = X and for all i 6= j, Ci \ Cj = ;. In some situations the
clustering is “soft,” namely, the partition of X into the di↵erent clusters
is probabilistic where the output is a function assigning to each domain
point, x 2 X , a vector (p1 (x), . . . , pk (x)), where pi (x) = P[x 2 Ci ] is
the probability that x belongs to cluster Ci . Another possible output is
a clustering dendrogram (from Greek dendron = tree, gramma = draw-
ing), which is a hierarchical tree of domain subsets, having the singleton
sets in its leaves, and the full domain as its root. We shall discuss this
formulation in more detail in the following.
310 Clustering
3. Max Linkage clustering, in which the distance between two clusters is defined
as the maximum distance between their elements, namely,
def
D(A, B) = max{d(x, y) : x 2 A, y 2 B}.
The linkage-based clustering algorithms are agglomerative in the sense that they
start from data that is completely fragmented and keep building larger and
larger clusters as they proceed. Without employing a stopping rule, the outcome
of such an algorithm can be described by a clustering dendrogram: that is, a tree
of domain subsets, having the singleton sets in its leaves, and the full domain as
its root. For example, if the input is the elements X = {a, b, c, d, e} ⇢ R2 with
the Euclidean distance as depicted on the left, then the resulting dendrogram is
the one depicted on the right:
22.2 k-Means and Other Cost Minimization Clusterings 311
{a, b, c, d, e}
a {b, c, d, e}
e
d {b, c} {d, e}
c
b {a} {b} {c} {d} {e}
The single linkage algorithm is closely related to Kruskal’s algorithm for finding
a minimal spanning tree on a weighted graph. Indeed, consider the full graph
whose vertices are elements of X and the weight of an edge (x, y) is the distance
d(x, y). Each merge of two clusters performed by the single linkage algorithm
corresponds to a choice of an edge in the aforementioned graph. It is also possible
to show that the set of edges the single linkage algorithm chooses along its run
forms a minimal spanning tree.
If one wishes to turn a dendrogram into a partition of the space (a clustering),
one needs to employ a stopping criterion. Common stopping criteria include
• Fixed number of clusters – fix some parameter, k, and stop merging clusters
as soon as the number of clusters is k.
• Distance upper bound – fix some r 2 R+ . Stop merging as soon as all the
between-clusters distances are larger than r. We can also set r to be
↵ max{d(x, y) : x, y 2 X } for some ↵ < 1. In that case the stopping
criterion is called “scaled distance upper bound.”
The previous examples can all be viewed as center-based objectives. The so-
lution to such a clustering problem is determined by a set of cluster centers,
and the clustering assigns each instance to the center closest to it. More gener-
ally, center-based objective is determined by choosing some monotonic function
f : R+ ! R+ and then defining
k X
X
Gf ((X , d), (C1 , . . . Ck )) = min f (d(x, µi )),
µ1 ,...µk 2X 0
i=1 x2Ci
and the MinCut objective that we shall discuss in Section 22.3 are not center-
based objectives.
k-Means
input: X ⇢ Rn ; Number of clusters k
initialize: Randomly choose initial centroids µ1 , . . . , µk
repeat until convergence
8i 2 [k] set Ci = {x 2 X : i = argminj kx µj k}
(break ties in some arbitrary manner)
P
8i 2 [k] update µi = |C1i | x2Ci x
lemma 22.1 Each iteration of the k-means algorithm does not increase the
k-means objective function (as given in Equation (22.1)).
314 Clustering
Proof To simplify the notation, let us use the shorthand G(C1 , . . . , Ck ) for the
k-means objective, namely,
k X
X
G(C1 , . . . , Ck ) = min kx µ i k2 . (22.2)
µ1 ,...,µk 2Rn
i=1 x2Ci
P P
It is convenient to define µ(Ci ) = |C1i | x2Ci x and note that µ(Ci ) = argminµ2Rn x2Ci kx
µk2 . Therefore, we can rewrite the k-means objective as
k X
X
G(C1 , . . . , Ck ) = kx µ(Ci )k2 . (22.3)
i=1 x2Ci
(t 1) (t 1)
Consider the update at iteration t of the k-means algorithm. Let C1 , . . . , Ck
(t 1) (t 1) (t) (t)
be the previous partition, let µi = µ(Ci ), and let C1 , . . . , Ck be the
new partition assigned at iteration t. Using the definition of the objective as
given in Equation (22.2) we clearly have that
k
X X
(t) (t) (t 1) 2
G(C1 , . . . , Ck ) kx µi k . (22.4)
i=1 x2C (t)
i
(t) (t)
In addition, the definition of the new partition (C1 , . . . , Ck ) implies that it
Pk P (t 1) 2
minimizes the expression i=1 x2Ci kx µi k over all possible partitions
(C1 , . . . , Ck ). Hence,
k
X X k
X X
(t 1) 2 (t 1) 2
kx µi k kx µi k . (22.5)
i=1 x2C (t) i=1 x2C (t 1)
i i
Using Equation (22.3) we have that the right-hand side of Equation (22.5) equals
(t 1) (t 1)
G(C1 , . . . , Ck ). Combining this with Equation (22.4) and Equation (22.5),
(t) (t) (t 1) (t 1)
we obtain that G(C1 , . . . , Ck ) G(C1 , . . . , Ck ), which concludes our
proof.
While the preceding lemma tells us that the k-means objective is monotonically
nonincreasing, there is no guarantee on the number of iterations the k-means al-
gorithm needs in order to reach convergence. Furthermore, there is no nontrivial
lower bound on the gap between the value of the k-means objective of the al-
gorithm’s output and the minimum possible value of that objective function. In
fact, k-means might converge to a point which is not even a local minimum (see
Exercise 2). To improve the results of k-means it is often recommended to repeat
the procedure several times with di↵erent randomly chosen initial centroids (e.g.,
we can choose the initial centroids to be random points from the data).
22.3 Spectral Clustering 315
The preceding objective assumes smaller values if the clusters are not too small.
Unfortunately, introducing this balancing makes the problem computationally
hard to solve. Spectral clustering is a way to relax the problem of minimizing
RatioCut.
The following lemma underscores the relation between RatioCut and the Lapla-
cian matrix.
Hi,j = p 1 1[i2Cj ] .
|Cj |
Proof Let h1 , . . . , hk be the columns of H. The fact that these vectors are
orthonormal is immediate from the definition. Next, by standard algebraic ma-
Pk
nipulations, it can be shown that trace(H > L H) = i=1 h> i Lhi and that for
any vector v we have
!
1 X X X 1X
>
v Lv = Dr,r vr2 2 vr vs Wr,s + Ds,s vs2 = Wr,s (vr vs ) 2 .
2 r r,s s
2 r,s
Applying this with v = hi and noting that (hi,r hi,s )2 is nonzero only if
r 2 Ci , s 2
/ Ci or the other way around, we obtain that
1 X
h>
i Lhi = Wr,s .
|Ci |
r2Ci ,s2C
/ i
The spectral clustering algorithm starts with finding the matrix H of the k
eigenvectors corresponding to the smallest eigenvalues of the graph Laplacian
matrix. It then represents points according to the rows of H. It is due to the
properties of the graph Laplacians that this change of representation is useful.
In many situations, this change of representation enables the simple k-means
algorithm to detect the clusters seamlessly. Intuitively, if H is as defined in
Lemma 22.3 then each point in the new representation is an indicator vector
whose value is nonzero only on the element corresponding to the cluster it belongs
to.
So far, we have mainly listed various useful clustering tools. However, some fun-
damental questions remain unaddressed. First and foremost, what is clustering?
What is it that distinguishes a clustering algorithm from any arbitrary function
that takes an input space and outputs a partition of that space? Are there any
basic properties of clustering that are independent of any specific algorithm or
task?
One method for addressing such questions is via an axiomatic approach. There
have been several attempts to provide an axiomatic definition of clustering. Let
us demonstrate this approach by presenting the attempt made by Kleinberg
(2003).
Consider a clustering function, F , that takes as input any finite domain X
with a dissimilarity function d over its pairs and returns a partition of X .
Consider the following three properties of such a function:
Scale Invariance (SI) For any domain set X , dissimilarity function d, and
any ↵ > 0, the following should hold: F (X , d) = F (X , ↵d) (where
def
(↵d)(x, y) = ↵ d(x, y)).
Richness (Ri) For any finite X and every partition C = (C1 , . . . Ck ) of X (into
nonempty subsets) there exists some dissimilarity function d over X such
that F (X , d) = C.
P P ⇣ ⌘
p(a,b)
I(x; C) = a b p(a, b) log p(a)p(b) , where the sum is over all values x can take and all
values C can take.
2 A sufficient statistic is a function of the data which has the property of sufficiency with
respect to a statistical model and its associated unknown parameter, meaning that “no
other statistic which can be calculated from the same sample provides any additional
information as to the value of the parameter.” For example, if we assume that a variable is
distributed normally with a unit variance and an unknown expectation, then the average
function is a sufficient statistic.
22.5 A High Level View of Clustering 319
Alternatively, one can relax the Consistency property. For example, say that two
clusterings C = (C1 , . . . Ck ) and C 0 = (C10 , . . . Cl0 ) are compatible if for every
clusters Ci 2 C and Cj0 2 C 0 , either Ci ✓ Cj0 or Cj0 ✓ Ci or Ci \ Cj0 = ; (it is
worthwhile noting that for every dendrogram, every two clusterings that are ob-
tained by trimming that dendrogram are compatible). “Refinement Consistency”
is the requirement that, under the assumptions of the Consistency property, the
new clustering F (X , d0 ) is compatible with the old clustering F (X , d). Many
common clustering functions satisfy this requirement as well as Scale Invariance
and Richness. Furthermore, one can come up with many other, di↵erent, prop-
erties of clustering functions that sound intuitive and desirable and are satisfied
by some common clustering functions.
There are many ways to interpret these results. We suggest to view it as indi-
cating that there is no “ideal” clustering function. Every clustering function will
inevitably have some “undesirable” properties. The choice of a clustering func-
tion for any given task must therefore take into account the specific properties
of that task. There is no generic clustering solution, just as there is no clas-
sification algorithm that will learn every learnable task (as the No-Free-Lunch
theorem shows). Clustering, just like classification prediction, must take into
account some prior knowledge about the specific task at hand.
22.6 Summary
22.8 Exercises
where diam(Cj ) = maxx,x0 2Cj d(x, x0 ) (we use the convention diam(Cj ) = 0
if |Cj | < 2).
Similarly to the k-means objective, it is NP-hard to minimize the k-
diam objective. Fortunately, we have a very simple approximation algorithm:
Initially, we pick some x 2 X and set µ1 = x. Then, the algorithm iteratively
sets
8j 2 {2, . . . , k}, µj = argmax min d(x, µi ).
x2X i2[j 1]
Finally, we set
Hint: Consider the point µk+1 (in other words, the next center we would have
chosen, if we wanted k + 1 clusters). Let r = minj2[k] d(µj , µk+1 ). Prove the
following inequalities
Prove that for every k > 1 the k-diam clustering function defined in the
previous exercise is not a center-based clustering function.
Hint: Given a clustering input (X , d), with |X | > 2, consider the e↵ect of
adding many close-by points to some (but not all) of the members of X , on
either the k-diam clustering or any given center-based clustering.
5. Recall that we discussed three clustering “properties”: Scale Invariance, Rich-
ness, and Consistency. Consider the Single Linkage clustering algorithm.
1. Find which of the three properties is satisfied by Single Linkage with the
Fixed Number of Clusters (any fixed nonzero number) stopping rule.
2. Find which of the three properties is satisfied by Single Linkage with the
Distance Upper Bound (any fixed nonzero upper bound) stopping rule.
3. Show that for any pair of these properties there exists a stopping criterion
for Single Linkage clustering, under which these two axioms are satisfied.
6. Given some number k, let k-Richness be the following requirement:
For any finite X and every partition C = (C1 , . . . Ck ) of X (into nonempty subsets)
there exists some dissimilarity function d over X such that F (X , d) = C.
Prove that, for every number k, there exists a clustering function that
satisfies the three properties: Scale Invariance, k-Richness, and Consistency.
23 Dimensionality Reduction
To solve this problem we first show that the optimal solution takes a specific
form.
lemma 23.1 Let (U, W ) be a solution to Equation (23.1). Then the columns of
U are orthonormal (namely, U > U is the identity matrix of Rn ) and W = U > .
Proof Fix any U, W and consider the mapping x 7! U W x. The range of this
mapping, R = {U W x : x 2 Rd }, is an n dimensional linear subspace of Rd . Let
V 2 Rd,n be a matrix whose columns form an orthonormal basis of this subspace,
namely, the range of V is R and V > V = I. Therefore, each vector in R can be
written as V y where y 2 Rn . For every x 2 Rd and y 2 Rn we have
kx V yk22 = kxk2 + y> V > V y 2y> V > x = kxk2 + kyk2 2y> (V > x),
where we used the fact that V > V is the identity matrix of Rn . Minimizing the
preceding expression with respect to y by comparing the gradient with respect
to y to zero gives that y = V > x. Therefore, for each x we have that
V V > x = argmin kx x̃k22 .
x̃2R
Since this holds for every U, W the proof of the lemma follows.
On the basis of the preceding lemma, we can rewrite the optimization problem
given in Equation (23.1) as follows:
m
X
argmin kxi U U > xi k22 . (23.2)
U 2Rd,n :U > U =I i=1
23.1 Principal Component Analysis (PCA) 325
Note that B > B = U > VV > U = U > U = I. Therefore, the columns of B are
Pd Pn 2
also orthonormal, which implies that j=1 i=1 Bj,i = n. In addition, let B̃ 2
R d,d
be a matrix such that its first n columns are the columns of B and in
Pd
addition B̃ > B̃ = I. Then, for every j we have i=1 B̃j,i
2
= 1, which implies that
Pn 2
B
i=1 j,i 1. It follows that:
d
X
trace(U > AU ) max Dj,j j .
2[0,1]d : k k1 n
j=1
326 Dimensionality Reduction
It is not hard to verify (see Exercise 2) that the right-hand side equals to
Pn
j=1 Dj,j . We have therefore shown that for every matrix U 2 Rd,n with or-
>
Pn
thonormal columns it holds that trace(U AU ) j=1 Dj,j . On the other hand,
if we set U to be the matrix whose columns are the n leading eigenvectors of A
Pn
we obtain that trace(U > AU ) = j=1 Dj,j , and this concludes our proof.
Remark 23.1 The proof of Theorem 23.2 also tells us that the value of the
Pn
objective of Equation (23.4) is i=1 Di,i . Combining this with Equation (23.3)
Pm Pd
and noting that i=1 kxi k2 = trace(A) = i=1 Di,i we obtain that the optimal
Pd
objective value of Equation (23.1) is i=n+1 Di,i .
Remark 23.3 The previous discussion also implies that to calculate the PCA
solution we only need to know how to calculate inner products between vectors.
This enables us to calculate PCA implicitly even when d is very large (or even
infinite) using kernels, which yields the kernel PCA algorithm.
1.5
0.5
−0.5
−1
−1.5
−1.5 −1 −0.5 0 0.5 1 1.5
Figure 23.1 A set of vectors in R2 (blue x’s) and their reconstruction after
dimensionality reduction to R1 using PCA (red circles).
PCA
input
A matrix of m examples X 2 Rm,d
number of components n
if (m > d)
A = X >X
Let u1 , . . . , un be the eigenvectors of A with largest eigenvalues
else
B = XX >
Let v1 , . . . , vn be the eigenvectors of B with largest eigenvalues
for i = 1, . . . , n set ui = kX >1 vi k X > vi
output: u1 , . . . , un
o o oo oo o
x xxx x xx
+ +
+++
++ *
* ** **
Figure 23.2 Images of faces extracted from the Yale data set. Top-Left: the original
images in R50x50 . Top-Right: the images after dimensionality reduction to R10 and
reconstruction. Middle row: an enlarged version of one of the images before and after
PCA. Bottom: The images after dimensionality reduction to R2 . The di↵erent marks
indicate di↵erent individuals.
Some images of faces are depicted on the top-left side of Figure 23.2. Using
PCA, we reduced the dimensionality to R10 and reconstructed back to the orig-
inal dimension, which is 502 . The resulting reconstructed images are depicted
on the top-right side of Figure 23.2. Finally, on the bottom of Figure 23.2 we
depict a 2 dimensional representation of the images. As can be seen, even from a
2 dimensional representation of the images we can still roughly separate di↵erent
individuals.
23.2 Random Projections 329
In this section we show that reducing the dimension by using a random linear
transformation leads to a simple compression scheme with a surprisingly low
distortion. The transformation x 7! W x, when W is a random matrix, is often
referred to as a random projection. In particular, we provide a variant of a famous
lemma due to Johnson and Lindenstrauss, showing that random projections do
not distort Euclidean distances too much.
Let x1 , x2 be two vectors in Rd . A matrix W does not distort too much the
distance between x1 and x2 if the ratio
kW x1 W x2 k
kx1 x2 k
is close to 1. In other words, the distances between x1 and x2 before and after
the transformation are almost the same. To show that kW x1 W x2 k is not too
far away from kx1 x2 k it suffices to show that W does not distort the norm of
the di↵erence vector x = x1 x2 . Therefore, from now on we focus on the ratio
kW xk
kxk .
We start with analyzing the distortion caused by applying a random projection
to a single vector.
lemma 23.3 Fix some x 2 Rd . Let W 2 Rn,d be a random matrix such that
each Wi,j is an independent normal random variable. Then, for every ✏ 2 (0, 3)
we have
" p 2
#
k(1/ n)W xk 2
P 2
1 >✏ 2 e ✏ n/6 .
kxk
Let wi be the ith row of W . The random variable hwi , xi is a weighted sum of
d independent normal random variables and therefore it is normally distributed
P
with zero mean and variance j x2j = kxk2 = 1. Therefore, the random vari-
P n
able kW xk2 = 2
i=1 (hwi , xi) has a
2
n distribution. The claim now follows
directly from a measure concentration property of 2 random variables stated in
Lemma B.12 given in Section B.7.
Interestingly, the bound given in Lemma 23.4 does not depend on the original
dimension of x. In fact, the bound holds even if x is in an infinite dimensional
Hilbert space.
Formally,
definition 23.5 (RIP) A matrix W 2 Rn,d is (✏, s)-RIP if for all x 6= 0 s.t.
kxk0 s we have
kW xk22
1 ✏.
kxk22
The first theorem establishes that RIP matrices yield a lossless compression
scheme for sparse vectors. It also provides a (nonefficient) reconstruction scheme.
theorem 23.6 Let ✏ < 1 and let W be a (✏, 2s)-RIP matrix. Let x be a vector
s.t. kxk0 s, let y = W x be the compression of x, and let
x̃ 2 argmin kvk0
v:W v=y
theorem 23.7 Assume that the conditions of Theorem 23.6 holds and that
✏ < 1+1p2 . Then,
x = argmin kvk0 = argmin kvk1 .
v:W v=y v:W v=y
In fact, we will prove a stronger result, which holds even if x is not a sparse
vector.
theorem 23.8 Let ✏ < 1+1p2 and let W be a (✏, 2s)-RIP matrix. Let x be an
arbitrary vector and denote
xs 2 argmin kx vk1 .
v:kvk0 s
That is, xs is the vector which equals x on the s largest elements of x and equals
0 elsewhere. Let y = W x be the compression of x and let
x? 2 argmin kvk1
v:W v=y
23.3.1 Proofs*
Proof of Theorem 23.8
We follow a proof due to Candès (2008).
Let h = x? x. Given a vector v and a set of indices I we denote by vI the
vector whose ith element is vi if i 2 I and 0 otherwise.
The first trick we use is to partition the set of indices [d] = {1, . . . , d} into
disjoint sets of size s. That is, we will write [d] = T0 [· T1 [· T2 . . . Td/s 1 where
for all i, |Ti | = s, and we assume for simplicity that d/s is an integer. We define
the partition as follows. In T0 we put the s indices corresponding to the s largest
elements in absolute values of x (ties are broken arbitrarily). Let T0c = [d] \ T0 .
Next, T1 will be the s indices corresponding to the s largest elements in absolute
c
value of hT0c . Let T0,1 = T0 [ T1 and T0,1 = [d] \ T0,1 . Next, T2 will correspond to
the s largest elements in absolute value of hT0,1
c . And, we will construct T3 , T4 , . . .
lemma 23.10 Let W be an (✏, 2s)-RIP matrix. Then, for any two disjoint sets
I, J, both of size at most s, and for any vector u we have that hW uI , W uJ i
✏kuI k2 kuJ k2 .
Proving Claim 1:
To prove this claim we do not use the RIP condition at all but only use the fact
that x? minimizes the `1 norm. Take j > 1. For each i 2 Tj and i0 2 Tj 1 we
have that |hi | |hi0 |. Therefore, khTj k1 khTj 1 k1 /s. Thus,
khTj k2 s1/2 khTj k1 s 1/2
khTj 1
k1 .
Summing this over j = 2, 3, . . . and using the triangle inequality we obtain that
X
khT0,1
c k2 khTj k2 s 1/2 khT0c k1 (23.6)
j 2
Next, we show that khT0c k1 cannot be large. Indeed, from the definition of x?
we have that kxk1 kx? k1 = kx + hk1 . Thus, using the triangle inequality we
obtain that
X X
kxk1 kx+hk1 = |xi +hi |+ |xi +hi | kxT0 k1 khT0 k1 +khT0c k1 kxT0c k1
i2T0 i2T0c
(23.7)
and since kxT0c k1 = kx xs k1 = kxk1 kxT0 k1 we get that
khT0c k1 khT0 k1 + 2kxT0c k1 . (23.8)
Combining this with Equation (23.6) we get that
1/2 1/2
khT0,1
c k2 s khT0 k1 + 2kxT0c k1 khT0 k2 + 2s kxT0c k1 ,
which concludes the proof of claim 1.
Proving Claim 2:
For the second claim we use the RIP condition to get that
✏)khT0,1 k22 kW hT0,1 k22 .
(1 (23.9)
P P
Since W hT0,1 = W h j 2 W h Tj = j 2 W hTj we have that
X X
kW hT0,1 k22 = hW hT0,1 , W hTj i = hW hT0 + W hT1 , W hTj i.
j 2 j 2
From the RIP condition on inner products we obtain that for all i 2 {1, 2} and
j 2 we have
|hW hTi , W hTj i| ✏khTi k2 khTj k2 .
23.3 Compressed Sensing 335
p
Since khT0 k2 + khT1 k2 2khT0,1 k2 we therefore get that
p X
kW hT0,1 k22 2✏khT0,1 k2 khTj k2 .
j 2
Q0 = {x 2 Rd : 8j 2 [d], 9i 2 { k, k + 1, . . . , k} s.t. xj = ki }.
Clearly, |Q0 | = (2k + 1)d . We shall set Q = Q0 \ B2 (1), where B2 (1) is the unit
`2 ball of Rd . Since the points in Q0 are distributed evenly on the unit `1 ball,
the size of Q is the size of Q0 times the ratio between the volumes of the unit `2
and `1 balls. The volume of the `1 ball is 2d and the volume of B2 (1) is
⇡ d/2
.
(1 + d/2)
For simplicity, assume that d is even and therefore
⇣ ⌘d/2
d/2
(1 + d/2) = (d/2)! e ,
336 Dimensionality Reduction
Now lets specify k. For each x 2 B2 (1) let v 2 Q be the vector whose ith element
is sign(xi ) b|xi | kc/k. Then, for each element we have that |xi vi | 1/k and
thus
p
d
kx vk .
k
p
To ensure that the right-hand side will be at most ✏ we shall set k = d d/✏e.
Plugging this value into Equation (23.10) we conclude that
p ⇣ q ⌘d
d
|Q| (3 d/(2✏))d (⇡/e)d/2 (d/2) d/2 = 3✏ 2e ⇡
3✏ .
Applying Lemma 23.4 on the set {UI v : v 2 Q} we obtain that for n satisfying
23.3 Compressed Sensing 337
the condition given in the lemma, the following holds with probability of at least
1 :
kW UI vk2
sup 1 ✏/2,
v2Q kUI vk2
Clearly a < 1. Our goal is to show that a ✏. This follows from the fact that
for any x 2 S of unit norm there exists v 2 Q such that kx UI vk ✏/4 and
therefore
Thus,
kW xk
8x 2 S, 1 + (✏/2 + (1 + a)✏/4) .
kxk
✏/2 + ✏/4
a ✏/2 + (1 + a)✏/4 ) a ✏.
1 ✏/4
kW xk
This proves that for all x 2 S we have kxk 1 ✏. The other side follows from
this as well since
(1 ✏) kW xk (1 + ✏),
(1 2 ✏) kW xk2 (1 + 3 ✏).
The proof of Theorem 23.9 follows from this by a union bound over all choices
of I.
338 Dimensionality Reduction
23.5 Summary
23.7 Exercises
1. In this exercise we show that in the general case, exact recovery of a linear
compression scheme is impossible.
1. let A 2 Rn,d be an arbitrary compression matrix where n d 1. Show
that there exists u, v 2 Rn , u 6= v such that Au = Av.
2. Conclude that exact recovery of a linear compression scheme is impossible.
2. Let ↵ 2 Rd such that ↵1 ↵2 · · · ↵d 0. Show that
d
X n
X
max ↵j j = ↵j .
2[0,1]d :k k1 n
j=1 j=1
Hint: Take every vector 2 [0, 1]d such that k k1 n. Let i be the minimal
index for which i < 1. If i = n + 1 we are done. Otherwise, show that we can
increase i , while possibly decreasing j for some j > i, and obtain a better
solution. This will imply that the optimal solution is to set i = 1 for i n
and i = 0 for i > n.
3. Kernel PCA: In this exercise we show how PCA can be used for construct-
ing nonlinear dimensionality reduction on the basis of the kernel trick (see
Chapter 16).
Let X be some instance space and let S = {x1 , . . . , xm } be a set of points
in X . Consider a feature mapping : X ! V , where V is some Hilbert space
(possibly of infinite dimension). Let K : X ⇥ X be a kernel function, that is,
k(x, x0 ) = h (x), (x0 )i. Kernel PCA is the process of mapping the elements
in S into V using , and then applying PCA over { (x1 ), . . . , (xm )} into
Rn . The output of this process is the set of reduced elements.
Show how this process can be done in polynomial time in terms of m
and n, assuming that each evaluation of K(·, ·) can be calculated in a con-
stant time. In particular, if your implementation requires multiplication of
two matrices A and B, verify that their product can be computed. Similarly,
340 Dimensionality Reduction
Show that the solution of the problem is to set w to be the first principle
vector of x1 , . . . , xm .
2. Let w1 be the first principal component as in the previous question. Now,
suppose we would like to find a second unit vector, w2 2 Rd , that maxi-
mizes the variance of hw2 , xi, but is also uncorrelated to hw1 , xi. That is,
we would like to solve:
Show that the solution to this problem is to set w to be the second principal
component of x1 , . . . , xm .
Hint: Note that
hw1 , wi = 0.
5. The Relation between SVD and PCA: Use the SVD theorem (Corol-
lary C.6) for providing an alternative proof of Theorem 23.2.
6. Random Projections Preserve Inner Products: The Johnson-Lindenstrauss
lemma tells us that a random projection preserves distances between a finite
set of vectors. In this exercise you need to prove that if the set of vectors are
within the unit ball, then not only are the distances between any two vectors
preserved, but the inner product is also preserved.
Let Q be a finite set of vectors in Rd and assume that for every x 2 Q we
have kxk 1.
1. Let 2 (0, 1) and n be an integer such that
r
6 log(|Q|2 / )
✏= 3.
n
Prove that with probability of at least 1 over a choice of a random
23.7 Exercises 341
When solving a given problem, try to avoid a more general problem as an intermediate
step.
Let us start with a simple example. A drug company developed a new drug to
treat some deadly disease. We would like to estimate the probability of survival
when using the drug. To do so, the drug company sampled a training set of m
people and gave them the drug. Let S = (x1 , . . . , xm ) denote the training set,
where for each i, xi = 1 if the ith person survived and xi = 0 otherwise. We can
model the underlying distribution using a single parameter, ✓ 2 [0, 1], indicating
the probability of survival.
We now would like to estimate the parameter ✓ on the basis of the training
set S. A natural idea is to use the average number of 1’s in S as an estimator.
That is,
m
1 X
✓ˆ = xi . (24.1)
m i=1
We define the log likelihood of S, given the parameter ✓, as the log of the preceding
expression:
X X
L(S; ✓) = log (P[S = (x1 , . . . , xm )]) = log(✓) xi + log(1 ✓) (1 xi ).
i i
The maximum likelihood estimator is the parameter that maximizes the likeli-
hood
✓ˆ 2 argmax L(S; ✓). (24.3)
✓
Next, we show that in our case, Equation (24.1) is a maximum likelihood esti-
mator. To see this, we take the derivative of L(S; ✓) with respect to ✓ and equate
it to zero:
P P
i xi i (1 xi )
= 0.
✓ 1 ✓
Solving the equation for ✓ we obtain the estimator given in Equation (24.1).
344 Generative Models
To find a parameter ✓ = (µ, ) that optimizes this we take the derivative of the
likelihood w.r.t. µ and w.r.t. and compare it to 0. We obtain the following two
equations:
m
d 1 X
L(S; ✓) = 2 (xi µ) = 0
dµ i=1
m
d 1 X m
L(S; ✓) = 3 (xi µ)2 =0
d i=1
Note that the maximum likelihood estimate is not always an unbiased estimator.
For example, while µ̂ is unbiased, it is possible to show that the estimate ˆ of
the variance is biased (Exercise 1).
Simplifying Notation
To simplify our notation, we use P[X = x] in this chapter to describe both the
probability that X = x (for discrete random variables) and the density of the
distribution at x (for continuous variables).
24.1 Maximum Likelihood Estimator 345
That is, `(✓, x) is the negation of the log-likelihood of the observation x, assuming
the data is distributed according to P✓ . This loss function is often referred to as
the log-loss. On the basis of this definition it is immediate that the maximum
likelihood principle is equivalent to minimizing the empirical risk with respect
to the loss function given in Equation (24.4). That is,
m
X m
X
argmin ( log(P✓ [xi ])) = argmax log(P✓ [xi ]).
✓ i=1 ✓ i=1
where DRE is called the relative entropy, and H is called the entropy func-
tion. The relative entropy is a divergence measure between two probabilities.
For discrete variables, it is always nonnegative and is equal to 0 only if the two
distributions are the same. It follows that the true risk is minimal when P✓ = P.
The expression given in Equation (24.5) underscores how our generative as-
sumption a↵ects our density estimation, even in the limit of infinite data. It
shows that if the underlying distribution is indeed of a parametric form, then by
choosing the correct parameter we can make the risk be the entropy of the distri-
bution. However, if the distribution is not of the assumed parametric form, even
the best parameter leads to an inferior model and the suboptimality is measured
by the relative entropy divergence.
To answer this question we need to define how we assess the quality of an approxi-
mated solution of the density estimation problem. Unlike discriminative learning,
where there is a clear notion of “loss,” in generative learning there are various
ways to define the loss of a model. On the basis of the previous subsection, one
natural candidate is the expected log-loss as given in Equation (24.5).
In some situations, it is easy to prove that the maximum likelihood principle
guarantees low true risk as well. For example, consider the problem of estimating
the mean of a Gaussian variable of unit variance. We saw previously that the
1
P ?
maximum likelihood estimator is the average: µ̂ = m i xi . Let µ be the optimal
parameter. Then,
✓ ◆
Pµ? [x]
E ? [`(µ̂, x) `(µ? , x)] = E ? log
x⇠N (µ ,1) x⇠N (µ ,1) Pµ̂ [x]
✓ ◆
1 1
= E? (x µ? )2 + (x µ̂)2
x⇠N (µ ,1) 2 2
2 ? 2
µ̂ (µ )
= + (µ? µ̂) E [x]
2 2 x⇠N (µ? ,1)
µ̂2 (µ? )2
= + (µ? µ̂) µ?
2 2
1
= (µ̂ µ? )2 . (24.6)
2
Next, we note that µ̂ is the average of m Gaussian variables and therefore it is
also distributed normally with mean µ? and variance ? /m. From this fact we
can derive bounds of the form: with probability of at least 1 we have that
? ?
|µ̂ µ | ✏ where ✏ depends on /m and on .
In some situations, the maximum likelihood estimator clearly overfits. For
example, consider a Bernoulli random variable X and let P[X = 1] = ✓? . As
we saw previously, using Hoe↵ding’s inequality we can easily derive a guarantee
on |✓? ✓| ˆ that holds with high probability (see Equation (24.2)). However, if
our goal is to obtain a small value of the expected log-loss function as defined in
Equation (24.5) we might fail. For example, assume that ✓? is nonzero but very
small. Then, the probability that no element of a sample of size m will be 1 is
?
(1 ✓? )m , which is greater than e 2✓ m . It follows that whenever m log(2) 2✓ ? ,
the probability that the sample is all zeros is at least 50%, and in that case, the
maximum likelihood rule will set ✓ˆ = 0. But the true risk of the estimate ✓ˆ = 0
is
ˆ x)] = ✓? `(✓,
E [`(✓, ˆ 1) + (1 ˆ 0)
✓? )`(✓,
x⇠✓ ?
ˆ + (1
= ✓? log(1/✓) ✓? ) log(1/(1 ˆ
✓))
= ✓? log(1/0) = 1.
This simple example shows that we should be careful in applying the maximum
likelihood principle.
To overcome overfitting, we can use the variety of tools we encountered pre-
24.2 Naive Bayes 347
With this assumption and using Bayes’ rule, the Bayes optimal classifier can be
further simplified:
hBayes (x) = argmax P[Y = y|X = x]
y2{0,1}
w = (µ1 µ0 )T ⌃ 1
and b= 1
2 µT0 ⌃ 1
µ0 µT1 ⌃ 1
µ1 . (24.8)
Note that Y is a hidden variable that we do not observe in our data. Neverthe-
less, we introduce Y since it helps us describe a simple parametric form of the
probability of X.
More generally, let ✓ be the parameters of the joint distribution of X and Y
(e.g., in the preceding example, ✓ consists of cy , µy , and ⌃y , for all y = 1, . . . , k).
Then, the log-likelihood of an observation x can be written as
k
!
X
log (P✓ [X = x]) = log P✓ [X = x, Y = y] .
y=1
In many situations, the summation inside the log makes the preceding opti-
mization problem computationally hard. The Expectation-Maximization (EM)
algorithm, due to Dempster, Laird, and Rubin, is an iterative procedure for
searching a (local) maximum of L(✓). While EM is not guaranteed to find the
global maximum, it often works reasonably well in practice.
EM is designed for those cases in which, had we known the values of the latent
variables Y , then the maximum likelihood optimization problem would have been
tractable. More precisely, define the following function over m ⇥ k matrices and
the set of parameters ✓:
m X
X k
F (Q, ✓) = Qi,y log (P✓ [X = xi , Y = y]) .
i=1 y=1
350 Generative Models
If each row of Q defines a probability over the ith latent variable given X = xi ,
then we can interpret F (Q, ✓) as the expected log-likelihood of a training set
(x1 , y1 ), . . . , (xm , ym ), where the expectation is with respect to the choice of
each yi on the basis of the ith row of Q. In the definition of F , the summation is
outside the log, and we assume that this makes the optimization problem with
respect to ✓ tractable:
assumption 24.1 For any matrix Q 2 [0, 1]m,k , such that each row of Q sums
to 1, the optimization problem
argmax F (Q, ✓)
✓
is tractable.
The intuitive idea of EM is that we have a “chicken and egg” problem. On one
hand, had we known Q, then by our assumption, the optimization problem of
finding the best ✓ is tractable. On the other hand, had we known the parameters
✓ we could have set Qi,y to be the probability of Y = y given that X = xi .
The EM algorithm therefore alternates between finding ✓ given Q and finding Q
given ✓. Formally, EM finds a sequence of solutions (Q(1) , ✓ (1) ), (Q(2) , ✓ (2) ), . . .
where at iteration t, we construct (Q(t+1) , ✓ (t+1) ) by performing two steps.
The second term is the sum of the entropies of the rows of Q. Let
( k
)
X
m,k
Q= Q 2 [0, 1] : 8i, Qi,y = 1
y=1
be the set of matrices whose rows define probabilities over [k]. The following
lemma shows that EM performs alternate maximization iterations for maximiz-
ing G.
Therefore, we only need to show that for any ✓, the solution of argmaxQ2Q G(Q, ✓)
is to set Qi,y = P✓ [Y = y|X = xi ]. Indeed, by Jensen’s inequality, for any Q 2 Q
we have that
m
X k
X ✓ ◆!
P✓ [X = xi , Y = y]
G(Q, ✓) = Qi,y log
i=1 y=1
Qi,y
m k
!!
X X P✓ [X = xi , Y = y]
log Qi,y
i=1 y=1
Qi,y
m k
!
X X
= log P✓ [X = xi , Y = y]
i=1 y=1
Xm
= log (P✓ [X = xi ]) = L(✓),
i=1
352 Generative Models
where the last equality follows from the definition of conditional probability. If
✓ is continuous we replace P[✓] with the density function and the sum becomes
an integral:
Z
P[X = x] = P[✓]P[X = x|✓] d✓.
✓
The second inequality follows from the assumption that X and S are independent
when we condition on ✓. Using Bayes’ rule we have
P[S|✓] P[✓]
P[✓|S] = ,
P[S]
and together with the assumption that points are independent conditioned on ✓,
we can write
m
P[S|✓] P[✓] 1 Y
P[✓|S] = = P[X = xi |✓] P[✓].
P[S] P[S] i=1
We therefore obtain the following expression for Bayesian prediction:
m
1 X Y
P[X = x|S] = P[X = x|✓] P[X = xi |✓] P[✓]. (24.16)
P[S] i=1
✓
Getting back to our drug company example, we can rewrite P[X = x|S] as
Z P P
1
P[X = x|S] = ✓x+ i xi (1 ✓)1 x+ i (1 xi ) P[✓] d✓.
P [S]
24.6 Summary 355
Maximum A Posteriori
In many situations, it is difficult to find a closed form solution to the integral
given in Equation (24.16). Several numerical methods can be used to approxi-
mate this integral. Another popular solution is to find a single ✓ which maximizes
P[✓|S]. The value of ✓ which maximizes P[✓|S] is called the Maximum A Poste-
riori estimator. Once this value is found, we can calculate the probability that
X = x given the maximum a posteriori estimator and independently on S.
24.6 Summary
The maximum likelihood principle was studied by Ronald Fisher in the beginning
of the 20th century. Bayesian statistics follow Bayes’ rule, which is named after
the 18th century English mathematician Thomas Bayes.
There are many excellent books on the generative and Bayesian approaches
to machine learning. See, for example, (Bishop 2006, Koller & Friedman 2009,
MacKay 2003, Murphy 2012, Barber 2012).
356 Generative Models
24.8 Exercises
• Show that the preceding objective is equivalent to the usual empirical error
had we added two pseudoexamples to the training set. Conclude that
the regularized maximum likelihood estimator would be
m
!
1 X
ˆ
✓= 1+ xi .
m+2 i=1
• Derive a high probability bound on |✓ˆ ✓? |. Hint: Rewrite this as |✓ˆ E[✓]+
ˆ
ˆ
E[✓] ✓ | and then use the triangle inequality and Hoe↵ding inequality.
?
• Use this to bound the true risk. Hint: Use the fact that now ✓ˆ m+2 1
to
ˆ ?
relate |✓ ✓ | to the relative entropy.
3. • Consider a general optimization problem of the form:
k
X X
max ⌫y log(cy ) s.t. cy > 0, cy = 1 ,
c
y=1 y
We emphasize that while there are some common techniques for feature learn-
ing one may want to try, the No-Free-Lunch theorem implies that there is no ulti-
mate feature learner. Any feature learning algorithm might fail on some problem.
In other words, the success of each feature learner relies (sometimes implicitly)
on some form of prior assumption on the data distribution. Furthermore, the
relative quality of features highly depends on the learning algorithm we are later
going to apply using these features. This is illustrated in the following example.
Example 25.1 Consider a regression problem in which X = R2 , Y = R, and
the loss function is the squared loss. Suppose that the underlying distribution
is such that an example (x, y) is generated as follows: First, we sample x1 from
the uniform distribution over [ 1, 1]. Then, we deterministically set y = x1 2 .
Finally, the second feature is set to be x2 = y + z, where z is sampled from the
uniform distribution over [ 0.01, 0.01]. Suppose we would like to choose a single
feature. Intuitively, the first feature should be preferred over the second feature
as the target can be perfectly predicted based on the first feature alone, while it
cannot be perfectly predicted based on the second feature. Indeed, choosing the
first feature would be the right choice if we are later going to apply polynomial
regression of degree at least 2. However, if the learner is going to be a linear
regressor, then we should prefer the second feature over the first one, since the
optimal linear predictor based on the first feature will have a larger risk than
the optimal linear predictor based on the second feature.
Throughout this section we assume that X = Rd . That is, each instance is repre-
sented as a vector of d features. Our goal is to learn a predictor that only relies
on k ⌧ d features. Predictors that use only a small subset of features require a
smaller memory footprint and can be applied faster. Furthermore, in applications
such as medical diagnostics, obtaining each possible “feature” (e.g., test result)
can be costly; therefore, a predictor that uses only a small number of features
is desirable even at the cost of a small degradation in performance, relative to
a predictor that uses more features. Finally, constraining the hypothesis class to
use a small subset of features can reduce its estimation error and thus prevent
overfitting.
Ideally, we could have tried all subsets of k out of d features and choose the
subset which leads to the best performing predictor. However, such an exhaustive
search is usually computationally intractable. In the following we describe three
computationally feasible approaches for feature selection. While these methods
cannot guarantee finding the optimal subset, they often work reasonably well in
practice. Some of the methods come with formal guarantees on the quality of the
selected subsets under certain assumptions. We do not discuss these guarantees
here.
25.1 Feature Selection 359
25.1.1 Filters
Maybe the simplest approach for feature selection is the filter method, in which
we assess individual features, independently of other features, according to some
quality measure. We can then select the k features that achieve the highest score
(alternatively, decide also on the number of features to select according to the
value of their scores).
Many quality measures for features have been proposed in the literature.
Maybe the most straightforward approach is to set the score of a feature ac-
cording to the error rate of a predictor that is trained solely by that feature.
To illustrate this, consider a linear regression problem with the squared loss.
Let v = (x1,j , . . . , xm,j ) 2 Rm be a vector designating the values of the jth
feature on a training set of m examples and let y = (y1 , . . . , ym ) 2 Rm be the
values of the target on the same m examples. The empirical squared loss of an
ERM linear predictor that uses only the jth feature would be
1
min kav + b yk2 ,
a,b2R m
where the meaning of adding a scalar b to a vector v is adding b to all coordinates
1
Pm
of v. To solve this problem, let v̄ = m i=1 vi be the averaged value of the
1
P m
feature and let ȳ = m i=1 y i be the averaged value of the target. Clearly (see
Exercise 1),
1 1
min kav + b yk2 = min ka(v v̄) + b (y ȳ)k2 . (25.1)
a,b2R m a,b2R m
Taking the derivative of the right-hand side objective with respect to b and
comparing it to zero we obtain that b = 0. Similarly, solving for a (once we know
that b = 0) yields a = hv v̄, y ȳi/kv v̄k2 . Plugging this value back into the
objective we obtain the value
(hv v̄, y ȳi)2
ky ȳk2 .
kv v̄k2
Ranking the features according to the minimal loss they achieve is equivalent
to ranking them according to the absolute value of the following score (where
now a higher score yields a better feature):
1
hv v̄, y ȳi hv v̄, y ȳi
= q m q . (25.2)
kv v̄k ky ȳk 1
v̄k2 m1
ky ȳk2
m kv
If Pearson’s coefficient equals zero it means that the optimal linear function
from v to y is the all-zeros function, which means that v alone is useless for
predicting y. However, this does not mean that v is a bad feature, as it might
be the case that together with other features v can perfectly predict y. Indeed,
consider a simple example in which the target is generated by the function y =
x1 + 2x2 . Assume also that x1 is generated from the uniform distribution over
{±1}, and x2 = 12 x1 + 12 z, where z is also generated i.i.d. from the uniform
distribution over {±1}. Then, E[x1 ] = E[x2 ] = E[y] = 0, and we also have
Therefore, for a large enough training set, the first feature is likely to have a
Pearson’s correlation coefficient that is close to zero, and hence it will most
probably not be selected. However, no function can predict the target value well
without knowing the first feature.
There are many other score functions that can be used by a filter method.
Notable examples are estimators of the mutual information or the area under
the receiver operating characteristic (ROC) curve. All of these score functions
su↵er from similar problems to the one illustrated previously. We refer the reader
to Guyon & Elissee↵ (2003).
We will maintain a vector ✓ t which minimizes the right-hand side of the equation.
Initially, we set I0 = ;, V0 = ;, and ✓ 1 to be the empty vector. At round t, for
every j, we decompose Xj = vj + uj where vj = Vt 1 Vt> 1 Xj is the projection
of Xj onto the subspace spanned by Vt 1 and uj is the part of Xj orthogonal to
Vt 1 (see Appendix C). Then,
(huj , yi)2
jt = argmax .
j kuj k2
where ej is the all zeros vector except 1 in the jth element. That is, we keep
the weights of the previously chosen coordinates intact and only optimize over
the new variable. Therefore, for each j we need to solve an optimization problem
over a single variable, which is a much easier task than optimizing over t.
An even simpler approach is to upper bound R(w) using a “simple” function
and then choose the feature which leads to the largest decrease in this upper
bound. For example, if R is a -smooth function (see Equation (12.5) in Chap-
ter 12), then
@R(w)
R(w + ⌘ej ) R(w) + ⌘ + ⌘ 2 /2.
@wj
See Exercise 3.
Backward Elimination
Another popular greedy selection approach is backward elimination. Here, we
start with the full set of features, and then we gradually remove one feature at a
time from the set of features. Given that our current set of selected features is I,
we go over all i 2 I, and apply the learning algorithm on the set of features I \{i}.
Each such application yields a di↵erent predictor, and we choose to remove the
feature i for which the predictor obtained from I \ {i} has the smallest risk (on
the training set or on a validation set).
Naturally, there are many possible variants of the backward elimination idea.
It is also possible to combine forward and backward greedy steps.
where1
kwk0 = |{i : wi 6= 0}|.
where is a regularization parameter. Since for any k1 there exists a such that
1 The function k · k0 is often referred to as the `0 norm. Despite the use of the “norm”
notation, k · k0 is not really a norm; for example, it does not satisfy the positive
homogeneity property of norms, kawk0 6= |a| kwk0 .
364 Feature Selection and Generation
Equation (25.4) and Equation (25.5) lead to the same solution, the two problems
are in some sense equivalent.
The `1 regularization often induces sparse solutions. To illustrate this, let us
start with the simple optimization problem
✓ ◆
1 2
min w xw + |w| . (25.6)
w2R 2
It is easy to verify (see Exercise 2) that the solution to this problem is the “soft
thresholding” operator
w = sign(x) [|x| ]+ , (25.7)
def
where [a]+ = max{a, 0}. That is, as long as the absolute value of x is smaller
than , the optimal solution will be zero.
Next, consider a one dimensional regression problem with respect to the squared
loss:
m
!
1 X 2
argmin (xi w yi ) + |w| .
w2Rm 2m i=1
1
P Pm
For simplicity let us assume that m i x2i = 1, and denote hx, yi = i=1 x i yi ;
then the optimal solution is
That is, the solution will be zero unless the correlation between the feature x
and the labels vector y is larger than .
Remark 25.4 Unlike the `1 norm, the `2 norm does not induce sparse solutions.
Indeed, consider the problem above with an `2 regularization, namely,
m
!
1 X
argmin (xi w yi )2 + w2 .
w2Rm 2m i=1
should rely on our prior assumptions on the problem. In the aforementioned ex-
ample, a prior assumption that may lead us to use the “clipping” transformation
is that features that get values larger than a predefined threshold value give us no
additional useful information, and therefore we can clip them to the predefined
threshold.
Centering:
This transformation makes the feature have zero mean, by setting fi fi f¯.
Unit Range:
This transformation makes the range of each feature be [0, 1]. Formally, let
fi fmin
fmax = maxi fi and fmin = mini fi . Then, we set fi fmax fmin . Similarly,
we can make the range of each feature be [ 1, 1] by the transformation fi
fi fmin
2 fmax fmin 1. Of course, it is easy to make the range [0, b] or [ b, b], where b is
a user-specified parameter.
Standardization:
This transformation makes all features have a zero mean and unit variance.
Pm
Formally, let ⌫ = m1
i=1 (fi f¯)2 be the empirical variance of the feature.
fp
i f
¯
Then, we set fi ⌫
.
Clipping:
This transformation clips high or low values of the feature. For example, fi
sign(fi ) max{b, |fi |}, where b is a user-specified parameter.
Sigmoidal Transformation:
As its name indicates, this transformation applies a sigmoid function on the
1
feature. For example, fi 1+exp(b fi ) , where b is a user-specified parameter.
This transformation can be thought of as a “soft” version of clipping: It has a
small e↵ect on values close to zero and behaves similarly to clipping on values
far away from zero.
368 Feature Selection and Generation
Logarithmic Transformation:
The transformation is fi log(b+fi ), where b is a user-specified parameter. This
is widely used when the feature is a “counting” feature. For example, suppose
that the feature represents the number of appearances of a certain word in a
text document. Then, the di↵erence between zero occurrences of the word and
a single occurrence is much more important than the di↵erence between 1000
occurrences and 1001 occurrences.
Remark 25.5 In the aforementioned transformations, each feature is trans-
formed on the basis of the values it obtains on the training set, independently
of other features’ values. In some situations we would like to set the parameter
of the transformation on the basis of other features as well. A notable example
is a transformation in which one applies a scaling to the features so that the
empirical average of some norm of the instances becomes 1.
in {0, 1}k that indicates the closest centroid to x, while takes as input an
indicator vector and returns the centroid representing this vector.
An important property of the k-means construction, which is key in allowing
k to be larger than d, is that maps instances into sparse vectors. In fact, in
k-means only a single coordinate of (x) is nonzero. An immediate extension of
the k-means construction is therefore to restrict the range of to be vectors with
at most s nonzero elements, where s is a small integer. In particular, let and
be functions that depend on µ1 , . . . , µk . The function maps an instance vector
x to a vector (x) 2 Rk , where (x) should have at most s nonzero elements.
Pk
The function (v) is defined to be i=1 vi µi . As before, our goal is to have a
small reconstruction error, and therefore we can define
where kvk0 = |{j : vj 6= 0}|. Note that when s = 1 and we further restrict kvk1 =
1 then we obtain the k-means encoding function; that is, (x) is the indicator
vector of the centroid closest to x. For larger values of s, the optimization problem
in the preceding definition of becomes computationally difficult. Therefore, in
practice, we sometime use `1 regularization instead of the sparsity constraint and
define to be
⇥ ⇤
(x) = argmin kx (v)k2 + kvk1 ,
v
25.4 Summary
Guyon & Elissee↵ (2003) surveyed several feature selection procedures, including
many types of filters.
Forward greedy selection procedures for minimizing a convex objective sub-
ject to a polyhedron constraint date back to the Frank-Wolfe algorithm (Frank
& Wolfe 1956). The relation to boosting has been studied by several authors,
including, (Warmuth, Liao & Ratsch 2006, Warmuth, Glocer & Vishwanathan
2008, Shalev-Shwartz & Singer 2008). Matching pursuit has been studied in the
signal processing community (Mallat & Zhang 1993). Several papers analyzed
greedy selection methods under various conditions. See, for example, Shalev-
Shwartz, Zhang & Srebro (2010) and the references therein.
The use of the `1 -norm as a surrogate for sparsity has a long history (e.g. Tib-
shirani (1996) and the references therein), and much work has been done on un-
derstanding the relationship between the `1 -norm and sparsity. It is also closely
related to compressed sensing (see Chapter 23). The ability to sparsify low `1
norm predictors dates back to Maurey (Pisier 1980-1981). In Section 26.4 we
also show that low `1 norm can be used to bound the estimation error of our
predictor.
Feature learning and dictionary learning have been extensively studied recently
in the context of deep neural networks. See, for example, (Lecun & Bengio 1995,
Hinton et al. 2006, Ranzato et al. 2007, Collobert & Weston 2008, Lee et al.
2009, Le et al. 2012, Bengio 2009) and the references therein.
25.6 Exercises
We have shown that if S is an ✏/2 representative sample then the ERM rule
is ✏-consistent, namely, LD (ERMH (S)) minh2H LD (h) + ✏.
To simplify our notation, let us denote
def def
F = ` H = {z 7! `(h, z) : h 2 H},
and given f 2 F, we define
m
1 X
LD (f ) = E [f (z)] , LS (f ) = f (zi ).
z⇠D m i=1
Taking supremum over f 2 F of both sides, and using the fact that the supremum
of expectation is smaller than expectation of the supremum we obtain
sup LD (f ) LS (f ) = sup E0 [LS 0 (f ) LS (f )]
f 2F f 2F S
" #
E0 sup LS 0 (f ) LS (f ) .
S f 2F
Next, we note that for each j, zj and zj0 are i.i.d. variables. Therefore, we can
replace them without a↵ecting the expectation:
2 0 13
X
E 0 4 sup @(f (zj0 ) f (zj )) + (f (zi0 ) f (zi ))A5 =
S,S f 2F
i6=j
2 0 13 (26.7)
X
E 4 sup @(f (zj ) f (zj0 )) + (f (zi0 ) f (zi ))A5 .
S,S 0 f 2F
i6=j
Finally,
X X X
0 0
sup i (f (zi ) f (zi )) sup i f (zi ) + sup i f (zi )
f 2F i f 2F i f 2F i
and since the probability of is the same as the probability of , the right-hand
side of Equation (26.9) can be bounded by
" #
X X
0
E0 sup i f (zi ) + sup i f (zi )
S,S , f 2F f 2F
i i
0
= m E0 [R(F S )] + m E[R(F S)] = 2m E[R(F S)].
S S S
The lemma immediately yields that, in expectation, the ERM rule finds a
hypothesis which is close to the optimal hypothesis in H.
theorem 26.3 We have
E [LD (ERMH (S)) LS (ERMH (S))] 2 E R(` H S).
S⇠D m S⇠D m
Proof The first inequality follows directly from Lemma 26.2. The second in-
equality follows because for any fixed h? ,
The third inequality follows from the previous inequality by relying on Markov’s
inequality (note that the random variable LD (ERMH (S)) LD (h? ) is nonnega-
tive).
Next, we derive bounds similar to the bounds in Theorem 26.3 with a better
dependence on the confidence parameter . To do so, we first introduce the
following bounded di↵erences concentration inequality.
theorem 26.5 Assume that for all z and h 2 H we have that |`(h, z)| c.
Then,
Proof First note that the random variable RepD (F, S) = suph2H (LD (h) LS (h))
satisfies the bounded di↵erences condition of Lemma 26.4 with a constant 2c/m.
Combining the bounds in Lemma 26.4 with Lemma 26.2 we obtain that with
probability of at least 1 ,
r r
2 ln(2/ ) 0 2 ln(2/ )
RepD (F, S) E RepD (F, S) + c 2 E0 R(` H S ) + c .
m S m
The first inequality of the theorem follows from the definition of RepD (F, S).
For the second inequality we note that the random variable R(` H S) also
satisfies the bounded di↵erences condition of Lemma 26.4 with a constant 2c/m.
Therefore, the second inequality follows from the first inequality, Lemma 26.4,
and the union bound. Finally, for the last inequality, denote hS = ERMH (S)
and note that
LD (hS ) LD (h? )
= LD (hS ) LS (hS ) + LS (hS ) LS (h? ) + LS (h? ) LD (h? )
(LD (hS ) LS (hS )) + (LS (h? ) LD (h? )) . (26.10)
The first summand on the right-hand side is bounded by the second inequality of
the theorem. For the second summand, we use the fact that h? does not depend
on S; hence by using Hoe↵ding’s inequality we obtain that with probaility of at
least 1 /2,
r
? ? ln(4/ )
LS (h ) LD (h ) c . (26.11)
2m
Combining this with the union bound we conclude our proof.
The preceding theorem tells us that if the quantity R(` H S) is small then it
is possible to learn the class H using the ERM rule. It is important to emphasize
that the last two bounds given in the theorem depend on the specific training
set S. That is, we use S both for learning a hypothesis from H as well as for
estimating the quality of it. This type of bound is called a data-dependent bound.
The following lemma tells us that the convex hull of A has the same complexity
as A.
380 Rademacher Complexities
PN
lemma 26.7 Let A be a subset of Rm and let A0 = { j=1 ↵j a(j) : N 2
N, 8j, a(j) 2 A, ↵j 0, k↵k1 = 1}. Then, R(A0 ) = R(A).
Proof The main idea follows from the fact that for any vector v we have
N
X
sup ↵j vj = max vj .
↵ 0:k↵k1 =1 j=1 j
Therefore,
m
X N
X
0 (j)
m R(A ) = E sup sup i ↵j ai
↵ 0:k↵k1 =1 a(1) ,...,a(N ) i=1 j=1
N
X m
X (j)
=E sup ↵j sup i ai
↵ 0:k↵k1 =1 j=1 a(j) i=1
m
X
= E sup i ai
a2A i=1
= m R(A),
and we conclude our proof.
The next lemma, due to Massart, states that the Rademacher complexity of
a finite set grows logarithmically with the size of the set.
lemma 26.8 (Massart lemma) Let A = {a1 , . . . , aN } be a finite set of vectors
PN
in Rm . Define ā = N1 i=1 ai . Then,
p
2 log(N )
R(A) max ka āk .
a2A m
Proof Based on Lemma 26.6, we can assume without loss of generality that
ā = 0. Let > 0 and let A0 = { a1 , . . . , aN }. We upper bound the Rademacher
complexity as follows:
✓ ◆
mR(A0 ) = E max0 h , ai = E log max0 eh ,ai
a2A a2A
" !#
X
E log eh ,ai
a2A0
" #!
X
log E eh ,ai
// Jensen’s inequality
a2A0
m
!
XY
i ai
= log E [e ] ,
i
a2A0 i=1
where the last equality occurs because the Rademacher variables are indepen-
dent. Next, using Lemma A.6 we have that for all ai 2 R,
i ai
exp(ai ) + exp( ai )
Ee = exp(a2i /2),
i 2
26.1 The Rademacher Complexity 381
and therefore
m
XY ✓ ◆! X
!
0 a2i 2
mR(A ) log exp = log exp kak /2
2
a2A0 i=1 a2A0
✓ ◆
log |A0 | max0 exp kak2 /2 = log(|A0 |) + max0 (kak2 /2).
a2A a2A
Proof For simplicity, we prove the lemma for the case ⇢ = 1. The case ⇢ 6=
1 will follow by defining 0 = ⇢1 and then using Lemma 26.6. Let Ai =
{(a1 , . . . , ai 1 , i (ai ), ai+1 , . . . , am ) : a 2 A}. Clearly, it suffices to prove that
for any set A and all i we have R(Ai ) R(A). Without loss of generality we will
prove the latter claim for i = 1 and to simplify notation we omit the subscript
from 1 . We have
" m
#
X
mR(A1 ) = E sup i ai
a2A1 i=1
" m
#
X
= E sup 1 (a1 ) + i ai
a2A i=2
" m
! m
!#
1 X X
= E sup (a1 ) + i ai + sup (a1 ) + i ai
2 2 ,..., m a2A i=2 a2A i=2
" m m
!#
1 X X
= E sup (a1 ) (a01 ) + i ai + 0
i ai
2 2 ,..., m a,a0 2A i=2 i=2
" m m
!#
1 X X
E sup |a1 a01 | + i ai + 0
i ai , (26.12)
2 2 ,..., m a,a0 2A i=2 i=2
where in the last inequality we used the assumption that is Lipschitz. Next,
we note that the absolute value on |a1 a01 | in the preceding expression can
382 Rademacher Complexities
be omitted since both a and a0 are from the same set A and the rest of the
expression in the supremum is not a↵ected by replacing a and a0 . Therefore,
" m m
!#
1 X X
0 0
mR(A1 ) E sup a1 a1 + i ai + i ai . (26.13)
2 2 ,..., m a,a0 2A i=2 i=2
But, using the same equalities as in Equation (26.12), it is easy to see that the
right-hand side of Equation (26.13) exactly equals m R(A), which concludes our
proof.
X m
X ⇥ 2
⇤
= hxi , xj i E [ i j] + hxi , xi i E i
i6=j i=1
Xm
= kxi k22 m max kxi k22 .
i
i=1
Combining this with Equation (26.15) and Equation (26.16) we conclude our
proof.
Next we bound the Rademacher complexity of H1 S.
lemma 26.11 Let S = (x1 , . . . , xm ) be vectors in Rn . Then,
r
2 log(2n)
R(H1 S) max kxi k1 .
i m
Proof Using Holder’s inequality we know that for any vectors w, v we have
hw, vi kwk1 kvk1 . Therefore,
" m
#
X
mR(H1 S) = E sup i ai
a2H1 S i=1
" m
#
X
=E sup i hw, xi i
w:kwk1 1 i=1
" m
#
X
=E sup hw, i xi i
w:kwk1 1 i=1
" m
#
X
E k i xi k1 . (26.17)
i=1
p
For each j 2 [n], let vj = (x1,j , . . . , xm,j ) 2 Rm . Note that kvj k2 m maxi kxi k1 .
Let V = {v1 , . . . , vn , v1 , . . . , vn }. The right-hand side of Equation (26.17) is
m R(V ). Using Massart lemma (Lemma 26.8) we have that
p
R(V ) max kxi k1 2 log(2n)/m,
i
Proof Let F = {(x, y) 7! (hw, xi, y) : w 2 H}. We will show that with
p
probability 1, R(F S) ⇢BR/ m and then the theorem will follow from
Theorem 26.5. Indeed, the set F S can be written as
and the bound on R(F S) follows directly by combining Lemma 26.9, Lemma 26.10,
and the assumption that kxk2 R with probability 1.
theorem 26.13 Consider a distribution D over X ⇥{±1} such that there exists
some vector w? with P(x,y)⇠D [yhw? , xi 1] = 1 and such that kxk2 R with
probability 1. Let wS be the output of Equation (26.19). Then, with probability
of at least 1 over the choice of S ⇠ Dm , we have that
r
2 R kw? k ? 2 ln(2/ )
P [y 6= sign(hwS , xi)] p + (1 + R kw k) .
(x,y)⇠D m m
26.3 Generalization Bounds for SVM 385
Proof Throughout the proof, let the loss function be the ramp loss (see Sec-
tion 15.2.3). Note that the range of the ramp loss is [0, 1] and that it is a
1-Lipschitz function. Since the ramp loss upper bounds the zero-one loss, we
have that
P [y 6= sign(hwS , xi)] LD (wS ).
(x,y)⇠D
Let B = kw? k2 and consider the set H = {w : kwk2 B}. By the definition of
hard-SVM and our assumption on the distribution, we have that wS 2 H with
probability 1 and that LS (wS ) = 0. Therefore, using Theorem 26.12 we have
that
r
2BR 2 ln(2/ )
LD (wS ) LS (wS ) + p + .
m m
Remark 26.1 Theorem 26.13 implies that the sample complexity of hard-SVM
2 ? 2
grows like R kw
✏2
k
. Using a more delicate analysis and the separability assump-
R2 kw? k2
tion, it is possible to improve the bound to an order of ✏ .
?
The bound in the preceding theorem depends on kw k, which is unknown.
In the following we derive a bound that depends on the norm of the output of
SVM; hence it can be calculated from the training set itself. The proof is similar
to the derivation of bounds for structure risk minimization (SRM).
theorem 26.14 Assume that the conditions of Theorem 26.13 hold. Then,
with probability of at least 1 over the choice of S ⇠ Dm , we have that
s
4RkwS k ln( 4 log2 (kwS k) )
P [y 6= sign(hwS , xi)] p + .
(x,y)⇠D m m
Remark 26.2 Note that all the bounds we have derived do not depend on the
dimension of w. This property is utilized when learning SVM with kernels, where
the dimension of w can be extremely large.
Our proof of the concentration lemma is due to Kakade and Tewari lecture
notes. Kakade, Sridharan & Tewari (2008) gave a unified framework for deriving
bounds on the Rademacher complexity of linear classes with respect to di↵erent
assumptions on the norms.
27 Covering Numbers
In this chapter we describe another way to measure the complexity of sets, which
is called covering numbers.
27.1 Covering
Example 27.1 (Subspace) Suppose that A ⇢ Rm , let c = maxa2A kak, p and as-
sume that A lies in a d-dimensional subspace of Rm . Then, N (r, A) (2c d/r)d .
To see this, let v1 , . . . , vd be an orthonormal basis of the subspace. Then, any
Pd
a 2 A can be written as a = i=1 ↵i vi with k↵k1 k↵k2 = kak2 c. Let
✏ 2 R and consider the set
( d )
X
0 0 0
A = ↵i vi : 8i, ↵i 2 { c, c + ✏, c + 2✏, . . . , c} .
i=1
Pd
Given a 2 A s.t. a = i=1↵i vi with k↵k1 c, there exists a0 2 A0 such that
X X
ka a0 k2 = k (↵i0 ↵i )vi k2 ✏2 kvi k2 ✏2 d.
i i
p
Choose ✏ = r/ d; then ka a0 k r and therefore A0 is an r-cover of A. Hence,
✓ ◆d p !d
0 2c 2c d
N (r, A) |A | = = .
✏ r
27.1.1 Properties
The following lemma is immediate from the definition.
N (⇢ r, A) N (r, A).
Hence, B 0 is an (⇢ r)-cover of B.
lemma 27.4 Let c = minā maxa2A ka āk. Then, for any integer M > 0,
M q
c2 M
6c X k k , A)).
R(A) p + 2 log(N (c 2
m m
k=1
kb(k) b(k 1)
k kb(k) a⇤ k + ka⇤ b(k 1)
k c (2 k
+2 (k 1)
) = 3c 2 k
.
B̂k = {(a a0 ) : a 2 Bk , a0 2 Bk 1 , ka a0 k 3 c 2 k
}.
390 Covering Numbers
Therefore,
M q
c2 M
6c X k k , A)).
R(A) p + 2 log(N (c2
m m
k=1
q p p
d log(2 d) + k d
q p p
d log(2 d) + d k.
Hence Lemma 27.5 yields
✓q ◆ p !
6c p p c d log(d)
R(A) d log(2 d) + 2 d = O .
m m
27.3 Bibliographic Remarks 391
The chaining technique is due to Dudley (1987). For an extensive study of cover-
ing numbers as well as other complexity measures that can be used to bound the
rate of uniform convergence we refer the reader to (Anthony & Bartlet 1999).
28 Proof of the Fundamental Theorem
of Learning Theory
In this chapter we prove Theorem 6.8 from Chapter 6. We remind the reader
the conditions of the theorem, which will hold throughout this chapter: H is a
hypothesis class of functions from a domain X to {0, 1}, the loss function is the
0 1 loss, and VCdim(H) = d < 1.
We shall prove the upper bound for both the realizable and agnostic cases
and shall prove the lower bound for the agnostic case. The lower bound for the
realizable case is left as an exercise.
For the upper bound we need to prove that there exists C such that H is agnostic
PAC learnable with sample complexity
d + ln(1/ )
mH (✏, ) C .
✏2
We will prove the slightly looser bound:
d log(d/✏) + ln(1/ )
mH (✏, ) C . (28.1)
✏2
The tighter bound in the theorem statement requires a more involved proof, in
which a more careful analysis of the Rademacher complexity using a technique
called “chaining” should be used. This is beyond the scope of this book.
To prove Equation (28.1), it suffices to show that applying the ERM with a
sample size
✓ ◆
32d 64d 8
m 4 2 · log 2
+ 2 · (8d log(e/d) + 2 log(4/ ))
✏ ✏ ✏
yields an ✏, -learner for H. We prove this result on the basis of Theorem 26.5.
Let (x1 , y1 ), . . . , (xm , ym ) be a classification training set. Recall that the Sauer-
Shelah lemma tells us that if VCdim(H) = d then
⇣ e m ⌘d
|{(h(x1 ), . . . , h(xm )) : h 2 H}| .
d
Denote A = {(1[h(x1 )6=y1 ] , . . . , 1[h(xm )6=ym ] ) : h 2 H}. This clearly implies that
⇣ e m ⌘d
|A| .
d
Combining this with Lemma 26.8 we obtain the following bound on the Rademacher
complexity:
r
2d log(em/d)
R(A) .
m
Using Theorem 26.5 we obtain that with probability of at least 1 , for every
h 2 H we have that
r r
8d log(em/d) 2 log(2/ )
LD (h) LS (h) + .
m m
Repeating the previous argument for minus the zero-one loss and applying the
union bound we obtain that with probability of at least 1 , for every h 2 H
it holds that
r r
8d log(em/d) 2 log(4/ )
|LD (h) LS (h)| +
m m
r
8d log(em/d) + 2 log(4/ )
2 .
m
To ensure that this is smaller than ✏ we need
4
m · (8d log(m) + 8d log(e/d) + 2 log(4/ )) .
✏2
Using Lemma A.2, a sufficient condition for the inequality to hold is that
✓ ◆
32d 64d 8
m 4 2 · log + 2 · (8d log(e/d) + 2 log(4/ )) .
✏ ✏2 ✏
Here, we prove that there exists C such that H is agnostic PAC learnable with
sample complexity
d + ln(1/ )
mH (✏, ) C .
✏2
We will prove the lower bound in two parts. First, we will show that m(✏, )
0.5 log(1/(4 ))/✏2 , and second we will show that for every 1/8 we have that
m(✏, ) 8d/✏2 . These two bounds will conclude the proof.
that there are h+ , h 2 H for which h+ (c) = 1 and h (c) = 1. Define two
distributions, D+ and D , such that for b 2 {±1} we have
(
1+yb✏
2 if x = c
Db ({(x, y)}) =
0 otherwise.
That is, all the distribution mass is concentrated on two examples (c, 1) and
(c, 1), where the probability of (c, b) is 1+b✏ 2 and the probability of (c, b) is
1 b✏
2 .
Let A be an arbitrary algorithm. Any training set sampled from Db has the
form S = (c, y1 ), . . . , (c, ym ). Therefore, it is fully characterized by the vector
y = (y1 , . . . , ym ) 2 {±1}m . Upon receiving a training set S, the algorithm A
returns a hypothesis h : X ! {±1}. Since the error of A w.r.t. Db only depends
on h(c), we can think of A as a mapping from {±1}m into {±1}. Therefore,
we denote by A(y) the value in {±1} corresponding to the prediction of h(c),
where h is the hypothesis that A outputs upon receiving the training set S =
(c, y1 ), . . . , (c, ym ).
Note that for any hypothesis h we have
1 h(c)b✏
LDb (h) = .
2
(
1 A(y)b✏ 1 ✏ ✏ if A(y) 6= b
LDb (A(y)) LDb (hb ) = =
2 2 0 otherwise.
Fix A. For b 2 {±1}, let Y b = {y 2 {0, 1}m : A(y) 6= b}. The distribution Db
induces a probability Pb over {±1}m . Hence,
X
P [LDb (A(y)) LDb (hb ) = ✏] = Db (Y b ) = Pb [y]1[A(y)6=b] .
y
Therefore,
max P [LDb (A(y)) LDb (hb ) = ✏]
b2{±1}
X
= max Pb [y]1[A(y)6=b]
b2{±1}
y
1X 1X
P+ [y]1[A(y)6=+] + P [y]1[A(y)6= ]
2 y 2 y
1 X 1 X
= (P+ [y]1[A(y)6=+] + P [y]1[A(y)6= ] ) + (P+ [y]1[A(y)6=+] + P [y]1[A(y)6= ] )
2 +
2
y2N y2N
1 X 1 X
(P [y]1[A(y)6=+] + P [y]1[A(y)6= ] ) + (P+ [y]1[A(y)6=+] + P+ [y]1[A(y)6= ] )
2 +
2
y2N y2N
1 X 1 X
= P [y] + P+ [y] .
2 +
2
y2N y2N
P P
Next note that y2N + P [y] = y2N P+ [y], and both values are the prob-
ability that a Binomial (m, (1 ✏)/2) random variable will have value greater
than m/2. Using Lemma B.11, this probability is lower bounded by
1⇣ p ⌘ 1⇣ p ⌘
1 1 exp( m✏2 /(1 ✏2 )) 1 1 exp( 2m✏2 ) ,
2 2
where we used the assumption that ✏2 1/2. It follows that if m 0.5 log(1/(4 ))/✏2
then there exists b such that
P [LDb (A(y)) LDb (hb ) = ✏]
✓ q ◆
1 p
1 1 4 ,
2
where the last inequality follows by standard algebraic manipulations. This con-
cludes our proof.
1 ⇢
h 2 H such that h(ci ) = bi for all i 2 [d], and its error is 2 . In addition, for
any other function f : X ! {±1}, it is easy to verify that
1 + ⇢ |{i 2 [d] : f (ci ) 6= bi }| 1 ⇢ |{i 2 [d] : f (ci ) = bi }|
LDb (f ) = · + · .
2 d 2 d
Therefore,
|{i 2 [d] : f (ci ) 6= bi }|
LDb (f ) min LDb (h) = ⇢ · . (28.2)
h2H d
Next, fix some learning algorithm A. As in the proof of the No-Free-Lunch
theorem, we have that
max E m LDb (A(S)) min LDb (h) (28.3)
Db :b2{±1}d S⇠Db h2H
E E m LDb (A(S)) min LDb (h) (28.4)
Db :b⇠U ({±1}d ) S⇠Db h2H
|{i 2 [d] : A(S)(ci ) 6= bi |
= E Em ⇢ · (28.5)
Db :b⇠U ({±1}d ) S⇠Db d
d
⇢X
= E E 1[A(S)(ci )6=bi ] , (28.6)
d i=1 Db :b⇠U ({±1}d ) S⇠Dbm
where the first equality follows from Equation (28.2). In addition, using the
definition of Db , to sample S ⇠ Db we can first sample (j1 , . . . , jm ) ⇠ U ([d])m , set
xr = cji , and finally sample yr such that P[yr = bji ] = (1 + ⇢)/2. Let us simplify
the notation and use y ⇠ b to denote sampling according to P[y = b] = (1 + ⇢)/2.
Therefore, the right-hand side of Equation (28.6) equals
d
⇢X
E E E 1[A(S)(ci )6=bi ] . (28.7)
d i=1 j⇠U ([d])m b⇠U ({±1}d ) 8r,yr ⇠bjr
We now proceed in two steps. First, we show that among all learning algorithms,
A, the one which minimizes Equation (28.7) (and hence also Equation (28.4))
is the Maximum-Likelihood learning rule, denoted AM L . Formally, for each i,
AM L (S)(ci ) is the majority vote among the set {yr : r 2 [m], xr = ci }. Second,
we lower bound Equation (28.7) for AM L .
lemma 28.1 Among all algorithms, Equation (28.4) is minimized for A being
the Maximum-Likelihood algorithm, AM L , defined as
!
X
8i, AM L (S)(ci ) = sign yr .
r:xr =ci
Proof Fix some j 2 [d]m . Note that given j and y 2 {±1}m , the training set
S is fully determined. Therefore, we can write A(j, y) instead of A(S). Let us
also fix i 2 [d]. Denote b¬i the sequence (b1 , . . . , bi 1 , bi+1 , . . . , bm ). Also, for any
28.2 The Lower Bound for the Agnostic Case 397
E E 1[A(S)(ci )6=bi ]
b⇠U ({±1}d ) 8r,yr ⇠bjr
1 X X
= E P [y|b¬i , bi ]1[A(j,y)(ci )6=bi ]
2 b¬i ⇠U ({±1}d 1)
y
bi 2{±1}
0 1
X 1 X X
= E P [y ¬I |b¬i ] @ P [y I |bi ]1[A(j,y)(ci )6=bi ] A .
b¬i ⇠U ({±1}d 1) 2 I
y ¬I y bi 2{±1}
The sum within the parentheses is minimized when A(j, y)(ci ) is the maximizer
of P [y I |bi ] over bi 2 {±1}, which is exactly the Maximum-Likelihood rule. Re-
peating the same argument for all i we conclude our proof.
Fix i. For every j, let ni (j) = {|t : jt = i|} be the number of instances in which
the instance is ci . For the Maximum-Likelihood rule, we have that the quantity
is exactly the probability that a binomial (ni (j), (1 ⇢)/2) random variable will
be larger than ni (j)/2. Using Lemma B.11, and the assumption ⇢2 1/2, we
have that
1⇣ p ⌘
P [B ni (j)/2] 1 1 e 2ni (j)⇢2 .
2
We have thus shown that
d
⇢X
E E E 1[A(S)(ci )6=bi ]
d i=1 j⇠U ([d])m b⇠U ({±1}d ) 8r,yr ⇠bjr
⇢ X
d ⇣ p ⌘
E m 1 1 e 2⇢2 ni (j)
2d i=1 j⇠U ([d])
⇢ X
d ⇣ p ⌘
E m 1 2⇢2 ni (j) ,
2d i=1 j⇠U ([d])
⇢ X⇣ ⌘
d p
= 1 2⇢2 m/d
2d i=1
⇢⇣ p ⌘
= 1 2⇢2 m/d .
2
398 Proof of the Fundamental Theorem of Learning Theory
Finally, Let = ⇢1 (LD (A(S)) minh2H LD (h)) and note that 2 [0, 1] (see
Equation (28.5)). Therefore, using Lemma B.1, we get that
✏ ✏
P[LD (A(S)) min LD (h) > ✏] = P > E[ ]
h2H ⇢ ⇢
1 ✏
.
4 ⇢
Choosing ⇢ = 8✏ we conclude that if m < 512d ✏2 , then with probability of at least
1/8 we will have LD (A(S)) minh2H LD (h) ✏.
Here we prove that there exists C such that H is PAC learnable with sample
complexity
d ln(1/✏) + ln(1/ )
mH (✏, ) C .
✏
We do so by showing that for m C d ln(1/✏)+ln(1/
✏
)
, H is learnable using the
ERM rule. We prove this claim based on the notion of ✏-nets.
8h 2 H : D(h) ✏ ) h \ S 6= ;.
theorem 28.3 Let H ⇢ 2X with VCdim(H) = d. Fix ✏ 2 (0, 1), 2 (0, 1/4)
and let
✓ ✓ ◆ ✓ ◆◆
8 16e 2
m 2d log + log .
✏ ✏
Then, with probability of at least 1 over a choice of S ⇠ Dm we have that S
is an ✏-net for H.
Proof Let
B = {S ⇢ X : |S| = m, 9h 2 H, D(h) ✏, h \ S = ;}
be the set of sets which are not ✏-nets. We need to bound P[S 2 B]. Define
Claim 1
P[S 2 B] 2 P[(S, T ) 2 B 0 ].
Proof of Claim 1 : Since S and T are chosen independently we can write
⇥ ⇤ h ⇥ ⇤i
P[(S, T ) 2 B 0 ] = E 2m 1[(S,T )2B 0 ] = E m E m 1[(S,T )2B 0 ] .
(S,T )⇠D S⇠D T ⇠D
Note that (S, T ) 2 B 0 implies S 2 B and therefore 1[(S,T )2B 0 ] = 1[(S,T )2B 0 ] 1[S2B] ,
which gives
P[(S, T ) 2 B 0 ] = E E 1[(S,T )2B 0 ] 1[S2B]
S⇠D m T ⇠D m
Fix some S. Then, either 1[S2B] = 0 or S 2 B and then 9hS such that D(hS ) ✏
and |hS \ S| = 0. It follows that a sufficient condition for (S, T ) 2 B 0 is that
|T \ hS | > ✏m
2 . Therefore, whenever S 2 B we have
Claim 2 (Symmetrization):
P[(S, T ) 2 B 0 ] e ✏m/4 ⌧H (2m).
Proof of Claim 2 : To simplify notation, let ↵ = m✏/2 and for a sequence A =
(x1 , . . . , x2m ) let A0 = (x1 , . . . , xm ). Using the definition of B 0 we get that
P[A 2 B 0 ] = E max 1[D(h) ✏] 1[|h\A0 |=0] 1[|h\A| ↵]
A⇠D 2m h2H
Let J = {j ⇢ [2m] : |j| = m}. For any j 2 J and A = (x1 , . . . , x2m ) define
Aj = (xj1 , . . . , xjm ). Since the elements of A are chosen i.i.d., we have that
for any j 2 J and any function f (A, A0 ) it holds that EA⇠D2m [f (A, A0 )] =
400 Proof of the Fundamental Theorem of Learning Theory
EA⇠D2m [f (A, Aj )]. Since this holds for any j it also holds for the expectation of
j chosen at random from J. In particular, it holds for the function f (A, A0 ) =
P
h2HA 1[|h\A0 |=0] 1[|h\A| ↵] . We therefore obtain that
X
P[A 2 B 0 ] E 2m E 1[|h\Aj |=0] 1[|h\A| ↵]
A⇠D j⇠J
h2HA
X
= E 2m 1[|h\A| ↵] E 1[|h\Aj |=0] .
A⇠D j⇠J
h2HA
Now, fix some A s.t. |h \ A| ↵. Then, Ej 1[|h\Aj |=0] is the probability that
when choosing m balls from a bag with at least ↵ red balls, we will never choose
a red ball. This probability is at most
Using the definition of the growth function we conclude the proof of Claim 2.
Completing the Proof: By Sauer’s lemma we know that ⌧H (2m) (2em/d)d .
Combining this with the two claims we obtain that
We would like the right-hand side of the inequality to be at most ; that is,
2(2em/d)d e ✏m/4
.
The proof of Natarajan’s lemma shares the same spirit of the proof of Sauer’s
lemma and is left as an exercise (see Exercise 3).
In this section we show how to calculate (or estimate) the Natarajan dimen-
sion of several popular classes, some of which were studied in Chapter 17. As
these calculations indicate, the Natarajan dimension is often proportional to the
number of parameters required to define a hypothesis.
If there are two labels that maximize hi (x), we choose the smaller one. Also, let
OvA,k k
Hbin = {T (h̄) : h̄ 2 (Hbin ) }.
OvA,k
What “should” be the Natarajan dimension of Hbin ? Intuitively, to specify a
hypothesis in Hbin we need d = VCdim(Hbin ) parameters. To specify a hypothe-
OvA,k
sis in Hbin , we need to specify k hypotheses in Hbin . Therefore, kd parameters
should suffice. The following lemma establishes this intuition.
The proof follows by taking the logarithm and applying Lemma A.1.
OvA,k
How tight is Lemma 29.5? It is not hard to see that for some classes, Ndim(Hbin )
can be much smaller than dk (see Exercise 1). However there are several natural
OvA,k
binary classes, Hbin (e.g., halfspaces), for which Ndim(Hbin ) = ⌦(dk) (see
Exercise 6).
learnable by some ERM, but other ERMs will fail to learn it. Clearly, this also
implies that the class is learnable but it does not have the uniform convergence
property. For simplicity, we consider only the realizable case.
The class we consider is defined as follows. The instance space X will be any
finite or countable set. Let Pf (X ) be the collection of all finite and cofinite
subsets of X (that is, for each A 2 Pf (X ), either A or X \ A must be finite).
Instead of [k], the label set is Y = Pf (X ) [ {⇤}, where ⇤ is some special label.
For every A 2 Pf (X ) define hA : X ! Y by
(
A x2A
hA (x) =
⇤ x2 /A
Finally, the hypothesis class we take is
H = {hA : A 2 Pf (X )}.
Agood (S) = h; ;
that is, it outputs the hypothesis which predicts ‘*’ for every x 2 X . The second
ERM, Abad , is defined by
The following claim shows that the sample complexity of Abad is about |X |-times
larger than the sample complexity of Agood . This establishes a gap between
di↵erent ERMs. If X is infinite, we even obtain a learnable class that is not
learnable by every ERM.
claim 29.9
Proof Let D be a distribution over X and suppose that the correct labeling
is hA . For any sample, Agood returns either h; or hA . If it returns hA then its
true error is zero. Thus, it returns a hypothesis with error ✏ only if all the m
examples in the sample are from X \ A while the error of h; , LD (h; ) = PD [A],
is ✏. Assume m 1✏ log( 1 ); then the probability of the latter event is no more
than (1 ✏)m e ✏m . This establishes item 1.
Next we prove item 2. We restrict the proof to the case that |X | = d < 1.
The proof for infinite X is similar. Suppose that X = {x0 , . . . , xd 1 }.
Let a > 0 be small enough such that 1 2✏ e 4✏ for every ✏ < a and fix
some ✏ < a. Define a distribution on X by setting P[x0 ] = 1 2✏ and for all
1 i d 1, P[xi ] = d2✏1 . Suppose that the correct hypothesis is h; and let the
sample size be m. Clearly, the hypothesis returned by Abad will err on all the
examples from X which are not in the sample. By Cherno↵’s bound, if m d6✏1 ,
1
then with probability e 6 , the sample will include no more than d 2 1 examples
from X . Thus the returned hypothesis will have error ✏.
We emphasize that the Õ notation may hide only poly-log factors of ✏, , and
Ndim(H), but no factor of k.
The Natarajan dimension is due to Natarajan (1989). That paper also established
the Natarajan lemma and the generalization of the fundamental theorem. Gen-
eralizations and sharper versions of the Natarajan lemma are studied in Haussler
& Long (1995). Ben-David, Cesa-Bianchi, Haussler & Long (1995) defined a large
family of notions of dimensions, all of which generalize the VC dimension and
may be used to estimate the sample complexity of multiclass classification.
The calculation of the Natarajan dimension, presented here, together with
calculation of other classes, can be found in Daniely et al. (2012). The example
of good and bad ERMs, as well as conjecture 29.10, are from Daniely et al.
(2011).
29.6 Exercises 409
29.6 Exercises
1. Let d, k > 0. Show that there exists a binary hypothesis Hbin of VC dimension
OvA,k
d such that Ndim(Hbin ) = d.
2. Prove Lemma 29.6.
3. Prove Natarajan’s lemma.
Hint: Fix some x0 2 X . For i, j 2 [k], denote by Hij all the functions f :
X \ {x0 } ! [k] that can be extended to a function in H both by defining
P
f (x0 ) = i and by defining f (x0 ) = j. Show that |H| |HX \{x0 } | + i6=j |Hij |
and use induction.
4. Adapt the proof of the binary fundamental theorem and Natarajan’s lemma
to prove that, for some universal constant C > 0 and for every hypothesis
class of Natarajan dimension d, the agnostic sample complexity of H is
kd
d log + log(1/ )
✏
mH (✏, ) C .
✏2
5. Prove that, for some universal constant C > 0 and for every hypothesis class
of Natarajan dimension d, the agnostic sample complexity of H is
d + log(1/ )
mH (✏, ) C .
✏2
Hint: Deduce it from the binary fundamental theorem.
6. Let H be the binary hypothesis class of (nonhomogenous) halfspaces in Rd .
The goal of this exercise is to prove that Ndim(HOvA,k ) (d 1) · (k 1).
1. Let Hdiscrete be the class of all functions f : [k 1] ⇥ [d 1] ! {0, 1} for
which there exists some i0 such that, for every j 2 [d 1]
8i < i0 , f (i, j) = 1 while 8i > i0 , f (i, j) = 0.
OvA,k
Show that Ndim(Hdiscrete ) = (d 1) · (k 1).
2. Show that Hdiscrete can be realized by H. That is, show that there exists
a mapping : [k 1] ⇥ [d 1] ! Rd such that
Hdiscrete ⇢ {h : h 2 H} .
Hint: You can take (i, j) to be the vector whose jth coordinate is 1, whose
last coordinate is i and the rest are zeros.
3. Conclude that Ndim(HOvA,k ) (d 1) · (k 1).
30 Compression Bounds
To motivate the results, let us first consider the following learning protocol.
First, we sample a sequence of k examples denoted T . On the basis of these
examples, we construct a hypothesis denoted hT . Now we would like to estimate
the performance of hT so we sample a fresh sequence of m k examples, denoted
V , and calculate the error of hT on V . Since V and T are independent, we
immediately get the following from Bernstein’s inequality (see Lemma B.10).
lemma 30.1 Assume that the range of the loss function is [0, 1]. Then,
" s #
2LV (hT ) log(1/ ) 4 log(1/ )
P LD (hT ) LV (hT ) + .
|V | |V |
To derive this bound, all we needed was independence between T and V .
Therefore, we can redefine the protocol as follows. First, we agree on a sequence
of k indices I = (i1 , . . . , ik ) 2 [m]k . Then, we sample a sequence of m examples
S = (z1 , . . . , zm ). Now, define T = SI = (zi1 , . . . , zik ) and define V to be the
rest of the examples in S. Note that this protocol is equivalent to the protocol
we defined before – hence Lemma 30.1 still holds.
Applying a union bound over the choice of the sequence of indices we obtain
the following theorem.
mk .
corollary 30.3 Assuming the conditions of Theorem 30.2, and further as-
suming that LV (A(S)) = 0, then, with probability of at least 1 over the choice
of S we have
8k log(m/ )
LD (A(S)) .
m
These results motivate the following definition:
The following lemma shows that the existence of a compression scheme for
the realizable case also implies the existence of a compression scheme for the
unrealizable case.
lemma 30.6 Let H be a hypothesis class for binary classification, and assume
it has a compression scheme of size k in the realizable case. Then, it has a
compression scheme of size k for the unrealizable case as well.
Proof Consider the following scheme: First, find an ERM hypothesis and denote
it by h. Then, discard all the examples on which h errs. Now, apply the realizable
compression scheme on the examples that have not been removed. The output of
the realizable compression scheme, denoted h0 , must be correct on the examples
that have not been removed. Since h errs on the removed examples it follows
that the error of h0 cannot be larger than the error of h; hence h0 is also an ERM
hypothesis.
30.2 Examples
In the examples that follows, we present compression schemes for several hy-
pothesis classes for binary classification. In light of Lemma 30.6 we focus on the
realizable case. Therefore, to show that a certain hypothesis class has a com-
pression scheme, it is necessary to show that there exist A, B, and k for which
LS (h0 ) = 0.
30.2.2 Halfspaces
Let X = Rd and consider the class of homogenous halfspaces, {x 7! sign(hw, xi) :
w 2 Rd }.
30.2 Examples 413
A Compression Scheme:
W.l.o.g. assume all labels are positive (otherwise, replace xi by yi xi ). The com-
pression scheme we propose is as follows. First, A finds the vector w which is
in the convex hull of {x1 , . . . , xm } and has minimal norm. Then, it represents it
as a convex combination of d points in the sample (it will be shown later that
this is always possible). The output of A are these d points. The algorithm B
receives these d points and set w to be the point in their convex hull of minimal
norm.
Next we prove that this indeed is a compression sceme. Since the data is
linearly separable, the convex hull of {x1 , . . . , xm } does not contain the origin.
Consider the point w in this convex hull closest to the origin. (This is a unique
point which is the Euclidean projection of the origin onto this convex hull.) We
claim that w separates the data.1 To see this, assume by contradiction that
2
hw, xi i 0 for some i. Take w0 = (1 ↵)w + ↵xi for ↵ = kxi kkwk 2 +kwk2 2 (0, 1).
< kwk2 ,
Note that p(x) can be rewritten as hw, (x)i where the elements of (x) are all
the monomials of x up to degree r. Therefore, the problem of constructing a com-
pression scheme for p(x) reduces to the problem of constructing a compression
0
scheme for halfspaces in Rd where d0 = O(dr ).
The Minimum Description Length (MDL) and Occam’s razor principles allow a
potentially very large hypothesis class but define a hierarchy over hypotheses and
prefer to choose hypotheses that appear higher in the hierarchy. In this chapter
we describe the PAC-Bayesian approach that further generalizes this idea. In
the PAC-Bayesian approach, one expresses the prior knowledge by defining prior
distribution over the hypothesis class.
By the linearity of expectation, the generalization loss and training loss of Q can
be written as
def def
LD (Q) = E [LD (h)] and LS (Q) = E [LS (h)].
h⇠Q h⇠Q
The following theorem tells us that the di↵erence between the generalization
loss and the empirical loss of a posterior Q is bounded by an expression that
depends on the Kullback-Leibler divergence between Q and the prior distribu-
tion P . The Kullback-Leibler is a natural measure of the distance between two
distributions. The theorem suggests that if we would like to minimize the gen-
eralization loss of Q, we should jointly minimize both the empirical loss of Q
and the Kullback-Leibler distance between Q and the prior distribution. We will
later show how in some cases this idea leads to the regularized risk minimization
principle.
where
def
D(Q||P ) = E [ln(Q(h)/P (h))]
h⇠Q
ES [ef (S) ]
P[f (S) ✏] = P[ef (S) e✏ ] . (31.1)
S S e✏
Let (h) = LD (h) LS (h). We will apply Equation (31.1) with the function
✓ ◆
2
f (S) = sup 2(m 1) E ( (h)) D(Q||P ) .
Q h⇠Q
We now turn to bound ES [ef (S) ]. The main trick is to upper bound f (S) by
using an expression that does not depend on Q but rather depends on the prior
probability P . To do so, fix some S and note that from the definition of D(Q||P )
we get that for all Q,
1) (h)2
2(m 1) E ( (h))2 D(Q||P ) = E [ln(e2(m P (h)/Q(h))]
h⇠Q h⇠Q
1) (h)2
ln E [e2(m P (h)/Q(h)]
h⇠Q
1) (h)2
= ln E [e2(m ], (31.2)
h⇠P
where the inequality follows from Jensen’s inequality and the concavity of the
log function. Therefore,
1) (h)2
E[ef (S) ] E E [e2(m ]. (31.3)
S S h⇠P
The advantage of the expression on the right-hand side stems from the fact that
we can switch the order of expectations (because P is a prior that does not
depend on S), which yields
1) (h)2
E[ef (S) ] E E[e2(m ]. (31.4)
S h⇠P S
31.2 Bibliographic Remarks 417
1) (h)2
Next, we claim that for all h we have ES [e2(m ] m. To do so, recall that
Hoe↵ding’s inequality tells us that
2m✏2
P[ (h) ✏] e .
S
2
This implies that ES [e2(m 1) (h) ] m (see Exercise 1). Combining this with
Equation (31.4) and plugging into Equation (31.1) we get
m
P[f (S) ✏] . (31.5)
S e✏
Denote the right-hand side of the above , thus ✏ = ln(m/ ), and we therefore
obtain that with probability of at least 1 we have that for all Q
Rearranging the inequality and using Jensen’s inequality again (the function x2
is convex) we conclude that
✓ ◆2
ln(m/ ) + D(Q||P )
E (h) E ( (h))2 . (31.6)
h⇠Q h⇠Q 2(m 1)
This rule is similar to the regularized risk minimization principle. That is, we
jointly minimize the empirical loss of Q on the sample and the Kullback-Leibler
“distance” between Q and P .
PAC-Bayes bounds were first introduced by McAllester (1998). See also (McAllester
1999, McAllester 2003, Seeger 2003, Langford & Shawe-Taylor 2003, Langford
2006).
31.3 Exercises
2m✏2
1. Let X be a random variable that satisfies P[X ✏] e . Prove that
2
E[e2(m 1)X ] m.
418 PAC-Bayes
2. • Suppose that H is a finite hypothesis class, set the prior to be uniform over
H, and set the posterior to be Q(hS ) = 1 for some hS and Q(h) = 0 for
all other h 2 H. Show that
s
ln(|H|) + ln(m/ )
LD (hS ) LS (h) + .
2(m 1)
Compare to the bounds we derived using uniform convergence.
• Derive a bound similar to the Occam bound given in Chapter 7 using the
PAC-Bayes bound
Appendix A Technical Lemmas
p
Proof For all i = 0, 1, 2, . . . denote ti = a (i+ log(b)). Since ti is monotonically
increasing we have that
p 1
X
E[|X x0 |] a log(b) + ti P[|X x0 | > ti 1 ].
i=1
1
X 1
X p p
log(b))2
ti P[|X x0 | > ti 1] 2ab (i + log(b))e (i 1+
i=1 i=1
Z 1
(x 1)2
2ab p xe dx
1+ log(b)
Z 1
y2
= 2ab p (y + 1)e dy
log(b)
Z 1
2
4ab p ye y dy
log(b)
h i1
y2
= 2ab e p
log(b)
= 2 a b/b = 2 a.
d ✓ ◆
X ⇣ e m ⌘d
m
.
k d
k=0
Proof We prove the claim by induction. For d = 1 the left-hand side equals
1 + m while the right-hand side equals em; hence the claim is true. Assume that
the claim holds for d and let us prove it for d + 1. By the induction assumption
we have
d+1 ✓ ◆
X ⇣ e m ⌘d ◆✓
m m
+
k d d+1
k=0
✓ ◆d !
⇣ e m ⌘d d m(m 1)(m 2) · · · (m d)
= 1+
d em (d + 1)d!
✓ ◆d !
⇣ em ⌘d d (m d)
1+ .
d e (d + 1)d!
Technical Lemmas 421
Therefore,
X1
ea + e a
a2n
= ,
2 n=0
(2n)!
and
X1
a2 /2 a2n
e = .
n=0
2n n!
Therefore,
1 µ a+µ 1
P[Z > 1 a] 1 = .
a a
The proof follows by denoting the right-hand side and solving for a.
The deviation between the empirical average and the mean given previously
decreases polynomially with m. It is possible to obtain a significantly faster
decrease. In the sections that follow we derive bounds that decrease exponentially
fast.
monotonicity of the exponent function and Markov’s inequality, we have that for
every t > 0
E[etZ ]
P[Z > (1 + )p] = P[etZ > et(1+ )p
] . (B.5)
e(1+ )tp
Next,
P Y
E[etZ ] = E[et i Zi
] = E[ etZi ]
i
Y
tZi
= E[e ] by independence
i
Y
= pi et + (1 pi )e0
i
Y
= 1 + pi (et 1)
i
Y t
epi (e 1)
using 1 + x ex
i
P
pi (et 1)
=e i
t
= e(e 1)p
.
Combining the above with Equation (B.5) and choosing t = log(1 + ) we obtain
h( ) p
P[Z > (1 + )p] e ,
where
h( ) = (1 + ) log(1 + ) .
2
p
P[Z > (1 + )p] e 2+2 /3 .
tZ t(1 )p E[e tZ
]
P[Z < (1 )p] = P[ Z > (1 )p] = P[e >e ] , (B.6)
e (1 )tp
B.4 Hoe↵ding’s Inequality 425
and,
P Y
tZ t Zi tZi
E[e ] = E[e i ] = E[ e ]
i
Y
tZi
= E[e ] by independence
i
Y
t
= 1 + pi (e 1)
i
Y t
epi (e 1)
using 1 + x ex
i
t
(e 1)p
=e .
Therefore,
Y 2 (b
a)2 2 (b a)2
✏ ✏+
P[X̄ ✏] e e 8m2 =e 8m .
i
lemma B.7 (Hoe↵ding’s lemma) Let X be a random variable that takes values
in the interval [a, b] and such that E[X] = 0. Then, for every > 0,
2 (b a)2
X
E[e ]e 8 .
x
Proof Since f (x) = e is a convex function, we have that for every ↵ 2 (0, 1),
and x 2 [a, b],
x b x a x a b
e e + e .
b a b a
Taking the expectation, we obtain that
X b E[X] a E[x] a b b a a
E[e ] e + e = e e b,
b a b a b a b a
where we used the fact that E[X] = 0. Denote h = (b a), p = b aa , and
L(h) = hp + log(1 p + peh ). Then, the expression on the right-hand side of
the above can be rewritten as eL(h) . Therefore, to conclude our proof it suffices
2
to show that L(h) h8 . This follows from Taylor’s theorem using the facts:
L(0) = L0 (0) = 0 and L00 (h) 1/4 for all h.
Bennet’s and Bernsein’s inequalities are similar to Cherno↵’s bounds, but they
hold for any sequence of independent random variables. We state the inequalities
without proof, which can be found, for example, in Cesa-Bianchi & Lugosi (2006).
where
h(a) = (1 + a) log(1 + a) a.
By using the inequality h(a) a2 /(2 + 2a/3) it is possible to derive the
following:
lemma B.9 (Bernstein’s inequality) Let Z1 , . . . , Zm be i.i.d. random variables
with a zero mean. If for all i, P(|Zi | < M ) = 1, then for all t > 0 :
" m
# !
X t2 /2
P Zi > t exp P .
i=1
E Zj2 + M t/3
B.5.1 Application
Bernstein’s inequality can be used to interpolate between the rate 1/✏ we derived
for PAC learning in the realizable case (in Chapter 2) and the rate 1/✏2 we derived
for the unrealizable case (in Chapter 4).
lemma B.10 Let ` : H ⇥ Z ! [0, 1] be a loss function. Let D be an arbitrary
distribution over Z. Fix some h. Then, for any 2 (0, 1) we have
" r #
2LD (h) log(1/ ) 2 log(1/ )
1. P LS (h) LD (h) + +
S⇠D m 3m m
" r #
2LS (h) log(1/ ) 4 log(1/ )
2. P LD (h) LS (h) + +
S⇠D m m m
2
B.7 Concentration of Variables
On the basis of the equation and using Cherno↵’s bounding method we have
h i
P[Z (1 + ✏)k)] = P e Z e(1+✏)k
⇥ ⇤
e (1+✏)k E e Z
(1+✏)k k/2
=e (1 2 )
(1+✏)k
e ek = e ✏k
,
where the last inequality occurs because (1 a) e a . Setting = ✏/6 (which
is in (0, 1/2) by our assumption) we obtain the second inequality stated in the
lemma.
Finally, the last inequality follows from the first two inequalities and the union
bound.
Appendix C Linear Algebra
In this chapter we only deal with linear algebra over finite dimensional Euclidean
spaces. We refer to vectors as column vectors.
Given two d dimensional vectors u, v 2 Rd , their inner product is
d
X
hu, vi = ui v i .
i=1
p
The Euclidean norm (a.k.a. the `2 norm) is kuk = hu, ui. We also use the `1
Pd
norm, kuk1 = i=1 |ui | and the `1 norm kuk1 = maxi |ui |.
A subspace of Rd is a subset of Rd which is closed under addition and scalar
multiplication. The span of a set of vectors u1 , . . . , uk is the subspace containing
all vectors of the form
k
X
↵i ui
i=1
A symmetric matrix A 2 Rd,d is positive definite if all its eigenvalues are positive.
A is positive semidefinite if all its eigenvalues are nonnegative.
theorem C.2 Let A 2 Rd,d be a symmetric matrix. Then, the following are
equivalent definitions of positive semidefiniteness of A:
Av = u and A> u = v.
We first show that if we can find r orthonormal singular vectors with positive
singular values, then we can decompose A = U DV > , with the columns of U and
V containing the left and right singular vectors, and D being a diagonal r ⇥ r
matrix with the singular values on its diagonal.
432 Linear Algebra
It follows that if U is a matrix whose columns are the ui ’s, V is a matrix whose
columns are the vi ’s, and D is a diagonal matrix with Di,i = i , then
A = U DV > .
Proof Any right singular vector of A must be in the range of A> (otherwise,
the singular value will have to be zero). Therefore, v1 , . . . , vr is an orthonormal
basis of the range of A. Let us complete it to an orthonormal basis of Rn by
Pr
adding the vectors vr+1 , . . . , vn . Define B = i=1 i ui vi> . It suffices to prove
that for all i, Avi = Bvi . Clearly, if i > r then Avi = 0 and Bvi = 0 as well.
For i r we have
r
X
>
Bvi = j uj vj vi = i ui = Avi ,
j=1
and
1
A> u = A> Av = v = v.
C.4 Singular Value Decomposition (SVD) 433
lemma C.5 Let A 2 Rm,n with rank r. Define the following vectors:
v1 = argmax kAvk
v2Rn :kvk=1
v2 = argmax kAvk
v2Rn :kvk=1
hv,v1 i=0
..
.
vr = argmax kAvk
v2Rn :kvk=1
8i<r, hv,vi i=0
Proof First note that since the rank of A is r, the range of A is a subspace of
dimension r, and therefore it is easy to verify that for all i = 1, . . . , r, kAvi k > 0.
Let W 2 Rn,n be an orthonormal matrix obtained by the eigenvalue decompo-
sition of A> A, namely, A> A = W DW > , with D being a diagonal matrix with
D1,1 D2,2 ··· 0. We will show that v1 , . . . , vr are eigenvectors of A> A
that correspond to nonzero eigenvalues, and, hence, using Lemma C.4 it follows
that these are also right singular vectors of A. The proof is by induction. For the
basis of the induction, note that any unit vector v can be written as v = W x,
for x = W > v, and note that kxk = 1. Therefore,
n
X
kAvk2 = kAW xk2 = kW DW > W xk2 = kW Dxk2 = kDxk2 = 2
Di,i xi 2 .
i=1
Therefore,
n
X
max kAvk2 = max 2
Di,i xi 2 .
v:kvk=1 x:kxk=1
i=1
The solution of the right-hand side is to set x = (1, 0, . . . , 0), which implies that
v1 is the first eigenvector of A> A. Since kAv1 k > 0 it follows that D1,1 > 0 as
required. For the induction step, assume that the claim holds for some 1 t
r 1. Then, any v which is orthogonal to v1 , . . . , vt can be written as v = W x
with all the first t elements of x being zero. It follows that
n
X
max kAvk2 = max 2
Di,i xi 2 .
v:kvk=1,8it,v> vi =0 x:kxk=1
i=t+1
The solution of the right-hand side is the all zeros vector except xt+1 = 1. This
implies that vt+1 is the (t + 1)th column of W . Finally, since kAvt+1 k > 0 it
follows that Dt+1,t+1 > 0 as required. This concludes our proof.
434 Linear Algebra
corollary C.6 (The SVD theorem) Let A 2 Rm,n with rank r. Then A =
U DV > where D is an r ⇥ r matrix with nonzero singular values of A and the
columns of U, V are orthonormal left and right singular vectors of A. Further-
2
more, for all i, Di,i is an eigenvalue of A> A, the ith column of V is the cor-
responding eigenvector of A> A and the ith column of U is the corresponding
eigenvector of AA> .
Notes
Abernethy, J., Bartlett, P. L., Rakhlin, A. & Tewari, A. (2008), Optimal strategies and
minimax lower bounds for online convex games, in ‘Proceedings of the Nineteenth
Annual Conference on Computational Learning Theory’.
Ackerman, M. & Ben-David, S. (2008), Measures of clustering quality: A working set
of axioms for clustering, in ‘Proceedings of Neural Information Processing Systems
(NIPS)’, pp. 121–128.
Agarwal, S. & Roth, D. (2005), Learnability of bipartite ranking functions, in ‘Pro-
ceedings of the 18th Annual Conference on Learning Theory’, pp. 16–31.
Agmon, S. (1954), ‘The relaxation method for linear inequalities’, Canadian Journal
of Mathematics 6(3), 382–392.
Aizerman, M. A., Braverman, E. M. & Rozonoer, L. I. (1964), ‘Theoretical foundations
of the potential function method in pattern recognition learning’, Automation and
Remote Control 25, 821–837.
Allwein, E. L., Schapire, R. & Singer, Y. (2000), ‘Reducing multiclass to binary: A uni-
fying approach for margin classifiers’, Journal of Machine Learning Research 1, 113–
141.
Alon, N., Ben-David, S., Cesa-Bianchi, N. & Haussler, D. (1997), ‘Scale-sensitive dimen-
sions, uniform convergence, and learnability’, Journal of the ACM 44(4), 615–631.
Anthony, M. & Bartlet, P. (1999), Neural Network Learning: Theoretical Foundations,
Cambridge University Press.
Baraniuk, R., Davenport, M., DeVore, R. & Wakin, M. (2008), ‘A simple proof of
the restricted isometry property for random matrices’, Constructive Approximation
28(3), 253–263.
Barber, D. (2012), Bayesian reasoning and machine learning, Cambridge University
Press.
Bartlett, P., Bousquet, O. & Mendelson, S. (2005), ‘Local rademacher complexities’,
Annals of Statistics 33(4), 1497–1537.
Bartlett, P. L. & Ben-David, S. (2002), ‘Hardness results for neural network approxi-
mation problems’, Theor. Comput. Sci. 284(1), 53–66.
Bartlett, P. L., Long, P. M. & Williamson, R. C. (1994), Fat-shattering and the learn-
ability of real-valued functions, in ‘Proceedings of the seventh annual conference on
Computational learning theory’, ACM, pp. 299–310.
Bartlett, P. L. & Mendelson, S. (2001), Rademacher and Gaussian complexities: Risk
bounds and structural results, in ‘14th Annual Conference on Computational Learn-
ing Theory, COLT 2001’, Vol. 2111, Springer, Berlin, pp. 224–240.
Breiman, L. (1996), Bias, variance, and arcing classifiers, Technical Report 460, Statis-
tics Department, University of California at Berkeley.
Breiman, L. (2001), ‘Random forests’, Machine learning 45(1), 5–32.
Breiman, L., Friedman, J. H., Olshen, R. A. & Stone, C. J. (1984), Classification and
Regression Trees, Wadsworth & Brooks.
Candès, E. (2008), ‘The restricted isometry property and its implications for com-
pressed sensing’, Comptes Rendus Mathematique 346(9), 589–592.
Candes, E. J. (2006), Compressive sampling, in ‘Proc. of the Int. Congress of Math.,
Madrid, Spain’.
Candes, E. & Tao, T. (2005), ‘Decoding by linear programming’, IEEE Trans. on
Information Theory 51, 4203–4215.
Cesa-Bianchi, N. & Lugosi, G. (2006), Prediction, learning, and games, Cambridge
University Press.
Chang, H. S., Weiss, Y. & Freeman, W. T. (2009), ‘Informative sensing’, arXiv preprint
arXiv:0901.4275 .
Chapelle, O., Le, Q. & Smola, A. (2007), Large margin optimization of ranking mea-
sures, in ‘NIPS Workshop: Machine Learning for Web Search’.
Collins, M. (2000), Discriminative reranking for natural language parsing, in ‘Machine
Learning’.
Collins, M. (2002), Discriminative training methods for hidden Markov models: Theory
and experiments with perceptron algorithms, in ‘Conference on Empirical Methods
in Natural Language Processing’.
Collobert, R. & Weston, J. (2008), A unified architecture for natural language process-
ing: deep neural networks with multitask learning, in ‘International Conference on
Machine Learning (ICML)’.
Cortes, C. & Vapnik, V. (1995), ‘Support-vector networks’, Machine Learning
20(3), 273–297.
Cover, T. (1965), ‘Behavior of sequential predictors of binary sequences’, Trans. 4th
Prague Conf. Information Theory Statistical Decision Functions, Random Processes
pp. 263–272.
Cover, T. & Hart, P. (1967), ‘Nearest neighbor pattern classification’, Information
Theory, IEEE Transactions on 13(1), 21–27.
Crammer, K. & Singer, Y. (2001), ‘On the algorithmic implementation of multiclass
kernel-based vector machines’, Journal of Machine Learning Research 2, 265–292.
Cristianini, N. & Shawe-Taylor, J. (2000), An Introduction to Support Vector Machines,
Cambridge University Press.
Daniely, A., Sabato, S., Ben-David, S. & Shalev-Shwartz, S. (2011), Multiclass learn-
ability and the erm principle, in ‘Conference on Learning Theory (COLT)’.
Daniely, A., Sabato, S. & Shwartz, S. S. (2012), Multiclass learning approaches: A
theoretical comparison with implications, in ‘NIPS’.
Davis, G., Mallat, S. & Avellaneda, M. (1997), ‘Greedy adaptive approximation’, Jour-
nal of Constructive Approximation 13, 57–98.
Devroye, L. & Györfi, L. (1985), Nonparametric Density Estimation: The L B1 S View,
Wiley.
Devroye, L., Györfi, L. & Lugosi, G. (1996), A Probabilistic Theory of Pattern Recog-
nition, Springer.
440 References
Dietterich, T. G. & Bakiri, G. (1995), ‘Solving multiclass learning problems via error-
correcting output codes’, Journal of Artificial Intelligence Research 2, 263–286.
Donoho, D. L. (2006), ‘Compressed sensing’, Information Theory, IEEE Transactions
on 52(4), 1289–1306.
Dudley, R., Gine, E. & Zinn, J. (1991), ‘Uniform and universal glivenko-cantelli classes’,
Journal of Theoretical Probability 4(3), 485–510.
Dudley, R. M. (1987), ‘Universal Donsker classes and metric entropy’, Annals of Prob-
ability 15(4), 1306–1326.
Fisher, R. A. (1922), ‘On the mathematical foundations of theoretical statistics’, Philo-
sophical Transactions of the Royal Society of London. Series A, Containing Papers
of a Mathematical or Physical Character 222, 309–368.
Floyd, S. (1989), Space-bounded learning and the Vapnik-Chervonenkis dimension, in
‘Conference on Learning Theory (COLT)’, pp. 349–364.
Floyd, S. & Warmuth, M. (1995), ‘Sample compression, learnability, and the Vapnik-
Chervonenkis dimension’, Machine Learning 21(3), 269–304.
Frank, M. & Wolfe, P. (1956), ‘An algorithm for quadratic programming’, Naval Res.
Logist. Quart. 3, 95–110.
Freund, Y. & Schapire, R. (1995), A decision-theoretic generalization of on-line learning
and an application to boosting, in ‘European Conference on Computational Learning
Theory (EuroCOLT)’, Springer-Verlag, pp. 23–37.
Freund, Y. & Schapire, R. E. (1999), ‘Large margin classification using the perceptron
algorithm’, Machine Learning 37(3), 277–296.
Garcia, J. & Koelling, R. (1996), ‘Relation of cue to consequence in avoidance learning’,
Foundations of animal behavior: classic papers with commentaries 4, 374.
Gentile, C. (2003), ‘The robustness of the p-norm algorithms’, Machine Learning
53(3), 265–299.
Georghiades, A., Belhumeur, P. & Kriegman, D. (2001), ‘From few to many: Illumina-
tion cone models for face recognition under variable lighting and pose’, IEEE Trans.
Pattern Anal. Mach. Intelligence 23(6), 643–660.
Gordon, G. (1999), Regret bounds for prediction problems, in ‘Conference on Learning
Theory (COLT)’.
Gottlieb, L.-A., Kontorovich, L. & Krauthgamer, R. (2010), Efficient classification for
metric data, in ‘23rd Conference on Learning Theory’, pp. 433–440.
Guyon, I. & Elissee↵, A. (2003), ‘An introduction to variable and feature selection’,
Journal of Machine Learning Research, Special Issue on Variable and Feature Selec-
tion 3, 1157–1182.
Hadamard, J. (1902), ‘Sur les problèmes aux dérivées partielles et leur signification
physique’, Princeton University Bulletin 13, 49–52.
Hastie, T., Tibshirani, R. & Friedman, J. (2001), The Elements of Statistical Learning,
Springer.
Haussler, D. (1992), ‘Decision theoretic generalizations of the PAC model for neural
net and other learning applications’, Information and Computation 100(1), 78–150.
Haussler, D. & Long, P. M. (1995), ‘A generalization of sauer’s lemma’, Journal of
Combinatorial Theory, Series A 71(2), 219–240.
Hazan, E., Agarwal, A. & Kale, S. (2007), ‘Logarithmic regret algorithms for online
convex optimization’, Machine Learning 69(2–3), 169–192.
References 441
Hinton, G. E., Osindero, S. & Teh, Y.-W. (2006), ‘A fast learning algorithm for deep
belief nets’, Neural Computation 18(7), 1527–1554.
Hiriart-Urruty, J.-B. & Lemaréchal, C. (1996), Convex Analysis and Minimization Al-
gorithms: Part 1: Fundamentals, Vol. 1, Springer.
Hsu, C.-W., Chang, C.-C. & Lin, C.-J. (2003), ‘A practical guide to support vector
classification’.
Hyafil, L. & Rivest, R. L. (1976), ‘Constructing optimal binary decision trees is NP-
complete’, Information Processing Letters 5(1), 15–17.
Joachims, T. (2005), A support vector method for multivariate performance measures,
in ‘Proceedings of the International Conference on Machine Learning (ICML)’.
Kakade, S., Sridharan, K. & Tewari, A. (2008), On the complexity of linear prediction:
Risk bounds, margin bounds, and regularization, in ‘NIPS’.
Karp, R. M. (1972), Reducibility among combinatorial problems, Springer.
Kearns, M. J., Schapire, R. E. & Sellie, L. M. (1994), ‘Toward efficient agnostic learn-
ing’, Machine Learning 17, 115–141.
Kearns, M. & Mansour, Y. (1996), On the boosting ability of top-down decision tree
learning algorithms, in ‘ACM Symposium on the Theory of Computing (STOC)’.
Kearns, M. & Ron, D. (1999), ‘Algorithmic stability and sanity-check bounds for leave-
one-out cross-validation’, Neural Computation 11(6), 1427–1453.
Kearns, M. & Valiant, L. G. (1988), Learning Boolean formulae or finite automata is
as hard as factoring, Technical Report TR-14-88, Harvard University Aiken Compu-
tation Laboratory.
Kearns, M. & Vazirani, U. (1994), An Introduction to Computational Learning Theory,
MIT Press.
Kleinberg, J. (2003), ‘An impossibility theorem for clustering’, Advances in Neural
Information Processing Systems pp. 463–470.
Klivans, A. R. & Sherstov, A. A. (2006), Cryptographic hardness for learning intersec-
tions of halfspaces, in ‘FOCS’.
Koller, D. & Friedman, N. (2009), Probabilistic Graphical Models: Principles and Tech-
niques, MIT Press.
Koltchinskii, V. & Panchenko, D. (2000), Rademacher processes and bounding the risk
of function learning, in ‘High Dimensional Probability II’, Springer, pp. 443–457.
Kuhn, H. W. (1955), ‘The hungarian method for the assignment problem’, Naval re-
search logistics quarterly 2(1-2), 83–97.
Kutin, S. & Niyogi, P. (2002), Almost-everywhere algorithmic stability and general-
ization error, in ‘Proceedings of the 18th Conference in Uncertainty in Artificial
Intelligence’, pp. 275–282.
La↵erty, J., McCallum, A. & Pereira, F. (2001), Conditional random fields: Probabilistic
models for segmenting and labeling sequence data, in ‘International Conference on
Machine Learning’, pp. 282–289.
Langford, J. (2006), ‘Tutorial on practical prediction theory for classification’, Journal
of machine learning research 6(1), 273.
Langford, J. & Shawe-Taylor, J. (2003), PAC-Bayes & margins, in ‘NIPS’, pp. 423–430.
Le Cun, L. (2004), Large scale online learning., in ‘Advances in Neural Information
Processing Systems 16: Proceedings of the 2003 Conference’, Vol. 16, MIT Press,
p. 217.
442 References
Le, Q. V., Ranzato, M.-A., Monga, R., Devin, M., Corrado, G., Chen, K., Dean, J. &
Ng, A. Y. (2012), Building high-level features using large scale unsupervised learning,
in ‘International Conference on Machine Learning (ICML)’.
Lecun, Y. & Bengio, Y. (1995), Convolutional Networks for Images, Speech and Time
Series, The MIT Press, pp. 255–258.
Lee, H., Grosse, R., Ranganath, R. & Ng, A. (2009), Convolutional deep belief networks
for scalable unsupervised learning of hierarchical representations, in ‘International
Conference on Machine Learning (ICML)’.
Littlestone, N. (1988), ‘Learning quickly when irrelevant attributes abound: A new
linear-threshold algorithm’, Machine Learning 2, 285–318.
Littlestone, N. & Warmuth, M. (1986), Relating data compression and learnability.
Unpublished manuscript.
Littlestone, N. & Warmuth, M. K. (1994), ‘The weighted majority algorithm’, Infor-
mation and Computation 108, 212–261.
Livni, R., Shalev-Shwartz, S. & Shamir, O. (2013), ‘A provably efficient algorithm for
training deep networks’, arXiv preprint arXiv:1304.7045 .
Livni, R. & Simon, P. (2013), Honest compressions and their application to compression
schemes, in ‘Conference on Learning Theory (COLT)’.
MacKay, D. J. (2003), Information theory, inference and learning algorithms,
Cambridge university press.
Mallat, S. & Zhang, Z. (1993), ‘Matching pursuits with time-frequency dictionaries’,
IEEE Transactions on Signal Processing 41, 3397–3415.
McAllester, D. A. (1998), Some PAC-Bayesian theorems, in ‘Conference on Learning
Theory (COLT)’.
McAllester, D. A. (1999), PAC-Bayesian model averaging, in ‘Conference on Learning
Theory (COLT)’, pp. 164–170.
McAllester, D. A. (2003), Simplified PAC-Bayesian margin bounds., in ‘Conference on
Learning Theory (COLT)’, pp. 203–215.
Minsky, M. & Papert, S. (1969), Perceptrons: An Introduction to Computational Ge-
ometry, The MIT Press.
Mukherjee, S., Niyogi, P., Poggio, T. & Rifkin, R. (2006), ‘Learning theory: stability is
sufficient for generalization and necessary and sufficient for consistency of empirical
risk minimization’, Advances in Computational Mathematics 25(1-3), 161–193.
Murata, N. (1998), ‘A statistical study of on-line learning’, Online Learning and Neural
Networks. Cambridge University Press, Cambridge, UK .
Murphy, K. P. (2012), Machine learning: a probabilistic perspective, The MIT Press.
Natarajan, B. (1995), ‘Sparse approximate solutions to linear systems’, SIAM J. Com-
puting 25(2), 227–234.
Natarajan, B. K. (1989), ‘On learning sets and functions’, Mach. Learn. 4, 67–97.
Nemirovski, A., Juditsky, A., Lan, G. & Shapiro, A. (2009), ‘Robust stochastic ap-
proximation approach to stochastic programming’, SIAM Journal on Optimization
19(4), 1574–1609.
Nemirovski, A. & Yudin, D. (1978), Problem complexity and method efficiency in opti-
mization, Nauka Publishers, Moscow.
Nesterov, Y. (2005), Primal-dual subgradient methods for convex problems, Technical
report, Center for Operations Research and Econometrics (CORE), Catholic Univer-
sity of Louvain (UCL).
References 443
Shelah, S. (1972), ‘A combinatorial problem; stability and order for models and theories
in infinitary languages’, Pac. J. Math 4, 247–261.
Sipser, M. (2006), Introduction to the Theory of Computation, Thomson Course Tech-
nology.
Slud, E. V. (1977), ‘Distribution inequalities for the binomial law’, The Annals of
Probability 5(3), 404–412.
Steinwart, I. & Christmann, A. (2008), Support vector machines, Springerverlag New
York.
Stone, C. (1977), ‘Consistent nonparametric regression’, The annals of statistics
5(4), 595–620.
Taskar, B., Guestrin, C. & Koller, D. (2003), Max-margin markov networks, in ‘NIPS’.
Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’, J. Royal.
Statist. Soc B. 58(1), 267–288.
Tikhonov, A. N. (1943), ‘On the stability of inverse problems’, Dolk. Akad. Nauk SSSR
39(5), 195–198.
Tishby, N., Pereira, F. & Bialek, W. (1999), The information bottleneck method, in
‘The 37’th Allerton Conference on Communication, Control, and Computing’.
Tsochantaridis, I., Hofmann, T., Joachims, T. & Altun, Y. (2004), Support vector
machine learning for interdependent and structured output spaces, in ‘Proceedings
of the Twenty-First International Conference on Machine Learning’.
Valiant, L. G. (1984), ‘A theory of the learnable’, Communications of the ACM
27(11), 1134–1142.
Vapnik, V. (1992), Principles of risk minimization for learning theory, in J. E. Moody,
S. J. Hanson & R. P. Lippmann, eds, ‘Advances in Neural Information Processing
Systems 4’, Morgan Kaufmann, pp. 831–838.
Vapnik, V. (1995), The Nature of Statistical Learning Theory, Springer.
Vapnik, V. N. (1982), Estimation of Dependences Based on Empirical Data, Springer-
Verlag.
Vapnik, V. N. (1998), Statistical Learning Theory, Wiley.
Vapnik, V. N. & Chervonenkis, A. Y. (1971), ‘On the uniform convergence of relative
frequencies of events to their probabilities’, Theory of Probability and its applications
XVI(2), 264–280.
Vapnik, V. N. & Chervonenkis, A. Y. (1974), Theory of pattern recognition, Nauka,
Moscow. (In Russian).
Von Luxburg, U. (2007), ‘A tutorial on spectral clustering’, Statistics and computing
17(4), 395–416.
von Neumann, J. (1928), ‘Zur theorie der gesellschaftsspiele (on the theory of parlor
games)’, Math. Ann. 100, 295—320.
Von Neumann, J. (1953), ‘A certain zero-sum two-person game equivalent to the opti-
mal assignment problem’, Contributions to the Theory of Games 2, 5–12.
Vovk, V. G. (1990), Aggregating strategies, in ‘Conference on Learning Theory
(COLT)’, pp. 371–383.
Warmuth, M., Glocer, K. & Vishwanathan, S. (2008), Entropy regularized lpboost, in
‘Algorithmic Learning Theory (ALT)’.
Warmuth, M., Liao, J. & Ratsch, G. (2006), Totally corrective boosting algorithms
that maximize the margin, in ‘Proceedings of the 23rd international conference on
Machine learning’.
446 References
Weston, J., Chapelle, O., Vapnik, V., Elissee↵, A. & Schölkopf, B. (2002), Kernel depen-
dency estimation, in ‘Advances in neural information processing systems’, pp. 873–
880.
Weston, J. & Watkins, C. (1999), Support vector machines for multi-class pattern
recognition, in ‘Proceedings of the Seventh European Symposium on Artificial Neural
Networks’.
Wolpert, D. H. & Macready, W. G. (1997), ‘No free lunch theorems for optimization’,
Evolutionary Computation, IEEE Transactions on 1(1), 67–82.
Zhang, T. (2004), Solving large scale linear prediction problems using stochastic gradi-
ent descent algorithms, in ‘Proceedings of the Twenty-First International Conference
on Machine Learning’.
Zhao, P. & Yu, B. (2006), ‘On model selection consistency of Lasso’, Journal of Machine
Learning Research 7, 2541–2567.
Zinkevich, M. (2003), Online convex programming and generalized infinitesimal gradi-
ent ascent, in ‘International Conference on Machine Learning’.
Index