sol3_2020

Mehryar Mohri
Foundations of Machine Learning 2020

Courant Institute of Mathematical Sciences
Homework assignment 3
Nov 14, 2020
Due: Nov 24, 2020 [before class starts].
A. Kernel methodss
1. Graph kernel. Let G = (V, E) be an undirected graph with vertex set

V and edge set E. V could represent a set of documents or biosequences
and E the set of connections between them. Let w[e] ∈ R denote the
weight assigned to edge e ∈ E. The weight of a path is the product
of the weights of its constituent edges. Show that the kernel K over
V × V where K(p, q) is the sum of the weights of all paths of length
two between p and q is PDS (Hint: you could introduce the matrix
W = (Wpq ), where Wpq = 0 when there is no edge between p and q,
Wpq equal to the weight of the edge between p and q otherwise).
Solution:
Graph kernel For any i, j ∈ V, let E[i, j] denote the set of edges between
i and j which is either reduced to one edge or is empty. For conve-
nience, let V = {1, . . . , n} and define the matrix W = (Wij ) ∈ Rn×n
by Wij = 0 if E[i, j] = ∅, Wij = w[e] if e ∈ E[i, j]. Then, we can write
X X
K(p, q) = w[e]w[e0 ] = Wpr Wrq = Wpq 2
.
e∈E[p,r],e0 ∈E[r,q] r
Let K = (Kpq ) denote the kernel matrix. Since W is symmetric For

any vector X ∈ Rn , X> KX = X> W2 X = X> W> WX = kWXk2 ≥
0. Thus, the eigenvalues of K are non-negative. The same holds
similarly for the kernel matrix restricted to any subset of V, thus K is
PDS. t
u
2. Pixel kernel. This problem consists of showing that a kernel useful in
applications is PDS.
R +∞
(a) Show that S : (z, z 0 ) 7→ 0 1t∈[0,z] 1t∈[0,z 0 ] dt is a PDS kernel de-
fined over R × R.
R +∞
Solution: (f, g) 7→ t=0 f (t)g(t)dt is a positive semi-definite in-
ner product over the set L2 (R). Let h·, ·i denote that inner
1
product. Then, for any z, z 0 ∈ R, S(z, z 0 ) = ht 7→ 1t∈[0,z] , t 7→
1t∈[0,z 0 ] i = hΨ(z), Ψ(z 0 )i, where Ψ is the feature mapping Ψ : z 7→
t 7→ 1t∈[0,z] . Thus, S can be expressed as an inner product and is
therefore PDS. t
u
(b) Use the previous question to prove that the kernel Kµ : (x, x0 ) 7→
QN min(|x |µ ,|x0 |µ )
k=1 e , where xk is the kth coordinate of x, is a
k k
PDS kernel defined over RN × RN , for any µ ≥ 0.
Solution: Observe that, for any µ ≥ 0, min(|u|µ , |u0 |µ ) = S(|u|µ , |u0 |µ ).

Thus, for any µ ≥ 0 and k ∈ [N ], (x, x0 ) 7→ min(|xk |µ , |x0k |µ ) is
a PDS kernel. The sum of these PDS kernels (over all k ∈ [N ])
is therefore also PDS and the composition with the power series
exp whose coefficients are non-negative and whose radius of con-
vergence is infinite is also PDS. t
u
B. Boosting
1. Logistic loss boosting.
(a) Show that Φ : u 7→ log2 (1 + e−u ) defines a convex and decreasing

function upper-bounding u 7→ 1u≤0 .
Solution: This is straightforward. For any u ∈ R,
1 e−u 1 1
Φ0 (u) = − −u
=− < 0.
log(2) 1 + e log(2) 1 + eu
Thus, Φ is decreasing. For any u ∈ R,

1 eu
Φ00 (u) = > 0.
log(2) (1 + eu )2
Thus, Φ is convex. For u ≤ 0, we have e−u ≥ 1, thus log2 (1 +

e−u ) ≥ log2 (1 + 1) = 1. For u > 0, since log is increasing,
log2 (1 + e−u ) ≥ log2 (1) = 0.
(b) Let H = {h1 , . . . , hN } be a finite hypothesis set serving as a
family of base predictors taking values in {−1, +1}. Define the
objective F (α) for a boosting-type algorithm using Φ, instead of
the exponential function used in AdaBoost, for a given sample of
size m and show that it is a convex function of α.
2
Solution: The objective function can be written for any α ∈ RN ,
α ≥ 0 as follows:
 
Xm N
X
F (α) = Φ yi αj hj (xi ) .
i=1 j=1
This a convex function of α since Φ is convex and composition

with an affine function of α preserves convexity. t
u
(c) Determine the best direction at iteration t if you apply coordinate
descent to F . You should adopt a notation similar to the one used
in class and define Dt , Zt , and t,k for t ∈ [T ] and k ∈ [N ]. What
should be the boosting condition on t,k for the best direction k?
Solution: For any η > 0 and direction k ∈ [N ], we can write:

 
m N
1 X X
F (αt−1 + ηek ) = Φ yi αt−1,j hj (xi ) + yi ηhk (xi ) .
m
i=1 j=1
Thus, the directional derivative of F along ek is given by:

 
m N
1 X X
F 0 (αt−1 , ek ) = yi hk (xi )Φ0 yi αt−1,j hj (xi )
m
i=1 j=1
m
1 X 1 yi hk (xi )
=−
m log(2) 1 + eyi N
P
j=1 αt−1,j hj (xi )
i=1
m
1 X
=− yi hk (xi )Dt (i)Zt
m
i=1
Zt
= − [(1 − t,k ) − t,k ]
m
Zt
= (2t,k − 1),
m
1 1
where Dt (i)Zt = log(2) yi
PN
α h (x )
. In view of that, the
1+e j=1 t−1,j j i
direction maximizing |F (αt−1 + ηek )| is the one with the minimal

weighted error t,k . For coordinate descent to succeed, we need
t,k < 12 for that best direction.
(d) Show that for any (u, v) ∈ R2 , the following inequality holds:
Φ(u + v) − Φ(u) ≤ −Φ0 (u)(e−v − 1).
3
Solution:
" #
1 + e−(u+v)
Φ(u + v) − Φ(u) = log2
1 + e−u
" #
1 + e−u + e−(u+v) − e−u
= log2
1 + e−u
e−v − 1

= log2 1 +
1 + eu
1 e−v − 1
≤ (log(1 + z) ≤ z)
log(2) 1 + eu
= −Φ0 (u)(e−v − 1).
(e) Use that to show the following:

m
1 X
F (αt−1 + ηek ) − F (αt−1 ) ≤ Dt (i)Zt [e−ηyi hk (xi ) − 1].
m
i=1
Solution: Observe that Dt (i)Zt = −Φ0 (yi N

P
j=1 αt−1,j hj (xi )). Then,
using the result of the previous question,
m N
1 X X
F (αt−1 + ηek ) − F (αt−1 ) ≤ −Φ0 (yi αt−1,j hj (xi ))[e−ηyi hk (xi ) − 1]
m
i=1 j=1
m
1 X
= Dt (i)Zt [e−ηyi hk (xi ) − 1].
m
i=1
(f) To determine the best step η for a given direction k, we can

minimize the upper bound of the previous question. Show that
this is syntactically the same minimization as the one for finding
the best step in AdaBoost. Give the expression of the step at
iteration t.
Solution: Minimizing that upper bound is equivalent to minimiz-

1 Pm
ing m i=1 Dt (i)e−ηyi hk (xi ) , which is the quantity minimized by
AdaBoost to determine the step. The expression of Dt is different
here but we can immediately use the result known h fori AdaBoost
1 1−t,k
to determine the step at iteration t: η = 2 log t,k .
4
(g) Give the full pseudocode of the algorithm.
Solution: This is straightforward based on the previous results.

(h) Give a margin-based generalization bound for this algorithm.
Solution: Straightforward using the ensemble Rademacher com-

plexity bound presented in class: fix ρ > 0; for any δ > 0, with
probability at least 1 − ρ,
! ! s
PT PT
t=1 αt ht t=1 αt ht 2 log 1δ
R ≤R bS,ρ + Rm (H) + .
kαk1 kαk1 ρ 2m
(i) Compare empirically AdaBoost and this logistic loss boosting al-
gorithm on the same dataset as the one used in the previous
homework by choosing T via cross-validation as multiples of 100.

sol3_2020

Uploaded by

Copyright:

Available Formats

sol3_2020

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

sol3_2020

Uploaded by

Copyright:

Available Formats

Mehryar Mohri

Foundations of Machine Learning 2020

1. Graph kernel. Let G = (V, E) be an undirected graph with vertex set

Let K = (Kpq ) denote the kernel matrix. Since W is symmetric For

PDS kernel defined over RN × RN , for any µ ≥ 0.

Solution: Observe that, for any µ ≥ 0, min(|u|µ , |u0 |µ ) = S(|u|µ , |u0 |µ ).

1. Logistic loss boosting.

(a) Show that Φ : u 7→ log2 (1 + e−u ) defines a convex and decreasing

Solution: This is straightforward. For any u ∈ R,

Thus, Φ is decreasing. For any u ∈ R,

Thus, Φ is convex. For u ≤ 0, we have e−u ≥ 1, thus log2 (1 +

This a convex function of α since Φ is convex and composition

Solution: For any η > 0 and direction k ∈ [N ], we can write:

Thus, the directional derivative of F along ek is given by:

direction maximizing |F (αt−1 + ηek )| is the one with the minimal

Φ(u + v) − Φ(u) ≤ −Φ0 (u)(e−v − 1).

(e) Use that to show the following:

Solution: Observe that Dt (i)Zt = −Φ0 (yi N

(f) To determine the best step η for a given direction k, we can

Solution: Minimizing that upper bound is equivalent to minimiz-

Solution: This is straightforward based on the previous results.

Solution: Straightforward using the ensemble Rademacher com-

You might also like