sol3_2020

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Mehryar Mohri

Foundations of Machine Learning 2020


Courant Institute of Mathematical Sciences
Homework assignment 3
Nov 14, 2020
Due: Nov 24, 2020 [before class starts].

A. Kernel methodss

1. Graph kernel. Let G = (V, E) be an undirected graph with vertex set


V and edge set E. V could represent a set of documents or biosequences
and E the set of connections between them. Let w[e] ∈ R denote the
weight assigned to edge e ∈ E. The weight of a path is the product
of the weights of its constituent edges. Show that the kernel K over
V × V where K(p, q) is the sum of the weights of all paths of length
two between p and q is PDS (Hint: you could introduce the matrix
W = (Wpq ), where Wpq = 0 when there is no edge between p and q,
Wpq equal to the weight of the edge between p and q otherwise).

Solution:
Graph kernel For any i, j ∈ V, let E[i, j] denote the set of edges between
i and j which is either reduced to one edge or is empty. For conve-
nience, let V = {1, . . . , n} and define the matrix W = (Wij ) ∈ Rn×n
by Wij = 0 if E[i, j] = ∅, Wij = w[e] if e ∈ E[i, j]. Then, we can write
X X
K(p, q) = w[e]w[e0 ] = Wpr Wrq = Wpq 2
.
e∈E[p,r],e0 ∈E[r,q] r

Let K = (Kpq ) denote the kernel matrix. Since W is symmetric For


any vector X ∈ Rn , X> KX = X> W2 X = X> W> WX = kWXk2 ≥
0. Thus, the eigenvalues of K are non-negative. The same holds
similarly for the kernel matrix restricted to any subset of V, thus K is
PDS. t
u
2. Pixel kernel. This problem consists of showing that a kernel useful in
applications is PDS.
R +∞
(a) Show that S : (z, z 0 ) 7→ 0 1t∈[0,z] 1t∈[0,z 0 ] dt is a PDS kernel de-
fined over R × R.
R +∞
Solution: (f, g) 7→ t=0 f (t)g(t)dt is a positive semi-definite in-
ner product over the set L2 (R). Let h·, ·i denote that inner

1
product. Then, for any z, z 0 ∈ R, S(z, z 0 ) = ht 7→ 1t∈[0,z] , t 7→
1t∈[0,z 0 ] i = hΨ(z), Ψ(z 0 )i, where Ψ is the feature mapping Ψ : z 7→
t 7→ 1t∈[0,z] . Thus, S can be expressed as an inner product and is
therefore PDS. t
u
(b) Use the previous question to prove that the kernel Kµ : (x, x0 ) 7→
QN min(|x |µ ,|x0 |µ )
k=1 e , where xk is the kth coordinate of x, is a
k k

PDS kernel defined over RN × RN , for any µ ≥ 0.

Solution: Observe that, for any µ ≥ 0, min(|u|µ , |u0 |µ ) = S(|u|µ , |u0 |µ ).


Thus, for any µ ≥ 0 and k ∈ [N ], (x, x0 ) 7→ min(|xk |µ , |x0k |µ ) is
a PDS kernel. The sum of these PDS kernels (over all k ∈ [N ])
is therefore also PDS and the composition with the power series
exp whose coefficients are non-negative and whose radius of con-
vergence is infinite is also PDS. t
u

B. Boosting

1. Logistic loss boosting.

(a) Show that Φ : u 7→ log2 (1 + e−u ) defines a convex and decreasing


function upper-bounding u 7→ 1u≤0 .

Solution: This is straightforward. For any u ∈ R,

1 e−u 1 1
Φ0 (u) = − −u
=− < 0.
log(2) 1 + e log(2) 1 + eu

Thus, Φ is decreasing. For any u ∈ R,


1 eu
Φ00 (u) = > 0.
log(2) (1 + eu )2

Thus, Φ is convex. For u ≤ 0, we have e−u ≥ 1, thus log2 (1 +


e−u ) ≥ log2 (1 + 1) = 1. For u > 0, since log is increasing,
log2 (1 + e−u ) ≥ log2 (1) = 0.
(b) Let H = {h1 , . . . , hN } be a finite hypothesis set serving as a
family of base predictors taking values in {−1, +1}. Define the
objective F (α) for a boosting-type algorithm using Φ, instead of
the exponential function used in AdaBoost, for a given sample of
size m and show that it is a convex function of α.

2
Solution: The objective function can be written for any α ∈ RN ,
α ≥ 0 as follows:
 
Xm N
X
F (α) = Φ yi αj hj (xi ) .
i=1 j=1

This a convex function of α since Φ is convex and composition


with an affine function of α preserves convexity. t
u
(c) Determine the best direction at iteration t if you apply coordinate
descent to F . You should adopt a notation similar to the one used
in class and define Dt , Zt , and t,k for t ∈ [T ] and k ∈ [N ]. What
should be the boosting condition on t,k for the best direction k?

Solution: For any η > 0 and direction k ∈ [N ], we can write:


 
m N
1 X X
F (αt−1 + ηek ) = Φ yi αt−1,j hj (xi ) + yi ηhk (xi ) .
m
i=1 j=1

Thus, the directional derivative of F along ek is given by:


 
m N
1 X X
F 0 (αt−1 , ek ) = yi hk (xi )Φ0 yi αt−1,j hj (xi )
m
i=1 j=1
m
1 X 1 yi hk (xi )
=−
m log(2) 1 + eyi N
P
j=1 αt−1,j hj (xi )
i=1
m
1 X
=− yi hk (xi )Dt (i)Zt
m
i=1
Zt
= − [(1 − t,k ) − t,k ]
m
Zt
= (2t,k − 1),
m
1 1
where Dt (i)Zt = log(2) yi
PN
α h (x )
. In view of that, the
1+e j=1 t−1,j j i

direction maximizing |F (αt−1 + ηek )| is the one with the minimal


weighted error t,k . For coordinate descent to succeed, we need
t,k < 12 for that best direction.
(d) Show that for any (u, v) ∈ R2 , the following inequality holds:

Φ(u + v) − Φ(u) ≤ −Φ0 (u)(e−v − 1).

3
Solution:
" #
1 + e−(u+v)
Φ(u + v) − Φ(u) = log2
1 + e−u
" #
1 + e−u + e−(u+v) − e−u
= log2
1 + e−u
e−v − 1
 
= log2 1 +
1 + eu
1 e−v − 1
≤ (log(1 + z) ≤ z)
log(2) 1 + eu
= −Φ0 (u)(e−v − 1).

(e) Use that to show the following:


m
1 X
F (αt−1 + ηek ) − F (αt−1 ) ≤ Dt (i)Zt [e−ηyi hk (xi ) − 1].
m
i=1

Solution: Observe that Dt (i)Zt = −Φ0 (yi N


P
j=1 αt−1,j hj (xi )). Then,
using the result of the previous question,
m N
1 X X
F (αt−1 + ηek ) − F (αt−1 ) ≤ −Φ0 (yi αt−1,j hj (xi ))[e−ηyi hk (xi ) − 1]
m
i=1 j=1
m
1 X
= Dt (i)Zt [e−ηyi hk (xi ) − 1].
m
i=1

(f) To determine the best step η for a given direction k, we can


minimize the upper bound of the previous question. Show that
this is syntactically the same minimization as the one for finding
the best step in AdaBoost. Give the expression of the step at
iteration t.

Solution: Minimizing that upper bound is equivalent to minimiz-


1 Pm
ing m i=1 Dt (i)e−ηyi hk (xi ) , which is the quantity minimized by
AdaBoost to determine the step. The expression of Dt is different
here but we can immediately use the result known h fori AdaBoost
1 1−t,k
to determine the step at iteration t: η = 2 log t,k .

4
(g) Give the full pseudocode of the algorithm.

Solution: This is straightforward based on the previous results.


(h) Give a margin-based generalization bound for this algorithm.

Solution: Straightforward using the ensemble Rademacher com-


plexity bound presented in class: fix ρ > 0; for any δ > 0, with
probability at least 1 − ρ,
! ! s
PT PT
t=1 αt ht t=1 αt ht 2 log 1δ
R ≤R bS,ρ + Rm (H) + .
kαk1 kαk1 ρ 2m

(i) Compare empirically AdaBoost and this logistic loss boosting al-
gorithm on the same dataset as the one used in the previous
homework by choosing T via cross-validation as multiples of 100.

You might also like