Recitation 11: Based On Nesterov, Yurii. Introductory Lectures On Convex Optimization: A Basic Course
Recitation 11: Based On Nesterov, Yurii. Introductory Lectures On Convex Optimization: A Basic Course
Recitation 11: Based On Nesterov, Yurii. Introductory Lectures On Convex Optimization: A Basic Course
Recitation 11
Lecturer: Calvin Wylie Topic: Woo-Hyung Cho
1
Smooth Convex Optimization
Recall that f : Rn 7→ R is convex if ∀x, y and t ∈ [0, 1], f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y)
Lemma 1 If f is convex and continuously differentiable, then f (x)+h∇f (x), y −xi ≤ f (y) (∀x, y).
Lemma 2 If f is convex and continuously differentiable, then h∇f (x) − ∇f (y), x − yi ≥ 0 (∀x, y),
i.e., ∇f is monotone.
Add (1) and (2) to get h∇f (x), y − xi + h∇f (y), x − yi ≤ 0. Then h∇f (x) − ∇f (y), x − yi ≥ 0.
2
11-1
Proof: By the fundamental theorem of calculus,
Z 1
d
f (y) − f (x) = f (x + t(y − x))dt
0 dt
Z 1
= h∇f (x + t(y − x)), y − xidt
0
Then,
Z 1
|f (y) − f (x) − h∇f (x), y − xi| = | h∇f (x + t(y − x)) − ∇f (x), y − xidt|
0
Z 1
≤ |h∇f (x + t(y − x)) − ∇f (x), y − xi|dt
0
Z 1
≤ k∇f (x + t(y − x)) − ∇f (x)kky − xkdt (Cauchy − Schwarz)
0
Z 1
≤ Lk∇x + t(y − x) − xkky − xkdt (Lipschitz)
0
Z 1
2
= Lky − xk tdt
0
L
= ky − xk2
2
2
Proof: (←) The proof follows immediately from Lemma 1: f (x∗ ) + h∇f (x∗ ), y − x∗ i ≤ f (y)
(∀y). If ∇f (x∗ ) = 0, then f (x∗ ) ≤ f (y) (∀y). Hence, f (x∗ ) is the global minimizer.
(→) For a proof by contradiction, suppose ∇f (x∗ ) 6= 0 and let d = −∇f (x∗ ). Then hd, ∇f (x∗ )i < 0.
Now recall the mean value theorem: (∀x, y ∈ Rn ) f (y) = f (x)+h∇f (x+t(y−x)), y−xi for some
t ∈ (0, 1). Since ∇f is continuous, h∇f (x∗ + td), di < 0 (∀ 0 ≤ t ≤ T ) for some T . For t ∈ [0, T ],
f (x∗ + td) = f (x∗ ) + h∇f (x∗ + ttd), tdi holds for some t ∈ (0, 1), where h∇f (x∗ + ttd), tdi < 0. This
shows that x∗ is not a global minimizer, and we have a contradiction. Therefore, ∇f (x∗ ) = 0. 2
Proof: Let y ∈ Rn and define g(x) = f (x) − h∇f (y), xi. Note that ∇g(y) = ∇f (y) − ∇f (y) = 0,
i.e., y minimizes g. Because g(y) ≤ g(·), g(y) ≤ g(x − L1 ∇g(x)) also holds ∀x. We apply Theorem
3.
11-2
1 1 L 1
g(x − ∇g(x)) ≤ g(x) + h∇g(x), − ∇g(x)i + k − ∇g(x)k2
L L 2 L
1 2 1 2
= g(x) − k∇g(x)k + k∇g(x)k
L 2L
1
= g(x) − k∇g(x)k2
2L
1 2
We use the definition g(x) = f (x) − h∇f (y), xi and the inequality g(y) ≤ g(x) − 2L k∇g(x)k
to derive
1
f (y) − h∇f (y), yi − f (x) + h∇f (y), xi ≤ − k∇f (x) − ∇f (y)k2
2L
Interchanging x and y,
1
f (x) − h∇f (x), xi − f (y) + h∇f (x), yi ≤ − k∇f (x) − ∇f (y)k2
2L
We add the two inequalities to get
1
k∇f (x) − ∇f (y)k2 ≤ h∇f (x) − ∇f (y), x − yi
L
2
Corollary 6 I − L2 ∇f is non-expansive.
Proof:
2 2 2
k(x − ∇f (x)) − (y − ∇f (y))k2 = k(x − y) − (∇f (x) − ∇f (y))k2
L L L
4 4
= kx − yk + 2 k∇f (x) − ∇f (y)k2 − hx − y, ∇f (x) − ∇f (y)i
2
L L
4 1
= kx − yk2 + ( k∇f (x) − ∇f (y)k2 − hx − y, ∇f (x) − ∇f (y)i)
L L
≤ kx − yk2 by Theorem 5
A KM iteration
1 2 1
xk+1 = (I − ∇f )(xk ) + xk
2 L 2
k 1 k
= x − ∇f (x )
L
performs a gradient descent with step size L1 . If we apply the KM algorithm iteratively, the sequence
xk converges to a fixed-point x∗ such that x∗ = (I − L2 ∇f )(x∗ ), which implies ∇f (x∗ ) = 0. Steepest
gradient descent converges to a minimizer when the step size is chosen between 0 and L2 .
11-3