Recitation 11: Based On Nesterov, Yurii. Introductory Lectures On Convex Optimization: A Basic Course

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

ORIE 6300 Mathematical Programming I November 2, 2016

Recitation 11
Lecturer: Calvin Wylie Topic: Woo-Hyung Cho

1
Smooth Convex Optimization
Recall that f : Rn 7→ R is convex if ∀x, y and t ∈ [0, 1], f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y)

Lemma 1 If f is convex and continuously differentiable, then f (x)+h∇f (x), y −xi ≤ f (y) (∀x, y).

Proof: Let x, y ∈ Rn and t ∈ [0, 1). Let xt = tx + (1 − t)y.


If f is convex, f (xt ) ≤ tf (x) + (1 − t)f (y). Since t 6= 1, we can divide by 1 − t:
1
f (y) ≥ (f (xt ) − tf (x))
1−t
1
= f (x) + (f (xt ) − f (x))
1−t
1
= f (x) + (f (x + (1 − t)(y − x)) − f (x))
1−t
Let t → 1, then f (y) ≥ f (x) + h∇f (x), y − xi.
Proof in the reverse direction is left to the readers as an exercise.
2

Lemma 2 If f is convex and continuously differentiable, then h∇f (x) − ∇f (y), x − yi ≥ 0 (∀x, y),
i.e., ∇f is monotone.

Proof: If Lemma 1 holds, Lemma 2 holds. We only prove one direction.


Let x, y ∈ Rn . By Lemma 1,

f (x) + h∇f (x), y − xi ≤ f (y) (1)


f (y) + h∇f (y), x − yi ≤ f (x) (2)

Add (1) and (2) to get h∇f (x), y − xi + h∇f (y), x − yi ≤ 0. Then h∇f (x) − ∇f (y), x − yi ≥ 0.
2

Theorem 3 If f is continuously differentiable (but not necessarily convex), and ∇f is L-Lipschitz,


i.e., k∇f (x) − ∇f (y)k ≤ Lkx − yk, then
L
|f (y) − f (x) − h∇f (x), y − xi| ≤ ky − xk2
2
φ1 (x) = f (x) + h∇f (x), y − xi + L2 ky − xk2 is an upper bound on f. Likewise, φ2 (x) = f (x) +
h∇f (x), y − xi − L2 ky − xk2 offers a lower bound.
1
Based on Nesterov, Yurii. Introductory lectures on convex optimization: A basic course.

11-1
Proof: By the fundamental theorem of calculus,
Z 1
d
f (y) − f (x) = f (x + t(y − x))dt
0 dt
Z 1
= h∇f (x + t(y − x)), y − xidt
0

Then,
Z 1
|f (y) − f (x) − h∇f (x), y − xi| = | h∇f (x + t(y − x)) − ∇f (x), y − xidt|
0
Z 1
≤ |h∇f (x + t(y − x)) − ∇f (x), y − xi|dt
0
Z 1
≤ k∇f (x + t(y − x)) − ∇f (x)kky − xkdt (Cauchy − Schwarz)
0
Z 1
≤ Lk∇x + t(y − x) − xkky − xkdt (Lipschitz)
0
Z 1
2
= Lky − xk tdt
0
L
= ky − xk2
2
2

Theorem 4 Let f be convex and continuously differentiable. Then x∗ is a global minimizer of f


iff ∇f (x∗ ) = 0.

Proof: (←) The proof follows immediately from Lemma 1: f (x∗ ) + h∇f (x∗ ), y − x∗ i ≤ f (y)
(∀y). If ∇f (x∗ ) = 0, then f (x∗ ) ≤ f (y) (∀y). Hence, f (x∗ ) is the global minimizer.

(→) For a proof by contradiction, suppose ∇f (x∗ ) 6= 0 and let d = −∇f (x∗ ). Then hd, ∇f (x∗ )i < 0.
Now recall the mean value theorem: (∀x, y ∈ Rn ) f (y) = f (x)+h∇f (x+t(y−x)), y−xi for some
t ∈ (0, 1). Since ∇f is continuous, h∇f (x∗ + td), di < 0 (∀ 0 ≤ t ≤ T ) for some T . For t ∈ [0, T ],
f (x∗ + td) = f (x∗ ) + h∇f (x∗ + ttd), tdi holds for some t ∈ (0, 1), where h∇f (x∗ + ttd), tdi < 0. This
shows that x∗ is not a global minimizer, and we have a contradiction. Therefore, ∇f (x∗ ) = 0. 2

Theorem 5 Let f be convex and continuously differentiable. Let ∇f be L-Lipschitz continuous.


Then
1
k∇f (x) − ∇f (y)k2 ≤ h∇f (x) − ∇f (y), x − yi
L
This is called the ”co-coercivity condition.”

Proof: Let y ∈ Rn and define g(x) = f (x) − h∇f (y), xi. Note that ∇g(y) = ∇f (y) − ∇f (y) = 0,
i.e., y minimizes g. Because g(y) ≤ g(·), g(y) ≤ g(x − L1 ∇g(x)) also holds ∀x. We apply Theorem
3.

11-2
1 1 L 1
g(x − ∇g(x)) ≤ g(x) + h∇g(x), − ∇g(x)i + k − ∇g(x)k2
L L 2 L
1 2 1 2
= g(x) − k∇g(x)k + k∇g(x)k
L 2L
1
= g(x) − k∇g(x)k2
2L
1 2
We use the definition g(x) = f (x) − h∇f (y), xi and the inequality g(y) ≤ g(x) − 2L k∇g(x)k
to derive
1
f (y) − h∇f (y), yi − f (x) + h∇f (y), xi ≤ − k∇f (x) − ∇f (y)k2
2L
Interchanging x and y,
1
f (x) − h∇f (x), xi − f (y) + h∇f (x), yi ≤ − k∇f (x) − ∇f (y)k2
2L
We add the two inequalities to get
1
k∇f (x) − ∇f (y)k2 ≤ h∇f (x) − ∇f (y), x − yi
L
2

Corollary 6 I − L2 ∇f is non-expansive.

Proof:
2 2 2
k(x − ∇f (x)) − (y − ∇f (y))k2 = k(x − y) − (∇f (x) − ∇f (y))k2
L L L
4 4
= kx − yk + 2 k∇f (x) − ∇f (y)k2 − hx − y, ∇f (x) − ∇f (y)i
2
L L
4 1
= kx − yk2 + ( k∇f (x) − ∇f (y)k2 − hx − y, ∇f (x) − ∇f (y)i)
L L
≤ kx − yk2 by Theorem 5

A KM iteration
1 2 1
xk+1 = (I − ∇f )(xk ) + xk
2 L 2
k 1 k
= x − ∇f (x )
L
performs a gradient descent with step size L1 . If we apply the KM algorithm iteratively, the sequence
xk converges to a fixed-point x∗ such that x∗ = (I − L2 ∇f )(x∗ ), which implies ∇f (x∗ ) = 0. Steepest
gradient descent converges to a minimizer when the step size is chosen between 0 and L2 .

11-3

You might also like