Recitation 11: Based On Nesterov, Yurii. Introductory Lectures On Convex Optimization: A Basic Course

ORIE 6300 Mathematical Programming I November 2, 2016
Recitation 11
Lecturer: Calvin Wylie Topic: Woo-Hyung Cho
1
Smooth Convex Optimization
Recall that f : Rn 7→ R is convex if ∀x, y and t ∈ [0, 1], f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y)
Lemma 1 If f is convex and continuously differentiable, then f (x)+h∇f (x), y −xi ≤ f (y) (∀x, y).
Proof: Let x, y ∈ Rn and t ∈ [0, 1). Let xt = tx + (1 − t)y.

If f is convex, f (xt ) ≤ tf (x) + (1 − t)f (y). Since t 6= 1, we can divide by 1 − t:
1
f (y) ≥ (f (xt ) − tf (x))
1−t
1
= f (x) + (f (xt ) − f (x))
1−t
1
= f (x) + (f (x + (1 − t)(y − x)) − f (x))
1−t
Let t → 1, then f (y) ≥ f (x) + h∇f (x), y − xi.
Proof in the reverse direction is left to the readers as an exercise.
2
Lemma 2 If f is convex and continuously differentiable, then h∇f (x) − ∇f (y), x − yi ≥ 0 (∀x, y),
i.e., ∇f is monotone.
Proof: If Lemma 1 holds, Lemma 2 holds. We only prove one direction.

Let x, y ∈ Rn . By Lemma 1,
f (x) + h∇f (x), y − xi ≤ f (y) (1)

f (y) + h∇f (y), x − yi ≤ f (x) (2)
Add (1) and (2) to get h∇f (x), y − xi + h∇f (y), x − yi ≤ 0. Then h∇f (x) − ∇f (y), x − yi ≥ 0.
2
Theorem 3 If f is continuously differentiable (but not necessarily convex), and ∇f is L-Lipschitz,

i.e., k∇f (x) − ∇f (y)k ≤ Lkx − yk, then
L
|f (y) − f (x) − h∇f (x), y − xi| ≤ ky − xk2
2
φ1 (x) = f (x) + h∇f (x), y − xi + L2 ky − xk2 is an upper bound on f. Likewise, φ2 (x) = f (x) +
h∇f (x), y − xi − L2 ky − xk2 offers a lower bound.
1
Based on Nesterov, Yurii. Introductory lectures on convex optimization: A basic course.
11-1
Proof: By the fundamental theorem of calculus,
Z 1
d
f (y) − f (x) = f (x + t(y − x))dt
0 dt
Z 1
= h∇f (x + t(y − x)), y − xidt
0
Then,
Z 1
|f (y) − f (x) − h∇f (x), y − xi| = | h∇f (x + t(y − x)) − ∇f (x), y − xidt|
0
Z 1
≤ |h∇f (x + t(y − x)) − ∇f (x), y − xi|dt
0
Z 1
≤ k∇f (x + t(y − x)) − ∇f (x)kky − xkdt (Cauchy − Schwarz)
0
Z 1
≤ Lk∇x + t(y − x) − xkky − xkdt (Lipschitz)
0
Z 1
2
= Lky − xk tdt
0
L
= ky − xk2
2
2
Theorem 4 Let f be convex and continuously differentiable. Then x∗ is a global minimizer of f

iff ∇f (x∗ ) = 0.
Proof: (←) The proof follows immediately from Lemma 1: f (x∗ ) + h∇f (x∗ ), y − x∗ i ≤ f (y)
(∀y). If ∇f (x∗ ) = 0, then f (x∗ ) ≤ f (y) (∀y). Hence, f (x∗ ) is the global minimizer.
(→) For a proof by contradiction, suppose ∇f (x∗ ) 6= 0 and let d = −∇f (x∗ ). Then hd, ∇f (x∗ )i < 0.
Now recall the mean value theorem: (∀x, y ∈ Rn ) f (y) = f (x)+h∇f (x+t(y−x)), y−xi for some
t ∈ (0, 1). Since ∇f is continuous, h∇f (x∗ + td), di < 0 (∀ 0 ≤ t ≤ T ) for some T . For t ∈ [0, T ],
f (x∗ + td) = f (x∗ ) + h∇f (x∗ + ttd), tdi holds for some t ∈ (0, 1), where h∇f (x∗ + ttd), tdi < 0. This
shows that x∗ is not a global minimizer, and we have a contradiction. Therefore, ∇f (x∗ ) = 0. 2
Theorem 5 Let f be convex and continuously differentiable. Let ∇f be L-Lipschitz continuous.

Then
1
k∇f (x) − ∇f (y)k2 ≤ h∇f (x) − ∇f (y), x − yi
L
This is called the ”co-coercivity condition.”
Proof: Let y ∈ Rn and define g(x) = f (x) − h∇f (y), xi. Note that ∇g(y) = ∇f (y) − ∇f (y) = 0,
i.e., y minimizes g. Because g(y) ≤ g(·), g(y) ≤ g(x − L1 ∇g(x)) also holds ∀x. We apply Theorem
3.
11-2
1 1 L 1
g(x − ∇g(x)) ≤ g(x) + h∇g(x), − ∇g(x)i + k − ∇g(x)k2
L L 2 L
1 2 1 2
= g(x) − k∇g(x)k + k∇g(x)k
L 2L
1
= g(x) − k∇g(x)k2
2L
1 2
We use the definition g(x) = f (x) − h∇f (y), xi and the inequality g(y) ≤ g(x) − 2L k∇g(x)k
to derive
1
f (y) − h∇f (y), yi − f (x) + h∇f (y), xi ≤ − k∇f (x) − ∇f (y)k2
2L
Interchanging x and y,
1
f (x) − h∇f (x), xi − f (y) + h∇f (x), yi ≤ − k∇f (x) − ∇f (y)k2
2L
We add the two inequalities to get
1
k∇f (x) − ∇f (y)k2 ≤ h∇f (x) − ∇f (y), x − yi
L
2
Corollary 6 I − L2 ∇f is non-expansive.
Proof:
2 2 2
k(x − ∇f (x)) − (y − ∇f (y))k2 = k(x − y) − (∇f (x) − ∇f (y))k2
L L L
4 4
= kx − yk + 2 k∇f (x) − ∇f (y)k2 − hx − y, ∇f (x) − ∇f (y)i
2
L L
4 1
= kx − yk2 + ( k∇f (x) − ∇f (y)k2 − hx − y, ∇f (x) − ∇f (y)i)
L L
≤ kx − yk2 by Theorem 5
A KM iteration
1 2 1
xk+1 = (I − ∇f )(xk ) + xk
2 L 2
k 1 k
= x − ∇f (x )
L
performs a gradient descent with step size L1 . If we apply the KM algorithm iteratively, the sequence
xk converges to a fixed-point x∗ such that x∗ = (I − L2 ∇f )(x∗ ), which implies ∇f (x∗ ) = 0. Steepest
gradient descent converges to a minimizer when the step size is chosen between 0 and L2 .
11-3

Recitation 11: Based On Nesterov, Yurii. Introductory Lectures On Convex Optimization: A Basic Course

Uploaded by

Copyright:

Available Formats

Recitation 11: Based On Nesterov, Yurii. Introductory Lectures On Convex Optimization: A Basic Course

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Recitation 11: Based On Nesterov, Yurii. Introductory Lectures On Convex Optimization: A Basic Course

Uploaded by

Copyright:

Available Formats

ORIE 6300 Mathematical Programming I November 2, 2016

Proof: Let x, y ∈ Rn and t ∈ [0, 1). Let xt = tx + (1 − t)y.

Proof: If Lemma 1 holds, Lemma 2 holds. We only prove one direction.

f (x) + h∇f (x), y − xi ≤ f (y) (1)

Theorem 3 If f is continuously differentiable (but not necessarily convex), and ∇f is L-Lipschitz,

Theorem 4 Let f be convex and continuously differentiable. Then x∗ is a global minimizer of f

Theorem 5 Let f be convex and continuously differentiable. Let ∇f be L-Lipschitz continuous.

You might also like