Notes ch0
Notes ch0
Notes ch0
PRELIMINARY
Yutong He
1 Norm
Definition 1.1 (Norm). A real-valued function k · k defined on linear space E is
called norm, if it satisfy:
Specifically, we have the following commonly used definition of `1 -norm and `2 -norm for
vector x = (x1 , x2 , · · · , xn )> ∈ Rn . When p = 1, the `1 -norm is given by:
1
When p = 2, the `2 -norm is the same as the Euclidean norm:
q
kxk2 = x21 + x22 + · · · + x2n .
where hx, yi = x> y denotes the inner product of x and y, and the equality holds if
and only if x and y are linearly correlated (i.e., ∃ α, β ∈ R, s.t. α2 + β 2 > 0 and
αx + βy = 0).
Definition 1.5 (Norm induced by a positive definite matrix A). Given a positive
definite matrix A ∈ Rn×n , we define:
√
kxkA = x> Ax,
Definition 1.6 (Frobenius norm / F-norm). The Frobenius norm (or F-norm) of
matrix A ∈ Rm×n is defined as:
v
um X n
q uX
>
kAkF = tr(A A) = t A2i,j ,
i=1 j=1
The Frobenius norm has similar properties as the `2 -norm. For example, we have the
following Cauchy’s inequality for the Frobenius norm.
2
Proposition 1.7 (Cauchy’s inequality). ∀A, B ∈ Rm×n , we have
where hA, Bi = tr(A> B) = tr(AB > ) denotes the Frobenius inner product of A and
B, and the equality holds if and only if A and B are linearly correlated.
Definition 1.8 (Spectral norm). The spectral norm of matrix A ∈ Rm×n is defined
as:
kAk2 = max
n
kAxk2 .
x∈R ,kxk2 =1
For a given matrix A ∈ Rm×n , it can be shown that kAk2 equals the largest singular
value of A, and kAk22 equals the largest eigenvalue of A> A (and AA> ).
By definition, we also have the following inequality:
Definition 1.9 (Nuclear norm). The nuclear norm of matrix A ∈ Rm×n is defined
as:
kAk∗ = σ1 + σ2 + · · · + σr ,
2 Gradient
2.1 Gradient and Hessian matrix
3
Definition 2.2 (Hessian matrix). Suppose f : Rn → R is twice continuously differen-
tiable, its Hessian matrix ∇2 f : Rn → Rn×n is defined at point x = (x1 , x2 , · · · , xn )>
as: ∂ 2 f (x) ∂ 2 f (x) ∂ 2 f (x)
∂x21 ∂x1 ∂x2 · · · ∂x 1 ∂xn
∂ 2 f (x) ∂ 2 f (x) ∂ 2 f (x)
∂x ∂x ∂x 2 · · · ∂x 2 ∂xn
∇2 f (x) = . 2 1 2
.
. .
.. .. .
..
. .
∂ 2 f (x) ∂ 2 f (x) ∂ 2 f (x)
∂xn ∂x1 ∂xn ∂x2 · · · ∂x2 n
1. (The least squares problem) Let f (x) := 12 kAx − bk22 , we have ∇f (x) = A> (Ax − b)
and ∇2 f (x) = A> A.
1
PM >
2. (The logistic regression problem) Let f (x) := M i=1 ln(1 + exp(−bi ai x)), we have
M
1 X −bi exp(−bi a> i x)
∇f (x) = ai ,
M i=1 1 + exp(−bi a>
i x)
and
M
1 X b2i exp(−bi a>
i x)ai,j ai,k
[∇2 f (x)]j,k = .
M i=1 [1 + exp(−bi a>i x)]
2
for some θ1 ∈ (0, 1). If f is further twice continuously differentiable, it holds that
Z 1
∇f (x + y) = ∇f (x) + ∇2 f (x + ty)ydt,
0
and
1
f (x + y) = f (x) + h∇f (x), yi + y > ∇2 f (x + θ2 y)y,
2
for some θ2 ∈ (0, 1).
2.2 L-smoothness
4
Theorem 2.5 (L-smooth property). Suppose f : Rn → R is L-smooth, it holds for
∀x, y ∈ Rn that
L
f (y) ≤ f (x) + h∇f (x), y − xi + ky − xk22 .
2
thus
L
3 f (x) + h∇f (x), y − xi + 2 ky − xk22
1 f (y)
0 (x, f (x))
−1
5
3 Convexity
3.1 Convex set
Proof. We first prove the existence of point x? . For a given y ∈ Rn , let d := inf x∈X kx − yk2 ,
there exists sequence {xk }∞ 2 2
k=1 ⊆ X such that kxk − yk2 ≤ d + 1/k. For any integers
0 < m < n, by the convexity of X we have (xm + xn )/2 ∈ X . By the definition of d, we
have k(xm + xn )/2 − yk2 ≥ d. Consequently,
1 ? 1 1
k(x?1 + x?2 )/2 − yk22 = kx − yk22 + kx?2 − yk22 − kx?1 − x?2 k22 < d,
2 1 2 4
a contradiction with the definition of d.
6
Theorem 3.4 (Convex property). Suppose f : X → R is differentiable, then f is
convex if and only if
Proof. Step 1: we prove that (1) holds if f is convex. By definition we have for any θ ∈ [0, 1]
that f (θy + (1 − θ)x) ≤ θf (y) + (1 − θ)f (x). Thus,
Step 2: we prove that f is convex if (1) holds. By (1) we have for ∀x, y ∈ X and θ ∈ [0, 1]
that
f (y) + f (x) ≥(f (x) + h∇f (x), y − xi) + (f (y) + h∇f (y), x − yi)
=f (x) + f (y) − h∇f (y) − ∇f (x), y − xi,
7
2.0
1.5
1.0
(x, f (x))
0.5 f (y)
0.0
−0.5
It can be proved that f : X → R is µ-strongly convex if and only if for ∀x, y ∈ X and
θ ∈ [0, 1], it holds that
µ
f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y) − θ(1 − θ)kx − yk22 .
2
8
Proof. Let g(x) := f (x) − µ2 kxk22 , then f is µ-strongly convex if and only if g is convex.
Note that (3) is equivalent to
µ µ µ
g(y) + kyk22 ≥ g(x) + kxk22 + h∇g(x) + µx, y − xi + ky − xk22 ,
2 2 2
⇔g(y) ≥ g(x) + h∇g(x), y − xi,
1
f (y) ≤ f (x) + h∇f (x), y − xi + k∇f (y) − ∇f (x)k22 , ∀ x, y ∈ Rn ,
2µ
1
h∇f (y) − ∇f (x), y − xi ≤ k∇f (y) − ∇f (x)k22 , ∀ x, y ∈ Rn .
µ
Proof. Fix x ∈ Rn , and let φ(y) := f (y) − f (x) − h∇f (x), y − xi, we know φ is µ-strongly
convex with minimum φ(x) = 0. Thus,
µ
0 = min φ(v) ≥ min φ(y) + h∇φ(y), v − yi + kv − yk22
v v 2
2
1 µ 1 1
=φ(y) − k∇φ(y)k22 + ∇φ(y) = φ(y) − k∇φ(y)k22
µ 2 µ 2 2µ
1
=f (y) − f (x) − h∇f (x), y − xi − k∇f (y) − ∇f (x)k22 ,
2µ
which is equivalent to
1
f (y) ≤ f (x) + h∇f (x), y − xi + k∇f (y) − ∇f (x)k22 .
2µ
Summing the two copies of the above inequality with x and y interchanged, we obtain
1
h∇f (y) − ∇f (x), y − xi ≤ k∇f (y) − ∇f (x)k22 .
µ
L-smooth convex functions are widely considered in optimization, which have the fun-
damental properties below.
9
2.0
1.5
1.0 f (y)
0.5
(x, f (x))
0.0
−1.0
L
0 ≤ f (y) − f (x) − h∇f (x), y − xi ≤ ky − xk22 , ∀ x, y ∈ Rn , (5)
2
1
f (y) − f (x) − h∇f (x), y − xi ≥ k∇f (y) − ∇f (x)k22 , ∀ x, y ∈ Rn , (6)
2L
1
h∇f (y) − ∇f (x), y − xi ≥ k∇f (y) − ∇f (x)k22 , ∀ x, y ∈ Rn , (7)
L
0 ≤ h∇f (y) − ∇f (x), y − xi ≤ Lky − xk22 , ∀ x, y ∈ Rn . (8)
Proof. Step 1: we show (5) holds when f is L-smooth and convex. In fact, (5) is a direct
result of Theorem 3.4 and Theorem 2.5.
Step 2: we show (5) implies (6). Fix x ∈ Rn and let φ(y) := f (y) − f (x) − h∇f (x), y − xi.
By (5) we know φ(x) = 0 is the minimum of φ. Substituting y by y − L1 ∇φ(y) and applying
(5), we obtain
1 1 1
0 ≤φ y − ∇φ(y) = f y − ∇φ(y) − f (x) − ∇f (x), y − ∇φ(y) − x
L L L
2
1 L 1 1
≤f (y) − ∇f (y), ∇φ(y) + ∇φ(y) − f (x) − ∇f (x), y − ∇φ(y) − x
L 2 L 2 L
1
=f (y) − f (x) − h∇f (x), y − xi − k∇f (y) − ∇f (x)k22 .
2L
10
Step 3: we show (6) implies (7).
1 1 1
k∇f (y) − ∇f (x)k22 = k∇f (y) − ∇f (x)k22 + k∇f (x) − ∇f (y)k22
L 2L 2L
≤ (f (y) − f (x) − h∇f (x), y − xi) + (f (x) − f (y) − h∇f (y), x − yi)
=h∇f (y) − ∇f (x), y − xi.
Step 4: we show (7) implies f is L-smooth and convex. The convexity can be justified by
directly applying Theorem 3.4. By Cauchy’s inequality, we have
1
k∇f (y) − ∇f (x)k22 ≤ h∇f (y) − ∇f (x), y − xi ≤ k∇f (y) − ∇f (x)k2 · ky − xk2 ,
L
which implies
k∇f (y) − ∇f (x)k2 ≤ Lky − xk2 .
Step 5: we show (5) implies (8). This can be achieved by directly adding two copies of (5)
with x and y interchanged.
Step 6: we show (8) implies (5). By Theorem 3.4, it remains to show the second inequality
in (5).
Z 1
f (y) − f (x) − h∇f (x), y − xi = h∇f (x + t(y − x)) − ∇f (x), y − xidt
0
1
L
Z
≤ kt(y − x)k22 dt
0 t
L
= ky − xk22 .
2
We also has the following property for L-smooth and µ-strongly convex functions. By
Theorem 2.5 and Theorem 3.7 we can easily know the fact that µ ≤ L.
µL 1
h∇f (y) − ∇f (x), y − xi ≥ ky − xk22 + k∇f (y) − ∇f (x)k22 .
µ+L µ+L
Proof. If µ = L, we have
µL 1
ky − xk22 + k∇f (y) − ∇f (x)k22
µ+L µ+L
µ 1
= ky − xk22 + k∇f (y) − ∇f (x)k22
2 2L
1 1
≤ h∇f (y) − ∇f (x), y − xi + h∇f (y) − ∇f (x), y − xi
2 2
=h∇f (y) − ∇f (x), y − xi.
11
If µ < L, g(x) := f (x) − µ2 kxk22 is (L − µ)-smooth convex function, implying
1
h∇g(y) − ∇g(x), y − xi ≥ k∇g(y) − ∇g(x)k22 ,
L−µ
which is equivalent to
µL 1
h∇f (y) − ∇f (x), y − xi ≥ ky − xk22 + k∇f (y) − ∇f (x)k22 .
µ+L µ+L
12