Notes ch0

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

CHAPTER 0.

PRELIMINARY

Yutong He

1 Norm
Definition 1.1 (Norm). A real-valued function k · k defined on linear space E is
called norm, if it satisfy:

1. (being positive definite) ∀x ∈ E, we have kxk ≥ 0, and kxk = 0 if and only if


x = 0.

2. (being absolutely homogeneous) ∀x ∈ E, α ∈ R, we have kαxk = |α| · kxk.

3. (the triangle inequality) ∀x, y ∈ E, we have kx + yk ≤ kxk + kyk.

1.1 Vector Norm


We consider vector norm defined on vector space E = Rn .

Definition 1.2 (`p -norm). The `p -norm (p ≥ 1) is defined as:

kxkp = (|x1 |p + |x2 |p + · · · + |xn |p )1/p ,

for all x = (x1 , x2 , · · · , xn )> ∈ Rn .

Specifically, we have the following commonly used definition of `1 -norm and `2 -norm for
vector x = (x1 , x2 , · · · , xn )> ∈ Rn . When p = 1, the `1 -norm is given by:

kxk1 = |x1 | + |x2 | + · · · + |xn |.

1
When p = 2, the `2 -norm is the same as the Euclidean norm:
q
kxk2 = x21 + x22 + · · · + x2n .

We have the following useful Cauchy’s inequality for `2 norm:

Proposition 1.3 (Cauchy’s inequality). ∀x, y ∈ Rn , we have

|hx, yi| ≤ kxk2 · kyk2 ,

where hx, yi = x> y denotes the inner product of x and y, and the equality holds if
and only if x and y are linearly correlated (i.e., ∃ α, β ∈ R, s.t. α2 + β 2 > 0 and
αx + βy = 0).

Definition 1.4 (`∞ -norm). The `∞ -norm is defined as:

kxk∞ = max{|x1 |, |x2 |, · · · , |xn |},

for all x = (x1 , x2 , · · · , xn )> ∈ Rn .

Definition 1.5 (Norm induced by a positive definite matrix A). Given a positive
definite matrix A ∈ Rn×n , we define:

kxkA = x> Ax,

for all x = (x1 , x2 , · · · , xn )> ∈ Rn .

1.2 Matrix norm


We consider matrix norms defined on matrix space E = Rm×n .

Definition 1.6 (Frobenius norm / F-norm). The Frobenius norm (or F-norm) of
matrix A ∈ Rm×n is defined as:
v
um X n
q uX
>
kAkF = tr(A A) = t A2i,j ,
i=1 j=1

where Ai,j represents the (i, j)-th element of matrix A.

The Frobenius norm has similar properties as the `2 -norm. For example, we have the
following Cauchy’s inequality for the Frobenius norm.

2
Proposition 1.7 (Cauchy’s inequality). ∀A, B ∈ Rm×n , we have

|hA, Bi| ≤ kAkF · kBkF ,

where hA, Bi = tr(A> B) = tr(AB > ) denotes the Frobenius inner product of A and
B, and the equality holds if and only if A and B are linearly correlated.

Definition 1.8 (Spectral norm). The spectral norm of matrix A ∈ Rm×n is defined
as:
kAk2 = max
n
kAxk2 .
x∈R ,kxk2 =1

For a given matrix A ∈ Rm×n , it can be shown that kAk2 equals the largest singular
value of A, and kAk22 equals the largest eigenvalue of A> A (and AA> ).
By definition, we also have the following inequality:

kAxk2 ≤ kAk2 kxk2 ,

for all A ∈ Rm×n and x ∈ Rn .

Definition 1.9 (Nuclear norm). The nuclear norm of matrix A ∈ Rm×n is defined
as:
kAk∗ = σ1 + σ2 + · · · + σr ,

where r = rank(A) and σ1 ≥ σ2 ≥ · · · ≥ σr are non-zero singular values of A.

We have the following inequalities for the matrix norms:

kABkF ≤kAk2 kBkF ,


|hA, Bi| ≤kAk2 kBk∗ .

2 Gradient
2.1 Gradient and Hessian matrix

Definition 2.1 (Gradient). Suppose f : Rn → R is continuously differentiable, its


gradient ∇f : Rn → Rn is defined at point x = (x1 , x2 , · · · , xn )> as:
 >
∂f (x) ∂f (x) ∂f (x)
∇f (x) = , ,··· .
∂x1 ∂x2 ∂xn

3
Definition 2.2 (Hessian matrix). Suppose f : Rn → R is twice continuously differen-
tiable, its Hessian matrix ∇2 f : Rn → Rn×n is defined at point x = (x1 , x2 , · · · , xn )>
as:  ∂ 2 f (x) ∂ 2 f (x) ∂ 2 f (x)

∂x21 ∂x1 ∂x2 · · · ∂x 1 ∂xn
 ∂ 2 f (x) ∂ 2 f (x) ∂ 2 f (x) 

∂x ∂x ∂x 2 · · · ∂x 2 ∂xn 

∇2 f (x) =  . 2 1 2
.

 . .
.. .. .
.. 
 . . 
∂ 2 f (x) ∂ 2 f (x) ∂ 2 f (x)
∂xn ∂x1 ∂xn ∂x2 · · · ∂x2 n

As an example, we consider gradients and Hessian matrices of two typical functions.

1. (The least squares problem) Let f (x) := 12 kAx − bk22 , we have ∇f (x) = A> (Ax − b)
and ∇2 f (x) = A> A.
1
PM >
2. (The logistic regression problem) Let f (x) := M i=1 ln(1 + exp(−bi ai x)), we have

M
1 X −bi exp(−bi a> i x)
∇f (x) = ai ,
M i=1 1 + exp(−bi a>
i x)

and
M
1 X b2i exp(−bi a>
i x)ai,j ai,k
[∇2 f (x)]j,k = .
M i=1 [1 + exp(−bi a>i x)]
2

Like univariate functions, we have Taylor expansions in the multivariate case:

Proposition 2.3 (Taylor expansion). Suppose f : Rn → R is continuously differen-


tiable, ∀x, y ∈ Rn , we have

f (x + y) = f (x) + h∇f (x + θ1 y), yi,

for some θ1 ∈ (0, 1). If f is further twice continuously differentiable, it holds that
Z 1
∇f (x + y) = ∇f (x) + ∇2 f (x + ty)ydt,
0

and
1
f (x + y) = f (x) + h∇f (x), yi + y > ∇2 f (x + θ2 y)y,
2
for some θ2 ∈ (0, 1).

2.2 L-smoothness

Definition 2.4 (L-smoothness). Function f : Rn → R is said to be L-smooth if


∀x, y ∈ Rn ,
k∇f (x) − ∇f (y)k2 ≤ Lkx − yk2 ,

where L > 0 is the Lipschitz constant of ∇f .

4
Theorem 2.5 (L-smooth property). Suppose f : Rn → R is L-smooth, it holds for
∀x, y ∈ Rn that
L
f (y) ≤ f (x) + h∇f (x), y − xi + ky − xk22 .
2

Proof. Let g(t) := f (x + t(y − x)), we have

g 0 (t) = h∇f (x + t(y − x)), y − xi,

thus

f (y) − f (x) − h∇f (x), y − xi


=g(1) − g(0) − g 0 (0)
Z 1
= (g 0 (t) − g 0 (0))dt
0
Z 1
= h∇f (x + t(y − x)) − ∇f (x), y − xidt
0
Z 1
≤ k∇f (x + t(y − x)) − ∇f (x)k2 · ky − xk2 dt
0
Z 1
≤ Ltky − xk22 dt
0
L
= ky − xk22 .
2

L
3 f (x) + h∇f (x), y − xi + 2 ky − xk22

1 f (y)

0 (x, f (x))

−1

−2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5

Figure 1: L-smooth property

5
3 Convexity
3.1 Convex set

Definition 3.1 (Convex set). A set X ⊆ Rn is called convex, if for ∀x, y ∈ X , it


holds that
θx + (1 − θ)y ∈ X , ∀θ ∈ [0, 1].

Theorem 3.2 (Projection onto closed convex sets). Suppose X ⊆ Rn is a closed


convex set, then for ∀y ∈ Rn , there exists a unique x? ∈ X such that kx? − yk2 ≤
kx − yk2 for any x ∈ X . The point x? is called the projection of y onto X , denoted
by PX (y).

Proof. We first prove the existence of point x? . For a given y ∈ Rn , let d := inf x∈X kx − yk2 ,
there exists sequence {xk }∞ 2 2
k=1 ⊆ X such that kxk − yk2 ≤ d + 1/k. For any integers
0 < m < n, by the convexity of X we have (xm + xn )/2 ∈ X . By the definition of d, we
have k(xm + xn )/2 − yk2 ≥ d. Consequently,

kxm − yk22 + kxn − yk22 1


hxm − y, xn − yi = 2k(xm + xn )/2 − yk22 − ≥ d2 − ,
2 m
2 2 2 4
⇒kxm − xn k2 = kxm − yk2 + kxn − yk2 − 2hxm − y, xn − yi ≤ .
m
Thus {xk }∞ ? n
k=1 is a Cauchy sequence, which implies the existence of point x ∈ R such that
? ?
x = limk→∞ xk . By closedness of X we know x ∈ X , and by the continuity of `2 -norm we
know kx? − yk2 = d. Next we show the uniqueness of x? . Otherwise ∃ x?1 , x?2 ∈ X (x?1 6= x?2 )
such that kx?1 − yk2 = kx?2 − yk2 = d, we thus have

1 ? 1 1
k(x?1 + x?2 )/2 − yk22 = kx − yk22 + kx?2 − yk22 − kx?1 − x?2 k22 < d,
2 1 2 4
a contradiction with the definition of d.

3.2 Convex function

Definition 3.3 (Convex function). Function f : X → R is said to be convex if


X ⊆ Rn is a convex set and ∀x, y ∈ X , it holds that

f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y), ∀ θ ∈ [0, 1].

Here we list some common convex functions:


1. vector norm: f (x) = kxk,

2. quadratic function: f (x) = 12 x> Ax where A is positive semi-definite,

3. linear function: f (x) = ha, xi for some a ∈ Rn ,

4. combination of convex functions: f (x) = α1 f1 (x) + α2 f2 (x) + · · · + αm fm (x) where


f1 , f2 , · · · , fm are convex and α1 , α2 , · · · , αm are non-negative.

6
Theorem 3.4 (Convex property). Suppose f : X → R is differentiable, then f is
convex if and only if

f (y) ≥ f (x) + h∇f (x), y − xi, ∀ x, y ∈ X . (1)

Similarly, f is convex if and only if

h∇f (y) − ∇f (x), y − xi ≥ 0, ∀ x, y ∈ X . (2)

Proof. Step 1: we prove that (1) holds if f is convex. By definition we have for any θ ∈ [0, 1]
that f (θy + (1 − θ)x) ≤ θf (y) + (1 − θ)f (x). Thus,

θf (y) + (1 − θ)f (x) − f (x)


f (y) − f (x) = lim
θ→0+0 θ
f (θy + (1 − θ)x) − f (x)
≥ lim
θ→0+0 θ
f (x + θ(y − x)) − f (x)
= lim
θ→0+0 θ
=h∇f (x), y − xi.

Step 2: we prove that f is convex if (1) holds. By (1) we have for ∀x, y ∈ X and θ ∈ [0, 1]
that

θf (x) + (1 − θ)f (y) − f (θx + (1 − θ)f (y))


=θ(f (x) − f (θx + (1 − θ)y)) + (1 − θ)(f (y) − f (θx + (1 − θ)y))
≥θh∇f (θx + (1 − θ)y), (1 − θ)(x − y)i + (1 − θ)h∇f (θx + (1 − θ)y), θ(y − x)i
=0.

Step 3: we prove that (1) implies (2). By (1) we have

f (y) + f (x) ≥(f (x) + h∇f (x), y − xi) + (f (y) + h∇f (y), x − yi)
=f (x) + f (y) − h∇f (y) − ∇f (x), y − xi,

which is equivalent to (2).


Step 4: we prove that (2) implies (1). This follows from
Z 1
f (y) − f (x) = h∇f (x + t(y − x)), y − xidt
0
1 
1
Z
= h∇f (x), y − xi + h∇f (x + t(y − x)) − ∇f (x), t(y − x)i dt
0 t
Z 1
≥ h∇f (x), y − xidt
0

=h∇f (x), y − xi.

7
2.0

1.5

1.0
(x, f (x))

0.5 f (y)

0.0

f (x) + h∇f (x), y − xi

−0.5

0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50

Figure 2: convex property

The following Jensen’s inequality is an important property of convex functions.

Proposition 3.5 (Jensen’s inequality). Let f : X → R be convex, then for any


x1 , x2 , · · · , xm ∈ X and non-negative θ1 , θ2 , · · · , θm satisfying θ1 + θ2 + · · · + θm = 1,
it holds that

f (θ1 x1 + θ2 x2 + · · · + θm xm ) ≤ θ1 f (x1 ) + θ2 f (x2 ) + · · · + θm f (xm ).

Definition 3.6 (µ-strongly convex function). Function f : X → R is said to be


µ-strongly convex if
µ
g(x) := f (x) − kxk22
2
is a convex function.

It can be proved that f : X → R is µ-strongly convex if and only if for ∀x, y ∈ X and
θ ∈ [0, 1], it holds that
µ
f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y) − θ(1 − θ)kx − yk22 .
2

Theorem 3.7 (µ-strongly convex property). Function f : X → R is µ-strongly


convex if and only if
µ
f (y) ≥ f (x) + h∇f (x), y − xi + ky − xk22 , ∀ x, y ∈ X , (3)
2
if and only if

h∇f (y) − ∇f (x), y − xi ≥ µky − xk22 , ∀ x, y ∈ X . (4)

8
Proof. Let g(x) := f (x) − µ2 kxk22 , then f is µ-strongly convex if and only if g is convex.
Note that (3) is equivalent to
µ µ µ
g(y) + kyk22 ≥ g(x) + kxk22 + h∇g(x) + µx, y − xi + ky − xk22 ,
2 2 2
⇔g(y) ≥ g(x) + h∇g(x), y − xi,

and that (4) is equivalent to

h∇g(y) + µy − ∇g(x) − µx, y − xi ≥ µky − xk22


⇔h∇g(y) − ∇g(x), y − xi ≥ 0,

it suffices to apply Theorem 3.4.

Theorem 3.8 (µ-strongly convex property). Let f : Rn → R be µ-strongly convex,


then the following inequalities hold:

1
f (y) ≤ f (x) + h∇f (x), y − xi + k∇f (y) − ∇f (x)k22 , ∀ x, y ∈ Rn ,

1
h∇f (y) − ∇f (x), y − xi ≤ k∇f (y) − ∇f (x)k22 , ∀ x, y ∈ Rn .
µ

Proof. Fix x ∈ Rn , and let φ(y) := f (y) − f (x) − h∇f (x), y − xi, we know φ is µ-strongly
convex with minimum φ(x) = 0. Thus,
 µ 
0 = min φ(v) ≥ min φ(y) + h∇φ(y), v − yi + kv − yk22
v v 2
2
1 µ 1 1
=φ(y) − k∇φ(y)k22 + ∇φ(y) = φ(y) − k∇φ(y)k22
µ 2 µ 2 2µ
1
=f (y) − f (x) − h∇f (x), y − xi − k∇f (y) − ∇f (x)k22 ,

which is equivalent to

1
f (y) ≤ f (x) + h∇f (x), y − xi + k∇f (y) − ∇f (x)k22 .

Summing the two copies of the above inequality with x and y interchanged, we obtain

1
h∇f (y) − ∇f (x), y − xi ≤ k∇f (y) − ∇f (x)k22 .
µ

L-smooth convex functions are widely considered in optimization, which have the fun-
damental properties below.

9
2.0

1.5

1.0 f (y)

0.5
(x, f (x))
0.0

f (x) + h∇f (x), y − xi + µ2 ky − xk22


−0.5

−1.0

−1.5 f (x) + h∇f (x), y − xi

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Figure 3: µ-strongly convex property

Theorem 3.9 (L-smooth convex property). Let f : Rn → R be differentiable, then


f is convex and L-smooth if and only if any one of the following conditions holds:

L
0 ≤ f (y) − f (x) − h∇f (x), y − xi ≤ ky − xk22 , ∀ x, y ∈ Rn , (5)
2
1
f (y) − f (x) − h∇f (x), y − xi ≥ k∇f (y) − ∇f (x)k22 , ∀ x, y ∈ Rn , (6)
2L
1
h∇f (y) − ∇f (x), y − xi ≥ k∇f (y) − ∇f (x)k22 , ∀ x, y ∈ Rn , (7)
L
0 ≤ h∇f (y) − ∇f (x), y − xi ≤ Lky − xk22 , ∀ x, y ∈ Rn . (8)

Proof. Step 1: we show (5) holds when f is L-smooth and convex. In fact, (5) is a direct
result of Theorem 3.4 and Theorem 2.5.
Step 2: we show (5) implies (6). Fix x ∈ Rn and let φ(y) := f (y) − f (x) − h∇f (x), y − xi.
By (5) we know φ(x) = 0 is the minimum of φ. Substituting y by y − L1 ∇φ(y) and applying
(5), we obtain
     
1 1 1
0 ≤φ y − ∇φ(y) = f y − ∇φ(y) − f (x) − ∇f (x), y − ∇φ(y) − x
L L L
  2  
1 L 1 1
≤f (y) − ∇f (y), ∇φ(y) + ∇φ(y) − f (x) − ∇f (x), y − ∇φ(y) − x
L 2 L 2 L
1
=f (y) − f (x) − h∇f (x), y − xi − k∇f (y) − ∇f (x)k22 .
2L

10
Step 3: we show (6) implies (7).

1 1 1
k∇f (y) − ∇f (x)k22 = k∇f (y) − ∇f (x)k22 + k∇f (x) − ∇f (y)k22
L 2L 2L
≤ (f (y) − f (x) − h∇f (x), y − xi) + (f (x) − f (y) − h∇f (y), x − yi)
=h∇f (y) − ∇f (x), y − xi.

Step 4: we show (7) implies f is L-smooth and convex. The convexity can be justified by
directly applying Theorem 3.4. By Cauchy’s inequality, we have

1
k∇f (y) − ∇f (x)k22 ≤ h∇f (y) − ∇f (x), y − xi ≤ k∇f (y) − ∇f (x)k2 · ky − xk2 ,
L
which implies
k∇f (y) − ∇f (x)k2 ≤ Lky − xk2 .

Step 5: we show (5) implies (8). This can be achieved by directly adding two copies of (5)
with x and y interchanged.
Step 6: we show (8) implies (5). By Theorem 3.4, it remains to show the second inequality
in (5).
Z 1
f (y) − f (x) − h∇f (x), y − xi = h∇f (x + t(y − x)) − ∇f (x), y − xidt
0
1
L
Z
≤ kt(y − x)k22 dt
0 t
L
= ky − xk22 .
2

We also has the following property for L-smooth and µ-strongly convex functions. By
Theorem 2.5 and Theorem 3.7 we can easily know the fact that µ ≤ L.

Theorem 3.10 (L-smooth µ-strongly convex property). Let f : Rn → R be L-


smooth and µ-strongly convex, then for ∀ x, y ∈ Rn , the following inequality holds:

µL 1
h∇f (y) − ∇f (x), y − xi ≥ ky − xk22 + k∇f (y) − ∇f (x)k22 .
µ+L µ+L

Proof. If µ = L, we have

µL 1
ky − xk22 + k∇f (y) − ∇f (x)k22
µ+L µ+L
µ 1
= ky − xk22 + k∇f (y) − ∇f (x)k22
2 2L
1 1
≤ h∇f (y) − ∇f (x), y − xi + h∇f (y) − ∇f (x), y − xi
2 2
=h∇f (y) − ∇f (x), y − xi.

11
If µ < L, g(x) := f (x) − µ2 kxk22 is (L − µ)-smooth convex function, implying

1
h∇g(y) − ∇g(x), y − xi ≥ k∇g(y) − ∇g(x)k22 ,
L−µ

which is equivalent to

µL 1
h∇f (y) − ∇f (x), y − xi ≥ ky − xk22 + k∇f (y) − ∇f (x)k22 .
µ+L µ+L

12

You might also like