Lecture02a Optimization Annotated PDF
Lecture02a Optimization Annotated PDF
Lecture02a Optimization Annotated PDF
" "
en
ai
Oy
Katey
Optimization
minor changes by Martin Jaggi 2019,2018,2017; c Martin Jaggi and Mohammad Emtiyaz Khan
2016
Last updated on: September 19, 2019
Learning / Estimation / Fitting
Given a cost function L(w), we wish
LCH
fwcx.DZ
to find w? which minimizes the cost:
min L(w)
w
subject to w 2 RD
=
II. (Yn -
I
few parameters and the cost is easy
to compute.
:c
.
:
D= 2 i
#.#¥T
"
w.
① For a large number of parame- W E IRD
ters D, however, grid search has
too many “for-loops”, resulting
in an exponential computational
complexity:
# tries
If we decide to use 10 possible
- = TOP
values for each dimension of w,
then we have to check 10D points.
This is clearly impossible for most
practical machine learning models,
which can often have D ⇡ millions
of parameters. Choosing a good
range of values for each dimension
is another problem.
Kir
-
÷÷
timum.
Optimization Landscapes
Sent
SGD
a¥€⇐ .
.
.
w*o w IRD
I
A local or global minimum is said
to be strict if the corresponding
-
:p
140
in
120
.
*
100
80
60
40
20
←
:⇐E.
0
10
5 10
0 5
0
−5 −5
−10 −10
15
we
on see,
Lai
.
10
/
5
MAE
← →
0
10
5
0 10
5
−5 0
−5 w
−10 −10
Definition of the gradient:
>
@L(w) @L(w)
rL(w) := ,...,
@w1 @wD
This is a vector, rL(w) 2 RD .
Gradient Descent
To minimize the function, we itera-
tively take a step in the (opposite)
direction of the gradient + =
1,43 ,
. . .
,T
w(t+1) := w(t) rL(w(t))
lzafzfyn WH
= -
(t+1) (t)
w0:= (1 )w0 + ȳ -
h =p
P
where ȳ := n yn/N . When is this
eowxttkcwwi-fw.sc
sequence guaranteed to converge?
"
559'
: iiiieeiiie
=¥E -
ten -
,
no
I
Wf
=
Wo
I
-
'
water,
"
pg WH Wo
step)
-
(
one
Gradient Descent for Linear MSE W =
Wn ,
. . -
YD
For linear regression
2 3 2 3
y1 x11 x12 . . . x1D ← datapoint 1
6 y2 7 6 x21 x22 . . . x2D 7 Xnt
y=6 7 6
4 .. 5 , X = 4 .. .. . . . .. 5
7
yN xN 1 xN 2 . . . xN D
We define the error vector e:
e=y Xw
L(w) :=
1 X
N
2N n=1
1 >
= 2N e e
yn ix>
n w
2
I
ix.
.
."
=±*×
then the gradient is given by few,L= -
IX : Dfe
rL(w) =
① 1
N X >
e
Honewor
Computational cost. What is
the complexity (# operations) of
computing the gradient?
-
① compute e
t 01N)
for y
-
for Xw
. .
IIe for In .
. .
total =
O(N 'D) )
(order notation
Variant with o↵set. Recall: Alter-
native trick when also incorporating an o↵-
set term for the regression:
← artificial feature for each
2 3 2 3 data point
y1 1 x11 x12 . . . x1D
6 7 6 7
6 y2 7 6
e =6 . .1 x 21 x 22 . . . x2D 7
y = 6 .. 7 X . . . . .. 7
.
4 . 5 4 .. .. .. 5
yN 1 xN 1 xN 2 . . . xN D Exercise :
compute Phew,
( Wo Wn . . .
Wp ) = w
1 X
L(w) = Lnn(w) ,
N n=1
wTxn/2
stochastic gradient descent (SGD)
algorithm is given by the following
update rule, at step t:
Me point Eth it uniformly atrandom
t ra i n i n g
one n . -
,
Theoretical Motivation. Idea: Eltham )
Cheap but unbiased estimate of the
Futz
-
gradient! = f- not
In expectation over the random
choice of n, we have =D (E ) .
|B|
n2B Example
again with
•
B = {n)
(t+1) (t)
→ SAD
w := w g.
• B = 11 , . . .
N)
,
threads in parallel).
Note that in the extreme case B :=
[N ], we obtain (batch) gradient de-
scent, i.e. g = rL.
'
④ :
hand -
-
DLCWF -
fXt(y - Xu)
cost :
OCN -
D) Meek "
'
Llyn xntw)
S④ Lncw)
-
=
:
-xntG
Yi costs
Phase
OCD)
en EIR
No differable functions
,,n
-
Non-Smooth Optimization
An alternative characterization of
convexity, for di↵erentiable func-
i÷
tions is given by
L(u) ←
L(w)+rL(w)>(u w) 8u, w
Subgradients
A vector g 2 RD such that
¥¥÷
L(u) + >
L(w) + g (u w) 8u
$
If L is convex and di↵erentiable
at w, then the only subgradient at w s
is g = rL(w).
Subgradient Descent .
Identical to the gradient descent al-
gorithm, but using a subgradient in- hinge -
Loss
stead of gradient. Update rule
w(t+1) := w(t) g
h : R ! R , h(e) := Pd
|e|.
j
'
toy example
set
-
y if eco
.in?jti isubgaradiEt--iz
2. Recall the definition of the
mean absolute error:
N
1 X
O
L(w) = MAE(w) :=
N n=1 l
yn f (xn)d
w
g
setohfasbgradints at
ga
Stochastic Subgradient Descent
Stochastic SubGradient Descent
(still abbreviated SGD commonly).
Lafitte Smart
.
T
-
T
Same, g being a subgradient to the possibly non -
computational
.af:÷÷÷÷f÷÷÷.
cost
sub of
gradient DL, gradient
Ln
SGD OCD) OCD)
In
Constrained Optimization
Sometimes, optimization problems
come posed with additional con-
straints:
Omin L(w),
w
subject to w 2 C.
-
L(w) C ⇢ RD
w
Hq
w
C ⇢ RD
8
A) Projected Gradient Descent
B) Transform it into an uncon-
strained problem
Convex Sets
A set C is convex i↵
the line segment between any two
points of C lies in C, i.e., if for any
u, v 2 C and any ✓ with 0 ✓ 1,
we have
✓u + (1 ✓)v 2 C. 2 Convex sets
Figure 2.2 Some simple convex and nonconvex sets. Left. The hexagon,
which includes its boundary (shown darker), is convex. Middle. The kidney
shaped set is not convex, since the line segment between the two points in
the set shown as dots is not contained in the set. Right. The square contains
Properties
some of but
boundary points Convex Sets
not others, and is not convex.
Roughly speaking, a set is convex if every point in the set can be seen by every other
point, along an unobstructed straight path between them, where unobstructed
means lying in the set. Every affine set is also convex, since it contains the entire
line between any two distinct points in it, and therefore also the line segment
①
Projected Gradient Descent
Idea: add a projection onto C after
every step:
°0) := arg min kv
PC (w w0 k .
v2C
Update rule:
(t+1)
⇥ (t) (t)
⇤
w := PC w rL(w ) .
*
9
w•
• l C ⇢ RD
.
til
PC (w )
C rL(w)
w
11
) min L(w)+ kAw
tradeoff
w2RD
X
bk1122
2
two
t Dw=Hb
• Linearized Penalty Functions
(see Lagrange Multipliers)
-
Implementation Issues
For gradient methods: FED Ln
%÷÷:c
"
Stopping criteria: When"rL(w) Il .
¥#
order
Optimality: If the second-order
④←
derivative is positive (positive semi-
definite to be precise), then it is a . .
÷÷÷÷÷÷
(possibly local) minimum. If the
function is also convex, then this convex
condition implies that we are at a and
global optimum. See the supplemen-
tary section on Optimality Condi-
tions.
↳
depends on the problem.
1117111 to
Line-search methods: For some
objectives L, we can set step-size
automatically using a line-search
method. More details on “back-
tracking” methods can be found in
Chapter 1 of Bertsekas’ book on
“nonlinear programming”.
Iii:
Therefore, it is typically advised
to normalize your input features.
In other words, we pre-condition
the optimization problem. With-
out this, step-size selection is more
difficult since di↵erent “directions”
might converge at di↵erent speed.
⇐@
Non-Convex Optimization
Computational Complexity
The computation cost is expressed using the big-O notation. Here is a
definition taken from Wikipedia. Let f and g be two functions defined on
some subset of the real numbers. We write f (x) = O(g(x)) as x ! 1,
if and only if there exists a positive real number c and a real number x0
such that |f (x)| c|g(x)|, 8x > x0.
Optimality Conditions
For a convex optimization problem, the first-order necessary condition
says that at an optimum the gradient is equal to zero.
rL(w?) = 0 (1)
SGD Theory
As we have seen above, when N is large, choosing a random training
example (xn, yn) and taking an SGD step is advantageous:
(t)
One way to obtain such sequences is := 1/(t + 1)r where r 2 (0.5, 1).
More Optimization Theory
If you want, you can gain a deeper understanding of several optimization
methods relevant for machine learning from this survey:
Convex Optimization: Algorithms and Complexity
- by Sébastien Bubeck
Exercises
1. Chain-rule