ML Opt

ML Methods:
Optimization Point of View
E. Le Pennec
MAP553 - Foundation of Machine Learning - Fall 2021
1
Outline
1 Supervised Learning
2 Optimization Point of View
3 SVM
4 Penalization
5 Cross Validation and Weights
2
Outline Supervised Learning
3 SVM
4 Penalization
3
Supervised Learning Supervised Learning
Supervised Learning Framework

Input measurement X ∈ X
Output measurement Y ∈ Y.
(X , Y ) ∼ P with P unknown.
Training data : Dn = {(X 1 , Y1 ), . . . , (X n , Yn )} (i.i.d. ∼ P)
Often
X ∈ Rd and Y ∈ {−1, 1} (classification)
or X ∈ Rd and Y ∈ R (regression).
A predictor is a function in F = {f : X → Y meas.}
Goal
Construct a good predictor fb from the training data.
Need to specify the meaning of good.

Classification and regression are almost the same problem!
4
Loss and Probabilistic Framework Supervised Learning
Loss function for a generic predictor

Loss function: ℓ(Y , f (X )) measures the goodness of the prediction of Y by f (X )
Examples:
Prediction loss: ℓ(Y , f (X )) = 1Y ̸=f (X )
Quadratic loss: ℓ(Y , f (X )) = |Y − f (X )|2
Risk function
Risk measured as the average loss for a new couple:
R(f ) = E(X,Y )∼P [ℓ(Y , f (X))]
Examples:
Prediction loss: E [ℓ(Y , f (X ))] = P (Y ̸= f (X ))
Quadratic loss: E [ℓ(Y , f (X ))] = E |Y − f (X )|2
Beware: As fb depends on Dn , R(fb) is a random variable!

5
Best Solution Supervised Learning
The best solution f ∗ (which is independent of Dn ) is h i

f ∗ = arg min R(f ) = arg min E [ℓ(Y , f (X ))] = arg min EX EY |X [ℓ(Y , f (X ))]
f ∈F f ∈F f ∈F
Bayes Predictor (explicit solution)

0 − 1 loss:
In binary classification with 
+1 if P (Y = +1|X ) ≥ P (Y = −1|X )

∗
f (X ) = ⇔ P (Y = +1|X ) ≥ 1/2

−1 otherwise

In regression with the quadratic loss

f ∗ (X ) = E [Y |X ]
Issue: Solution requires to know E [Y |X ] for all values of X !
6
Goal Supervised Learning
Machine Learning
Learn a rule to construct a predictor fb ∈ F from the training data Dn s.t. the
risk R(fb) is small on average or with high probability with respect to Dn .
In practice, the rule should be an algorithm!

Canonical example: Empirical Risk Minimizer
One restricts f to a subset of functions S = {fθ , θ ∈ Θ}
One replaces the minimization of the average loss by the minimization of the
empirical loss
n
1X
f = fθb = argmin
b ℓ(Yi , fθ (X i ))
fθ ,θ∈Θ n i=1
Examples:
Linear regression
Linear discrimination with
S = {x 7→ sign{x ⊤ β + β (0) } /β ∈ Rd , β (0) ∈ R}
7
Example: TwoClass Dataset Supervised Learning
Synthetic Dataset
Two features/covariates.
Two classes.
Dataset from Applied Predictive Modeling, M. Kuhn and K. Johnson, Springer
Numerical experiments with R and the caret package.
8
Example: Linear Discrimination Supervised Learning
9
Example: More Complex Model Supervised Learning
10
Under-fitting / Over-fitting Issue Supervised Learning
Model Complexity Dilemna
Source: A. Ng
What is best a simple or a complex model?
Too simple to be good? Too complex to be learned?
11
Under-fitting / Over-fitting
Source: Unknown
Under-fitting: simple model are too simple.
Over-fitting: complex model are too specific to the training set.
12
Bias-Variance Dilemma Supervised Learning
General setting:
F = {measurable functions X → Y}
Best solution: f ∗ = argminf ∈F R(f )
Class S ⊂ F of functions
Ideal target in S: fS∗ = argminf ∈S R(f )
Estimate in S: bfS obtained with some procedure
Approximation error and estimation error (Bias/Variance)
R(fbS ) − R(f ∗ ) = R(fS∗ ) − R(f ∗ ) + R(fbS ) − R(fS∗ )

| {z } | {z }
Approximation error Estimation error
Approx. error can be large if the model S is not suitable.

Estimation error can be large if the model is complex.
Agnostic approach
No assumption (so far) on the law of (X , Y ).
13
Different behavior for different model complexity

Low complexity model are easily learned but the approximation error (bias) may
be large (Under-fit).
High complexity model may contain a good ideal target but the estimation error
(variance) can be large (Over-fit)
Source: Unknown
Bias-variance trade-off ⇐⇒ avoid overfitting and underfitting
Rk: Better to think in term of method (including feature engineering and specific
algorithm) rather than only of model.
14
Theoretical Analysis Supervised Learning
Statistical Learning Analysis

Error decomposition:
R(fbS ) − R(f ⋆ ) = R(fS⋆ ) − R(f ⋆ ) + R(fbS ) − R(fS⋆ )
| {z } | {z }
Bound on the approximation term: approximation theory.
Probabilistic bound on the estimation term: probability theory!
Goal: Agnostic bounds, i.e. bounds that do not require assumptions on P!
(Statistical Learning?)
Often need mild assumptions on P. . . (Nonparametric Statistics?)
15
Binary Classification Loss Issue Supervised Learning
Empirical Risk Minimizer

n
1X
fb = argmin ℓ0/1 (Yi , f (X i ))
f ∈S n i=1
Classification loss: ℓ0/1 (y , f (x )) = 1y ̸=f (x )

Not convex and not smooth!
16
Probabilistic Point of View Supervised Learning
Ideal Solution and Estimation
The best solution f ∗ (which is independent of Dn ) is h i

f ∗ = arg min R(f ) = arg min E [ℓ(Y , f (X ))] = arg min EX EY |X [ℓ(Y , f (x ))]
f ∈F f ∈F f ∈F
Bayes Predictor (explicit solution)

In binary classification with 0 − 1 loss:
(
+1 if P (Y = +1|X ) ≥ P (Y = −1|X )
f ∗ (X ) =
Source: A. Fermin
−1 otherwise
Issue: Solution requires to know E [Y |X ] for all values of X !

Solution: Replace it by an estimate.
17
Optimization Point of View Supervised Learning
Loss Convexification
Minimizer of the risk

n
1X
fb = argmin ℓ0/1 (Yi , f (X i ))
f ∈S n i=1
Issue: Classification loss is not convex or smooth.

Solution: Replace it by a convex majorant.
18
Probabilistic and Optimization Framework Supervised Learning
How to find a good function f with a small risk

R(f ) = E [ℓ(Y , f (X ))] ?
Canonical approach: fS = argminf ∈S n1 ni=1 ℓ(Yi , f (X i ))
b P
Problems
How to choose S?
How to compute the minimization?
A Probabilistic Point of View

Solution: For X , estimate Y |X plug this estimate in the Bayes classifier:
(Generalized) Linear Models, Kernel methods, k-nn, Naive Bayes, Tree,
Bagging. . .
An Optimization Point of View

Solution: If necessary replace the loss ℓ by an upper bound ℓ′ and minimize the
empirical loss: SVR, SVM, Neural Network,Tree, Boosting. . .
19
Outline Optimization Point of View
3 SVM
4 Penalization
20
Probabilistic and Optimization Framework Optimization Point of View
How to find a good function f with a small risk

R(f ) = E [ℓ(Y , f (X ))] ?
Canonical approach: fS = argminf ∈S n1 ni=1 ℓ(Yi , f (X i ))
b P
Problems
How to choose S?
How to compute the minimization?
A Probabilistic Point of View

Solution: For X , estimate Y |X plug this estimate in the Bayes classifier:
(Generalized) Linear Models, Kernel methods, k-nn, Naive Bayes, Tree,
Bagging. . .
An Optimization Point of View

Solution: If necessary replace the loss ℓ by an upper bound ℓ′ and minimize the
empirical loss: SVR, SVM, Neural Network,Tree, Boosting. . .
21
Three Classical Methods in a Nutshell Optimization Point of View
Penalized Logistic Regression

Let fθ (X ) = X ⊤ β + β (0) with θ = (β, β (0) ).
n
1X
Find θ̂ = arg min log 1 + e −Yi fθ (X i ) + λ∥β∥1
n i=1
Classify using sign(fθ̂ )
Support Vector Machine

Let fθ (X ) = X ⊤ β + β (0) with θ = (β, β (0) ).
n
1X
Find θ̂ = arg min max (1 − Yi fθ (X i ), 0) + λ∥β∥22
n i=1
22
Three Classical Methods in a Nutshell Optimization Point of View
Deep Learning
Let fθ (X ) with f a feed forward neural network outputing two values with a
softmax layer as a last layer.
n
1X
Optimize by gradient descent the cross-entropy − log fθ (X i )(Yi )
n i=1
Those three methods rely on a similar heuristic: the optimization point of view!
23
Empirical Risk Minimization Optimization Point of View
The best solution f ⋆ is the one minimizing

f ⋆ = arg min R(f ) = arg min E [ℓ(Y , f (X ))]
Empirical Risk Minimization

One restricts f to a subset of functions S = {fθ , θ ∈ Θ}
One replaces the minimization of the average loss by the minimization of the
average empirical loss
n
1X
fb = fθb = argmin ℓ(Yi , fθ (X i ))
fθ ,θ∈Θ n i=1
Intractable for the ℓ0/1 loss!
24
Convexification Strategy Optimization Point of View
Risk Convexification
Replace the loss ℓ(Y , fθ (X )) by a convex upperbound ℓ′ (Y , fθ (X )) (surrogate
loss).
Minimize the average of the surrogate empirical loss
n
1X
f˜ = fθb = argmin ℓ′ (Yi , fθ (X i ))
fθ ,θ∈Θ n i=1
Use fb = sign(f˜)
Much easier optimization.

Instantiation
Logistic (Revisited)
Support Vector Machine
(Deep) Neural Network
Boosting 25
Classification Loss and Convexification Optimization Point of View
Convexification
Replace the loss ℓ0/1 (Y , f (X )) by
ℓ′ (Y , f (X )) = l(Yf (X ))
with l a convex function.
Further mild assumption: l is decreasing, differentiable at 0 and l ′ (0) < 0.
26
Classification Loss and Convexification Optimization Point of View
Classical convexification
Logistic loss: ℓ′ (Y , f (X )) = log(1 + e −Yf (X ) ) (Logistic / NN)
Hinge loss: ℓ′ (Y , f (X )) = (1 − Yf (X ))+ (SVM)
Exponential loss: ℓ′ (Y , f (X )) = e −Yf (X ) (Boosting. . . )
26
Properties Optimization Point of View
The Target is the Bayes Classifier

The minimizer of
E ℓ′ (Y , f (X )) = E [l(Yf (X ))]

is the Bayes classifier f ⋆ = sign(2η(X ) − 1)
Control of the Excess Risk

It exists a convex function
h Ψ such that i h i
Ψ E ℓ (Y , sign(f (X )) − E ℓ0/1 (Y , f ⋆ (X )
0/1
≤ E ℓ′ (Y , f (X ) − E ℓ′ (Y , f ⋆ (X ))

Theoretical guarantee!
27
Proof Optimization Point of View
By definition,
E [l(Yf )|X ] = η(X )l(f ) + (1 − η(X ))l(−f )
Let H(f , η) = ηl(f ) + (1 − η)l(−f ), the optimal value for f˜ satisfies
δH(f˜, η) = −ηδl(f˜) + (1 − η)δl(−f˜) ∋ 0.
With a slight abuse of notation, we denote by δl(f˜) and δl(−f˜) the two subgradients such
that
ηδl(f˜) − (1 − η)δl(−f˜) = 0
Now we discuss the sign of f˜:
If f˜ > 0, δl(−f˜) < δl(f˜) and thus η > (1 − η), i.e. 2η − 1 > 0.
Conversely, if f˜ < 0 then 2η − 1 < 0
Thus sign(f˜) = sign(2η − 1) i.e. the minimizer of E [l(yf )|X ] is f ∗ (X ) = sign(2η(X ) − 1)
28
We define H(η) = inf f H(f , η) = inf f (ηl(f ) + (1 − η)l(−f )). By construction, H

is a concave function satisfying H(1/2 + x ) = H(1/2 − x ).
Furthermore, one verify that if we consider the minimimum over the wrong sign
classifiers, inff ,f (2η−1)<0 H(f , η) = l(0).
Indeed,
inf H(f , η)
f ,f (2η−1)<0
= inf (ηl(f ) + (1 − η)l(−f ))

f ,f (2η−1)<0
η(l(0) + l ′ (0)f ) + (1 − η)(l(0) − l ′ (0)f )

≥ inf
f ,f (2η−1)<0
≥ l(0) + inf l ′ (0)f (2η − 1) = l(0)

f ,f (2η−1)<0
29
Furthermore,
E ℓ′ (Y , f (X ) = EX [H(f , η(X )]

E ℓ′ (Y , f ⋆ (X )) = EX [H(η(X )]

We define then
1+θ

Ψ(θ) = l(0) − H
2
which is thus a convex function satisfying Ψ(0) = 0 and Ψ(θ) > 0 for θ > 0.
30
Recall that h i h i
E ℓ0/1 (Y , sign(f (X ))) − E ℓ0/1 (Y , f ⋆ (X ))
h i
= EX |2η(X ) − 1|1f ⋆ (X )̸=sign(f (X ))
Using Jensen inequality,
h we derive i h i
Ψ E ℓ0/1 (Y , sign(f (X ))) − E ℓ0/1 (Y , f ⋆ (X ))
h i
≤ EX Ψ |2η(X ) − 1|1f ⋆ (X )̸=sign(f (X ))
31
Using Ψ(0)= h0 and the symmetry iof H, h i

1 + |2η(X ) − 1|

≤ EX l(0) − H 1
f ⋆ (X )̸=sign(f (X ))
2
h i
≤ EX (l(0) − H(η(X ))) 1f ⋆ (X )̸=sign(f (X ))
h i
≤ EX (l(0) − H(η(X ))) 1f (X )(2η(X )−1)<0
Using the property
h of the wrong sign classifiers
i h i
h i
≤ EX (H(f , η(X )) − H(f ⋆ , η(X ))) 1f (X )(2η(X )−1)<0
≤ EX [(H(f , η(X )) − H(f ⋆ , η(X )))]
≤ E ℓ′ (Y , f (X )) − E ℓ′ (Y , f ⋆ (X ))

32
Logistic Revisited Optimization Point of View
Ideal solution:
n
1X
f = argmin
b ℓ0/1 (Yi , f (X i ))
f ∈S n i=1
Logistic regression
Use f (X ) = X ⊤ β + β (0) .
Use the logistic loss ℓ(y , f ) = log2 (1 + e −yf ), i.e. the -log-likelihood.
Different vision than the statistician but same algorithm!
33
Logistic Revisited Optimization Point of View
34
Outline SVM
3 SVM
4 Penalization
35
Ideal Separable Case SVM
Linear classifier: sign(X ⊤ β + β (0) )

Separable case: ∃(β, β (0) ), ∀i, Yi (X i ⊤ β + β (0) ) > 0!
How to choose (β, β (0) ) so that the separation is maximal?
Source: M. Mohri et al.

Strict separation: ∃(β, β (0) ), ∀i, Yi (X i ⊤ β + β (0) ) ≥ 1
Distance between X ⊤ β + β (0) = 1 and X ⊤ β + β (0) = −1:
2
∥β∥
Maximizing this distance is equivalent to minimizing 12 ∥β∥2 .
36
Ideal Separable Case SVM
Separable SVM
Constrained optimization formulation:
1
min ∥β∥2 with ∀i, Yi (X i ⊤ β + β (0) ) ≥ 1
2

Quadratic Programming setting.
Efficient solver available. . .
36
Non Separable Case SVM
What about the non separable case?
SVM relaxation
Relax the assumptions
∀i, Yi (X i ⊤ β + β (0) ) ≥ 1 to ∀i, Yi (X i ⊤ β + β (0) ) ≥ 1 − si
with the slack variables si ≥ 0

Keep those slack variables as small as possible by minimizing
n
1 X
∥β∥2 + C si
2 i=1
where C > 0 is the goodness-of-fit strength
37
Non Separable Case SVM
SVM
n
∀i, Yi (X i ⊤ β + β (0) ) ≥ 1 − si
(
1 2
X
min ∥β∥ + C si with
2 i=1 ∀i, si ≥ 0

Hinge Loss reformulation:
n
1
max(0, 1 − Yi (X i ⊤ β + β (0) ))
X
2
min ∥β∥ + C
2 i=1
| {z }
Hinge Loss
Constrained convex optimization algorithms vs gradient descent algorithms. 37

SVM as a Penalized Convex Relaxation SVM
Convex relaxation:
n
1
max(1 − Yi (X i ⊤ β + β (0) ), 0)
X
argmin ∥β∥2 + C
2 i=1
n
1X 1 1
= argmin max(1 − Yi (X i ⊤ β + β (0) ), 0) + ∥β∥2
n i=1 Cn 2
Prop: ℓ0/1 (Yi , sign(X i ⊤ β + β (0) )) ≤ max(1 − Yi (X i ⊤ β + β (0) ), 0)
Penalized convex relaxation (Tikhonov!)

n
1X
ℓ0/1 (Yi , sign(X i ⊤ β + β (0) ))
n i=1
n
1X 1 1
≤ max(1 − Yi (X i ⊤ β + β (0) ), 0) + ∥β∥2
n i=1 Cn 2
38
SVM SVM
39
The Kernel Trick SVM
Non linear separation: just replace X by a non linear Φ(X ). . .

Knowing ϕ(X i )⊤ ϕ(X j ) is sufficient to compute the SVM solution.
Kernel trick
Computing k(X , X ′ ) = ϕ(X )⊤ ϕ(X ′ ) may be easier than computing ϕ(X ),
ϕ(X ′ ) and then the scalar product!
ϕ can be specified through its definite positive kernel k.
Source: Unknown
Examples: Polynomial kernel k(X , X ′ ) = (1 + X ⊤ X ′ )d , Gaussian kernel
′ 2
k(X , X ′ ) = e −∥X −X ∥ /2 ,. . .
RKHS setting!
Can be used in (logistic) regression and more. . .
40
SVM SVM
41
SVM SVM
42
Constrained Minimization SVM
Constrained Minimization
Goal:
min f (x )
x
(
hj (x ) = 0, j = 1, . . . p
with
gi (x ) ≤ 0, i = 1, . . . q
or rather with argmin!
Different Setting
f , hj , gi differentiable
f convex, hj affine and gi convex.
Feasibility
x is feasible if hj (x ) = 0 and gi (x ) ≤ 0.
Rk: The set of feasible points may be empty 43
Lagrangian SVM
Constrained Minimization
Goal: (
∗ hj (x ) = 0, j = 1, . . . p
p = min f (x ) with
x gi (x ) ≤ 0, i = 1, . . . q
Lagrangian
Def: p
X q
X
L(x , λ, µ) = f (x ) + λj hj (x ) + µi gi (x )
j=1 i=1
with λ ∈ Rp and µ ∈ (R+ )q .
The λj and µi are called the dual (or Lagrange) variables.
Prop: (
f (x ) if x is feasible
max L(x , λ, µ) =
λ∈Rp , µ∈(R+ )q +∞ otherwise
min max L(x , λ, µ) = p ∗
x λ∈Rp , µ∈(R+ )q 44
Lagrangial Dual SVM
Lagrangian
Def:
p
X q
X
L(x , λ, µ) = f (x ) + λj hj (x ) + µi gi (x )
j=1 i=1
with λ ∈ Rp and µ ∈ (R+ )q .
Lagrangian Dual
Lagrangian dual function:
Q(λ, µ) = min L(x , λ, µ)
x
Prop:
Q(λ, µ) ≤ f (x ), for all feasible x
max Q(λ, µ) ≤ min f (x )
λ∈Rp , µ∈(R+ )q x feasible
45
Duality SVM
Primal
Primal: (
∗ hj (x ) = 0, j = 1, . . . p
p = min f (x ) with
x ∈X gi (x ) ≤ 0, i = 1, . . . q
Dual
Dual:
q∗ = max Q(λ, µ) = max min L(x , λ, µ)
λ∈Rp , µ∈(R+ )q λ∈Rp , µ∈(R+ )q x
Duality
Always weak duality:
q∗ ≤ p∗
max min L(x , λ, µ) ≤ min max L(x , λ, µ)
λ∈Rp , µ∈(R+ )q x x λ∈Rp , µ∈(R+ )q
Not always strong duality q ∗ = p ∗ . 46
Strong Duality SVM
Strong Duality
Strong duality:
q∗ = p∗
max min L(x , λ, µ) = min max L(x , λ, µ)
λ∈Rp , µ∈(R+ )q x x λ∈Rp , µ∈(R+ )q
Allow to compute the solution of one problem from the other.
Requires some assumptions!
Strong Duality under Convexity and Slater’s Condition

f convex, hj affine and gi convex.
Slater’s condition: it exists a feasible point such that hj (x ) = 0 for all j and
gi (x ) < 0 for all i.
Sufficient to prove strong duality.
Rk: If the gi are affine, it suffices to have hj (x ) = 0 for all j and gi (x ) ≤ 0 for all
i. 47
KKT SVM
Karush-Kuhn-Tucker Condition
Stationarity:
∇x L(x ∗ , λ, µ) = ∇f (x ∗ ) + λj ∇h(x ∗ ) + µi ∇g(x ∗ ) = 0
X X
j i
Primal admissibility:
hj (x ∗ ) = 0 and gi (x ∗ ) ≤ 0
Dual admissibility:
µi ≥ 0
Complementary slackness:
µi gi (x ∗ ) = 0
KKT Theorem
If f convex, hj affine and gi convex, all are differentiable and strong duality
holds then x ∗ is a solution of the primal problem if and only if the KKT
condition holds 48
SVM and Lagrangian SVM
SVM
n
∀i, Yi (X i ⊤ β + β (0) ) ≥ 1 − si
(
X
min ∥β∥2 + C si with
i=1 ∀i, si ≥ 0
SVM Lagragian
Lagrangian:
n
1 X
L(β, β (0) , s, α, µ) = ∥β∥2 + C si
2 i=1
αi (1 − si − Yi (X i ⊤ β + β (0) )) −
X X
+ µ i si
i i
49
SVM and KKT SVM
KKT Optimality Conditions

Stationarity: X
∇β L(β, β (0) , s, α, µ) = β − αi Yi X i = 0
Xi
∇β (0) L(β, β (0) , s, α, µ) = − αi = 0
i
∇si L(β, β (0) , s, α, µ) = C − αi − µi = 0
Primal and dual admissibility:
(1 − si − Yi (X i ⊤ β + β (0) )) ≤ 0, si ≥ 0, αi ≥ 0, and µi ≥ 0
Complementary slackness:
αi (1 − si − Yi (X i ⊤ β + β (0) )) = 0 and µi si = 0
Consequence
β ∗ = i αi Yi X i and 0 ≤ αi ≤ C .
P
If αi ̸= 0, X i is called a support vector and either
si = 0 and Yi (X i ⊤ β + β (0) ) = 1 (margin hyperplane),
or αi = C (outliers).
β (0)∗ = Yi − X i ⊤ β ∗ for any support vector with 0 < αi < C .
50
SVM Dual SVM
SVM Lagrangian Dual

Lagrangian Dual:
Q(α, µ) = min L(β, β (0) , s, α, µ)
β,β (0) ,s
Prop:
P
if i αi Yi ̸= 0 or ∃i, αi + µi ̸= C ,
Q(α, µ) = −∞
P
if i αi Yi = 0 and ∀i, αi + µ i = C ,
X 1X
Q(α, µ) = αi − αi αj Yi Yj X i ⊤ X j
i
2 i,j
SVM Dual problem

Dual problem is a Quadratic Programming problem:
1X
αi αj Yi Yj X i ⊤ X j
X
max Q(α, µ) ⇔ max αi −
α≥0,µ≥0 0≤α≤C
i
2 i,j
Involves the X i only through their scalar products. 51
Mercer Theorem SVM
Mercer Representation Theorem

For any loss ℓ and any increasing function Φ, the minimizer in β of
n
ℓ(Yi , X i ⊤ β + β (0) ) + Φ(∥β∥2 )
X
i=1
n
is a linear combination of the input points β ∗ = αi′ X i .
X
i=1
Minimization problem in α′ :
n
αj′ X i ⊤ X j + β (0) ) + Φ(∥β∥2 )
X X
ℓ(Yi ,
i=1 j
involving only the scalar product of the data.
Optimal predictor requires only to compute scalarXproducts.
∗ ⊤ ∗
ˆ
f (X ) = X β + β (0),∗
= αi′ X i ⊤ X
i
Transform a problem in dimension dim(X ) in a problem in dimension n.
Direct minimization in β can be more efficient. . . 52
Feature Map SVM
Feature Engineering
Art of creating new features from the existing one X .
′
Example: add monomials (X (j) )2 , X (j) X (j ) . . .
Adding feature increases the dimension.
Feature Map
Application ϕ : X → H with H an Hilbert space.
Source: Unknown
Linear decision boundary in H: ϕ(X )⊤ β + β (0) = 0 is not an hyperplane
anymore in X .
Heuristic: Increasing dimension allows to make data almost linearly separable.

53
Polynomial Mapping SVM
Polynomial Mapping of order 2

ϕ : R2 → R6 √ √ √
ϕ(X ) = (X (1) )2 , (X (2) )2 , 2X (1) X (2) , 2X (1) , 2X (2) , 1

Allow to solve the XOR classification problem with the hyperplane X (1) X (2) = 0.
Polynomial Mapping and Scalar Product
Prop:
ϕ(X )⊤ ϕ(X ′ ) = (1 + X ⊤ X ′ )2
54
SVM Primal and Dual SVM
Primal, Lagrandian and Dual

Primal:
n
∀i, Yi (ϕ(X i )⊤ β + β (0) ) ≥ 1 − si
(
X
2
min ∥β∥ + C si with
i=1 ∀i, si ≥ 0
Lagrangian:
n
1 X
L(β, β (0) , s, α, µ) = ∥β∥2 + C si
2 i=1
αi (1 − si − Yi (ϕ(X i )⊤ β + β (0) )) −
X X
+ µi s i
i i
Dual:
1X
αi αj Yi Yj ϕ(X i )⊤ ϕ(X j )
X
max Q(α, µ) ⇔ max αi −
α≥0,µ≥0 0≤α≤C
i
2 i,j
ϕ(X )⊤ β ∗ β (0),∗ αi Yi ϕ(X )⊤ ϕ(X i )
P
Optimal + = i
Only need to know to compute ϕ(X )⊤ ϕ(X ′ ) to obtain the solution. 55

From Map to Kernel SVM
Many algorithms (e.g. SVM) require only to be able to compute the scalar
product ϕ(X )⊤ ϕ(X ′ ).
Kernel
Any application
k :X ×X →R
is called a kernel over X .
Kernel Trick
Computing directly the kernel k(x , x ′ ) = ϕ(X )⊤ ϕ(X ′ ) may be easier than
computing ϕ(X ), ϕ(X ′ ) and then the scalar product.
Here k is defined from ϕ.

Under some assumption on k, ϕ can be implicitly defined from k!
56
PDS Kernel SVM
Positive Definite Symmetric Kernels

A kernel k is PDS if and only if
k is symmetric, i.e.
k(X , X ′ ) = k(X ′ , X )
for any N ∈ N and any (X 1 , . . . , X N ) ∈ X N ,
K = [k(X i , X j )]1≤i,j≤N
N
is positive semi-definite, i.e. ∀u ∈ RX
u⊤K u = u (i) u (j) k(X i , X j ) ≥ 0
1≤i,j≤N
or equivalently all the eigenvalues of K are non-negative.
The matrix K is called the Gram matrix associated to (X 1 , . . . , X N ).
57
Reproducing Kernel Hilbert Space SVM
Mercer Theorem
For any PDS kernel k : X × X → R, it exists a Hilbert space H ⊂ RX with a
scalar product ⟨·, ·⟩H such that
it exists a mapping ϕ : X → H satisfying
k(X , X ′ ) = ⟨ϕ(X ), ϕ(X )⟩H
the reproducing property holds, i.e. for any h ∈ H and any X ∈ X
h(X ) = ⟨h, k(X , ·)⟩H .
By def., H is a reproducing kernel Hilbert space (RKHS).

H is called the feature space associated to k and ϕ the feature mapping.
No unicity in general.
Rk: if k(X , X ′ ) = ϕ′ (X )⊤ ϕ′ (X ′ ) with ϕ′ : X → Rp then
⊤ ⊤
H can be chosen as {X 7→ ϕ′ (X ) β, β ∈ Rp } and ∥X 7→ ϕ′ (X ) β∥2H = ∥β∥22 .
ϕ(X )(X ′ ) = X ⊤ X ′ .
58
Kernel Construction Machinery SVM
Separable Kernel
For any function Ψ : X → R, k(X , X ′ ) = Ψ(X )Ψ(X ′ ) is PDS.
Kernel Stability
For any PDS kernels k1 and k2 , k1 + k2 and k1 k2 are PDS kernels.
For any sequence of PDS kernels kn converging pointwise to a kernel k, k is a
PDS kernel.
For any PDS kernel k such that |k| ≤ r and n
X any power series n an z with an ≥ 0
P
and a convergence radius larger than r , an k n is a PDS kernel.

n
k(X , X ′ )
For any PDS kernel k, the renormalized kernel k ′ (X , X ′ ) = q is
k(X , X )k(X ′ , X ′ )
a PDS kernel.
Cauchy-Schwartz for k PDS: k(X , X ′ )2 ≤ k(X , X )k(X ′ , X ′ )
59
Classical Kernels SVM
PDS Kernels
Vanilla kernel:
k(X , X ′ ) = X ⊤ X ′
Polynomial kernel:
k(X , X ′ ) = (1 + X ⊤ X ′ )k
Gaussian RBF kernel:
k(X , X ′ ) = exp −γ∥X − X ′ ∥2
Tanh kernel:
k(X , X ′ ) = tanh(aX ⊤ X ′ + b)
Most classical is the Gaussian RBF kernel. . .

Lots of freedom to construct kernel for non classical data.
60
Representer Theorem SVM
Representer Theorem
Let k be a PDS kernel and H its corresponding RKHS,
for any increasing function Φ and any function L : Rn → R, the optimization
problem
argmin L(h(X 1 ), . . . , h(X n )) + Φ(∥h∥)
h∈H
admits only solutions of the form
n
αi′ k(X i , ·).
X
i=1
Examples:
(kernelized) SVM
(kernelized) Penalized Logistic Regression (Ridge)
(kernelized) Penalized Regression (Ridge)
61
Kernelized SVM SVM
Primal
Constrained Optimization:
n
(
X ∀i, Yi (f (X i ) + β (0) ) ≥ 1 − si
min ∥f ∥2H +C si with
f ∈H,β (0) ,s i=1 ∀i, si ≥ 0
Hinge loss: n
X
min ∥f ∥2H + C max(0, 1 − Yi (f (X i ) + β (0) ))
f ∈H,β (0) i=1
Representer:
αi′ αj′ k(X i , X j )
X
min
α′ ,β (0) i,j
n
αj′ k(X j , X i ) + β (0) ))
X X
+C max(0, 1 − Yi (
i=1 j
Dual
Dual: X 1X
max Q(α, µ) ⇔ max αi − αi αj Yi Yj k(X i , X j )
α≥0,µ≥0 0≤α≤C
i
2 i,j 62
SVM SVM
63
SVM SVM
64
Outline Penalization
3 SVM
4 Penalization
65
Bias-Variance Dilemma Penalization
General setting:
F = {measurable functions X → Y}
Best solution: f ∗ = argminf ∈F R(f )
Class S ⊂ F of functions
Ideal target in S: fS∗ = argminf ∈S R(f )
Estimate in S: bfS obtained with some procedure
Approximation error and estimation error (Bias/Variance)
R(fbS ) − R(f ∗ ) = R(fS∗ ) − R(f ∗ ) + R(fbS ) − R(fS∗ )

| {z } | {z }
Approx. error can be large if the model S is not suitable.

Estimation error can be large if the model is complex.
66
Under-fitting / Over-fitting Issue Penalization
Different behavior for different model complexity

Low complexity model are easily learned but the approximation error (bias) may
be large (Under-fit).
High complexity model may contain a good ideal target but the estimation error
(variance) can be large (Over-fit)
Source: Unknown
Bias-variance trade-off ⇐⇒ avoid overfitting and underfitting
Rk: Better to think in term of method (including feature engineering and specific
algorithm) rather than only of model.
67
Simplified Models Penalization
Bias-Variance Issue
Most complex models may not be the best ones due to the variability of the
Source: Tibshirani et al.

estimate.
Naive idea: can we simplify our model without loosing too much?
by using only a subset of the variables?
by forcing the coefficients to be small?
Can we do better than exploring all possibilities?
68
Linear Models Penalization
Setting: Gen. linear model = prediction of Y by h(x ⊤ β).

Model coefficients
Model entirely specified by β.
Coefficientwise:
β (i) = 0 means that the ith covariate is not used.
β (i) ∼ 0 means that the ith covariate as a low influence. . .
If some covariates are useless, better use a simpler model. . .

Submodels
Simplify the model through a constraint on β!
Examples:
Support: Impose that β (i) = 0 for iP
̸∈ I.
d
Support size: Impose that ∥β∥0 = i=1 1β (i) ̸=0 < C
Norm: Impose that ∥β∥p < C with 1 ≤ p (Often p = 2 or p = 1)
69
Norms and Sparsity Penalization
Sparsity
β is sparse if its number of non-zero coefficients (ℓ0 ) is small. . .
Easy interpretation in term of dimension/complexity.
Norm Constraint and Sparsity

Sparsest solution obtained by definition with the ℓ0 norm.
Source: Tibshirani et al.

No induced sparsity with the ℓ2 norm. . .
Sparsity with the ℓ1 norm (can even be proved to be the same than with the ℓ0
norm under some assumptions).
Geometric explanation.
70
Constraint and Penalization Penalization
Constrained Optimization
Choose a constant C .
Compute β as
n
1X
argmin ℓ(Yi , h(x i ⊤ β))
β∈Rd ,∥β∥p ≤C n i=1
Lagrangian Reformulation
Choose λ and compute β as
n
1X ′
argmin ℓ(Yi , h(x i ⊤ β)) + λ∥β∥pp
β∈Rd n i=1
with p = p except if p = 0 where p ′ = 1.
′
Easier calibration. . . but no explicit model S.
Rk: ∥β∥p is not scaling invariant if p ̸= 0. . .

Initial rescaling issue. 71
Penalization Penalization
Penalized Linear Model

Minimization of
n
1X
argmin ℓ(Yi , h(x i ⊤ β)) + pen(β)
β∈Rd n i=1
where pen(β) is a (sparsity promoting) penalty
Variable selection if β is sparse.
Classical Penalties
AIC: pen(β) = λ∥β∥0 (non convex / sparsity)
Ridge: pen(β) = λ∥β∥22 (convex / no sparsity)
Lasso: pen(β) = λ∥β∥1 (convex / sparsity)
Elastic net: pen(β) = λ1 ∥β∥1 + λ2 ∥β∥22 (convex / sparsity)
Easy optimization if pen (and the loss) is convex. . .

Need to specify λ to define a ML method! 72
Penalized Gen. Linear Models Penalization
Classical Examples
Penalized Least Squares
Penalized Logistic Regression
Penalized Maximum Likelihood
SVM
Tree pruning
Sometimes used even if the parametrization is not linear. . .
73
Penalization Penalization
Penalized ℓ0/1 loss (Structural Risk Minimization)

Minimization of
n
1X
argmin ℓ0/1 (Yi , fm (X i )) + pen(m)
fm ,m∈M,fm ∈Sm n i=1
where pen(m) is a complexity driven penalty. . .
No easy optimization here!

Classical Penalties
q
log |M|
Finite class: pen(m) = λ n
s
en
dVC (Sm ) log dVC (Sm )
Finite VC Dimension: pen(m) = λ n
Need to specify λ!
74
Convexified Loss Penalization Penalization
Penalized convexified ℓ loss

Minimization of
n
1X
argmin ℓ(Yi , fm (X i )) + pen(m)
fm ,m∈M,fm ∈Sm n i=1
where pen(m) is a complexity driven penalty. . .
Easy optimization here!

Reuse the previous pen(m)!
Need to specify λ!
SVM case:
dVC ∼ ∥β∥2 which advocates for a penalty in λ∥β∥. . .
A penalty in λ′ ∥β∥2 is more convenient numerically and there is a correspondence
between the two problems. . .
75
Penalization and Cross-Validation Penalization
Practical Selection Methodology

Choose a penalty family penλ .
Compute a CV error for the penalty penλ for all λ ∈ Λ.
Determine λ
b the λ minimizing the CV error.
Compute the final model with the penalty penbλ .
CV allows to select a ML method, penalized estimation with a penalty penbλ , not a

single predictor hence the need of a final reestimation.
Why not using CV on a grid?
Grid size scales exponentially with the dimension!
If the penalized minimization is easy, much cheaper to compute the CV error
for all λ ∈ Λ. . .
CV performs best when the set of candidates is not too big (or is structured. . . )
76
Outline Cross Validation and Weights
3 SVM
4 Penalization
77
Unbalanced and Rebalanced Dataset Cross Validation and Weights
Unbalanced Class
Setting: One of the class is much more present than the other.
Source: University of Granada

Issue: Classifier too attracted by the majority class!
Rebalanced Dataset
Setting: Class proportions are different in the training and testing set (stratified
sampling)
Issue: Training errors are not estimate of testing errors.
78
Resampling Strategies Cross Validation and Weights
Resampling
Modify the training dataset so that the classes are more balanced.
Two flavors:
Source: Oracle
Sub-sampling which spoils data,
Over-sampling which needs to create new examples.
Issues: Training data is not anymore representative of testing data
Hard to do it right! 79
Resampling Effect Cross Validation and Weights
Testing Training
Testing class prob.: πt (k) Training class prob.: πtr (k)
Testing error target: Training Error target:
Eπt [ℓ(Y , f (X ))] = Eπtr [ℓ(Y , f (X ))] =
X X
πt (k)E [ℓ(Y , f (X ))|Y = k] πtr (k)E [ℓ(Y , f (X ))|Y = k]
k k
Implicit Testing Error Using the Training One

Amounts to use a weighted loss: X
Eπtr [ℓ(Y , f (X ))] = πtr (k)E [ℓ(Y , f (X ))|Y = k]
k
πtr (k)
X

= πt (k)E ℓ(Y , f (X ))Y = k
k
πt (k)
πtr (Y )

= Eπt ℓ(Y , f (X ))
πt (Y )
Put more weight on less probable classes! 80
Weighted Loss Cross Validation and Weights
In unbalanced situation, often the cost of misprediction is not the same for all
classes (e.g. medical diagnosis, credit lending. . . )
Much better to use this explicitly than to do blind resampling!
Weighted Loss
Weighted loss:
ℓ(Y , f (X )) →
− C (Y )ℓ(Y , f (X ))
Weighted error target:
E [C (Y )ℓ(Y , f (X ))]
Rk: Strong link with ℓ as C is independent of f .

Often allow to reuse algorithm constructed for ℓ.
C may also depends on X . . .
81
Weighted Loss, ℓ0/1 loss and Bayes Classifier Cross Validation and Weights
The Bayes classifier is now: h i

f ⋆ = argmin E [C (Y )ℓ(Y , f (X ))] = argmin EX EY |X [C (Y )ℓ(Y , f (X ))]
Bayes Predictor
For ℓ0/1 loss,
f ⋆ (X ) = argmax C (k)P (Y = k|X )
k
Same effect than a threshold modification for the binary setting!
Allow to put more emphasis on some classes than others.
82
Linking Weights and Proportions Cross Validation and Weights
Cost and Proportions

Testing error target: X
Eπt [Ct (Y )ℓ(Y , f (X ))] = πt (k)Ct (k)E [ℓ(Y , f (X ))|Y = k]
k
Training error target X
Eπtr [Ctr (Y )ℓ(Y , f (X ))] = πtr (k)Ctr (k)E [ℓ(Y , f (X ))|Y = k]
k
Coincide if
πt (k)Ct (k) = πtr (k)Ctr (k)
Lots of flexibility in the choice of Ct , Ctr or πtr !
83
Combining Weights and Resampling Cross Validation and Weights
Weighted Loss and Resampling

Weighted loss: choice of a weight Ct ̸= 1.
Resampling: use a πtr ̸= πt .
Stratified sampling may be used to reduced the size of a dataset without loosing a
low probability class!
Combining Weights and Resampling

Weighted loss: use Ctr = Ct as πtr = πt .
Resampling: use an implicit Ct (k) = πtr (k)/πt (k).
Combined: use Ctr (k) = Ct (k)πt (k)/πtr (k)
Most ML methods allow such weights!
84
References Cross Validation and Weights
T. Hastie, R. Tibshirani, and J. Friedman.

The Elements of Statistical Learning.
Springer Series in Statistics, 2009
M. Mohri, A. Rostamizadeh, and A. Talwalkar.
Foundations of Machine Learning.
MIT Press, 2012
A. Géron.
Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow (2nd ed.)
O’Reilly, 2019
Ch. Giraud.
Introduction to High-Dimensional Statistics.
CRC Press, 2014
S. Bubeck.
Convex Optimization: Algorithms and Complexity.
Now Publisher, 2015
85
Licence and Contributors
Creative Commons Attribution-ShareAlike (CC BY-SA 4.0)

You are free to:
Share: copy and redistribute the material in any medium or format
Adapt: remix, transform, and build upon the material for any purpose, even commercially.
Under the following terms:
Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in
any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the
original.
No additional restrictions: You may not apply legal terms or technological measures that legally restrict others from doing anything
the license permits.
Contributors
Main contributor: E. Le Pennec
Contributors: S. Boucheron, A. Dieuleveut, A.K. Fermin, S. Gadat, S. Gaiffas,
A. Guilloux, Ch. Keribin, E. Matzner, M. Sangnier, E. Scornet. 86

ML Opt

Uploaded by

Copyright:

Available Formats

ML Opt

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Opt

Uploaded by

Copyright:

Available Formats

ML Methods:

Optimization Point of View

MAP553 - Foundation of Machine Learning - Fall 2021

2 Optimization Point of View

5 Cross Validation and Weights

2 Optimization Point of View

5 Cross Validation and Weights

Supervised Learning Framework

Need to specify the meaning of good.

Loss function for a generic predictor

Beware: As fb depends on Dn , R(fb) is a random variable!

The best solution f ∗ (which is independent of Dn ) is h i

Bayes Predictor (explicit solution)

In regression with the quadratic loss

Issue: Solution requires to know E [Y |X ] for all values of X !

In practice, the rule should be an algorithm!

Model Complexity Dilemna

R(fbS ) − R(f ∗ ) = R(fS∗ ) − R(f ∗ ) + R(fbS ) − R(fS∗ )

Approx. error can be large if the model S is not suitable.

Different behavior for different model complexity

Statistical Learning Analysis

Often need mild assumptions on P. . . (Nonparametric Statistics?)

Empirical Risk Minimizer

Classification loss: ℓ0/1 (y , f (x )) = 1y ̸=f (x )

Ideal Solution and Estimation

The best solution f ∗ (which is independent of Dn ) is h i

Bayes Predictor (explicit solution)

Issue: Solution requires to know E [Y |X ] for all values of X !

Minimizer of the risk

Issue: Classification loss is not convex or smooth.

How to find a good function f with a small risk

A Probabilistic Point of View

An Optimization Point of View

2 Optimization Point of View

5 Cross Validation and Weights

How to find a good function f with a small risk

A Probabilistic Point of View

An Optimization Point of View

Penalized Logistic Regression

Support Vector Machine

The best solution f ⋆ is the one minimizing

Empirical Risk Minimization

Intractable for the ℓ0/1 loss!

Much easier optimization.

The Target is the Bayes Classifier

is the Bayes classifier f ⋆ = sign(2η(X ) − 1)

Control of the Excess Risk

We define H(η) = inf f H(f , η) = inf f (ηl(f ) + (1 − η)l(−f )). By construction, H

= inf (ηl(f ) + (1 − η)l(−f ))

η(l(0) + l ′ (0)f ) + (1 − η)(l(0) − l ′ (0)f )

≥ l(0) + inf l ′ (0)f (2η − 1) = l(0)

Using Ψ(0)= h0 and the symmetry iof H, h i

Different vision than the statistician but same algorithm!

2 Optimization Point of View

5 Cross Validation and Weights

Linear classifier: sign(X ⊤ β + β (0) )

Source: M. Mohri et al.

Source: M. Mohri et al.

What about the non separable case?

Using Ψ(0)= h0 and the symmetry iof H, h i