ML Opt

Download as pdf or txt
Download as pdf or txt
You are on page 1of 89

ML Methods:

Optimization Point of View

E. Le Pennec

MAP553 - Foundation of Machine Learning - Fall 2021

1
Outline

1 Supervised Learning

2 Optimization Point of View

3 SVM

4 Penalization

5 Cross Validation and Weights

2
Outline Supervised Learning

1 Supervised Learning

2 Optimization Point of View

3 SVM

4 Penalization

5 Cross Validation and Weights

3
Supervised Learning Supervised Learning

Supervised Learning Framework


Input measurement X ∈ X
Output measurement Y ∈ Y.
(X , Y ) ∼ P with P unknown.
Training data : Dn = {(X 1 , Y1 ), . . . , (X n , Yn )} (i.i.d. ∼ P)
Often
X ∈ Rd and Y ∈ {−1, 1} (classification)
or X ∈ Rd and Y ∈ R (regression).
A predictor is a function in F = {f : X → Y meas.}
Goal
Construct a good predictor fb from the training data.

Need to specify the meaning of good.


Classification and regression are almost the same problem!
4
Loss and Probabilistic Framework Supervised Learning

Loss function for a generic predictor


Loss function: ℓ(Y , f (X )) measures the goodness of the prediction of Y by f (X )
Examples:
Prediction loss: ℓ(Y , f (X )) = 1Y ̸=f (X )
Quadratic loss: ℓ(Y , f (X )) = |Y − f (X )|2

Risk function
Risk measured as the average loss for a new couple:
R(f ) = E(X,Y )∼P [ℓ(Y , f (X))]
Examples:
Prediction loss: E [ℓ(Y , f (X ))] = P (Y ̸= f (X )) 
Quadratic loss: E [ℓ(Y , f (X ))] = E |Y − f (X )|2

Beware: As fb depends on Dn , R(fb) is a random variable!


5
Best Solution Supervised Learning

The best solution f ∗ (which is independent of Dn ) is h i


f ∗ = arg min R(f ) = arg min E [ℓ(Y , f (X ))] = arg min EX EY |X [ℓ(Y , f (X ))]
f ∈F f ∈F f ∈F

Bayes Predictor (explicit solution)


0 − 1 loss:
In binary classification with 
+1 if P (Y = +1|X ) ≥ P (Y = −1|X )


f (X ) = ⇔ P (Y = +1|X ) ≥ 1/2

−1 otherwise

In regression with the quadratic loss


f ∗ (X ) = E [Y |X ]

Issue: Solution requires to know E [Y |X ] for all values of X !

6
Goal Supervised Learning

Machine Learning
Learn a rule to construct a predictor fb ∈ F from the training data Dn s.t. the
risk R(fb) is small on average or with high probability with respect to Dn .

In practice, the rule should be an algorithm!


Canonical example: Empirical Risk Minimizer
One restricts f to a subset of functions S = {fθ , θ ∈ Θ}
One replaces the minimization of the average loss by the minimization of the
empirical loss
n
1X
f = fθb = argmin
b ℓ(Yi , fθ (X i ))
fθ ,θ∈Θ n i=1

Examples:
Linear regression
Linear discrimination with
S = {x 7→ sign{x ⊤ β + β (0) } /β ∈ Rd , β (0) ∈ R}
7
Example: TwoClass Dataset Supervised Learning

Synthetic Dataset
Two features/covariates.
Two classes.
Dataset from Applied Predictive Modeling, M. Kuhn and K. Johnson, Springer
Numerical experiments with R and the caret package.

8
Example: Linear Discrimination Supervised Learning

9
Example: More Complex Model Supervised Learning

10
Under-fitting / Over-fitting Issue Supervised Learning

Model Complexity Dilemna

Source: A. Ng
What is best a simple or a complex model?
Too simple to be good? Too complex to be learned?

11
Under-fitting / Over-fitting Issue Supervised Learning

Under-fitting / Over-fitting

Source: Unknown
Under-fitting: simple model are too simple.
Over-fitting: complex model are too specific to the training set.

12
Bias-Variance Dilemma Supervised Learning

General setting:
F = {measurable functions X → Y}
Best solution: f ∗ = argminf ∈F R(f )
Class S ⊂ F of functions
Ideal target in S: fS∗ = argminf ∈S R(f )
Estimate in S: bfS obtained with some procedure
Approximation error and estimation error (Bias/Variance)

R(fbS ) − R(f ∗ ) = R(fS∗ ) − R(f ∗ ) + R(fbS ) − R(fS∗ )


| {z } | {z }
Approximation error Estimation error

Approx. error can be large if the model S is not suitable.


Estimation error can be large if the model is complex.
Agnostic approach
No assumption (so far) on the law of (X , Y ).

13
Under-fitting / Over-fitting Issue Supervised Learning

Different behavior for different model complexity


Low complexity model are easily learned but the approximation error (bias) may
be large (Under-fit).
High complexity model may contain a good ideal target but the estimation error
(variance) can be large (Over-fit)

Source: Unknown
Bias-variance trade-off ⇐⇒ avoid overfitting and underfitting

Rk: Better to think in term of method (including feature engineering and specific
algorithm) rather than only of model.
14
Theoretical Analysis Supervised Learning

Statistical Learning Analysis


Error decomposition:
R(fbS ) − R(f ⋆ ) = R(fS⋆ ) − R(f ⋆ ) + R(fbS ) − R(fS⋆ )
| {z } | {z }
Approximation error Estimation error
Bound on the approximation term: approximation theory.
Probabilistic bound on the estimation term: probability theory!
Goal: Agnostic bounds, i.e. bounds that do not require assumptions on P!
(Statistical Learning?)

Often need mild assumptions on P. . . (Nonparametric Statistics?)

15
Binary Classification Loss Issue Supervised Learning

Empirical Risk Minimizer


n
1X
fb = argmin ℓ0/1 (Yi , f (X i ))
f ∈S n i=1

Classification loss: ℓ0/1 (y , f (x )) = 1y ̸=f (x )


Not convex and not smooth!
16
Probabilistic Point of View Supervised Learning

Ideal Solution and Estimation

The best solution f ∗ (which is independent of Dn ) is h i


f ∗ = arg min R(f ) = arg min E [ℓ(Y , f (X ))] = arg min EX EY |X [ℓ(Y , f (x ))]
f ∈F f ∈F f ∈F

Bayes Predictor (explicit solution)


In binary classification with 0 − 1 loss:
(
+1 if P (Y = +1|X ) ≥ P (Y = −1|X )
f ∗ (X ) =

Source: A. Fermin
−1 otherwise

Issue: Solution requires to know E [Y |X ] for all values of X !


Solution: Replace it by an estimate.
17
Optimization Point of View Supervised Learning

Loss Convexification

Minimizer of the risk


n
1X
fb = argmin ℓ0/1 (Yi , f (X i ))
f ∈S n i=1

Issue: Classification loss is not convex or smooth.


Solution: Replace it by a convex majorant.
18
Probabilistic and Optimization Framework Supervised Learning

How to find a good function f with a small risk


R(f ) = E [ℓ(Y , f (X ))] ?
Canonical approach: fS = argminf ∈S n1 ni=1 ℓ(Yi , f (X i ))
b P

Problems
How to choose S?
How to compute the minimization?

A Probabilistic Point of View


Solution: For X , estimate Y |X plug this estimate in the Bayes classifier:
(Generalized) Linear Models, Kernel methods, k-nn, Naive Bayes, Tree,
Bagging. . .

An Optimization Point of View


Solution: If necessary replace the loss ℓ by an upper bound ℓ′ and minimize the
empirical loss: SVR, SVM, Neural Network,Tree, Boosting. . .
19
Outline Optimization Point of View

1 Supervised Learning

2 Optimization Point of View

3 SVM

4 Penalization

5 Cross Validation and Weights

20
Probabilistic and Optimization Framework Optimization Point of View

How to find a good function f with a small risk


R(f ) = E [ℓ(Y , f (X ))] ?
Canonical approach: fS = argminf ∈S n1 ni=1 ℓ(Yi , f (X i ))
b P

Problems
How to choose S?
How to compute the minimization?

A Probabilistic Point of View


Solution: For X , estimate Y |X plug this estimate in the Bayes classifier:
(Generalized) Linear Models, Kernel methods, k-nn, Naive Bayes, Tree,
Bagging. . .

An Optimization Point of View


Solution: If necessary replace the loss ℓ by an upper bound ℓ′ and minimize the
empirical loss: SVR, SVM, Neural Network,Tree, Boosting. . .
21
Three Classical Methods in a Nutshell Optimization Point of View

Penalized Logistic Regression


Let fθ (X ) = X ⊤ β + β (0) with θ = (β, β (0) ).
n
1X  
Find θ̂ = arg min log 1 + e −Yi fθ (X i ) + λ∥β∥1
n i=1
Classify using sign(fθ̂ )

Support Vector Machine


Let fθ (X ) = X ⊤ β + β (0) with θ = (β, β (0) ).
n
1X
Find θ̂ = arg min max (1 − Yi fθ (X i ), 0) + λ∥β∥22
n i=1
Classify using sign(fθ̂ )

22
Three Classical Methods in a Nutshell Optimization Point of View

Deep Learning
Let fθ (X ) with f a feed forward neural network outputing two values with a
softmax layer as a last layer.
n
1X  
Optimize by gradient descent the cross-entropy − log fθ (X i )(Yi )
n i=1
Classify using sign(fθ̂ )

Those three methods rely on a similar heuristic: the optimization point of view!

23
Empirical Risk Minimization Optimization Point of View

The best solution f ⋆ is the one minimizing


f ⋆ = arg min R(f ) = arg min E [ℓ(Y , f (X ))]

Empirical Risk Minimization


One restricts f to a subset of functions S = {fθ , θ ∈ Θ}
One replaces the minimization of the average loss by the minimization of the
average empirical loss
n
1X
fb = fθb = argmin ℓ(Yi , fθ (X i ))
fθ ,θ∈Θ n i=1

Intractable for the ℓ0/1 loss!

24
Convexification Strategy Optimization Point of View

Risk Convexification
Replace the loss ℓ(Y , fθ (X )) by a convex upperbound ℓ′ (Y , fθ (X )) (surrogate
loss).
Minimize the average of the surrogate empirical loss
n
1X
f˜ = fθb = argmin ℓ′ (Yi , fθ (X i ))
fθ ,θ∈Θ n i=1

Use fb = sign(f˜)

Much easier optimization.


Instantiation
Logistic (Revisited)
Support Vector Machine
(Deep) Neural Network
Boosting 25
Classification Loss and Convexification Optimization Point of View

Convexification
Replace the loss ℓ0/1 (Y , f (X )) by
ℓ′ (Y , f (X )) = l(Yf (X ))
with l a convex function.
Further mild assumption: l is decreasing, differentiable at 0 and l ′ (0) < 0.

26
Classification Loss and Convexification Optimization Point of View

Classical convexification
Logistic loss: ℓ′ (Y , f (X )) = log(1 + e −Yf (X ) ) (Logistic / NN)
Hinge loss: ℓ′ (Y , f (X )) = (1 − Yf (X ))+ (SVM)
Exponential loss: ℓ′ (Y , f (X )) = e −Yf (X ) (Boosting. . . )

26
Properties Optimization Point of View

The Target is the Bayes Classifier


The minimizer of
E ℓ′ (Y , f (X )) = E [l(Yf (X ))]
 

is the Bayes classifier f ⋆ = sign(2η(X ) − 1)

Control of the Excess Risk


It exists a convex function
 h Ψ such that i h i
Ψ E ℓ (Y , sign(f (X )) − E ℓ0/1 (Y , f ⋆ (X )
0/1

≤ E ℓ′ (Y , f (X ) − E ℓ′ (Y , f ⋆ (X ))
   

Theoretical guarantee!

27
Proof Optimization Point of View

By definition,
E [l(Yf )|X ] = η(X )l(f ) + (1 − η(X ))l(−f )
Let H(f , η) = ηl(f ) + (1 − η)l(−f ), the optimal value for f˜ satisfies
δH(f˜, η) = −ηδl(f˜) + (1 − η)δl(−f˜) ∋ 0.
With a slight abuse of notation, we denote by δl(f˜) and δl(−f˜) the two subgradients such
that
ηδl(f˜) − (1 − η)δl(−f˜) = 0
Now we discuss the sign of f˜:
If f˜ > 0, δl(−f˜) < δl(f˜) and thus η > (1 − η), i.e. 2η − 1 > 0.
Conversely, if f˜ < 0 then 2η − 1 < 0
Thus sign(f˜) = sign(2η − 1) i.e. the minimizer of E [l(yf )|X ] is f ∗ (X ) = sign(2η(X ) − 1)

28
Proof Optimization Point of View

We define H(η) = inf f H(f , η) = inf f (ηl(f ) + (1 − η)l(−f )). By construction, H


is a concave function satisfying H(1/2 + x ) = H(1/2 − x ).
Furthermore, one verify that if we consider the minimimum over the wrong sign
classifiers, inff ,f (2η−1)<0 H(f , η) = l(0).
Indeed,
inf H(f , η)
f ,f (2η−1)<0

= inf (ηl(f ) + (1 − η)l(−f ))


f ,f (2η−1)<0

η(l(0) + l ′ (0)f ) + (1 − η)(l(0) − l ′ (0)f )



≥ inf
f ,f (2η−1)<0

≥ l(0) + inf l ′ (0)f (2η − 1) = l(0)


f ,f (2η−1)<0

29
Proof Optimization Point of View

Furthermore,
E ℓ′ (Y , f (X ) = EX [H(f , η(X )]
 

E ℓ′ (Y , f ⋆ (X )) = EX [H(η(X )]
 

We define then
1+θ
 
Ψ(θ) = l(0) − H
2
which is thus a convex function satisfying Ψ(0) = 0 and Ψ(θ) > 0 for θ > 0.

30
Proof Optimization Point of View

Recall that h i h i
E ℓ0/1 (Y , sign(f (X ))) − E ℓ0/1 (Y , f ⋆ (X ))
h i
= EX |2η(X ) − 1|1f ⋆ (X )̸=sign(f (X ))
Using Jensen inequality,
 h we derive i h i
Ψ E ℓ0/1 (Y , sign(f (X ))) − E ℓ0/1 (Y , f ⋆ (X ))
h  i
≤ EX Ψ |2η(X ) − 1|1f ⋆ (X )̸=sign(f (X ))

31
Proof Optimization Point of View

Using Ψ(0)= h0 and the symmetry iof H, h i


Ψ E ℓ0/1 (Y , sign(f (X ))) − E ℓ0/1 (Y , f ⋆ (X ))
1 + |2η(X ) − 1|
   
≤ EX l(0) − H 1
f ⋆ (X )̸=sign(f (X ))
2
h i
≤ EX (l(0) − H(η(X ))) 1f ⋆ (X )̸=sign(f (X ))
h i
≤ EX (l(0) − H(η(X ))) 1f (X )(2η(X )−1)<0
Using the property
 h of the wrong sign classifiers
i h i
Ψ E ℓ0/1 (Y , sign(f (X ))) − E ℓ0/1 (Y , f ⋆ (X ))
h i
≤ EX (H(f , η(X )) − H(f ⋆ , η(X ))) 1f (X )(2η(X )−1)<0
≤ EX [(H(f , η(X )) − H(f ⋆ , η(X )))]
≤ E ℓ′ (Y , f (X )) − E ℓ′ (Y , f ⋆ (X ))
   

32
Logistic Revisited Optimization Point of View

Ideal solution:
n
1X
f = argmin
b ℓ0/1 (Yi , f (X i ))
f ∈S n i=1

Logistic regression
Use f (X ) = X ⊤ β + β (0) .
Use the logistic loss ℓ(y , f ) = log2 (1 + e −yf ), i.e. the -log-likelihood.

Different vision than the statistician but same algorithm!

33
Logistic Revisited Optimization Point of View

34
Outline SVM

1 Supervised Learning

2 Optimization Point of View

3 SVM

4 Penalization

5 Cross Validation and Weights

35
Ideal Separable Case SVM

Linear classifier: sign(X ⊤ β + β (0) )


Separable case: ∃(β, β (0) ), ∀i, Yi (X i ⊤ β + β (0) ) > 0!
How to choose (β, β (0) ) so that the separation is maximal?

Source: M. Mohri et al.


Strict separation: ∃(β, β (0) ), ∀i, Yi (X i ⊤ β + β (0) ) ≥ 1
Distance between X ⊤ β + β (0) = 1 and X ⊤ β + β (0) = −1:
2
∥β∥
Maximizing this distance is equivalent to minimizing 12 ∥β∥2 .
36
Ideal Separable Case SVM

Separable SVM
Constrained optimization formulation:
1
min ∥β∥2 with ∀i, Yi (X i ⊤ β + β (0) ) ≥ 1
2

Source: M. Mohri et al.


Quadratic Programming setting.
Efficient solver available. . .

36
Non Separable Case SVM

What about the non separable case?

SVM relaxation
Relax the assumptions
∀i, Yi (X i ⊤ β + β (0) ) ≥ 1 to ∀i, Yi (X i ⊤ β + β (0) ) ≥ 1 − si
with the slack variables si ≥ 0

Source: M. Mohri et al.


Keep those slack variables as small as possible by minimizing
n
1 X
∥β∥2 + C si
2 i=1
where C > 0 is the goodness-of-fit strength
37
Non Separable Case SVM

SVM
Constrained optimization formulation:
n
∀i, Yi (X i ⊤ β + β (0) ) ≥ 1 − si
(
1 2
X
min ∥β∥ + C si with
2 i=1 ∀i, si ≥ 0

Source: M. Mohri et al.


Hinge Loss reformulation:
n
1
max(0, 1 − Yi (X i ⊤ β + β (0) ))
X
2
min ∥β∥ + C
2 i=1
| {z }
Hinge Loss

Constrained convex optimization algorithms vs gradient descent algorithms. 37


SVM as a Penalized Convex Relaxation SVM

Convex relaxation:
n
1
max(1 − Yi (X i ⊤ β + β (0) ), 0)
X
argmin ∥β∥2 + C
2 i=1
n
1X 1 1
= argmin max(1 − Yi (X i ⊤ β + β (0) ), 0) + ∥β∥2
n i=1 Cn 2
Prop: ℓ0/1 (Yi , sign(X i ⊤ β + β (0) )) ≤ max(1 − Yi (X i ⊤ β + β (0) ), 0)

Penalized convex relaxation (Tikhonov!)


n
1X
ℓ0/1 (Yi , sign(X i ⊤ β + β (0) ))
n i=1
n
1X 1 1
≤ max(1 − Yi (X i ⊤ β + β (0) ), 0) + ∥β∥2
n i=1 Cn 2

38
SVM SVM

39
The Kernel Trick SVM

Non linear separation: just replace X by a non linear Φ(X ). . .


Knowing ϕ(X i )⊤ ϕ(X j ) is sufficient to compute the SVM solution.
Kernel trick
Computing k(X , X ′ ) = ϕ(X )⊤ ϕ(X ′ ) may be easier than computing ϕ(X ),
ϕ(X ′ ) and then the scalar product!
ϕ can be specified through its definite positive kernel k.

Source: Unknown
Examples: Polynomial kernel k(X , X ′ ) = (1 + X ⊤ X ′ )d , Gaussian kernel
′ 2
k(X , X ′ ) = e −∥X −X ∥ /2 ,. . .
RKHS setting!
Can be used in (logistic) regression and more. . .
40
SVM SVM

41
SVM SVM

42
Constrained Minimization SVM

Constrained Minimization
Goal:
min f (x )
x
(
hj (x ) = 0, j = 1, . . . p
with
gi (x ) ≤ 0, i = 1, . . . q
or rather with argmin!

Different Setting
f , hj , gi differentiable
f convex, hj affine and gi convex.

Feasibility
x is feasible if hj (x ) = 0 and gi (x ) ≤ 0.
Rk: The set of feasible points may be empty 43
Lagrangian SVM

Constrained Minimization
Goal: (
∗ hj (x ) = 0, j = 1, . . . p
p = min f (x ) with
x gi (x ) ≤ 0, i = 1, . . . q

Lagrangian
Def: p
X q
X
L(x , λ, µ) = f (x ) + λj hj (x ) + µi gi (x )
j=1 i=1
with λ ∈ Rp and µ ∈ (R+ )q .
The λj and µi are called the dual (or Lagrange) variables.
Prop: (
f (x ) if x is feasible
max L(x , λ, µ) =
λ∈Rp , µ∈(R+ )q +∞ otherwise
min max L(x , λ, µ) = p ∗
x λ∈Rp , µ∈(R+ )q 44
Lagrangial Dual SVM

Lagrangian
Def:
p
X q
X
L(x , λ, µ) = f (x ) + λj hj (x ) + µi gi (x )
j=1 i=1
with λ ∈ Rp and µ ∈ (R+ )q .

Lagrangian Dual
Lagrangian dual function:
Q(λ, µ) = min L(x , λ, µ)
x
Prop:
Q(λ, µ) ≤ f (x ), for all feasible x
max Q(λ, µ) ≤ min f (x )
λ∈Rp , µ∈(R+ )q x feasible

45
Duality SVM

Primal
Primal: (
∗ hj (x ) = 0, j = 1, . . . p
p = min f (x ) with
x ∈X gi (x ) ≤ 0, i = 1, . . . q

Dual
Dual:
q∗ = max Q(λ, µ) = max min L(x , λ, µ)
λ∈Rp , µ∈(R+ )q λ∈Rp , µ∈(R+ )q x

Duality
Always weak duality:
q∗ ≤ p∗
max min L(x , λ, µ) ≤ min max L(x , λ, µ)
λ∈Rp , µ∈(R+ )q x x λ∈Rp , µ∈(R+ )q
Not always strong duality q ∗ = p ∗ . 46
Strong Duality SVM

Strong Duality
Strong duality:
q∗ = p∗
max min L(x , λ, µ) = min max L(x , λ, µ)
λ∈Rp , µ∈(R+ )q x x λ∈Rp , µ∈(R+ )q
Allow to compute the solution of one problem from the other.
Requires some assumptions!

Strong Duality under Convexity and Slater’s Condition


f convex, hj affine and gi convex.
Slater’s condition: it exists a feasible point such that hj (x ) = 0 for all j and
gi (x ) < 0 for all i.
Sufficient to prove strong duality.
Rk: If the gi are affine, it suffices to have hj (x ) = 0 for all j and gi (x ) ≤ 0 for all
i. 47
KKT SVM

Karush-Kuhn-Tucker Condition
Stationarity:
∇x L(x ∗ , λ, µ) = ∇f (x ∗ ) + λj ∇h(x ∗ ) + µi ∇g(x ∗ ) = 0
X X

j i
Primal admissibility:
hj (x ∗ ) = 0 and gi (x ∗ ) ≤ 0
Dual admissibility:
µi ≥ 0
Complementary slackness:
µi gi (x ∗ ) = 0

KKT Theorem
If f convex, hj affine and gi convex, all are differentiable and strong duality
holds then x ∗ is a solution of the primal problem if and only if the KKT
condition holds 48
SVM and Lagrangian SVM

SVM
Constrained optimization formulation:
n
∀i, Yi (X i ⊤ β + β (0) ) ≥ 1 − si
(
X
min ∥β∥2 + C si with
i=1 ∀i, si ≥ 0

SVM Lagragian
Lagrangian:
n
1 X
L(β, β (0) , s, α, µ) = ∥β∥2 + C si
2 i=1
αi (1 − si − Yi (X i ⊤ β + β (0) )) −
X X
+ µ i si
i i

49
SVM and KKT SVM

KKT Optimality Conditions


Stationarity: X
∇β L(β, β (0) , s, α, µ) = β − αi Yi X i = 0
Xi
∇β (0) L(β, β (0) , s, α, µ) = − αi = 0
i
∇si L(β, β (0) , s, α, µ) = C − αi − µi = 0
Primal and dual admissibility:
(1 − si − Yi (X i ⊤ β + β (0) )) ≤ 0, si ≥ 0, αi ≥ 0, and µi ≥ 0
Complementary slackness:
αi (1 − si − Yi (X i ⊤ β + β (0) )) = 0 and µi si = 0

Consequence
β ∗ = i αi Yi X i and 0 ≤ αi ≤ C .
P
If αi ̸= 0, X i is called a support vector and either
si = 0 and Yi (X i ⊤ β + β (0) ) = 1 (margin hyperplane),
or αi = C (outliers).
β (0)∗ = Yi − X i ⊤ β ∗ for any support vector with 0 < αi < C .

50
SVM Dual SVM

SVM Lagrangian Dual


Lagrangian Dual:
Q(α, µ) = min L(β, β (0) , s, α, µ)
β,β (0) ,s
Prop:
P
if i αi Yi ̸= 0 or ∃i, αi + µi ̸= C ,
Q(α, µ) = −∞
P
if i αi Yi = 0 and ∀i, αi + µ i = C ,
X 1X
Q(α, µ) = αi − αi αj Yi Yj X i ⊤ X j
i
2 i,j

SVM Dual problem


Dual problem is a Quadratic Programming problem:
1X
αi αj Yi Yj X i ⊤ X j
X
max Q(α, µ) ⇔ max αi −
α≥0,µ≥0 0≤α≤C
i
2 i,j
Involves the X i only through their scalar products. 51
Mercer Theorem SVM

Mercer Representation Theorem


For any loss ℓ and any increasing function Φ, the minimizer in β of
n
ℓ(Yi , X i ⊤ β + β (0) ) + Φ(∥β∥2 )
X

i=1
n
is a linear combination of the input points β ∗ = αi′ X i .
X

i=1
Minimization problem in α′ :
n
αj′ X i ⊤ X j + β (0) ) + Φ(∥β∥2 )
X X
ℓ(Yi ,
i=1 j
involving only the scalar product of the data.
Optimal predictor requires only to compute scalarXproducts.
∗ ⊤ ∗
ˆ
f (X ) = X β + β (0),∗
= αi′ X i ⊤ X
i
Transform a problem in dimension dim(X ) in a problem in dimension n.
Direct minimization in β can be more efficient. . . 52
Feature Map SVM

Feature Engineering
Art of creating new features from the existing one X .

Example: add monomials (X (j) )2 , X (j) X (j ) . . .
Adding feature increases the dimension.

Feature Map
Application ϕ : X → H with H an Hilbert space.

Source: Unknown
Linear decision boundary in H: ϕ(X )⊤ β + β (0) = 0 is not an hyperplane
anymore in X .

Heuristic: Increasing dimension allows to make data almost linearly separable.


53
Polynomial Mapping SVM

Polynomial Mapping of order 2


ϕ : R2 → R6  √ √ √ 
ϕ(X ) = (X (1) )2 , (X (2) )2 , 2X (1) X (2) , 2X (1) , 2X (2) , 1

Source: M. Mohri et al.


Allow to solve the XOR classification problem with the hyperplane X (1) X (2) = 0.
Polynomial Mapping and Scalar Product
Prop:
ϕ(X )⊤ ϕ(X ′ ) = (1 + X ⊤ X ′ )2
54
SVM Primal and Dual SVM

Primal, Lagrandian and Dual


Primal:
n
∀i, Yi (ϕ(X i )⊤ β + β (0) ) ≥ 1 − si
(
X
2
min ∥β∥ + C si with
i=1 ∀i, si ≥ 0
Lagrangian:
n
1 X
L(β, β (0) , s, α, µ) = ∥β∥2 + C si
2 i=1
αi (1 − si − Yi (ϕ(X i )⊤ β + β (0) )) −
X X
+ µi s i
i i
Dual:
1X
αi αj Yi Yj ϕ(X i )⊤ ϕ(X j )
X
max Q(α, µ) ⇔ max αi −
α≥0,µ≥0 0≤α≤C
i
2 i,j
ϕ(X )⊤ β ∗ β (0),∗ αi Yi ϕ(X )⊤ ϕ(X i )
P
Optimal + = i

Only need to know to compute ϕ(X )⊤ ϕ(X ′ ) to obtain the solution. 55


From Map to Kernel SVM

Many algorithms (e.g. SVM) require only to be able to compute the scalar
product ϕ(X )⊤ ϕ(X ′ ).

Kernel
Any application
k :X ×X →R
is called a kernel over X .

Kernel Trick
Computing directly the kernel k(x , x ′ ) = ϕ(X )⊤ ϕ(X ′ ) may be easier than
computing ϕ(X ), ϕ(X ′ ) and then the scalar product.

Here k is defined from ϕ.


Under some assumption on k, ϕ can be implicitly defined from k!
56
PDS Kernel SVM

Positive Definite Symmetric Kernels


A kernel k is PDS if and only if
k is symmetric, i.e.
k(X , X ′ ) = k(X ′ , X )
for any N ∈ N and any (X 1 , . . . , X N ) ∈ X N ,
K = [k(X i , X j )]1≤i,j≤N
N
is positive semi-definite, i.e. ∀u ∈ RX
u⊤K u = u (i) u (j) k(X i , X j ) ≥ 0
1≤i,j≤N
or equivalently all the eigenvalues of K are non-negative.

The matrix K is called the Gram matrix associated to (X 1 , . . . , X N ).

57
Reproducing Kernel Hilbert Space SVM

Mercer Theorem
For any PDS kernel k : X × X → R, it exists a Hilbert space H ⊂ RX with a
scalar product ⟨·, ·⟩H such that
it exists a mapping ϕ : X → H satisfying
k(X , X ′ ) = ⟨ϕ(X ), ϕ(X )⟩H
the reproducing property holds, i.e. for any h ∈ H and any X ∈ X
h(X ) = ⟨h, k(X , ·)⟩H .

By def., H is a reproducing kernel Hilbert space (RKHS).


H is called the feature space associated to k and ϕ the feature mapping.
No unicity in general.
Rk: if k(X , X ′ ) = ϕ′ (X )⊤ ϕ′ (X ′ ) with ϕ′ : X → Rp then
⊤ ⊤
H can be chosen as {X 7→ ϕ′ (X ) β, β ∈ Rp } and ∥X 7→ ϕ′ (X ) β∥2H = ∥β∥22 .
ϕ(X )(X ′ ) = X ⊤ X ′ .
58
Kernel Construction Machinery SVM

Separable Kernel
For any function Ψ : X → R, k(X , X ′ ) = Ψ(X )Ψ(X ′ ) is PDS.
Kernel Stability
For any PDS kernels k1 and k2 , k1 + k2 and k1 k2 are PDS kernels.
For any sequence of PDS kernels kn converging pointwise to a kernel k, k is a
PDS kernel.
For any PDS kernel k such that |k| ≤ r and n
X any power series n an z with an ≥ 0
P

and a convergence radius larger than r , an k n is a PDS kernel.


n
k(X , X ′ )
For any PDS kernel k, the renormalized kernel k ′ (X , X ′ ) = q is
k(X , X )k(X ′ , X ′ )
a PDS kernel.
Cauchy-Schwartz for k PDS: k(X , X ′ )2 ≤ k(X , X )k(X ′ , X ′ )
59
Classical Kernels SVM

PDS Kernels
Vanilla kernel:
k(X , X ′ ) = X ⊤ X ′
Polynomial kernel:
k(X , X ′ ) = (1 + X ⊤ X ′ )k
Gaussian RBF kernel:  
k(X , X ′ ) = exp −γ∥X − X ′ ∥2
Tanh kernel:
k(X , X ′ ) = tanh(aX ⊤ X ′ + b)

Most classical is the Gaussian RBF kernel. . .


Lots of freedom to construct kernel for non classical data.

60
Representer Theorem SVM

Representer Theorem
Let k be a PDS kernel and H its corresponding RKHS,
for any increasing function Φ and any function L : Rn → R, the optimization
problem
argmin L(h(X 1 ), . . . , h(X n )) + Φ(∥h∥)
h∈H
admits only solutions of the form
n
αi′ k(X i , ·).
X

i=1

Examples:
(kernelized) SVM
(kernelized) Penalized Logistic Regression (Ridge)
(kernelized) Penalized Regression (Ridge)

61
Kernelized SVM SVM

Primal
Constrained Optimization:
n
(
X ∀i, Yi (f (X i ) + β (0) ) ≥ 1 − si
min ∥f ∥2H +C si with
f ∈H,β (0) ,s i=1 ∀i, si ≥ 0
Hinge loss: n
X
min ∥f ∥2H + C max(0, 1 − Yi (f (X i ) + β (0) ))
f ∈H,β (0) i=1
Representer:
αi′ αj′ k(X i , X j )
X
min
α′ ,β (0) i,j
n
αj′ k(X j , X i ) + β (0) ))
X X
+C max(0, 1 − Yi (
i=1 j

Dual
Dual: X 1X
max Q(α, µ) ⇔ max αi − αi αj Yi Yj k(X i , X j )
α≥0,µ≥0 0≤α≤C
i
2 i,j 62
SVM SVM

63
SVM SVM

64
Outline Penalization

1 Supervised Learning

2 Optimization Point of View

3 SVM

4 Penalization

5 Cross Validation and Weights

65
Bias-Variance Dilemma Penalization

General setting:
F = {measurable functions X → Y}
Best solution: f ∗ = argminf ∈F R(f )
Class S ⊂ F of functions
Ideal target in S: fS∗ = argminf ∈S R(f )
Estimate in S: bfS obtained with some procedure
Approximation error and estimation error (Bias/Variance)

R(fbS ) − R(f ∗ ) = R(fS∗ ) − R(f ∗ ) + R(fbS ) − R(fS∗ )


| {z } | {z }
Approximation error Estimation error

Approx. error can be large if the model S is not suitable.


Estimation error can be large if the model is complex.

66
Under-fitting / Over-fitting Issue Penalization

Different behavior for different model complexity


Low complexity model are easily learned but the approximation error (bias) may
be large (Under-fit).
High complexity model may contain a good ideal target but the estimation error
(variance) can be large (Over-fit)

Source: Unknown
Bias-variance trade-off ⇐⇒ avoid overfitting and underfitting

Rk: Better to think in term of method (including feature engineering and specific
algorithm) rather than only of model.
67
Simplified Models Penalization

Bias-Variance Issue
Most complex models may not be the best ones due to the variability of the

Source: Tibshirani et al.


estimate.
Naive idea: can we simplify our model without loosing too much?
by using only a subset of the variables?
by forcing the coefficients to be small?
Can we do better than exploring all possibilities?
68
Linear Models Penalization

Setting: Gen. linear model = prediction of Y by h(x ⊤ β).


Model coefficients
Model entirely specified by β.
Coefficientwise:
β (i) = 0 means that the ith covariate is not used.
β (i) ∼ 0 means that the ith covariate as a low influence. . .

If some covariates are useless, better use a simpler model. . .


Submodels
Simplify the model through a constraint on β!
Examples:
Support: Impose that β (i) = 0 for iP
̸∈ I.
d
Support size: Impose that ∥β∥0 = i=1 1β (i) ̸=0 < C
Norm: Impose that ∥β∥p < C with 1 ≤ p (Often p = 2 or p = 1)

69
Norms and Sparsity Penalization

Sparsity
β is sparse if its number of non-zero coefficients (ℓ0 ) is small. . .
Easy interpretation in term of dimension/complexity.

Norm Constraint and Sparsity


Sparsest solution obtained by definition with the ℓ0 norm.

Source: Tibshirani et al.


No induced sparsity with the ℓ2 norm. . .
Sparsity with the ℓ1 norm (can even be proved to be the same than with the ℓ0
norm under some assumptions).
Geometric explanation.
70
Constraint and Penalization Penalization

Constrained Optimization
Choose a constant C .
Compute β as
n
1X
argmin ℓ(Yi , h(x i ⊤ β))
β∈Rd ,∥β∥p ≤C n i=1

Lagrangian Reformulation
Choose λ and compute β as
n
1X ′
argmin ℓ(Yi , h(x i ⊤ β)) + λ∥β∥pp
β∈Rd n i=1
with p = p except if p = 0 where p ′ = 1.

Easier calibration. . . but no explicit model S.

Rk: ∥β∥p is not scaling invariant if p ̸= 0. . .


Initial rescaling issue. 71
Penalization Penalization

Penalized Linear Model


Minimization of
n
1X
argmin ℓ(Yi , h(x i ⊤ β)) + pen(β)
β∈Rd n i=1
where pen(β) is a (sparsity promoting) penalty
Variable selection if β is sparse.

Classical Penalties
AIC: pen(β) = λ∥β∥0 (non convex / sparsity)
Ridge: pen(β) = λ∥β∥22 (convex / no sparsity)
Lasso: pen(β) = λ∥β∥1 (convex / sparsity)
Elastic net: pen(β) = λ1 ∥β∥1 + λ2 ∥β∥22 (convex / sparsity)

Easy optimization if pen (and the loss) is convex. . .


Need to specify λ to define a ML method! 72
Penalized Gen. Linear Models Penalization

Classical Examples
Penalized Least Squares
Penalized Logistic Regression
Penalized Maximum Likelihood
SVM
Tree pruning

Sometimes used even if the parametrization is not linear. . .

73
Penalization Penalization

Penalized ℓ0/1 loss (Structural Risk Minimization)


Minimization of
n
1X
argmin ℓ0/1 (Yi , fm (X i )) + pen(m)
fm ,m∈M,fm ∈Sm n i=1
where pen(m) is a complexity driven penalty. . .

No easy optimization here!


Classical Penalties
q
log |M|
Finite class: pen(m) = λ n
s  
en
dVC (Sm ) log dVC (Sm )
Finite VC Dimension: pen(m) = λ n

Need to specify λ!
74
Convexified Loss Penalization Penalization

Penalized convexified ℓ loss


Minimization of
n
1X
argmin ℓ(Yi , fm (X i )) + pen(m)
fm ,m∈M,fm ∈Sm n i=1
where pen(m) is a complexity driven penalty. . .

Easy optimization here!


Reuse the previous pen(m)!
Need to specify λ!
SVM case:
dVC ∼ ∥β∥2 which advocates for a penalty in λ∥β∥. . .
A penalty in λ′ ∥β∥2 is more convenient numerically and there is a correspondence
between the two problems. . .

75
Penalization and Cross-Validation Penalization

Practical Selection Methodology


Choose a penalty family penλ .
Compute a CV error for the penalty penλ for all λ ∈ Λ.
Determine λ
b the λ minimizing the CV error.
Compute the final model with the penalty penbλ .

CV allows to select a ML method, penalized estimation with a penalty penbλ , not a


single predictor hence the need of a final reestimation.
Why not using CV on a grid?
Grid size scales exponentially with the dimension!
If the penalized minimization is easy, much cheaper to compute the CV error
for all λ ∈ Λ. . .
CV performs best when the set of candidates is not too big (or is structured. . . )
76
Outline Cross Validation and Weights

1 Supervised Learning

2 Optimization Point of View

3 SVM

4 Penalization

5 Cross Validation and Weights

77
Unbalanced and Rebalanced Dataset Cross Validation and Weights

Unbalanced Class
Setting: One of the class is much more present than the other.

Source: University of Granada


Issue: Classifier too attracted by the majority class!

Rebalanced Dataset
Setting: Class proportions are different in the training and testing set (stratified
sampling)
Issue: Training errors are not estimate of testing errors.
78
Resampling Strategies Cross Validation and Weights

Resampling
Modify the training dataset so that the classes are more balanced.
Two flavors:

Source: Oracle
Sub-sampling which spoils data,
Over-sampling which needs to create new examples.
Issues: Training data is not anymore representative of testing data
Hard to do it right! 79
Resampling Effect Cross Validation and Weights

Testing Training
Testing class prob.: πt (k) Training class prob.: πtr (k)
Testing error target: Training Error target:
Eπt [ℓ(Y , f (X ))] = Eπtr [ℓ(Y , f (X ))] =
X X
πt (k)E [ℓ(Y , f (X ))|Y = k] πtr (k)E [ℓ(Y , f (X ))|Y = k]
k k

Implicit Testing Error Using the Training One


Amounts to use a weighted loss: X
Eπtr [ℓ(Y , f (X ))] = πtr (k)E [ℓ(Y , f (X ))|Y = k]
k
πtr (k)
X  

= πt (k)E ℓ(Y , f (X )) Y = k
k 
πt (k)
πtr (Y )

= Eπt ℓ(Y , f (X ))
πt (Y )
Put more weight on less probable classes! 80
Weighted Loss Cross Validation and Weights

In unbalanced situation, often the cost of misprediction is not the same for all
classes (e.g. medical diagnosis, credit lending. . . )
Much better to use this explicitly than to do blind resampling!

Weighted Loss
Weighted loss:
ℓ(Y , f (X )) →
− C (Y )ℓ(Y , f (X ))
Weighted error target:
E [C (Y )ℓ(Y , f (X ))]

Rk: Strong link with ℓ as C is independent of f .


Often allow to reuse algorithm constructed for ℓ.
C may also depends on X . . .

81
Weighted Loss, ℓ0/1 loss and Bayes Classifier Cross Validation and Weights

The Bayes classifier is now: h i


f ⋆ = argmin E [C (Y )ℓ(Y , f (X ))] = argmin EX EY |X [C (Y )ℓ(Y , f (X ))]

Bayes Predictor
For ℓ0/1 loss,
f ⋆ (X ) = argmax C (k)P (Y = k|X )
k
Same effect than a threshold modification for the binary setting!

Allow to put more emphasis on some classes than others.

82
Linking Weights and Proportions Cross Validation and Weights

Cost and Proportions


Testing error target: X
Eπt [Ct (Y )ℓ(Y , f (X ))] = πt (k)Ct (k)E [ℓ(Y , f (X ))|Y = k]
k
Training error target X
Eπtr [Ctr (Y )ℓ(Y , f (X ))] = πtr (k)Ctr (k)E [ℓ(Y , f (X ))|Y = k]
k
Coincide if
πt (k)Ct (k) = πtr (k)Ctr (k)

Lots of flexibility in the choice of Ct , Ctr or πtr !

83
Combining Weights and Resampling Cross Validation and Weights

Weighted Loss and Resampling


Weighted loss: choice of a weight Ct ̸= 1.
Resampling: use a πtr ̸= πt .

Stratified sampling may be used to reduced the size of a dataset without loosing a
low probability class!

Combining Weights and Resampling


Weighted loss: use Ctr = Ct as πtr = πt .
Resampling: use an implicit Ct (k) = πtr (k)/πt (k).
Combined: use Ctr (k) = Ct (k)πt (k)/πtr (k)

Most ML methods allow such weights!

84
References Cross Validation and Weights

T. Hastie, R. Tibshirani, and J. Friedman.


The Elements of Statistical Learning.
Springer Series in Statistics, 2009
M. Mohri, A. Rostamizadeh, and A. Talwalkar.
Foundations of Machine Learning.
MIT Press, 2012
A. Géron.
Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow (2nd ed.)
O’Reilly, 2019
Ch. Giraud.
Introduction to High-Dimensional Statistics.
CRC Press, 2014
S. Bubeck.
Convex Optimization: Algorithms and Complexity.
Now Publisher, 2015
85
Licence and Contributors

Creative Commons Attribution-ShareAlike (CC BY-SA 4.0)


You are free to:
Share: copy and redistribute the material in any medium or format
Adapt: remix, transform, and build upon the material for any purpose, even commercially.
Under the following terms:
Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in
any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the
original.
No additional restrictions: You may not apply legal terms or technological measures that legally restrict others from doing anything
the license permits.

Contributors
Main contributor: E. Le Pennec
Contributors: S. Boucheron, A. Dieuleveut, A.K. Fermin, S. Gadat, S. Gaiffas,
A. Guilloux, Ch. Keribin, E. Matzner, M. Sangnier, E. Scornet. 86

You might also like