ML Opt
ML Opt
ML Opt
E. Le Pennec
1
Outline
1 Supervised Learning
3 SVM
4 Penalization
2
Outline Supervised Learning
1 Supervised Learning
3 SVM
4 Penalization
3
Supervised Learning Supervised Learning
Risk function
Risk measured as the average loss for a new couple:
R(f ) = E(X,Y )∼P [ℓ(Y , f (X))]
Examples:
Prediction loss: E [ℓ(Y , f (X ))] = P (Y ̸= f (X ))
Quadratic loss: E [ℓ(Y , f (X ))] = E |Y − f (X )|2
6
Goal Supervised Learning
Machine Learning
Learn a rule to construct a predictor fb ∈ F from the training data Dn s.t. the
risk R(fb) is small on average or with high probability with respect to Dn .
Examples:
Linear regression
Linear discrimination with
S = {x 7→ sign{x ⊤ β + β (0) } /β ∈ Rd , β (0) ∈ R}
7
Example: TwoClass Dataset Supervised Learning
Synthetic Dataset
Two features/covariates.
Two classes.
Dataset from Applied Predictive Modeling, M. Kuhn and K. Johnson, Springer
Numerical experiments with R and the caret package.
8
Example: Linear Discrimination Supervised Learning
9
Example: More Complex Model Supervised Learning
10
Under-fitting / Over-fitting Issue Supervised Learning
Source: A. Ng
What is best a simple or a complex model?
Too simple to be good? Too complex to be learned?
11
Under-fitting / Over-fitting Issue Supervised Learning
Under-fitting / Over-fitting
Source: Unknown
Under-fitting: simple model are too simple.
Over-fitting: complex model are too specific to the training set.
12
Bias-Variance Dilemma Supervised Learning
General setting:
F = {measurable functions X → Y}
Best solution: f ∗ = argminf ∈F R(f )
Class S ⊂ F of functions
Ideal target in S: fS∗ = argminf ∈S R(f )
Estimate in S: bfS obtained with some procedure
Approximation error and estimation error (Bias/Variance)
13
Under-fitting / Over-fitting Issue Supervised Learning
Source: Unknown
Bias-variance trade-off ⇐⇒ avoid overfitting and underfitting
Rk: Better to think in term of method (including feature engineering and specific
algorithm) rather than only of model.
14
Theoretical Analysis Supervised Learning
15
Binary Classification Loss Issue Supervised Learning
Source: A. Fermin
−1 otherwise
Loss Convexification
Problems
How to choose S?
How to compute the minimization?
1 Supervised Learning
3 SVM
4 Penalization
20
Probabilistic and Optimization Framework Optimization Point of View
Problems
How to choose S?
How to compute the minimization?
22
Three Classical Methods in a Nutshell Optimization Point of View
Deep Learning
Let fθ (X ) with f a feed forward neural network outputing two values with a
softmax layer as a last layer.
n
1X
Optimize by gradient descent the cross-entropy − log fθ (X i )(Yi )
n i=1
Classify using sign(fθ̂ )
Those three methods rely on a similar heuristic: the optimization point of view!
23
Empirical Risk Minimization Optimization Point of View
24
Convexification Strategy Optimization Point of View
Risk Convexification
Replace the loss ℓ(Y , fθ (X )) by a convex upperbound ℓ′ (Y , fθ (X )) (surrogate
loss).
Minimize the average of the surrogate empirical loss
n
1X
f˜ = fθb = argmin ℓ′ (Yi , fθ (X i ))
fθ ,θ∈Θ n i=1
Use fb = sign(f˜)
Convexification
Replace the loss ℓ0/1 (Y , f (X )) by
ℓ′ (Y , f (X )) = l(Yf (X ))
with l a convex function.
Further mild assumption: l is decreasing, differentiable at 0 and l ′ (0) < 0.
26
Classification Loss and Convexification Optimization Point of View
Classical convexification
Logistic loss: ℓ′ (Y , f (X )) = log(1 + e −Yf (X ) ) (Logistic / NN)
Hinge loss: ℓ′ (Y , f (X )) = (1 − Yf (X ))+ (SVM)
Exponential loss: ℓ′ (Y , f (X )) = e −Yf (X ) (Boosting. . . )
26
Properties Optimization Point of View
≤ E ℓ′ (Y , f (X ) − E ℓ′ (Y , f ⋆ (X ))
Theoretical guarantee!
27
Proof Optimization Point of View
By definition,
E [l(Yf )|X ] = η(X )l(f ) + (1 − η(X ))l(−f )
Let H(f , η) = ηl(f ) + (1 − η)l(−f ), the optimal value for f˜ satisfies
δH(f˜, η) = −ηδl(f˜) + (1 − η)δl(−f˜) ∋ 0.
With a slight abuse of notation, we denote by δl(f˜) and δl(−f˜) the two subgradients such
that
ηδl(f˜) − (1 − η)δl(−f˜) = 0
Now we discuss the sign of f˜:
If f˜ > 0, δl(−f˜) < δl(f˜) and thus η > (1 − η), i.e. 2η − 1 > 0.
Conversely, if f˜ < 0 then 2η − 1 < 0
Thus sign(f˜) = sign(2η − 1) i.e. the minimizer of E [l(yf )|X ] is f ∗ (X ) = sign(2η(X ) − 1)
28
Proof Optimization Point of View
29
Proof Optimization Point of View
Furthermore,
E ℓ′ (Y , f (X ) = EX [H(f , η(X )]
E ℓ′ (Y , f ⋆ (X )) = EX [H(η(X )]
We define then
1+θ
Ψ(θ) = l(0) − H
2
which is thus a convex function satisfying Ψ(0) = 0 and Ψ(θ) > 0 for θ > 0.
30
Proof Optimization Point of View
Recall that h i h i
E ℓ0/1 (Y , sign(f (X ))) − E ℓ0/1 (Y , f ⋆ (X ))
h i
= EX |2η(X ) − 1|1f ⋆ (X )̸=sign(f (X ))
Using Jensen inequality,
h we derive i h i
Ψ E ℓ0/1 (Y , sign(f (X ))) − E ℓ0/1 (Y , f ⋆ (X ))
h i
≤ EX Ψ |2η(X ) − 1|1f ⋆ (X )̸=sign(f (X ))
31
Proof Optimization Point of View
32
Logistic Revisited Optimization Point of View
Ideal solution:
n
1X
f = argmin
b ℓ0/1 (Yi , f (X i ))
f ∈S n i=1
Logistic regression
Use f (X ) = X ⊤ β + β (0) .
Use the logistic loss ℓ(y , f ) = log2 (1 + e −yf ), i.e. the -log-likelihood.
33
Logistic Revisited Optimization Point of View
34
Outline SVM
1 Supervised Learning
3 SVM
4 Penalization
35
Ideal Separable Case SVM
Separable SVM
Constrained optimization formulation:
1
min ∥β∥2 with ∀i, Yi (X i ⊤ β + β (0) ) ≥ 1
2
36
Non Separable Case SVM
SVM relaxation
Relax the assumptions
∀i, Yi (X i ⊤ β + β (0) ) ≥ 1 to ∀i, Yi (X i ⊤ β + β (0) ) ≥ 1 − si
with the slack variables si ≥ 0
SVM
Constrained optimization formulation:
n
∀i, Yi (X i ⊤ β + β (0) ) ≥ 1 − si
(
1 2
X
min ∥β∥ + C si with
2 i=1 ∀i, si ≥ 0
Convex relaxation:
n
1
max(1 − Yi (X i ⊤ β + β (0) ), 0)
X
argmin ∥β∥2 + C
2 i=1
n
1X 1 1
= argmin max(1 − Yi (X i ⊤ β + β (0) ), 0) + ∥β∥2
n i=1 Cn 2
Prop: ℓ0/1 (Yi , sign(X i ⊤ β + β (0) )) ≤ max(1 − Yi (X i ⊤ β + β (0) ), 0)
38
SVM SVM
39
The Kernel Trick SVM
Source: Unknown
Examples: Polynomial kernel k(X , X ′ ) = (1 + X ⊤ X ′ )d , Gaussian kernel
′ 2
k(X , X ′ ) = e −∥X −X ∥ /2 ,. . .
RKHS setting!
Can be used in (logistic) regression and more. . .
40
SVM SVM
41
SVM SVM
42
Constrained Minimization SVM
Constrained Minimization
Goal:
min f (x )
x
(
hj (x ) = 0, j = 1, . . . p
with
gi (x ) ≤ 0, i = 1, . . . q
or rather with argmin!
Different Setting
f , hj , gi differentiable
f convex, hj affine and gi convex.
Feasibility
x is feasible if hj (x ) = 0 and gi (x ) ≤ 0.
Rk: The set of feasible points may be empty 43
Lagrangian SVM
Constrained Minimization
Goal: (
∗ hj (x ) = 0, j = 1, . . . p
p = min f (x ) with
x gi (x ) ≤ 0, i = 1, . . . q
Lagrangian
Def: p
X q
X
L(x , λ, µ) = f (x ) + λj hj (x ) + µi gi (x )
j=1 i=1
with λ ∈ Rp and µ ∈ (R+ )q .
The λj and µi are called the dual (or Lagrange) variables.
Prop: (
f (x ) if x is feasible
max L(x , λ, µ) =
λ∈Rp , µ∈(R+ )q +∞ otherwise
min max L(x , λ, µ) = p ∗
x λ∈Rp , µ∈(R+ )q 44
Lagrangial Dual SVM
Lagrangian
Def:
p
X q
X
L(x , λ, µ) = f (x ) + λj hj (x ) + µi gi (x )
j=1 i=1
with λ ∈ Rp and µ ∈ (R+ )q .
Lagrangian Dual
Lagrangian dual function:
Q(λ, µ) = min L(x , λ, µ)
x
Prop:
Q(λ, µ) ≤ f (x ), for all feasible x
max Q(λ, µ) ≤ min f (x )
λ∈Rp , µ∈(R+ )q x feasible
45
Duality SVM
Primal
Primal: (
∗ hj (x ) = 0, j = 1, . . . p
p = min f (x ) with
x ∈X gi (x ) ≤ 0, i = 1, . . . q
Dual
Dual:
q∗ = max Q(λ, µ) = max min L(x , λ, µ)
λ∈Rp , µ∈(R+ )q λ∈Rp , µ∈(R+ )q x
Duality
Always weak duality:
q∗ ≤ p∗
max min L(x , λ, µ) ≤ min max L(x , λ, µ)
λ∈Rp , µ∈(R+ )q x x λ∈Rp , µ∈(R+ )q
Not always strong duality q ∗ = p ∗ . 46
Strong Duality SVM
Strong Duality
Strong duality:
q∗ = p∗
max min L(x , λ, µ) = min max L(x , λ, µ)
λ∈Rp , µ∈(R+ )q x x λ∈Rp , µ∈(R+ )q
Allow to compute the solution of one problem from the other.
Requires some assumptions!
Karush-Kuhn-Tucker Condition
Stationarity:
∇x L(x ∗ , λ, µ) = ∇f (x ∗ ) + λj ∇h(x ∗ ) + µi ∇g(x ∗ ) = 0
X X
j i
Primal admissibility:
hj (x ∗ ) = 0 and gi (x ∗ ) ≤ 0
Dual admissibility:
µi ≥ 0
Complementary slackness:
µi gi (x ∗ ) = 0
KKT Theorem
If f convex, hj affine and gi convex, all are differentiable and strong duality
holds then x ∗ is a solution of the primal problem if and only if the KKT
condition holds 48
SVM and Lagrangian SVM
SVM
Constrained optimization formulation:
n
∀i, Yi (X i ⊤ β + β (0) ) ≥ 1 − si
(
X
min ∥β∥2 + C si with
i=1 ∀i, si ≥ 0
SVM Lagragian
Lagrangian:
n
1 X
L(β, β (0) , s, α, µ) = ∥β∥2 + C si
2 i=1
αi (1 − si − Yi (X i ⊤ β + β (0) )) −
X X
+ µ i si
i i
49
SVM and KKT SVM
Consequence
β ∗ = i αi Yi X i and 0 ≤ αi ≤ C .
P
If αi ̸= 0, X i is called a support vector and either
si = 0 and Yi (X i ⊤ β + β (0) ) = 1 (margin hyperplane),
or αi = C (outliers).
β (0)∗ = Yi − X i ⊤ β ∗ for any support vector with 0 < αi < C .
50
SVM Dual SVM
i=1
n
is a linear combination of the input points β ∗ = αi′ X i .
X
i=1
Minimization problem in α′ :
n
αj′ X i ⊤ X j + β (0) ) + Φ(∥β∥2 )
X X
ℓ(Yi ,
i=1 j
involving only the scalar product of the data.
Optimal predictor requires only to compute scalarXproducts.
∗ ⊤ ∗
ˆ
f (X ) = X β + β (0),∗
= αi′ X i ⊤ X
i
Transform a problem in dimension dim(X ) in a problem in dimension n.
Direct minimization in β can be more efficient. . . 52
Feature Map SVM
Feature Engineering
Art of creating new features from the existing one X .
′
Example: add monomials (X (j) )2 , X (j) X (j ) . . .
Adding feature increases the dimension.
Feature Map
Application ϕ : X → H with H an Hilbert space.
Source: Unknown
Linear decision boundary in H: ϕ(X )⊤ β + β (0) = 0 is not an hyperplane
anymore in X .
Many algorithms (e.g. SVM) require only to be able to compute the scalar
product ϕ(X )⊤ ϕ(X ′ ).
Kernel
Any application
k :X ×X →R
is called a kernel over X .
Kernel Trick
Computing directly the kernel k(x , x ′ ) = ϕ(X )⊤ ϕ(X ′ ) may be easier than
computing ϕ(X ), ϕ(X ′ ) and then the scalar product.
57
Reproducing Kernel Hilbert Space SVM
Mercer Theorem
For any PDS kernel k : X × X → R, it exists a Hilbert space H ⊂ RX with a
scalar product ⟨·, ·⟩H such that
it exists a mapping ϕ : X → H satisfying
k(X , X ′ ) = ⟨ϕ(X ), ϕ(X )⟩H
the reproducing property holds, i.e. for any h ∈ H and any X ∈ X
h(X ) = ⟨h, k(X , ·)⟩H .
Separable Kernel
For any function Ψ : X → R, k(X , X ′ ) = Ψ(X )Ψ(X ′ ) is PDS.
Kernel Stability
For any PDS kernels k1 and k2 , k1 + k2 and k1 k2 are PDS kernels.
For any sequence of PDS kernels kn converging pointwise to a kernel k, k is a
PDS kernel.
For any PDS kernel k such that |k| ≤ r and n
X any power series n an z with an ≥ 0
P
PDS Kernels
Vanilla kernel:
k(X , X ′ ) = X ⊤ X ′
Polynomial kernel:
k(X , X ′ ) = (1 + X ⊤ X ′ )k
Gaussian RBF kernel:
k(X , X ′ ) = exp −γ∥X − X ′ ∥2
Tanh kernel:
k(X , X ′ ) = tanh(aX ⊤ X ′ + b)
60
Representer Theorem SVM
Representer Theorem
Let k be a PDS kernel and H its corresponding RKHS,
for any increasing function Φ and any function L : Rn → R, the optimization
problem
argmin L(h(X 1 ), . . . , h(X n )) + Φ(∥h∥)
h∈H
admits only solutions of the form
n
αi′ k(X i , ·).
X
i=1
Examples:
(kernelized) SVM
(kernelized) Penalized Logistic Regression (Ridge)
(kernelized) Penalized Regression (Ridge)
61
Kernelized SVM SVM
Primal
Constrained Optimization:
n
(
X ∀i, Yi (f (X i ) + β (0) ) ≥ 1 − si
min ∥f ∥2H +C si with
f ∈H,β (0) ,s i=1 ∀i, si ≥ 0
Hinge loss: n
X
min ∥f ∥2H + C max(0, 1 − Yi (f (X i ) + β (0) ))
f ∈H,β (0) i=1
Representer:
αi′ αj′ k(X i , X j )
X
min
α′ ,β (0) i,j
n
αj′ k(X j , X i ) + β (0) ))
X X
+C max(0, 1 − Yi (
i=1 j
Dual
Dual: X 1X
max Q(α, µ) ⇔ max αi − αi αj Yi Yj k(X i , X j )
α≥0,µ≥0 0≤α≤C
i
2 i,j 62
SVM SVM
63
SVM SVM
64
Outline Penalization
1 Supervised Learning
3 SVM
4 Penalization
65
Bias-Variance Dilemma Penalization
General setting:
F = {measurable functions X → Y}
Best solution: f ∗ = argminf ∈F R(f )
Class S ⊂ F of functions
Ideal target in S: fS∗ = argminf ∈S R(f )
Estimate in S: bfS obtained with some procedure
Approximation error and estimation error (Bias/Variance)
66
Under-fitting / Over-fitting Issue Penalization
Source: Unknown
Bias-variance trade-off ⇐⇒ avoid overfitting and underfitting
Rk: Better to think in term of method (including feature engineering and specific
algorithm) rather than only of model.
67
Simplified Models Penalization
Bias-Variance Issue
Most complex models may not be the best ones due to the variability of the
69
Norms and Sparsity Penalization
Sparsity
β is sparse if its number of non-zero coefficients (ℓ0 ) is small. . .
Easy interpretation in term of dimension/complexity.
Constrained Optimization
Choose a constant C .
Compute β as
n
1X
argmin ℓ(Yi , h(x i ⊤ β))
β∈Rd ,∥β∥p ≤C n i=1
Lagrangian Reformulation
Choose λ and compute β as
n
1X ′
argmin ℓ(Yi , h(x i ⊤ β)) + λ∥β∥pp
β∈Rd n i=1
with p = p except if p = 0 where p ′ = 1.
′
Classical Penalties
AIC: pen(β) = λ∥β∥0 (non convex / sparsity)
Ridge: pen(β) = λ∥β∥22 (convex / no sparsity)
Lasso: pen(β) = λ∥β∥1 (convex / sparsity)
Elastic net: pen(β) = λ1 ∥β∥1 + λ2 ∥β∥22 (convex / sparsity)
Classical Examples
Penalized Least Squares
Penalized Logistic Regression
Penalized Maximum Likelihood
SVM
Tree pruning
73
Penalization Penalization
Need to specify λ!
74
Convexified Loss Penalization Penalization
75
Penalization and Cross-Validation Penalization
1 Supervised Learning
3 SVM
4 Penalization
77
Unbalanced and Rebalanced Dataset Cross Validation and Weights
Unbalanced Class
Setting: One of the class is much more present than the other.
Rebalanced Dataset
Setting: Class proportions are different in the training and testing set (stratified
sampling)
Issue: Training errors are not estimate of testing errors.
78
Resampling Strategies Cross Validation and Weights
Resampling
Modify the training dataset so that the classes are more balanced.
Two flavors:
Source: Oracle
Sub-sampling which spoils data,
Over-sampling which needs to create new examples.
Issues: Training data is not anymore representative of testing data
Hard to do it right! 79
Resampling Effect Cross Validation and Weights
Testing Training
Testing class prob.: πt (k) Training class prob.: πtr (k)
Testing error target: Training Error target:
Eπt [ℓ(Y , f (X ))] = Eπtr [ℓ(Y , f (X ))] =
X X
πt (k)E [ℓ(Y , f (X ))|Y = k] πtr (k)E [ℓ(Y , f (X ))|Y = k]
k k
In unbalanced situation, often the cost of misprediction is not the same for all
classes (e.g. medical diagnosis, credit lending. . . )
Much better to use this explicitly than to do blind resampling!
Weighted Loss
Weighted loss:
ℓ(Y , f (X )) →
− C (Y )ℓ(Y , f (X ))
Weighted error target:
E [C (Y )ℓ(Y , f (X ))]
81
Weighted Loss, ℓ0/1 loss and Bayes Classifier Cross Validation and Weights
Bayes Predictor
For ℓ0/1 loss,
f ⋆ (X ) = argmax C (k)P (Y = k|X )
k
Same effect than a threshold modification for the binary setting!
82
Linking Weights and Proportions Cross Validation and Weights
83
Combining Weights and Resampling Cross Validation and Weights
Stratified sampling may be used to reduced the size of a dataset without loosing a
low probability class!
84
References Cross Validation and Weights
Contributors
Main contributor: E. Le Pennec
Contributors: S. Boucheron, A. Dieuleveut, A.K. Fermin, S. Gadat, S. Gaiffas,
A. Guilloux, Ch. Keribin, E. Matzner, M. Sangnier, E. Scornet. 86