Class 03
Class 03
Class 03
Andrea Caponnetto
samples...
f(x)
x
suppose this is the “true” solution...
f(x)
x
... but suppose ERM gives this solution!
f(x)
x
Regularization
n
1 �
V (f (xi), yi)
n i=1
which in general – for arbitrary hypothesis space H – is
ill-posed. Ivanov regularizes by finding the function that
minimizes
n
1 �
V (f (xi), yi)
n i=1
while satisfying
�f �2
H ≤ A,
with � · �, the norm in the normed function space H
Function space
1. �f � ≥ 0 and �f � = 0 iff f = 0;
2. �f + g� ≤ �f � + �g�;
3. �αf � = |α| �f �.
�f � = max |f (t)|.
a≤t≤b
Minimize I(x)
i=1
The vectors
Ft[f ] = f (t)
|Ft[f ]| = |f (t)| ≤ M �f �H ∀f ∈ H
where � · �H is the norm in the Hilbert space of functions.
4
f(x)
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
RKHS
K(t, x) := Kt(x)
.
Positive definite kernels
Definition
Theorem
Sketch of proof
• K is pd because
n n
cj Ktj ||2
� � �
cicj K(ti, tj ) = cicj �Kti , Ktj �K = || K ≥ 0.
i,j=1 i,j=1
Sketch of proof (cont.)
• Linear kernel
K(x, x�) = x · x�
• Gaussian kernel
�x−x� �2
K(x, x�) = e σ2 , σ>0
• Polynomial kernel
n
1 �
fS = arg min V (f (xi), yi) + λ�f �2
K
f ∈H n i=1
Smoothness
We want to separate two classes using lines and see how the magnitude
of the slope corresponds to a measure of complexity.
We will look at three examples and see that each example requires
more complicated functions, functions with greater slopes, to separate
the positive examples from negative examples.
A linear example (cont.)
1.5 1.5
1 1
0.5 0.5
f(X)
f(x)
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−2 −2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
x x
1.5
0.5
f(x)
−0.5
−1
−1.5
−2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
n
1 �
fS = arg min V (f (xi), yi) + λ�f �2
K
f ∈H n i=1
V = (y − f (x))2
• Support vector machines for classification (SVMC)
V = |1 − yf (x)|+
where
(k)+ ≡ max(k, 0).
The Square Loss
5
L2 loss
0
−3 −2 −1 0 1 2 3
y−f(x)
The Hinge Loss
3.5
2.5
Hinge Loss
1.5
0.5
−3 −2 −1 0 1 2
3
y * f(x)
Existence and uniqueness of minimum
Both the squared loss and the hinge loss are convex.
V = Θ(−f (x)y),
where Θ(·) is the Heaviside step function, is not convex.
IS [f ] + λ�f �2
K,
can be represented by the expression
n
�
fS (x) = ciK(xi, x),
i=1
for some n-tuple (c1, . . . , cn) ∈ IRn.
H0 = span({Kxi }i=1,...,n)
⊥
From the reproducing property of H, ∀f ∈ H0
� � �
�f, ciKxi �K = ci�f, Kxi �K = cif (xi) = 0.
i i i
Since by orthogonality
IS [f0] + λ�f0�2
K ≤ IS [f 0 + f ⊥ ] + λ�f + f ⊥ �2 .
0 0 0 K
min g(c)
c∈IRn
for a suitable function g(·).