PerceptronSVM Module5 Part2 October2023

Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

Term Paper, Homework

Deadline for Term Paper Phase I submission: October 31, 2023 11:30
pm
Can only use Fashion MNIST for numerical dataset (no MNIST)

Once groups are formed, it is mandatory to create a slack sub-


channel for each group — identify a group leader and have
them do so — add instructor and TAs

Deadline for creating a Slack subchannel: October 31, 2023

Homework 5 is released and is due: November 1, 2023


• This is an intense HW; get started;
• Also, start on your term paper project :)

1
Perceptron Analysis: Linearly Separable Case

<latexit sha1_base64="Ak/gyvoP6uJcC1BzlHM7mG9Txcs=">AAAEPnicbVNNb9NAEHUTPkr4auHIZUQNalAU4giVUqlSBRzgUpUq/ZDqNFqv1/Eq612zXreJLP8yLvwGbhy5cAAhrhyZdVKatuzBGr/xzL73Zhykgmem0/m6UKtfu37j5uKtxu07d+/dX1p+sJ+pXFO2R5VQ+jAgGRNcsj3DjWCHqWYkCQQ7CEZvbP7ghOmMK9kzk5T1EzKUPOKUGIQGy7WeLxWXIZMGfMPGJoiKXsyUZgmsvhaKjsB/CtvqhI9UFLXAe7XWbW6U8D4CEzMwmnDJ5RBCYgj4Prh+QkxMiSjelpt+sVq9Ys9xOfBak4HXbM1DXYS6zZaLZSJUJruQ3MbkdtMvXeDZP25WJ9FiAhlLiSYos8RrT7mJISF6yCW2GpIkIS4EEyCQS25AKp1AjOp1KohkM5LY7bQ8frYBvthn2sxjvrYIbHqu1VR1d4PNjtuyomWl/IwQNqUsNVrJczOokuj5kGVAIsN0OTUm0oQWu8fdspgyxMgFprXSGYS5toVnHVpwGjNtiV6gNi5nxHzBYNeFSGkgQpzrGZduGy9rNM6H2jsjLPMkYBpUBAmuFRkhu4SEzNpk00paa2FODpLGJ5HodU5jNPxjziRldhqByrF7aGvdStLzOU1taAyWVjrtTnXgauDNghVndnYGS1/8UNE8QcZUkCw78jqp6RdEG05xxA0/t/OmIzJkRxhKkrCsX1TrX8ITRMLKjEih4gqdryhIkmWTJMAvrU/Z5ZwF/5c7yk203i+4THODyqcXRbkAo8D+SxByzajBZQw5oZojV6AxbiXFoWfWBO+y5KvBfrftrbVffOiubK3P7Fh0HjmPnVXHc146W847Z8fZc2jtU+1b7UftZ/1z/Xv9V/339NPawqzmoXPh1P/8BcrLY88=</latexit>

Theorem (Block & Noviko↵, 1962): If the training data


D = {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )} is linearly separable
with margin by a unit norm hyperplane w⇤ : kw⇤ k = 1
with b = 0, then the perceptron training converges after
R2
2 errors during training, where kxk  R for all x.

Then the number of mistakes made by the online perceptron


on any such sequence is bounded by R2 / 2 .

Proof details:
1) Visual : https://www.cs.cornell.edu/courses/cs4780/2018fa/
lectures/lecturenote03.html
2) Chapter 9, 9.1.2: Understanding ML, Shalev-Schwartz, Ben-
David
2
Perceptron Analysis: Linearly Separable Case

<latexit sha1_base64="Ak/gyvoP6uJcC1BzlHM7mG9Txcs=">AAAEPnicbVNNb9NAEHUTPkr4auHIZUQNalAU4giVUqlSBRzgUpUq/ZDqNFqv1/Eq612zXreJLP8yLvwGbhy5cAAhrhyZdVKatuzBGr/xzL73Zhykgmem0/m6UKtfu37j5uKtxu07d+/dX1p+sJ+pXFO2R5VQ+jAgGRNcsj3DjWCHqWYkCQQ7CEZvbP7ghOmMK9kzk5T1EzKUPOKUGIQGy7WeLxWXIZMGfMPGJoiKXsyUZgmsvhaKjsB/CtvqhI9UFLXAe7XWbW6U8D4CEzMwmnDJ5RBCYgj4Prh+QkxMiSjelpt+sVq9Ys9xOfBak4HXbM1DXYS6zZaLZSJUJruQ3MbkdtMvXeDZP25WJ9FiAhlLiSYos8RrT7mJISF6yCW2GpIkIS4EEyCQS25AKp1AjOp1KohkM5LY7bQ8frYBvthn2sxjvrYIbHqu1VR1d4PNjtuyomWl/IwQNqUsNVrJczOokuj5kGVAIsN0OTUm0oQWu8fdspgyxMgFprXSGYS5toVnHVpwGjNtiV6gNi5nxHzBYNeFSGkgQpzrGZduGy9rNM6H2jsjLPMkYBpUBAmuFRkhu4SEzNpk00paa2FODpLGJ5HodU5jNPxjziRldhqByrF7aGvdStLzOU1taAyWVjrtTnXgauDNghVndnYGS1/8UNE8QcZUkCw78jqp6RdEG05xxA0/t/OmIzJkRxhKkrCsX1TrX8ITRMLKjEih4gqdryhIkmWTJMAvrU/Z5ZwF/5c7yk203i+4THODyqcXRbkAo8D+SxByzajBZQw5oZojV6AxbiXFoWfWBO+y5KvBfrftrbVffOiubK3P7Fh0HjmPnVXHc146W847Z8fZc2jtU+1b7UftZ/1z/Xv9V/339NPawqzmoXPh1P/8BcrLY88=</latexit>

Theorem (Block & Noviko↵, 1962): If the training data


D = {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )} is linearly separable
with margin by a unit norm hyperplane w⇤ : kw⇤ k = 1
with b = 0, then the perceptron training converges after
R2
2 errors during training, where kxk  R for all x.

Then the number of mistakes made by the online perceptron


on any such sequence is bounded by R2 / 2 .

■ How does the guarantee on the number of mistakes


(therefore, number of updates) depend on γ ?

3
Beyond Linearly Separable Case
■ Perceptron algorithm is super
cool!
No assumption about data
distribution!
■Could be generated by an
oblivious adversary, no need
to be iid
Makes a fixed number of
mistakes, and it’s done for
ever!
■ Even if you see infinite data

4
Beyond Linearly Separable Case
■ Perceptron is useless in practice!
Real world data is not linearly separable
If data not separable, cycles forever and hard to detect

5
Beyond Linearly Separable Case
■ Perceptron is useless in practice!
Real world data is not linearly separable
If data not separable, cycles forever and hard to detect
Even if separable, may not give good generalization
accuracy (small margin)

6
Beyond Linearly Separable Case
■ There are a number of problems with this algorithm,
summarized in Ripley (1996):
When the data are separable, there are many solutions,
and which one is found depends on the starting values.
The “finite” number of steps can be very large. The
smaller the gap, the longer the time to find it.
When the data are not separable, the algorithm will not
converge, and cycles develop. The cycles can be long
and therefore hard to detect.

From Hastie et al. Chap 4, Page 131.


7
Support Vector
Machines
Lalitha Sankar
Arizona State University

Ref: Hastie et al., Chaps 4 (4.5) and 12


8
Linear classifiers – Which line is better?

How does wT x + b = 0 look?


<latexit sha1_base64="X3wNDWk/QFxBRuq4sWthy81NBN8=">AAACBHicbVDLSsNAFJ3UV62vqMtuBltBEEpS8LERC266rNAXtLFMJpN26CQTZibWErpw46+4caGIWz/CnX/jtM1CWw9cOJxzL/fe40aMSmVZ30ZmZXVtfSO7mdva3tndM/cPmpLHApMG5oyLtoskYTQkDUUVI+1IEBS4jLTc4c3Ub90TISkP62ocESdA/ZD6FCOlpZ6Zr/IR9DiRsDi6qz+cuvAKWkXIOB9e98yCVbJmgMvETkkBpKj1zK+ux3EckFBhhqTs2FaknAQJRTEjk1w3liRCeIj6pKNpiAIinWT2xAQea8WDPhe6QgVn6u+JBAVSjgNXdwZIDeSiNxX/8zqx8i+dhIZRrEiI54v8mEHF4TQR6FFBsGJjTRAWVN8K8QAJhJXOLadDsBdfXibNcsk+K1m35ULlPI0jC/LgCJwAG1yACqiCGmgADB7BM3gFb8aT8WK8Gx/z1oyRzhyCPzA+fwByy5X/</latexit>

9
Pick the one with the largest margin!

b=0
xTw +

margin 2
γ 10
Hyperplanes: Properties

Consider the hyperplane wT x + b = 0 and let L


<latexit sha1_base64="m9Nd9UW10Uauvg7N9qnJjXKTyAE=">AAAEaHicbVNRb9MwEM66AqMw2OABIV6sLaANttJMGkxIlSYmIZB4GFK3IS2lcpxLa82xg+1sq6L8IH4Nr+wv8Cs4J63WbliRcr7z3X3fd3aUCW5sp3O10Fhs3rl7b+l+68HD5UePV1afHBuVawZHTAmlv0fUgOASjiy3Ar5nGmgaCTiJzg5c/OQctOFK9uw4g35Kh5InnFGLrsFq42MoFZcxSEsOlDQ8Bk3sCMgID+tMUAnEv/jRu3wTdTs+oTImAizxw5TaEaOi+Fr6JAxbWEBZqDINxlVCMixrDTE5G6Gbzud0w+Lyw7RsWPptV+OLJPZCbcc8BekAmy1M5IbgR4kj2G61wgiGXBYg8xQ0tVC2Qm4hJdcsPimNMMeu1BSDfzkIauxo7fhbFaMNdG7jdhN5ufakN8qNC70NxTFoSy5C7f6+6+94nQOzWFsqnVJBrKrJ5jqhDBzhOU0c1ArYFE4FxfVHFbmcP1wDwhDpku3oOrfnGvChhJjEeBWorBvNlvMdkPlpINwhPwdJovFUrkRQgXVel62KONLubJJXXRImmrIiKIspZVJzLslGPZzNVggynsmv9zPy4xqsrHfanWqR20YwMda9yTocrC4sh7FiWENaJqgxp0Ens1skEfBT9guqLWfCjTY3kFF2RodwiqakKZh+UV36krxET0wSlDdRKEXlnc0oaGrMOI3wpFPH3Iw55/9ip7lN9voFl1luQbK6UZLXE8dHgaPQeBPEGA3KNEeshI0oCmnxnc11cbWtUsIgFXwVAuTQjopqWPVlLYtgN7MlChjclOu2cbzTDt61d7/trO/vTaRc8l54a96GF3jvvX3vs3foHXms8avxu/GncbX4t7nSfNZ8Xh9tLExynnpzq7n2DzcUZz0=</latexit>

denote the set of points such that L = {x : wT x + b = 0}.


In two-dimensions, this is a line.

1. For any two points x1 and x2 , wT (x1 x2 ) = 0.


Thus, w/kwk is the vector normal to the surface of L.
2. For any point x0 in L, wT x0 = b.

3. The signed distance of any point x to L is given by

T 1
w (x x0 ) = (wT x + b)
kwk

11
Pick the one with the largest margin!

b=0
Distance from x0 to

xTw +
hyperplane defined
by xT w + b = 0?
x0
w

12
Pick the one with the largest margin!

b=0
Distance from x0 to

xTw +
hyperplane defined
by xT w + b = 0?
x0
w e0 is the projection of x0
If x
onto the hyperplane then
w
||x0 x e0 ||2 = |(xT0 x
e0 )T ||w|| 2
|

13
Pick the one with the largest margin!

b=0
Distance from x0 to

xTw +
hyperplane defined
by xT w + b = 0?
x0
w e0 is the projection of x0
If x
onto the hyperplane then
w
||x0 x e0 ||2 = |(xT0 x
e0 )T ||w|| 2
|

1 T
= |x
||w||2 0 w eT0 w|
x

14
Pick the one with the largest margin!

b=0
Distance from x0 to

xTw +
hyperplane defined
by xT w + b = 0?
x0
w e0 is the projection of x0
If x
onto the hyperplane then
w
||x0 x e0 ||2 = |(xT0 x
e0 )T ||w|| 2
|

1 T
= |x
||w||2 0 w eT0 w|
x
1 T
= |x
||w||2 0 w + b|
1
<latexit sha1_base64="/qlLDN4gc9C6Cbyr1ZLrxbPvAGc=">AAACJ3icbVDLSsNAFJ3UV62vqEs3g0UQhJKUol1JwY3LCn1BE8NkOmmHTh7MTNQS8jdu/BU3goro0j9x0kbU1gMDh3POZe49bsSokIbxoRWWlldW14rrpY3Nre0dfXevI8KYY9LGIQt5z0WCMBqQtqSSkV7ECfJdRrru+CLzuzeECxoGLTmJiO2jYUA9ipFUkqOfWx5HODHTxGIdwiW0fCRHrpfcptDimZI61W/tLnWM69ZP4sR19LJRMaaAi8TMSRnkaDr6szUIceyTQGKGhOibRiTtBHFJMSNpyYoFiRAeoyHpKxognwg7md6ZwiOlDKAXcvUCCafq74kE+UJMfFclsx3FvJeJ/3n9WHp1O6FBFEsS4NlHXsygDGFWGhxQTrBkE0UQ5lTtCvEIqeKkqrakSjDnT14knWrFPK3UrmrlRj2vowgOwCE4BiY4Aw1wCZqgDTC4B4/gBbxqD9qT9qa9z6IFLZ/ZB3+gfX4BOo+nXg==</latexit>

T
Signed distance (distance with sign) : x 0w+b
kwk 2 15
Realizable (Linearly Separable) Setting

b=0
Distance of x0 from
hyperplane xT w + b:

xTw +
1
(xT0 w + b)
||w||2
Optimal Hyperplane
max
w,b
1
subject to yi (xTi w + b) 8i
||w||2

margin 2 Pick the one with the largest margin!


γ 16
Pick the one with the largest margin!

b=0
Distance of x0 from
hyperplane xT w + b:

xTw +
1
(xT0 w + b)
||w||2
Optimal Hyperplane
max
w,b
1
subject to yi (xTi w + b) 8i
||w||2

Optimal Hyperplane (reparameterized)

min ||w||22
w,b

subject to yi (xTi w + b) 1 8i
margin 2
γ 17
Pick the one with the largest margin!

b=0
■ Solve efficiently by many
methods, e.g.,
xTw +
quadratic programming
(QP)
■ Well-studied solution
algorithms
Stochastic gradient

Optimal Hyperplane (reparameterized)

min ||w||22
w,b

subject to yi (xTi w + b) 1 8i
margin 2
γ 18
Pick the one with the largest margin!

b=0
■ Solution? We will derive it
neatly when studying

xTw +
kernel methods.
■ Works for both kernel
methods and otherwise.
■ Geometric interpretation:
Hastie et al 4.5-4.5.2

Optimal Hyperplane (reparameterized)

min ||w||22
w,b

subject to yi (xTi w + b) 1 8i
margin 2
γ 19
SVM as penalization method

■ Original quadratic program with linear constraints:

min ||w||22
w,b

subject to yi (xTi w + b) 1 8i

■ For convex optimization objective with convex constraints,


an equivalent form exists (recall ridge/LASSO)

■ An equivalent form is to maximize the difference


1 yi (xTi w + b)
<latexit sha1_base64="IlZWM2DHNUMb+8KcHMk7DtC+hMo=">AAACCnicbVDLSsNAFJ3UV62vqEs30SJUxJJI0S4LblxW6AvaGCbTSTs4mYSZiRpC1m78FTcuFHHrF7jzb5y0EbT1wMCZc+7l3nvckBIhTfNLKywsLi2vFFdLa+sbm1v69k5HBBFHuI0CGvCeCwWmhOG2JJLiXsgx9F2Ku+7NReZ3bzEXJGAtGYfY9uGIEY8gKJXk6PvWSeyQysCHcux6yX3qkOvWz+8uPXaPHL1sVs0JjHli5aQMcjQd/XMwDFDkYyYRhUL0LTOUdgK5JIjitDSIBA4huoEj3FeUQR8LO5mckhqHShkaXsDVY9KYqL87EugLEfuuqsyWFLNeJv7n9SPp1e2EsDCSmKHpIC+ihgyMLBdjSDhGksaKQMSJ2tVAY8ghkiq9kgrBmj15nnROq9ZZtXZVKzfqeRxFsAcOQAVY4Bw0wCVogjZA4AE8gRfwqj1qz9qb9j4tLWh5zy74A+3jG+xhmmY=</latexit>

■ subject to a bounded ℓ2-norm on w


<latexit sha1_base64="3qvwbnTUgr8mMKYjo63vRzc0J3Q=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIaJcFNy4r2Ae2pWTSTBuayQzJHaUM/Qs3LhRx69+482/MtLPQ1gOBwzn3knOPH0th0HW/ncLa+sbmVnG7tLO7t39QPjxqmSjRjDdZJCPd8anhUijeRIGSd2LNaehL3vYnN5nffuTaiEjd4zTm/ZCOlAgEo2ilh15IcewH6dNsUK64VXcOskq8nFQgR2NQ/uoNI5aEXCGT1Jiu58bYT6lGwSSflXqJ4TFlEzriXUsVDbnpp/PEM3JmlSEJIm2fQjJXf2+kNDRmGvp2Mktolr1M/M/rJhjU+qlQcYJcscVHQSIJRiQ7nwyF5gzl1BLKtLBZCRtTTRnakkq2BG/55FXSuqh6V9XLu8tKvZbXUYQTOIVz8OAa6nALDWgCAwXP8ApvjnFenHfnYzFacPKdY/gD5/MH+6CRGg==</latexit>

20
Hard SVM: Equivalent Form Hinge Loss

■ Original quadratic program with linear constraints:


min ||w||22
w,b

subject to yi (xTi w + b) 1 8i
■ Obtain the following equivalent form in the realizable setting
(recall the max. margin constraint we started with!):
Xn
max{0, 1 yi (b + xTi w)} + ||w||22
i=1

■ Every constraint is equally important (same Lagrange


weight — hard SVM — no slack)
■ We thus obtain a new loss: Hinge loss:
`i = max{0, 1 yi (xTi w + b)}
<latexit sha1_base64="0dKU4bU4cqEX8IB2O5hqcZMP9pw=">AAACG3icbVDLSsNAFJ3UV62vqEs3wSJU1JKUot0IBTcuK/QFTQ2T6aQdOnkwM9GGkP9w46+4caGIK8GFf+OkjaCtBwbOPede5t5jB5RwoetfSm5peWV1Lb9e2Njc2t5Rd/fa3A8Zwi3kU591bcgxJR5uCSIo7gYMQ9emuGOPr1K/c4cZJ77XFFGA+y4cesQhCAopWWrFxJRa5NJ04cSM9VPjLLJISVZiZDvxJLHIbfOnuk9O7GMzsdSiXtan0BaJkZEiyNCw1A9z4KPQxZ5AFHLeM/RA9GPIBEEUJwUz5DiAaAyHuCepB13M+/H0tkQ7kspAc3wmnye0qfp7IoYu55Fry850TT7vpeJ/Xi8UTq0fEy8IBfbQ7CMnpJrwtTQobUAYRoJGkkDEiNxVQyPIIBIyzoIMwZg/eZG0K2XjvFy9qRbrtSyOPDgAh6AEDHAB6uAaNEALIPAAnsALeFUelWflTXmfteaUbGYf/IHy+Q3zDaFc</latexit>

21
SVM for the Non-Separable Setting

■ If data is linearly separable


min ||w||22
xT w + b = 0 w,b

yi (xTi w + b) 1 8i

1
||w||2

1
||w||2
22
What if the data is still not linearly
separable?

xT w + b = 0

1
||w||2

1
||w||2

23
What if the data is still not linearly
separable?

xT w + b = 0 ■ Ignore points on the


other side of the
hyperplane.

■ Is it still separable?
1
||w||2 ■ Does the margin
change?

1
■ How do we resolve
the case where some
||w||2 samples are noisy?
24
What if the data is still not linearly
separable?

xT w + b = 0 ■ If data is not linearly


separable, some
points don’t satisfy
margin constraint:

1
||w||2

1
||w||2

25
What if the data is still not linearly
separable?
If data is not linearly
xT w + b = 0 ■
separable, some points
don’t satisfy margin
constraint:

min ||w||22
1 w,b
||w||2 yi (xTi w + b) 1 ⇠i 8i
Xn
⇠i 0, ⇠j  ⌫
j=1
1
||w||2

26
What if the data is still not linearly
separable?
If data is not linearly
xT w + b = 0 ■
separable, some points
don’t satisfy margin
constraint:

min ||w||22
1 w,b
||w||2 yi (xTi w + b) 1 ⇠i 8i
Xn
⇠i 0, ⇠j  ⌫
j=1
1
||w||2
■ What are “support vectors”?
27
What if the data is still not linearly
separable?
■ If data is not linearly
xT w + b = 0 separable, some points
don’t satisfy margin
constraint:
min ||w||22
w,b

1 yi (xTi w + b) 1 ⇠i 8i
||w||2 n
X
⇠i 0, ⇠j  ⌫
j=1

1 ■ Support vectors: set of points


closest to separating hyperplane
||w||2 that influence the position and
orientation of the plane!
28
What if the data is still not linearly
separable?
If data is not linearly
xT w + b = 0 ■
separable, some points
don’t satisfy margin
constraint:

min ||w||22
1 w,b
||w||2 yi (xTi w + b) 1 ⇠i 8i
Xn
⇠i 0, ⇠j  ⌫
j=1
1
||w||2

Misclassifications occur when ⇠i > 1


<latexit sha1_base64="nNVYJFd61JrQcTDhzNeTREIIKQs=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEtCcpePFYwX5AG8pmO2mXbjZhdyOW0B/hxYMiXv093vw3btsctPXBwOO9GWbmBYng2rjut1NYW9/Y3Cpul3Z29/YPyodHLR2nimGTxSJWnYBqFFxi03AjsJMopFEgsB2Mb2d++xGV5rF8MJME/YgOJQ85o8ZK7d4T7/Mbr1+uuFV3DrJKvJxUIEejX/7qDWKWRigNE1Trrucmxs+oMpwJnJZ6qcaEsjEdYtdSSSPUfjY/d0rOrDIgYaxsSUPm6u+JjEZaT6LAdkbUjPSyNxP/87qpCWt+xmWSGpRssShMBTExmf1OBlwhM2JiCWWK21sJG1FFmbEJlWwI3vLLq6R1UfWuqpf3l5V6LY+jCCdwCufgwTXU4Q4a0AQGY3iGV3hzEufFeXc+Fq0FJ585hj9wPn8AzhiPMw==</latexit>

29
SVM as penalization method

■ Original quadratic program with linear constraints:

min ||w||22
w,b

yi (xTi w + b) 1 ⇠i 8i
⇠i : slack variables
<latexit sha1_base64="oUKIkLKHSldDN1BdNjTBy5yY1os=">AAACAnicbVDLSgMxFL1TX7W+Rl2Jm2AruCozBVFcFd24rGAf0A5DJk3b0ExmSDLFMhQ3/oobF4q49Svc+Tem7Sy09UDgcM493NwTxJwp7TjfVm5ldW19I79Z2Nre2d2z9w8aKkokoXUS8Ui2AqwoZ4LWNdOctmJJcRhw2gyGN1O/OaJSsUjc63FMvRD3BesxgrWRfPuo1HlgPiuhK6Q4JkM0wpJhk1a+XXTKzgxombgZKUKGmm9/dboRSUIqNOFYqbbrxNpLsdSMcDopdBJFY7MC92nbUIFDqrx0dsIEnRqli3qRNE9oNFN/J1IcKjUOAzMZYj1Qi95U/M9rJ7p36aVMxImmgswX9RKOdISmfaAuk5RoPjYEE8nMXxEZYImJNq0VTAnu4snLpFEpu+dl565SrF5ndeThGE7gDFy4gCrcQg3qQOARnuEV3qwn68V6tz7mozkryxzCH1ifP6r9llU=</latexit>

Xn
⇠i 0, ⇠j  ⌫
j=1

• Note that this is a relaxation; it does not minimize #


of misclassifications (support vectors) but sum of
distances from plane!
• Generally solved using dual problem
• Solution has to ensure non-negatives slack variables
(enforced by bounding Lagrangian)
30
min ||w||22
w,b

SVM: Solution yi (xTi w + b) 1 ⇠i 8i


Xn
⇠i 0, ⇠j  ⌫
j=1

■ Original quadratic program with linear constraints:

■ Solved as a constrained convex optimization:


N
<latexit sha1_base64="gBpGP5hNbphUOY9Z4NRzknkwtYg=">AAAC93icbVJNb9QwEHXCV1kobOHIxWIFKnQbJStBK6FKlXrhhIq021Zap5HjOKlVx0ltp+zKyh/hwgGEuPJXuPFvsLcpXVpGivzmvZkXe+y05kzpMPzt+bdu37l7b+V+78HD1UeP+2tPDlTVSEInpOKVPEqxopwJOtFMc3pUS4rLlNPD9HTP6YfnVCpWibGe1zQucSFYzgjWlkrWvFWU0oIJgzkrxOu2h0omEoNKrE/S3Hxqh+nwMkEz1rYQnTU4gy8hyiUmJmrNqL2qPjbjpWxjD6mmTAzbiazyoXUONmtbhHpI05mWpVGBDq5Mo825K1hfchzPHLORvoKIU9g5WAN42eQohgp6Fg47yjJwB06dEEH0zq0jt/Ks0qojRHw8thZ2I1Rkf0+f9AdhEC4C3gRRBwagi/2k/wtlFWlKKjThWKlpFNY6NlhqRji142wUrTE5xQWdWihwSVVsFvfWwheWyWBeSfsJDRfscofBpVLzMrWVbh7quubI/2nTRufbsWGibjQV5OJHecOhrqB7BDBjkhLN5xZgIpndKyQn2N6ntk/FDSG6fuSb4GAURG+DNx9Hg93tbhwr4Bl4DtZBBLbALngP9sEEEE97n72v3jd/7n/xv/s/Lkp9r+t5Cv4J/+cfE6jvBg==</latexit>

1 T X
min w w+C ⇠i
w,b,⇠ 2 i=1
s.t. 1 yi (wT xi + b)  ⇠i
⇠i 0, ⇠ = [⇠1 ⇠2 . . . ⇠n ]T

■ Lagrangian C trades off margin-width and mis-


classifications
■ Generally determined by CV
31
Machine Learning Problems

■ Have a bunch of iid data of the form:


{(xi , yi )}ni=1 x i 2 Rd yi 2 R
n
X
■ Learning a model’s parameters:
`i (w)
Each `i (w) is convex. i=1

Hinge Loss: `i (w) = max{0, 1 yi xTi w}


Logistic Loss: `i (w) = log(1 + exp( yi xTi w))
Squared error Loss: `i (w) = (yi xTi w)2

How do we solve for w? The last few lectures!


32
What is the Perceptron Optimizing?

Perceptron update rule:


  
wk+1 wk xk
= + yk 1{yi (b + xTi w) < 0}
bk+1 bk 1

SVM objective:
n
X n
X
max{0, 1 yi (b + xTi w)} + ||w||22 = `i (w, b)
i=1 i=1

rw `i (w, b) =?
<latexit sha1_base64="Xvl4X2sEnVDmW6Lb6J03F8St+tE=">AAACOXicdVDLSsNAFJ34rPFVdelmsAgVpCSi6EYsunFZwT6gCWFmOmmHTiZhZqKUkN9y41+4E9y4UMStP2DSVtRWDwwczjmXuffgiDOlLevRmJmdm19YLCyZyyura+vFjc2GCmNJaJ2EPJQtjBTlTNC6ZprTViQpCjCnTdy/yP3mDZWKheJaDyLqBqgrmM8I0pnkFWuOQJgjL3ECpHvYT27T1KGce6z8rezjvdMz6DjmVxj/lzFNr1iyKtYQcJrYY1ICY9S84oPTCUkcUKEJR0q1bSvSboKkZoTT1HRiRSNE+qhL2xkVKKDKTYaXp3A3UzrQD2X2hIZD9edEggKlBgHOkvmmatLLxb+8dqz9EzdhIoo1FWT0kR9zqEOY1wg7TFKi+SAjiEiW7QpJD0lEdFZ2XoI9efI0aRxU7KOKdXVYqp6P6yiAbbADysAGx6AKLkEN1AEBd+AJvIBX4954Nt6M91F0xhjPbIFfMD4+AfBDrNg=</latexit>

rb `i (w, b) =?
33
What is the Perceptron Optimizing?

Perceptron update rule:


  
wk+1 wk xk
= + yk 1{yi (b + xTi w) < 0}
bk+1 bk 1

SVM objective:
n
X n
X
max{0, 1 yi (b + xTi w)} + ||w||22 = `i (w, b)
i=1 i=1
<latexit sha1_base64="V7X8wpUGFe289x05X4MVTkQuX/c=">AAAC23icfVLLbtQwFHXCq4RHB1iysRiBilpGScVrAVIFG5ZF6rQjjUPkeG5mrDpOZDu0I8uzYQFCbPkxdvwGX4Azk9LSIq5k6ejcx7kP57Xg2sTxzyC8dPnK1Wtr16MbN2/dXu/dubuvq0YxGLJKVGqUUw2CSxgabgSMagW0zAUc5IdvW//BR1CaV3LPzGtISzqVvOCMGk9lvV9E0lzQzJKSmlle2CPnCAiR8fEps5Wn+DUmOUy5tMyraYeJrAydRnhp9sk84yfxxxl3m6RQlNltInwrE+qsdGcEtvAjTAwcG7vgxcJZn7yRb/7Jdxn/sHca/vhV4uVIJ/W/wm1duypcmRmoI67Bj0MiAnJy0niU9frxIF4avgiSDvRRZ7tZ7weZVKwpQRomqNbjJK5NaqkynAlwEWk01JQd0imMPZS0BJ3a5W0cfuiZCS4q5Z80eMmezbC01Hpe5j6ynUOf97Xkv3zjxhQvU8tl3RiQbCVUNAKbCreHxhOugBkx94AyxX2vmM2oX57x36FdQnJ+5Itgf3uQPB88e/+0v/OmW8cauo8eoA2UoBdoB71Du2iIWDAKFsHn4EuYhp/Cr+G3VWgYdDn30F8Wfv8NyCvndw==</latexit>

(
2
yi x i + n w, if yi (b + xTi w) < 1
rw `i [w, b] = 2
n w, otherwise
<latexit sha1_base64="kv91EZrElcyOr7CllzHWllb8Hks=">AAACjXicbVHbbtQwEHXCrYRLt/DIi9UKVNSySirKTQWt4AEei9RtK62XyPZOdq06TmQ7tJHlfgWfwBfx1r/B2d0KaBnJ0tE5c2Y8M6yWwtg0vYjiGzdv3b6zcje5d//Bw9Xe2qNDUzWaw5BXstLHjBqQQsHQCivhuNZASybhiJ186vSj76CNqNSBbWsYl3SqRCE4tYHKez+JokzS3DFPQMpcjEhJ7YwV7tRvszF+jwmDqVCOhybGY6IqS6cJnod70ebCb+NnmFg4s+5cFOfeBW6TbV2WOfO5+Hbwp+jzvSxUIcsKaWd2C3dlZ6BPhQHvg05ATS6b5r2NtJ/OA18H2RJsDNbJ1o+LQbuf936RScWbEpTlkhozytLajh3VVnAJPiGNgZryEzqFUYCKlmDGbr5Nj58GZoKLSoenLJ6zfzscLY1pSxYyu6nMVa0j/6eNGlu8GTuh6saC4otGRSOxrXB3GjwRGriVbQCUaxH+ivmMasptOGASlpBdHfk6ONzpZ6/6L7+GbXxEi1hBT9A62kQZeo0G6AvaR0PEoyRKo7fRu3g13o334g+L1Dhaeh6jfyL+/BvBUseL</latexit>

(
yi , if yi (b + xTi w) < 1
rb `i [w, b] =
0, otherwise 34
What is the Perceptron Optimizing?
Perceptron update rule:
  
wk+1 wk xk
= + yk 1{yi (b + xTi w) < 0}
bk+1 bk 1
SVM objective:

n
X n
X
max{0, 1 yi (b + xTi w)} + ||w||22 = `i (w, b)
i=1 i=1

<latexit sha1_base64="V7X8wpUGFe289x05X4MVTkQuX/c=">AAAC23icfVLLbtQwFHXCq4RHB1iysRiBilpGScVrAVIFG5ZF6rQjjUPkeG5mrDpOZDu0I8uzYQFCbPkxdvwGX4Azk9LSIq5k6ejcx7kP57Xg2sTxzyC8dPnK1Wtr16MbN2/dXu/dubuvq0YxGLJKVGqUUw2CSxgabgSMagW0zAUc5IdvW//BR1CaV3LPzGtISzqVvOCMGk9lvV9E0lzQzJKSmlle2CPnCAiR8fEps5Wn+DUmOUy5tMyraYeJrAydRnhp9sk84yfxxxl3m6RQlNltInwrE+qsdGcEtvAjTAwcG7vgxcJZn7yRb/7Jdxn/sHca/vhV4uVIJ/W/wm1duypcmRmoI67Bj0MiAnJy0niU9frxIF4avgiSDvRRZ7tZ7weZVKwpQRomqNbjJK5NaqkynAlwEWk01JQd0imMPZS0BJ3a5W0cfuiZCS4q5Z80eMmezbC01Hpe5j6ynUOf97Xkv3zjxhQvU8tl3RiQbCVUNAKbCreHxhOugBkx94AyxX2vmM2oX57x36FdQnJ+5Itgf3uQPB88e/+0v/OmW8cauo8eoA2UoBdoB71Du2iIWDAKFsHn4EuYhp/Cr+G3VWgYdDn30F8Wfv8NyCvndw==</latexit>

(
2
yi x i + n w, if yi (b + xTi w) < 1
rw `i [w, b] = 2
n w, otherwise
Recall SGD?
<latexit sha1_base64="kj0OhQQDEqm3x4G5pH3XsCuQA2E=">AAACKnicbVBNSxxBEO0xH5oxias55tJkDRhClhlB9CIaI+hRE1eFnWWo6a1ZG3t6hu4aZRnm9+TiX8nFgyJe/SH2rAsmmgdNv36viq56SaGkpSC48aZevHz1enrmjT/79t37udb8wqHNSyOwK3KVm+MELCqpsUuSFB4XBiFLFB4lpz8a/+gMjZW5PqBRgf0MhlqmUgA5KW59/4kClOK/drY3fH/xPK7oa1ivN3f9LUKCSEOiIK6iDOgkSavzuk6XHh9fFuNWO+gEY/DnJJyQNptgL25dRoNclBlqEgqs7YVBQf0KDEmhsPaj0mIB4hSG2HNUQ4a2X41Xrflnpwx4mht3NPGx+ndHBZm1oyxxlc2M9qnXiP/zeiWla/1K6qIk1OLho7RUnHLe5MYH0qAgNXIEhJFuVi5OwIAgl67vQgifrvycHC53wpVOsL/c3tyaxDHDPrJPbImFbJVtsl22x7pMsN/sD7ti196Fd+ndeLcPpVPepOcD+wfe3T2v2qbl</latexit>

<latexit sha1_base64="ZbedqVbsEKnvjc26yu1L3Ftjv2I=">AAACmXicbVFbaxQxGM2Mt7reVgVf+hJclIp1mRFL+1ChKkrxqWK3LWzGIcl+sxuayQxJpu0S0t/kb/HNf2Nmd+ql9YPA4TvnfLewWgpjk+RnFF+7fuPmrZXbvTt3791/0H/46MBUjeYw4pWs9BGjBqRQMLLCSjiqNdCSSThkxx9a/vAEtBGV2rfzGrKSTpUoBKc2pPL+d6IokzR3zBOQMhdjUlI7Y4U79essw28xYTAVyvHQxHhMVGXptIcX4V7Nc3GhP8uF9+v4OSYWzqw7F8W5d4FfYy9/S3wuvu3/afBiOw0VSVctac1u6a7sDPSpMOB94AmoycUAeX+QDJNF4Ksg7cAAdbGX93+QScWbEpTlkhozTpPaZo5qK7gE3yONgZryYzqFcYCKlmAyt7isx89CZoKLSoenLF5k/3Y4WhozL1lQtluZy1yb/B83bmyxlTmh6saC4stGRSOxrXD7TXgiNHAr5wFQrkWYFfMZ1ZTb8Jm9cIT08spXwcHrYboxTL68Gey8786xglbRU7SGUrSJdtAu2kMjxKMn0Xb0MfoUr8bv4t3481IaR53nMfon4q+/ADC2yfk=</latexit>

(
yi x i , if yi (b +xTi w) <1
rb `i [w, b] = wt+1 = wt ⌘rw f (w)
0, otherwise
35
What is the Perceptron Optimizing?
Perceptron update rule:
  
wk+1 wk xk
= + yk 1{yi (b + xTi w) < 0}
bk+1 bk 1
SVM objective:

n
X n
X
max{0, 1 yi (b + xTi w)} + ||w||22 = `i (w, b)
i=1 i=1

<latexit sha1_base64="V7X8wpUGFe289x05X4MVTkQuX/c=">AAAC23icfVLLbtQwFHXCq4RHB1iysRiBilpGScVrAVIFG5ZF6rQjjUPkeG5mrDpOZDu0I8uzYQFCbPkxdvwGX4Azk9LSIq5k6ejcx7kP57Xg2sTxzyC8dPnK1Wtr16MbN2/dXu/dubuvq0YxGLJKVGqUUw2CSxgabgSMagW0zAUc5IdvW//BR1CaV3LPzGtISzqVvOCMGk9lvV9E0lzQzJKSmlle2CPnCAiR8fEps5Wn+DUmOUy5tMyraYeJrAydRnhp9sk84yfxxxl3m6RQlNltInwrE+qsdGcEtvAjTAwcG7vgxcJZn7yRb/7Jdxn/sHca/vhV4uVIJ/W/wm1duypcmRmoI67Bj0MiAnJy0niU9frxIF4avgiSDvRRZ7tZ7weZVKwpQRomqNbjJK5NaqkynAlwEWk01JQd0imMPZS0BJ3a5W0cfuiZCS4q5Z80eMmezbC01Hpe5j6ynUOf97Xkv3zjxhQvU8tl3RiQbCVUNAKbCreHxhOugBkx94AyxX2vmM2oX57x36FdQnJ+5Itgf3uQPB88e/+0v/OmW8cauo8eoA2UoBdoB71Du2iIWDAKFsHn4EuYhp/Cr+G3VWgYdDn30F8Wfv8NyCvndw==</latexit>

(
2
yi x i + n w, if yi (b + xTi w) < 1
rw `i [w, b] = 2
n w, otherwise
Perceptron is just SGD
<latexit sha1_base64="ZbedqVbsEKnvjc26yu1L3Ftjv2I=">AAACmXicbVFbaxQxGM2Mt7reVgVf+hJclIp1mRFL+1ChKkrxqWK3LWzGIcl+sxuayQxJpu0S0t/kb/HNf2Nmd+ql9YPA4TvnfLewWgpjk+RnFF+7fuPmrZXbvTt3791/0H/46MBUjeYw4pWs9BGjBqRQMLLCSjiqNdCSSThkxx9a/vAEtBGV2rfzGrKSTpUoBKc2pPL+d6IokzR3zBOQMhdjUlI7Y4U79essw28xYTAVyvHQxHhMVGXptIcX4V7Nc3GhP8uF9+v4OSYWzqw7F8W5d4FfYy9/S3wuvu3/afBiOw0VSVctac1u6a7sDPSpMOB94AmoycUAeX+QDJNF4Ksg7cAAdbGX93+QScWbEpTlkhozTpPaZo5qK7gE3yONgZryYzqFcYCKlmAyt7isx89CZoKLSoenLF5k/3Y4WhozL1lQtluZy1yb/B83bmyxlTmh6saC4stGRSOxrXD7TXgiNHAr5wFQrkWYFfMZ1ZTb8Jm9cIT08spXwcHrYboxTL68Gey8786xglbRU7SGUrSJdtAu2kMjxKMn0Xb0MfoUr8bv4t3481IaR53nMfon4q+/ADC2yfk=</latexit>

(
yi x i , if yi (b + xTi w) < 1 on SVM with = 0, ⌘ = 1!
rb `i [w, b] =
0, otherwise
36
What is the Perceptron Optimizing?
Perceptron update rule:
  
wk+1 wk xk
= + yk 1{yi (b + xTi w) < 0}
bk+1 bk 1
SVM objective:

https://ronan.collobert.com/pub/matos/2004_links_icml.pdf
Links between Perceptrons, MLPs and SVMs
<latexit sha1_base64="V7X8wpUGFe289x05X4MVTkQuX/c=">AAAC23icfVLLbtQwFHXCq4RHB1iysRiBilpGScVrAVIFG5ZF6rQjjUPkeG5mrDpOZDu0I8uzYQFCbPkxdvwGX4Azk9LSIq5k6ejcx7kP57Xg2sTxzyC8dPnK1Wtr16MbN2/dXu/dubuvq0YxGLJKVGqUUw2CSxgabgSMagW0zAUc5IdvW//BR1CaV3LPzGtISzqVvOCMGk9lvV9E0lzQzJKSmlle2CPnCAiR8fEps5Wn+DUmOUy5tMyraYeJrAydRnhp9sk84yfxxxl3m6RQlNltInwrE+qsdGcEtvAjTAwcG7vgxcJZn7yRb/7Jdxn/sHca/vhV4uVIJ/W/wm1duypcmRmoI67Bj0MiAnJy0niU9frxIF4avgiSDvRRZ7tZ7weZVKwpQRomqNbjJK5NaqkynAlwEWk01JQd0imMPZS0BJ3a5W0cfuiZCS4q5Z80eMmezbC01Hpe5j6ynUOf97Xkv3zjxhQvU8tl3RiQbCVUNAKbCreHxhOugBkx94AyxX2vmM2oX57x36FdQnJ+5Itgf3uQPB88e/+0v/OmW8cauo8eoA2UoBdoB71Du2iIWDAKFsHn4EuYhp/Cr+G3VWgYdDn30F8Wfv8NyCvndw==</latexit>

(
2
yi x i + n w, if yi (b + xTi w) < 1
rw `i [w, b] = 2
n w, otherwise
Perceptron is just SGD
<latexit sha1_base64="ZbedqVbsEKnvjc26yu1L3Ftjv2I=">AAACmXicbVFbaxQxGM2Mt7reVgVf+hJclIp1mRFL+1ChKkrxqWK3LWzGIcl+sxuayQxJpu0S0t/kb/HNf2Nmd+ql9YPA4TvnfLewWgpjk+RnFF+7fuPmrZXbvTt3791/0H/46MBUjeYw4pWs9BGjBqRQMLLCSjiqNdCSSThkxx9a/vAEtBGV2rfzGrKSTpUoBKc2pPL+d6IokzR3zBOQMhdjUlI7Y4U79essw28xYTAVyvHQxHhMVGXptIcX4V7Nc3GhP8uF9+v4OSYWzqw7F8W5d4FfYy9/S3wuvu3/afBiOw0VSVctac1u6a7sDPSpMOB94AmoycUAeX+QDJNF4Ksg7cAAdbGX93+QScWbEpTlkhozTpPaZo5qK7gE3yONgZryYzqFcYCKlmAyt7isx89CZoKLSoenLF5k/3Y4WhozL1lQtluZy1yb/B83bmyxlTmh6saC4stGRSOxrXD7TXgiNHAr5wFQrkWYFfMZ1ZTb8Jm9cIT08spXwcHrYboxTL68Gey8786xglbRU7SGUrSJdtAu2kMjxKMn0Xb0MfoUr8bv4t3481IaR53nMfon4q+/ADC2yfk=</latexit>

(
yi x i , if yi (b + xTi w) < 1 on SVM with = 0, ⌘ = 1!
rb `i [w, b] =
0, otherwise
37
SVMs vs logistic regression

■ Different loss functions: why expect the same solutions?


■ SVM: learns separating planes
■ LR: learns to predict soft predictions using linear model
(hyperplanes)

■ LR: large (negative) margin approximation : ??

38
SVMs vs logistic regression

■ Different loss functions: why expect the same


solutions?
■ SVM: learns separating planes
■ LR: learns to predict soft predictions

■ SVM and LR will match for separable data

■ Logistic Regression:
■ Prediction could be interpreted as risk/likelihood
■ However calibration needed to precisely map to probabilities
— most complex models are not calibrated

39
SVMs vs logistic regression

■ Logistic Regression:
■ Prediction could be interpreted as risk/likelihood
■ However calibration needed to precisely map to
probabilities — most complex models are not calibrated

■ Calibration: a model is calibrated (in probabilities)


when a prediction of a class with confidence p
(predicted output) is correct 100p percent of the time
■ ensuring the predicted value in any bin (over [0,1])
matches the true positive fraction for the same samples
■ Isotonic regression, non-parametric bootstrap used to
calibrate
40
What about multiple classes?

41
SVMs vs logistic regression

■ Multiclass setting:
Softmax naturally generalizes logistic regression
SVMs ? https://nlp.stanford.edu/IR-book/html/htmledition/
multiclass-svms-1.html
■ What about good old least squares?

42
What’s Next?
■ Nearest neighbor-based classification
■ Kernel methods
Generalize SVM to linear sum of non-linear
functions
■ Unsupervised learning: PCA and k-means
clustering
■ Bootstrap method:
Another way to cross-validate
Crucial when statistics are estimated
■ P-values and Hypothesis testing
Bayesian methods
43

You might also like