PerceptronSVM Module5 Part2 October2023

Term Paper, Homework
Deadline for Term Paper Phase I submission: October 31, 2023 11:30
pm
Can only use Fashion MNIST for numerical dataset (no MNIST)
Once groups are formed, it is mandatory to create a slack sub-

channel for each group — identify a group leader and have
them do so — add instructor and TAs
Deadline for creating a Slack subchannel: October 31, 2023
Homework 5 is released and is due: November 1, 2023

• This is an intense HW; get started;
• Also, start on your term paper project :)
1
Perceptron Analysis: Linearly Separable Case
<latexit sha1_base64="Ak/gyvoP6uJcC1BzlHM7mG9Txcs=">AAAEPnicbVNNb9NAEHUTPkr4auHIZUQNalAU4giVUqlSBRzgUpUq/ZDqNFqv1/Eq612zXreJLP8yLvwGbhy5cAAhrhyZdVKatuzBGr/xzL73Zhykgmem0/m6UKtfu37j5uKtxu07d+/dX1p+sJ+pXFO2R5VQ+jAgGRNcsj3DjWCHqWYkCQQ7CEZvbP7ghOmMK9kzk5T1EzKUPOKUGIQGy7WeLxWXIZMGfMPGJoiKXsyUZgmsvhaKjsB/CtvqhI9UFLXAe7XWbW6U8D4CEzMwmnDJ5RBCYgj4Prh+QkxMiSjelpt+sVq9Ys9xOfBak4HXbM1DXYS6zZaLZSJUJruQ3MbkdtMvXeDZP25WJ9FiAhlLiSYos8RrT7mJISF6yCW2GpIkIS4EEyCQS25AKp1AjOp1KohkM5LY7bQ8frYBvthn2sxjvrYIbHqu1VR1d4PNjtuyomWl/IwQNqUsNVrJczOokuj5kGVAIsN0OTUm0oQWu8fdspgyxMgFprXSGYS5toVnHVpwGjNtiV6gNi5nxHzBYNeFSGkgQpzrGZduGy9rNM6H2jsjLPMkYBpUBAmuFRkhu4SEzNpk00paa2FODpLGJ5HodU5jNPxjziRldhqByrF7aGvdStLzOU1taAyWVjrtTnXgauDNghVndnYGS1/8UNE8QcZUkCw78jqp6RdEG05xxA0/t/OmIzJkRxhKkrCsX1TrX8ITRMLKjEih4gqdryhIkmWTJMAvrU/Z5ZwF/5c7yk203i+4THODyqcXRbkAo8D+SxByzajBZQw5oZojV6AxbiXFoWfWBO+y5KvBfrftrbVffOiubK3P7Fh0HjmPnVXHc146W847Z8fZc2jtU+1b7UftZ/1z/Xv9V/339NPawqzmoXPh1P/8BcrLY88=</latexit>
Theorem (Block & Noviko↵, 1962): If the training data

D = {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )} is linearly separable
with margin by a unit norm hyperplane w⇤ : kw⇤ k = 1
with b = 0, then the perceptron training converges after
R2
2 errors during training, where kxk  R for all x.
Then the number of mistakes made by the online perceptron

on any such sequence is bounded by R2 / 2 .
Proof details:
1) Visual : https://www.cs.cornell.edu/courses/cs4780/2018fa/
lectures/lecturenote03.html
2) Chapter 9, 9.1.2: Understanding ML, Shalev-Schwartz, Ben-
David
2
Perceptron Analysis: Linearly Separable Case
<latexit sha1_base64="Ak/gyvoP6uJcC1BzlHM7mG9Txcs=">AAAEPnicbVNNb9NAEHUTPkr4auHIZUQNalAU4giVUqlSBRzgUpUq/ZDqNFqv1/Eq612zXreJLP8yLvwGbhy5cAAhrhyZdVKatuzBGr/xzL73Zhykgmem0/m6UKtfu37j5uKtxu07d+/dX1p+sJ+pXFO2R5VQ+jAgGRNcsj3DjWCHqWYkCQQ7CEZvbP7ghOmMK9kzk5T1EzKUPOKUGIQGy7WeLxWXIZMGfMPGJoiKXsyUZgmsvhaKjsB/CtvqhI9UFLXAe7XWbW6U8D4CEzMwmnDJ5RBCYgj4Prh+QkxMiSjelpt+sVq9Ys9xOfBak4HXbM1DXYS6zZaLZSJUJruQ3MbkdtMvXeDZP25WJ9FiAhlLiSYos8RrT7mJISF6yCW2GpIkIS4EEyCQS25AKp1AjOp1KohkM5LY7bQ8frYBvthn2sxjvrYIbHqu1VR1d4PNjtuyomWl/IwQNqUsNVrJczOokuj5kGVAIsN0OTUm0oQWu8fdspgyxMgFprXSGYS5toVnHVpwGjNtiV6gNi5nxHzBYNeFSGkgQpzrGZduGy9rNM6H2jsjLPMkYBpUBAmuFRkhu4SEzNpk00paa2FODpLGJ5HodU5jNPxjziRldhqByrF7aGvdStLzOU1taAyWVjrtTnXgauDNghVndnYGS1/8UNE8QcZUkCw78jqp6RdEG05xxA0/t/OmIzJkRxhKkrCsX1TrX8ITRMLKjEih4gqdryhIkmWTJMAvrU/Z5ZwF/5c7yk203i+4THODyqcXRbkAo8D+SxByzajBZQw5oZojV6AxbiXFoWfWBO+y5KvBfrftrbVffOiubK3P7Fh0HjmPnVXHc146W847Z8fZc2jtU+1b7UftZ/1z/Xv9V/339NPawqzmoXPh1P/8BcrLY88=</latexit>
Theorem (Block & Noviko↵, 1962): If the training data

D = {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )} is linearly separable
with margin by a unit norm hyperplane w⇤ : kw⇤ k = 1
with b = 0, then the perceptron training converges after
R2
2 errors during training, where kxk  R for all x.
Then the number of mistakes made by the online perceptron

on any such sequence is bounded by R2 / 2 .
■ How does the guarantee on the number of mistakes

(therefore, number of updates) depend on γ ?
3
Beyond Linearly Separable Case
■ Perceptron algorithm is super
cool!
No assumption about data
distribution!
■Could be generated by an
oblivious adversary, no need
to be iid
Makes a fixed number of
mistakes, and it’s done for
ever!
■ Even if you see infinite data
4
■ Perceptron is useless in practice!
Real world data is not linearly separable
If data not separable, cycles forever and hard to detect
5
■ Perceptron is useless in practice!
Real world data is not linearly separable
If data not separable, cycles forever and hard to detect
Even if separable, may not give good generalization
accuracy (small margin)
6
■ There are a number of problems with this algorithm,
summarized in Ripley (1996):
When the data are separable, there are many solutions,
and which one is found depends on the starting values.
The “finite” number of steps can be very large. The
smaller the gap, the longer the time to find it.
When the data are not separable, the algorithm will not
converge, and cycles develop. The cycles can be long
and therefore hard to detect.
From Hastie et al. Chap 4, Page 131.

7
Support Vector
Machines
Lalitha Sankar
Arizona State University
Ref: Hastie et al., Chaps 4 (4.5) and 12

8
Linear classifiers – Which line is better?
How does wT x + b = 0 look?

<latexit sha1_base64="X3wNDWk/QFxBRuq4sWthy81NBN8=">AAACBHicbVDLSsNAFJ3UV62vqMtuBltBEEpS8LERC266rNAXtLFMJpN26CQTZibWErpw46+4caGIWz/CnX/jtM1CWw9cOJxzL/fe40aMSmVZ30ZmZXVtfSO7mdva3tndM/cPmpLHApMG5oyLtoskYTQkDUUVI+1IEBS4jLTc4c3Ub90TISkP62ocESdA/ZD6FCOlpZ6Zr/IR9DiRsDi6qz+cuvAKWkXIOB9e98yCVbJmgMvETkkBpKj1zK+ux3EckFBhhqTs2FaknAQJRTEjk1w3liRCeIj6pKNpiAIinWT2xAQea8WDPhe6QgVn6u+JBAVSjgNXdwZIDeSiNxX/8zqx8i+dhIZRrEiI54v8mEHF4TQR6FFBsGJjTRAWVN8K8QAJhJXOLadDsBdfXibNcsk+K1m35ULlPI0jC/LgCJwAG1yACqiCGmgADB7BM3gFb8aT8WK8Gx/z1oyRzhyCPzA+fwByy5X/</latexit>
9
Pick the one with the largest margin!
b=0
xTw +
margin 2
γ 10
Hyperplanes: Properties
Consider the hyperplane wT x + b = 0 and let L

<latexit sha1_base64="m9Nd9UW10Uauvg7N9qnJjXKTyAE=">AAAEaHicbVNRb9MwEM66AqMw2OABIV6sLaANttJMGkxIlSYmIZB4GFK3IS2lcpxLa82xg+1sq6L8IH4Nr+wv8Cs4J63WbliRcr7z3X3fd3aUCW5sp3O10Fhs3rl7b+l+68HD5UePV1afHBuVawZHTAmlv0fUgOASjiy3Ar5nGmgaCTiJzg5c/OQctOFK9uw4g35Kh5InnFGLrsFq42MoFZcxSEsOlDQ8Bk3sCMgID+tMUAnEv/jRu3wTdTs+oTImAizxw5TaEaOi+Fr6JAxbWEBZqDINxlVCMixrDTE5G6Gbzud0w+Lyw7RsWPptV+OLJPZCbcc8BekAmy1M5IbgR4kj2G61wgiGXBYg8xQ0tVC2Qm4hJdcsPimNMMeu1BSDfzkIauxo7fhbFaMNdG7jdhN5ufakN8qNC70NxTFoSy5C7f6+6+94nQOzWFsqnVJBrKrJ5jqhDBzhOU0c1ArYFE4FxfVHFbmcP1wDwhDpku3oOrfnGvChhJjEeBWorBvNlvMdkPlpINwhPwdJovFUrkRQgXVel62KONLubJJXXRImmrIiKIspZVJzLslGPZzNVggynsmv9zPy4xqsrHfanWqR20YwMda9yTocrC4sh7FiWENaJqgxp0Ens1skEfBT9guqLWfCjTY3kFF2RodwiqakKZh+UV36krxET0wSlDdRKEXlnc0oaGrMOI3wpFPH3Iw55/9ip7lN9voFl1luQbK6UZLXE8dHgaPQeBPEGA3KNEeshI0oCmnxnc11cbWtUsIgFXwVAuTQjopqWPVlLYtgN7MlChjclOu2cbzTDt61d7/trO/vTaRc8l54a96GF3jvvX3vs3foHXms8avxu/GncbX4t7nSfNZ8Xh9tLExynnpzq7n2DzcUZz0=</latexit>
denote the set of points such that L = {x : wT x + b = 0}.

In two-dimensions, this is a line.
1. For any two points x1 and x2 , wT (x1 x2 ) = 0.

Thus, w/kwk is the vector normal to the surface of L.
2. For any point x0 in L, wT x0 = b.
3. The signed distance of any point x to L is given by
T 1
w (x x0 ) = (wT x + b)
kwk
11
b=0
Distance from x0 to
xTw +
hyperplane defined
by xT w + b = 0?
x0
w
12
b=0
Distance from x0 to
xTw +
hyperplane defined
by xT w + b = 0?
x0
w e0 is the projection of x0
If x
onto the hyperplane then
w
||x0 x e0 ||2 = |(xT0 x
e0 )T ||w|| 2
|
13
b=0
Distance from x0 to
xTw +
hyperplane defined
by xT w + b = 0?
x0
If x
w
||x0 x e0 ||2 = |(xT0 x
e0 )T ||w|| 2
|
1 T
= |x
||w||2 0 w eT0 w|
x
14
b=0
Distance from x0 to
xTw +
hyperplane defined
by xT w + b = 0?
x0
If x
w
||x0 x e0 ||2 = |(xT0 x
e0 )T ||w|| 2
|
1 T
= |x
||w||2 0 w eT0 w|
x
1 T
= |x
||w||2 0 w + b|
1
<latexit sha1_base64="/qlLDN4gc9C6Cbyr1ZLrxbPvAGc=">AAACJ3icbVDLSsNAFJ3UV62vqEs3g0UQhJKUol1JwY3LCn1BE8NkOmmHTh7MTNQS8jdu/BU3goro0j9x0kbU1gMDh3POZe49bsSokIbxoRWWlldW14rrpY3Nre0dfXevI8KYY9LGIQt5z0WCMBqQtqSSkV7ECfJdRrru+CLzuzeECxoGLTmJiO2jYUA9ipFUkqOfWx5HODHTxGIdwiW0fCRHrpfcptDimZI61W/tLnWM69ZP4sR19LJRMaaAi8TMSRnkaDr6szUIceyTQGKGhOibRiTtBHFJMSNpyYoFiRAeoyHpKxognwg7md6ZwiOlDKAXcvUCCafq74kE+UJMfFclsx3FvJeJ/3n9WHp1O6FBFEsS4NlHXsygDGFWGhxQTrBkE0UQ5lTtCvEIqeKkqrakSjDnT14knWrFPK3UrmrlRj2vowgOwCE4BiY4Aw1wCZqgDTC4B4/gBbxqD9qT9qa9z6IFLZ/ZB3+gfX4BOo+nXg==</latexit>
T
Signed distance (distance with sign) : x 0w+b
kwk 2 15
Realizable (Linearly Separable) Setting
b=0
Distance of x0 from
hyperplane xT w + b:
xTw +
1
(xT0 w + b)
||w||2
Optimal Hyperplane
max
w,b
1
subject to yi (xTi w + b) 8i
||w||2
margin 2 Pick the one with the largest margin!

γ 16
b=0
Distance of x0 from
hyperplane xT w + b:
xTw +
1
(xT0 w + b)
||w||2
Optimal Hyperplane
max
w,b
1
subject to yi (xTi w + b) 8i
||w||2
Optimal Hyperplane (reparameterized)
min ||w||22
w,b
subject to yi (xTi w + b) 1 8i
margin 2
γ 17
b=0
■ Solve efficiently by many
methods, e.g.,
xTw +
quadratic programming
(QP)
■ Well-studied solution
algorithms
Stochastic gradient
min ||w||22
w,b
margin 2
γ 18
b=0
■ Solution? We will derive it
neatly when studying
xTw +
kernel methods.
■ Works for both kernel
methods and otherwise.
■ Geometric interpretation:
Hastie et al 4.5-4.5.2
min ||w||22
w,b
margin 2
γ 19
SVM as penalization method
■ Original quadratic program with linear constraints:
min ||w||22
w,b
■ For convex optimization objective with convex constraints,

an equivalent form exists (recall ridge/LASSO)
■ An equivalent form is to maximize the difference

1 yi (xTi w + b)
<latexit sha1_base64="IlZWM2DHNUMb+8KcHMk7DtC+hMo=">AAACCnicbVDLSsNAFJ3UV62vqEs30SJUxJJI0S4LblxW6AvaGCbTSTs4mYSZiRpC1m78FTcuFHHrF7jzb5y0EbT1wMCZc+7l3nvckBIhTfNLKywsLi2vFFdLa+sbm1v69k5HBBFHuI0CGvCeCwWmhOG2JJLiXsgx9F2Ku+7NReZ3bzEXJGAtGYfY9uGIEY8gKJXk6PvWSeyQysCHcux6yX3qkOvWz+8uPXaPHL1sVs0JjHli5aQMcjQd/XMwDFDkYyYRhUL0LTOUdgK5JIjitDSIBA4huoEj3FeUQR8LO5mckhqHShkaXsDVY9KYqL87EugLEfuuqsyWFLNeJv7n9SPp1e2EsDCSmKHpIC+ihgyMLBdjSDhGksaKQMSJ2tVAY8ghkiq9kgrBmj15nnROq9ZZtXZVKzfqeRxFsAcOQAVY4Bw0wCVogjZA4AE8gRfwqj1qz9qb9j4tLWh5zy74A+3jG+xhmmY=</latexit>
■ subject to a bounded ℓ2-norm on w

<latexit sha1_base64="3qvwbnTUgr8mMKYjo63vRzc0J3Q=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIaJcFNy4r2Ae2pWTSTBuayQzJHaUM/Qs3LhRx69+482/MtLPQ1gOBwzn3knOPH0th0HW/ncLa+sbmVnG7tLO7t39QPjxqmSjRjDdZJCPd8anhUijeRIGSd2LNaehL3vYnN5nffuTaiEjd4zTm/ZCOlAgEo2ilh15IcewH6dNsUK64VXcOskq8nFQgR2NQ/uoNI5aEXCGT1Jiu58bYT6lGwSSflXqJ4TFlEzriXUsVDbnpp/PEM3JmlSEJIm2fQjJXf2+kNDRmGvp2Mktolr1M/M/rJhjU+qlQcYJcscVHQSIJRiQ7nwyF5gzl1BLKtLBZCRtTTRnakkq2BG/55FXSuqh6V9XLu8tKvZbXUYQTOIVz8OAa6nALDWgCAwXP8ApvjnFenHfnYzFacPKdY/gD5/MH+6CRGg==</latexit>
20
Hard SVM: Equivalent Form Hinge Loss

min ||w||22
w,b
■ Obtain the following equivalent form in the realizable setting
(recall the max. margin constraint we started with!):
Xn
max{0, 1 yi (b + xTi w)} + ||w||22
i=1
■ Every constraint is equally important (same Lagrange

weight — hard SVM — no slack)
■ We thus obtain a new loss: Hinge loss:
ì = max{0, 1 yi (xTi w + b)}
<latexit sha1_base64="0dKU4bU4cqEX8IB2O5hqcZMP9pw=">AAACG3icbVDLSsNAFJ3UV62vqEs3wSJU1JKUot0IBTcuK/QFTQ2T6aQdOnkwM9GGkP9w46+4caGIK8GFf+OkjaCtBwbOPede5t5jB5RwoetfSm5peWV1Lb9e2Njc2t5Rd/fa3A8Zwi3kU591bcgxJR5uCSIo7gYMQ9emuGOPr1K/c4cZJ77XFFGA+y4cesQhCAopWWrFxJRa5NJ04cSM9VPjLLJISVZiZDvxJLHIbfOnuk9O7GMzsdSiXtan0BaJkZEiyNCw1A9z4KPQxZ5AFHLeM/RA9GPIBEEUJwUz5DiAaAyHuCepB13M+/H0tkQ7kspAc3wmnye0qfp7IoYu55Fry850TT7vpeJ/Xi8UTq0fEy8IBfbQ7CMnpJrwtTQobUAYRoJGkkDEiNxVQyPIIBIyzoIMwZg/eZG0K2XjvFy9qRbrtSyOPDgAh6AEDHAB6uAaNEALIPAAnsALeFUelWflTXmfteaUbGYf/IHy+Q3zDaFc</latexit>
21
SVM for the Non-Separable Setting
■ If data is linearly separable

min ||w||22
xT w + b = 0 w,b
yi (xTi w + b) 1 8i
1
||w||2
1
||w||2
22
What if the data is still not linearly
separable?
xT w + b = 0
1
||w||2
1
||w||2
23
separable?
xT w + b = 0 ■ Ignore points on the

other side of the
hyperplane.
■ Is it still separable?
1
||w||2 ■ Does the margin
change?
1
■ How do we resolve
the case where some
||w||2 samples are noisy?
24
separable?
xT w + b = 0 ■ If data is not linearly

separable, some
points don’t satisfy
margin constraint:
1
||w||2
1
||w||2
25
separable?
If data is not linearly
xT w + b = 0 ■
separable, some points
don’t satisfy margin
constraint:
min ||w||22
1 w,b
||w||2 yi (xTi w + b) 1 ⇠i 8i
Xn
⇠i 0, ⇠j  ⌫
j=1
1
||w||2
26
separable?
xT w + b = 0 ■
constraint:
min ||w||22
1 w,b
||w||2 yi (xTi w + b) 1 ⇠i 8i
Xn
⇠i 0, ⇠j  ⌫
j=1
1
||w||2
■ What are “support vectors”?
27
separable?
■ If data is not linearly
xT w + b = 0 separable, some points
constraint:
min ||w||22
w,b
1 yi (xTi w + b) 1 ⇠i 8i
||w||2 n
X
⇠i 0, ⇠j  ⌫
j=1
1 ■ Support vectors: set of points

closest to separating hyperplane
||w||2 that influence the position and
orientation of the plane!
28
separable?
xT w + b = 0 ■
constraint:
min ||w||22
1 w,b
||w||2 yi (xTi w + b) 1 ⇠i 8i
Xn
⇠i 0, ⇠j  ⌫
j=1
1
||w||2
Misclassifications occur when ⇠i > 1

<latexit sha1_base64="nNVYJFd61JrQcTDhzNeTREIIKQs=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEtCcpePFYwX5AG8pmO2mXbjZhdyOW0B/hxYMiXv093vw3btsctPXBwOO9GWbmBYng2rjut1NYW9/Y3Cpul3Z29/YPyodHLR2nimGTxSJWnYBqFFxi03AjsJMopFEgsB2Mb2d++xGV5rF8MJME/YgOJQ85o8ZK7d4T7/Mbr1+uuFV3DrJKvJxUIEejX/7qDWKWRigNE1Trrucmxs+oMpwJnJZ6qcaEsjEdYtdSSSPUfjY/d0rOrDIgYaxsSUPm6u+JjEZaT6LAdkbUjPSyNxP/87qpCWt+xmWSGpRssShMBTExmf1OBlwhM2JiCWWK21sJG1FFmbEJlWwI3vLLq6R1UfWuqpf3l5V6LY+jCCdwCufgwTXU4Q4a0AQGY3iGV3hzEufFeXc+Fq0FJ585hj9wPn8AzhiPMw==</latexit>
29
SVM as penalization method
min ||w||22
w,b
yi (xTi w + b) 1 ⇠i 8i
⇠i : slack variables
<latexit sha1_base64="oUKIkLKHSldDN1BdNjTBy5yY1os=">AAACAnicbVDLSgMxFL1TX7W+Rl2Jm2AruCozBVFcFd24rGAf0A5DJk3b0ExmSDLFMhQ3/oobF4q49Svc+Tem7Sy09UDgcM493NwTxJwp7TjfVm5ldW19I79Z2Nre2d2z9w8aKkokoXUS8Ui2AqwoZ4LWNdOctmJJcRhw2gyGN1O/OaJSsUjc63FMvRD3BesxgrWRfPuo1HlgPiuhK6Q4JkM0wpJhk1a+XXTKzgxombgZKUKGmm9/dboRSUIqNOFYqbbrxNpLsdSMcDopdBJFY7MC92nbUIFDqrx0dsIEnRqli3qRNE9oNFN/J1IcKjUOAzMZYj1Qi95U/M9rJ7p36aVMxImmgswX9RKOdISmfaAuk5RoPjYEE8nMXxEZYImJNq0VTAnu4snLpFEpu+dl565SrF5ndeThGE7gDFy4gCrcQg3qQOARnuEV3qwn68V6tz7mozkryxzCH1ifP6r9llU=</latexit>
Xn
⇠i 0, ⇠j  ⌫
j=1
• Note that this is a relaxation; it does not minimize #

of misclassifications (support vectors) but sum of
distances from plane!
• Generally solved using dual problem
• Solution has to ensure non-negatives slack variables
(enforced by bounding Lagrangian)
30
min ||w||22
w,b
SVM: Solution yi (xTi w + b) 1 ⇠i 8i

Xn
⇠i 0, ⇠j  ⌫
j=1
■ Solved as a constrained convex optimization:

N
<latexit sha1_base64="gBpGP5hNbphUOY9Z4NRzknkwtYg=">AAAC93icbVJNb9QwEHXCV1kobOHIxWIFKnQbJStBK6FKlXrhhIq021Zap5HjOKlVx0ltp+zKyh/hwgGEuPJXuPFvsLcpXVpGivzmvZkXe+y05kzpMPzt+bdu37l7b+V+78HD1UeP+2tPDlTVSEInpOKVPEqxopwJOtFMc3pUS4rLlNPD9HTP6YfnVCpWibGe1zQucSFYzgjWlkrWvFWU0oIJgzkrxOu2h0omEoNKrE/S3Hxqh+nwMkEz1rYQnTU4gy8hyiUmJmrNqL2qPjbjpWxjD6mmTAzbiazyoXUONmtbhHpI05mWpVGBDq5Mo825K1hfchzPHLORvoKIU9g5WAN42eQohgp6Fg47yjJwB06dEEH0zq0jt/Ks0qojRHw8thZ2I1Rkf0+f9AdhEC4C3gRRBwagi/2k/wtlFWlKKjThWKlpFNY6NlhqRji142wUrTE5xQWdWihwSVVsFvfWwheWyWBeSfsJDRfscofBpVLzMrWVbh7quubI/2nTRufbsWGibjQV5OJHecOhrqB7BDBjkhLN5xZgIpndKyQn2N6ntk/FDSG6fuSb4GAURG+DNx9Hg93tbhwr4Bl4DtZBBLbALngP9sEEEE97n72v3jd/7n/xv/s/Lkp9r+t5Cv4J/+cfE6jvBg==</latexit>
1 T X
min w w+C ⇠i
w,b,⇠ 2 i=1
s.t. 1 yi (wT xi + b)  ⇠i
⇠i 0, ⇠ = [⇠1 ⇠2 . . . ⇠n ]T
■ Lagrangian C trades off margin-width and mis-

classifications
■ Generally determined by CV
31
Machine Learning Problems
■ Have a bunch of iid data of the form:

{(xi , yi )}ni=1 x i 2 Rd yi 2 R
n
X
■ Learning a model’s parameters:
ì (w)
Each ì (w) is convex. i=1
Hinge Loss: ì (w) = max{0, 1 yi xTi w}

Logistic Loss: ì (w) = log(1 + exp( yi xTi w))
Squared error Loss: ì (w) = (yi xTi w)2
How do we solve for w? The last few lectures!

32
What is the Perceptron Optimizing?
Perceptron update rule:

  
wk+1 wk xk
= + yk 1{yi (b + xTi w) < 0}
bk+1 bk 1
SVM objective:
n
X n
X
max{0, 1 yi (b + xTi w)} + ||w||22 = ì (w, b)
i=1 i=1
rw ì (w, b) =?
<latexit sha1_base64="Xvl4X2sEnVDmW6Lb6J03F8St+tE=">AAACOXicdVDLSsNAFJ34rPFVdelmsAgVpCSi6EYsunFZwT6gCWFmOmmHTiZhZqKUkN9y41+4E9y4UMStP2DSVtRWDwwczjmXuffgiDOlLevRmJmdm19YLCyZyyura+vFjc2GCmNJaJ2EPJQtjBTlTNC6ZprTViQpCjCnTdy/yP3mDZWKheJaDyLqBqgrmM8I0pnkFWuOQJgjL3ECpHvYT27T1KGce6z8rezjvdMz6DjmVxj/lzFNr1iyKtYQcJrYY1ICY9S84oPTCUkcUKEJR0q1bSvSboKkZoTT1HRiRSNE+qhL2xkVKKDKTYaXp3A3UzrQD2X2hIZD9edEggKlBgHOkvmmatLLxb+8dqz9EzdhIoo1FWT0kR9zqEOY1wg7TFKi+SAjiEiW7QpJD0lEdFZ2XoI9efI0aRxU7KOKdXVYqp6P6yiAbbADysAGx6AKLkEN1AEBd+AJvIBX4954Nt6M91F0xhjPbIFfMD4+AfBDrNg=</latexit>
rb ì (w, b) =?
33

  
wk+1 wk xk
= + yk 1{yi (b + xTi w) < 0}
bk+1 bk 1
SVM objective:
n
X n
X
max{0, 1 yi (b + xTi w)} + ||w||22 = ì (w, b)
i=1 i=1
<latexit sha1_base64="V7X8wpUGFe289x05X4MVTkQuX/c=">AAAC23icfVLLbtQwFHXCq4RHB1iysRiBilpGScVrAVIFG5ZF6rQjjUPkeG5mrDpOZDu0I8uzYQFCbPkxdvwGX4Azk9LSIq5k6ejcx7kP57Xg2sTxzyC8dPnK1Wtr16MbN2/dXu/dubuvq0YxGLJKVGqUUw2CSxgabgSMagW0zAUc5IdvW//BR1CaV3LPzGtISzqVvOCMGk9lvV9E0lzQzJKSmlle2CPnCAiR8fEps5Wn+DUmOUy5tMyraYeJrAydRnhp9sk84yfxxxl3m6RQlNltInwrE+qsdGcEtvAjTAwcG7vgxcJZn7yRb/7Jdxn/sHca/vhV4uVIJ/W/wm1duypcmRmoI67Bj0MiAnJy0niU9frxIF4avgiSDvRRZ7tZ7weZVKwpQRomqNbjJK5NaqkynAlwEWk01JQd0imMPZS0BJ3a5W0cfuiZCS4q5Z80eMmezbC01Hpe5j6ynUOf97Xkv3zjxhQvU8tl3RiQbCVUNAKbCreHxhOugBkx94AyxX2vmM2oX57x36FdQnJ+5Itgf3uQPB88e/+0v/OmW8cauo8eoA2UoBdoB71Du2iIWDAKFsHn4EuYhp/Cr+G3VWgYdDn30F8Wfv8NyCvndw==</latexit>
(
2
yi x i + n w, if yi (b + xTi w) < 1
rw ì [w, b] = 2
n w, otherwise
<latexit sha1_base64="kv91EZrElcyOr7CllzHWllb8Hks=">AAACjXicbVHbbtQwEHXCrYRLt/DIi9UKVNSySirKTQWt4AEei9RtK62XyPZOdq06TmQ7tJHlfgWfwBfx1r/B2d0KaBnJ0tE5c2Y8M6yWwtg0vYjiGzdv3b6zcje5d//Bw9Xe2qNDUzWaw5BXstLHjBqQQsHQCivhuNZASybhiJ186vSj76CNqNSBbWsYl3SqRCE4tYHKez+JokzS3DFPQMpcjEhJ7YwV7tRvszF+jwmDqVCOhybGY6IqS6cJnod70ebCb+NnmFg4s+5cFOfeBW6TbV2WOfO5+Hbwp+jzvSxUIcsKaWd2C3dlZ6BPhQHvg05ATS6b5r2NtJ/OA18H2RJsDNbJ1o+LQbuf936RScWbEpTlkhozytLajh3VVnAJPiGNgZryEzqFUYCKlmDGbr5Nj58GZoKLSoenLJ6zfzscLY1pSxYyu6nMVa0j/6eNGlu8GTuh6saC4otGRSOxrXB3GjwRGriVbQCUaxH+ivmMasptOGASlpBdHfk6ONzpZ6/6L7+GbXxEi1hBT9A62kQZeo0G6AvaR0PEoyRKo7fRu3g13o334g+L1Dhaeh6jfyL+/BvBUseL</latexit>
(
yi , if yi (b + xTi w) < 1
rb ì [w, b] =
0, otherwise 34
  
wk+1 wk xk
= + yk 1{yi (b + xTi w) < 0}
bk+1 bk 1
SVM objective:
n
X n
X
max{0, 1 yi (b + xTi w)} + ||w||22 = ì (w, b)
i=1 i=1
(
2
rw ì [w, b] = 2
n w, otherwise
Recall SGD?
<latexit sha1_base64="kj0OhQQDEqm3x4G5pH3XsCuQA2E=">AAACKnicbVBNSxxBEO0xH5oxias55tJkDRhClhlB9CIaI+hRE1eFnWWo6a1ZG3t6hu4aZRnm9+TiX8nFgyJe/SH2rAsmmgdNv36viq56SaGkpSC48aZevHz1enrmjT/79t37udb8wqHNSyOwK3KVm+MELCqpsUuSFB4XBiFLFB4lpz8a/+gMjZW5PqBRgf0MhlqmUgA5KW59/4kClOK/drY3fH/xPK7oa1ivN3f9LUKCSEOiIK6iDOgkSavzuk6XHh9fFuNWO+gEY/DnJJyQNptgL25dRoNclBlqEgqs7YVBQf0KDEmhsPaj0mIB4hSG2HNUQ4a2X41Xrflnpwx4mht3NPGx+ndHBZm1oyxxlc2M9qnXiP/zeiWla/1K6qIk1OLho7RUnHLe5MYH0qAgNXIEhJFuVi5OwIAgl67vQgifrvycHC53wpVOsL/c3tyaxDHDPrJPbImFbJVtsl22x7pMsN/sD7ti196Fd+ndeLcPpVPepOcD+wfe3T2v2qbl</latexit>
<latexit sha1_base64="ZbedqVbsEKnvjc26yu1L3Ftjv2I=">AAACmXicbVFbaxQxGM2Mt7reVgVf+hJclIp1mRFL+1ChKkrxqWK3LWzGIcl+sxuayQxJpu0S0t/kb/HNf2Nmd+ql9YPA4TvnfLewWgpjk+RnFF+7fuPmrZXbvTt3791/0H/46MBUjeYw4pWs9BGjBqRQMLLCSjiqNdCSSThkxx9a/vAEtBGV2rfzGrKSTpUoBKc2pPL+d6IokzR3zBOQMhdjUlI7Y4U79essw28xYTAVyvHQxHhMVGXptIcX4V7Nc3GhP8uF9+v4OSYWzqw7F8W5d4FfYy9/S3wuvu3/afBiOw0VSVctac1u6a7sDPSpMOB94AmoycUAeX+QDJNF4Ksg7cAAdbGX93+QScWbEpTlkhozTpPaZo5qK7gE3yONgZryYzqFcYCKlmAyt7isx89CZoKLSoenLF5k/3Y4WhozL1lQtluZy1yb/B83bmyxlTmh6saC4stGRSOxrXD7TXgiNHAr5wFQrkWYFfMZ1ZTb8Jm9cIT08spXwcHrYboxTL68Gey8786xglbRU7SGUrSJdtAu2kMjxKMn0Xb0MfoUr8bv4t3481IaR53nMfon4q+/ADC2yfk=</latexit>
(
yi x i , if yi (b +xTi w) <1
rb ì [w, b] = wt+1 = wt ⌘rw f (w)
0, otherwise
35
  
wk+1 wk xk
= + yk 1{yi (b + xTi w) < 0}
bk+1 bk 1
SVM objective:
n
X n
X
max{0, 1 yi (b + xTi w)} + ||w||22 = ì (w, b)
i=1 i=1
(
2
rw ì [w, b] = 2
n w, otherwise
Perceptron is just SGD
(
yi x i , if yi (b + xTi w) < 1 on SVM with = 0, ⌘ = 1!
rb ì [w, b] =
0, otherwise
36
  
wk+1 wk xk
= + yk 1{yi (b + xTi w) < 0}
bk+1 bk 1
SVM objective:
https://ronan.collobert.com/pub/matos/2004_links_icml.pdf
Links between Perceptrons, MLPs and SVMs
(
2
rw ì [w, b] = 2
n w, otherwise
Perceptron is just SGD
(
yi x i , if yi (b + xTi w) < 1 on SVM with = 0, ⌘ = 1!
rb ì [w, b] =
0, otherwise
37
SVMs vs logistic regression
■ Different loss functions: why expect the same solutions?

■ SVM: learns separating planes
■ LR: learns to predict soft predictions using linear model
(hyperplanes)
■ LR: large (negative) margin approximation : ??
38
■ Different loss functions: why expect the same

solutions?
■ SVM: learns separating planes
■ LR: learns to predict soft predictions
■ SVM and LR will match for separable data
■ Logistic Regression:
■ Prediction could be interpreted as risk/likelihood
■ However calibration needed to precisely map to probabilities
— most complex models are not calibrated
39
■ Logistic Regression:
■ Prediction could be interpreted as risk/likelihood
■ However calibration needed to precisely map to
probabilities — most complex models are not calibrated
■ Calibration: a model is calibrated (in probabilities)

when a prediction of a class with confidence p
(predicted output) is correct 100p percent of the time
■ ensuring the predicted value in any bin (over [0,1])
matches the true positive fraction for the same samples
■ Isotonic regression, non-parametric bootstrap used to
calibrate
40
What about multiple classes?
41
■ Multiclass setting:
Softmax naturally generalizes logistic regression
SVMs ? https://nlp.stanford.edu/IR-book/html/htmledition/
multiclass-svms-1.html
■ What about good old least squares?
42
What’s Next?
■ Nearest neighbor-based classification
■ Kernel methods
Generalize SVM to linear sum of non-linear
functions
■ Unsupervised learning: PCA and k-means
clustering
■ Bootstrap method:
Another way to cross-validate
Crucial when statistics are estimated
■ P-values and Hypothesis testing
Bayesian methods
43

PerceptronSVM Module5 Part2 October2023

Uploaded by

Copyright:

Available Formats

PerceptronSVM Module5 Part2 October2023

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PerceptronSVM Module5 Part2 October2023

Uploaded by

Copyright:

Available Formats

Term Paper, Homework

Once groups are formed, it is mandatory to create a slack sub-

Deadline for creating a Slack subchannel: October 31, 2023

Homework 5 is released and is due: November 1, 2023

Theorem (Block & Noviko↵, 1962): If the training data

Then the number of mistakes made by the online perceptron

Theorem (Block & Noviko↵, 1962): If the training data

Then the number of mistakes made by the online perceptron

■ How does the guarantee on the number of mistakes

From Hastie et al. Chap 4, Page 131.

Ref: Hastie et al., Chaps 4 (4.5) and 12

How does wT x + b = 0 look?

Consider the hyperplane wT x + b = 0 and let L

denote the set of points such that L = {x : wT x + b = 0}.

1. For any two points x1 and x2 , wT (x1 x2 ) = 0.

3. The signed distance of any point x to L is given by

margin 2 Pick the one with the largest margin!

Optimal Hyperplane (reparameterized)

Optimal Hyperplane (reparameterized)

Optimal Hyperplane (reparameterized)

■ Original quadratic program with linear constraints:

■ For convex optimization objective with convex constraints,

■ An equivalent form is to maximize the difference

■ subject to a bounded ℓ2-norm on w

■ Original quadratic program with linear constraints:

■ Every constraint is equally important (same Lagrange

■ If data is linearly separable

xT w + b = 0 ■ Ignore points on the

xT w + b = 0 ■ If data is not linearly

1 ■ Support vectors: set of points

Misclassifications occur when ⇠i > 1

■ Original quadratic program with linear constraints:

• Note that this is a relaxation; it does not minimize #

SVM: Solution yi (xTi w + b) 1 ⇠i 8i

■ Original quadratic program with linear constraints:

■ Solved as a constrained convex optimization:

■ Lagrangian C trades off margin-width and mis-

■ Have a bunch of iid data of the form:

Hinge Loss: `i (w) = max{0, 1 yi xTi w}

How do we solve for w? The last few lectures!

Perceptron update rule:

Perceptron update rule:

■ Different loss functions: why expect the same solutions?

■ LR: large (negative) margin approximation : ??

■ Different loss functions: why expect the same

■ SVM and LR will match for separable data

■ Calibration: a model is calibrated (in probabilities)

You might also like