PerceptronSVM Module5 Part2 October2023
PerceptronSVM Module5 Part2 October2023
PerceptronSVM Module5 Part2 October2023
Deadline for Term Paper Phase I submission: October 31, 2023 11:30
pm
Can only use Fashion MNIST for numerical dataset (no MNIST)
1
Perceptron Analysis: Linearly Separable Case
<latexit sha1_base64="Ak/gyvoP6uJcC1BzlHM7mG9Txcs=">AAAEPnicbVNNb9NAEHUTPkr4auHIZUQNalAU4giVUqlSBRzgUpUq/ZDqNFqv1/Eq612zXreJLP8yLvwGbhy5cAAhrhyZdVKatuzBGr/xzL73Zhykgmem0/m6UKtfu37j5uKtxu07d+/dX1p+sJ+pXFO2R5VQ+jAgGRNcsj3DjWCHqWYkCQQ7CEZvbP7ghOmMK9kzk5T1EzKUPOKUGIQGy7WeLxWXIZMGfMPGJoiKXsyUZgmsvhaKjsB/CtvqhI9UFLXAe7XWbW6U8D4CEzMwmnDJ5RBCYgj4Prh+QkxMiSjelpt+sVq9Ys9xOfBak4HXbM1DXYS6zZaLZSJUJruQ3MbkdtMvXeDZP25WJ9FiAhlLiSYos8RrT7mJISF6yCW2GpIkIS4EEyCQS25AKp1AjOp1KohkM5LY7bQ8frYBvthn2sxjvrYIbHqu1VR1d4PNjtuyomWl/IwQNqUsNVrJczOokuj5kGVAIsN0OTUm0oQWu8fdspgyxMgFprXSGYS5toVnHVpwGjNtiV6gNi5nxHzBYNeFSGkgQpzrGZduGy9rNM6H2jsjLPMkYBpUBAmuFRkhu4SEzNpk00paa2FODpLGJ5HodU5jNPxjziRldhqByrF7aGvdStLzOU1taAyWVjrtTnXgauDNghVndnYGS1/8UNE8QcZUkCw78jqp6RdEG05xxA0/t/OmIzJkRxhKkrCsX1TrX8ITRMLKjEih4gqdryhIkmWTJMAvrU/Z5ZwF/5c7yk203i+4THODyqcXRbkAo8D+SxByzajBZQw5oZojV6AxbiXFoWfWBO+y5KvBfrftrbVffOiubK3P7Fh0HjmPnVXHc146W847Z8fZc2jtU+1b7UftZ/1z/Xv9V/339NPawqzmoXPh1P/8BcrLY88=</latexit>
Proof details:
1) Visual : https://www.cs.cornell.edu/courses/cs4780/2018fa/
lectures/lecturenote03.html
2) Chapter 9, 9.1.2: Understanding ML, Shalev-Schwartz, Ben-
David
2
Perceptron Analysis: Linearly Separable Case
<latexit sha1_base64="Ak/gyvoP6uJcC1BzlHM7mG9Txcs=">AAAEPnicbVNNb9NAEHUTPkr4auHIZUQNalAU4giVUqlSBRzgUpUq/ZDqNFqv1/Eq612zXreJLP8yLvwGbhy5cAAhrhyZdVKatuzBGr/xzL73Zhykgmem0/m6UKtfu37j5uKtxu07d+/dX1p+sJ+pXFO2R5VQ+jAgGRNcsj3DjWCHqWYkCQQ7CEZvbP7ghOmMK9kzk5T1EzKUPOKUGIQGy7WeLxWXIZMGfMPGJoiKXsyUZgmsvhaKjsB/CtvqhI9UFLXAe7XWbW6U8D4CEzMwmnDJ5RBCYgj4Prh+QkxMiSjelpt+sVq9Ys9xOfBak4HXbM1DXYS6zZaLZSJUJruQ3MbkdtMvXeDZP25WJ9FiAhlLiSYos8RrT7mJISF6yCW2GpIkIS4EEyCQS25AKp1AjOp1KohkM5LY7bQ8frYBvthn2sxjvrYIbHqu1VR1d4PNjtuyomWl/IwQNqUsNVrJczOokuj5kGVAIsN0OTUm0oQWu8fdspgyxMgFprXSGYS5toVnHVpwGjNtiV6gNi5nxHzBYNeFSGkgQpzrGZduGy9rNM6H2jsjLPMkYBpUBAmuFRkhu4SEzNpk00paa2FODpLGJ5HodU5jNPxjziRldhqByrF7aGvdStLzOU1taAyWVjrtTnXgauDNghVndnYGS1/8UNE8QcZUkCw78jqp6RdEG05xxA0/t/OmIzJkRxhKkrCsX1TrX8ITRMLKjEih4gqdryhIkmWTJMAvrU/Z5ZwF/5c7yk203i+4THODyqcXRbkAo8D+SxByzajBZQw5oZojV6AxbiXFoWfWBO+y5KvBfrftrbVffOiubK3P7Fh0HjmPnVXHc146W847Z8fZc2jtU+1b7UftZ/1z/Xv9V/339NPawqzmoXPh1P/8BcrLY88=</latexit>
3
Beyond Linearly Separable Case
■ Perceptron algorithm is super
cool!
No assumption about data
distribution!
■Could be generated by an
oblivious adversary, no need
to be iid
Makes a fixed number of
mistakes, and it’s done for
ever!
■ Even if you see infinite data
4
Beyond Linearly Separable Case
■ Perceptron is useless in practice!
Real world data is not linearly separable
If data not separable, cycles forever and hard to detect
5
Beyond Linearly Separable Case
■ Perceptron is useless in practice!
Real world data is not linearly separable
If data not separable, cycles forever and hard to detect
Even if separable, may not give good generalization
accuracy (small margin)
6
Beyond Linearly Separable Case
■ There are a number of problems with this algorithm,
summarized in Ripley (1996):
When the data are separable, there are many solutions,
and which one is found depends on the starting values.
The “finite” number of steps can be very large. The
smaller the gap, the longer the time to find it.
When the data are not separable, the algorithm will not
converge, and cycles develop. The cycles can be long
and therefore hard to detect.
9
Pick the one with the largest margin!
b=0
xTw +
margin 2
γ 10
Hyperplanes: Properties
T 1
w (x x0 ) = (wT x + b)
kwk
11
Pick the one with the largest margin!
b=0
Distance from x0 to
xTw +
hyperplane defined
by xT w + b = 0?
x0
w
12
Pick the one with the largest margin!
b=0
Distance from x0 to
xTw +
hyperplane defined
by xT w + b = 0?
x0
w e0 is the projection of x0
If x
onto the hyperplane then
w
||x0 x e0 ||2 = |(xT0 x
e0 )T ||w|| 2
|
13
Pick the one with the largest margin!
b=0
Distance from x0 to
xTw +
hyperplane defined
by xT w + b = 0?
x0
w e0 is the projection of x0
If x
onto the hyperplane then
w
||x0 x e0 ||2 = |(xT0 x
e0 )T ||w|| 2
|
1 T
= |x
||w||2 0 w eT0 w|
x
14
Pick the one with the largest margin!
b=0
Distance from x0 to
xTw +
hyperplane defined
by xT w + b = 0?
x0
w e0 is the projection of x0
If x
onto the hyperplane then
w
||x0 x e0 ||2 = |(xT0 x
e0 )T ||w|| 2
|
1 T
= |x
||w||2 0 w eT0 w|
x
1 T
= |x
||w||2 0 w + b|
1
<latexit sha1_base64="/qlLDN4gc9C6Cbyr1ZLrxbPvAGc=">AAACJ3icbVDLSsNAFJ3UV62vqEs3g0UQhJKUol1JwY3LCn1BE8NkOmmHTh7MTNQS8jdu/BU3goro0j9x0kbU1gMDh3POZe49bsSokIbxoRWWlldW14rrpY3Nre0dfXevI8KYY9LGIQt5z0WCMBqQtqSSkV7ECfJdRrru+CLzuzeECxoGLTmJiO2jYUA9ipFUkqOfWx5HODHTxGIdwiW0fCRHrpfcptDimZI61W/tLnWM69ZP4sR19LJRMaaAi8TMSRnkaDr6szUIceyTQGKGhOibRiTtBHFJMSNpyYoFiRAeoyHpKxognwg7md6ZwiOlDKAXcvUCCafq74kE+UJMfFclsx3FvJeJ/3n9WHp1O6FBFEsS4NlHXsygDGFWGhxQTrBkE0UQ5lTtCvEIqeKkqrakSjDnT14knWrFPK3UrmrlRj2vowgOwCE4BiY4Aw1wCZqgDTC4B4/gBbxqD9qT9qa9z6IFLZ/ZB3+gfX4BOo+nXg==</latexit>
T
Signed distance (distance with sign) : x 0w+b
kwk 2 15
Realizable (Linearly Separable) Setting
b=0
Distance of x0 from
hyperplane xT w + b:
xTw +
1
(xT0 w + b)
||w||2
Optimal Hyperplane
max
w,b
1
subject to yi (xTi w + b) 8i
||w||2
b=0
Distance of x0 from
hyperplane xT w + b:
xTw +
1
(xT0 w + b)
||w||2
Optimal Hyperplane
max
w,b
1
subject to yi (xTi w + b) 8i
||w||2
min ||w||22
w,b
subject to yi (xTi w + b) 1 8i
margin 2
γ 17
Pick the one with the largest margin!
b=0
■ Solve efficiently by many
methods, e.g.,
xTw +
quadratic programming
(QP)
■ Well-studied solution
algorithms
Stochastic gradient
min ||w||22
w,b
subject to yi (xTi w + b) 1 8i
margin 2
γ 18
Pick the one with the largest margin!
b=0
■ Solution? We will derive it
neatly when studying
xTw +
kernel methods.
■ Works for both kernel
methods and otherwise.
■ Geometric interpretation:
Hastie et al 4.5-4.5.2
min ||w||22
w,b
subject to yi (xTi w + b) 1 8i
margin 2
γ 19
SVM as penalization method
min ||w||22
w,b
subject to yi (xTi w + b) 1 8i
20
Hard SVM: Equivalent Form Hinge Loss
subject to yi (xTi w + b) 1 8i
■ Obtain the following equivalent form in the realizable setting
(recall the max. margin constraint we started with!):
Xn
max{0, 1 yi (b + xTi w)} + ||w||22
i=1
21
SVM for the Non-Separable Setting
yi (xTi w + b) 1 8i
1
||w||2
1
||w||2
22
What if the data is still not linearly
separable?
xT w + b = 0
1
||w||2
1
||w||2
23
What if the data is still not linearly
separable?
■ Is it still separable?
1
||w||2 ■ Does the margin
change?
1
■ How do we resolve
the case where some
||w||2 samples are noisy?
24
What if the data is still not linearly
separable?
1
||w||2
1
||w||2
25
What if the data is still not linearly
separable?
If data is not linearly
xT w + b = 0 ■
separable, some points
don’t satisfy margin
constraint:
min ||w||22
1 w,b
||w||2 yi (xTi w + b) 1 ⇠i 8i
Xn
⇠i 0, ⇠j ⌫
j=1
1
||w||2
26
What if the data is still not linearly
separable?
If data is not linearly
xT w + b = 0 ■
separable, some points
don’t satisfy margin
constraint:
min ||w||22
1 w,b
||w||2 yi (xTi w + b) 1 ⇠i 8i
Xn
⇠i 0, ⇠j ⌫
j=1
1
||w||2
■ What are “support vectors”?
27
What if the data is still not linearly
separable?
■ If data is not linearly
xT w + b = 0 separable, some points
don’t satisfy margin
constraint:
min ||w||22
w,b
1 yi (xTi w + b) 1 ⇠i 8i
||w||2 n
X
⇠i 0, ⇠j ⌫
j=1
min ||w||22
1 w,b
||w||2 yi (xTi w + b) 1 ⇠i 8i
Xn
⇠i 0, ⇠j ⌫
j=1
1
||w||2
29
SVM as penalization method
min ||w||22
w,b
yi (xTi w + b) 1 ⇠i 8i
⇠i : slack variables
<latexit sha1_base64="oUKIkLKHSldDN1BdNjTBy5yY1os=">AAACAnicbVDLSgMxFL1TX7W+Rl2Jm2AruCozBVFcFd24rGAf0A5DJk3b0ExmSDLFMhQ3/oobF4q49Svc+Tem7Sy09UDgcM493NwTxJwp7TjfVm5ldW19I79Z2Nre2d2z9w8aKkokoXUS8Ui2AqwoZ4LWNdOctmJJcRhw2gyGN1O/OaJSsUjc63FMvRD3BesxgrWRfPuo1HlgPiuhK6Q4JkM0wpJhk1a+XXTKzgxombgZKUKGmm9/dboRSUIqNOFYqbbrxNpLsdSMcDopdBJFY7MC92nbUIFDqrx0dsIEnRqli3qRNE9oNFN/J1IcKjUOAzMZYj1Qi95U/M9rJ7p36aVMxImmgswX9RKOdISmfaAuk5RoPjYEE8nMXxEZYImJNq0VTAnu4snLpFEpu+dl565SrF5ndeThGE7gDFy4gCrcQg3qQOARnuEV3qwn68V6tz7mozkryxzCH1ifP6r9llU=</latexit>
Xn
⇠i 0, ⇠j ⌫
j=1
1 T X
min w w+C ⇠i
w,b,⇠ 2 i=1
s.t. 1 yi (wT xi + b) ⇠i
⇠i 0, ⇠ = [⇠1 ⇠2 . . . ⇠n ]T
SVM objective:
n
X n
X
max{0, 1 yi (b + xTi w)} + ||w||22 = `i (w, b)
i=1 i=1
rw `i (w, b) =?
<latexit sha1_base64="Xvl4X2sEnVDmW6Lb6J03F8St+tE=">AAACOXicdVDLSsNAFJ34rPFVdelmsAgVpCSi6EYsunFZwT6gCWFmOmmHTiZhZqKUkN9y41+4E9y4UMStP2DSVtRWDwwczjmXuffgiDOlLevRmJmdm19YLCyZyyura+vFjc2GCmNJaJ2EPJQtjBTlTNC6ZprTViQpCjCnTdy/yP3mDZWKheJaDyLqBqgrmM8I0pnkFWuOQJgjL3ECpHvYT27T1KGce6z8rezjvdMz6DjmVxj/lzFNr1iyKtYQcJrYY1ICY9S84oPTCUkcUKEJR0q1bSvSboKkZoTT1HRiRSNE+qhL2xkVKKDKTYaXp3A3UzrQD2X2hIZD9edEggKlBgHOkvmmatLLxb+8dqz9EzdhIoo1FWT0kR9zqEOY1wg7TFKi+SAjiEiW7QpJD0lEdFZ2XoI9efI0aRxU7KOKdXVYqp6P6yiAbbADysAGx6AKLkEN1AEBd+AJvIBX4954Nt6M91F0xhjPbIFfMD4+AfBDrNg=</latexit>
rb `i (w, b) =?
33
What is the Perceptron Optimizing?
SVM objective:
n
X n
X
max{0, 1 yi (b + xTi w)} + ||w||22 = `i (w, b)
i=1 i=1
<latexit sha1_base64="V7X8wpUGFe289x05X4MVTkQuX/c=">AAAC23icfVLLbtQwFHXCq4RHB1iysRiBilpGScVrAVIFG5ZF6rQjjUPkeG5mrDpOZDu0I8uzYQFCbPkxdvwGX4Azk9LSIq5k6ejcx7kP57Xg2sTxzyC8dPnK1Wtr16MbN2/dXu/dubuvq0YxGLJKVGqUUw2CSxgabgSMagW0zAUc5IdvW//BR1CaV3LPzGtISzqVvOCMGk9lvV9E0lzQzJKSmlle2CPnCAiR8fEps5Wn+DUmOUy5tMyraYeJrAydRnhp9sk84yfxxxl3m6RQlNltInwrE+qsdGcEtvAjTAwcG7vgxcJZn7yRb/7Jdxn/sHca/vhV4uVIJ/W/wm1duypcmRmoI67Bj0MiAnJy0niU9frxIF4avgiSDvRRZ7tZ7weZVKwpQRomqNbjJK5NaqkynAlwEWk01JQd0imMPZS0BJ3a5W0cfuiZCS4q5Z80eMmezbC01Hpe5j6ynUOf97Xkv3zjxhQvU8tl3RiQbCVUNAKbCreHxhOugBkx94AyxX2vmM2oX57x36FdQnJ+5Itgf3uQPB88e/+0v/OmW8cauo8eoA2UoBdoB71Du2iIWDAKFsHn4EuYhp/Cr+G3VWgYdDn30F8Wfv8NyCvndw==</latexit>
(
2
yi x i + n w, if yi (b + xTi w) < 1
rw `i [w, b] = 2
n w, otherwise
<latexit sha1_base64="kv91EZrElcyOr7CllzHWllb8Hks=">AAACjXicbVHbbtQwEHXCrYRLt/DIi9UKVNSySirKTQWt4AEei9RtK62XyPZOdq06TmQ7tJHlfgWfwBfx1r/B2d0KaBnJ0tE5c2Y8M6yWwtg0vYjiGzdv3b6zcje5d//Bw9Xe2qNDUzWaw5BXstLHjBqQQsHQCivhuNZASybhiJ186vSj76CNqNSBbWsYl3SqRCE4tYHKez+JokzS3DFPQMpcjEhJ7YwV7tRvszF+jwmDqVCOhybGY6IqS6cJnod70ebCb+NnmFg4s+5cFOfeBW6TbV2WOfO5+Hbwp+jzvSxUIcsKaWd2C3dlZ6BPhQHvg05ATS6b5r2NtJ/OA18H2RJsDNbJ1o+LQbuf936RScWbEpTlkhozytLajh3VVnAJPiGNgZryEzqFUYCKlmDGbr5Nj58GZoKLSoenLJ6zfzscLY1pSxYyu6nMVa0j/6eNGlu8GTuh6saC4otGRSOxrXB3GjwRGriVbQCUaxH+ivmMasptOGASlpBdHfk6ONzpZ6/6L7+GbXxEi1hBT9A62kQZeo0G6AvaR0PEoyRKo7fRu3g13o334g+L1Dhaeh6jfyL+/BvBUseL</latexit>
(
yi , if yi (b + xTi w) < 1
rb `i [w, b] =
0, otherwise 34
What is the Perceptron Optimizing?
Perceptron update rule:
wk+1 wk xk
= + yk 1{yi (b + xTi w) < 0}
bk+1 bk 1
SVM objective:
n
X n
X
max{0, 1 yi (b + xTi w)} + ||w||22 = `i (w, b)
i=1 i=1
<latexit sha1_base64="V7X8wpUGFe289x05X4MVTkQuX/c=">AAAC23icfVLLbtQwFHXCq4RHB1iysRiBilpGScVrAVIFG5ZF6rQjjUPkeG5mrDpOZDu0I8uzYQFCbPkxdvwGX4Azk9LSIq5k6ejcx7kP57Xg2sTxzyC8dPnK1Wtr16MbN2/dXu/dubuvq0YxGLJKVGqUUw2CSxgabgSMagW0zAUc5IdvW//BR1CaV3LPzGtISzqVvOCMGk9lvV9E0lzQzJKSmlle2CPnCAiR8fEps5Wn+DUmOUy5tMyraYeJrAydRnhp9sk84yfxxxl3m6RQlNltInwrE+qsdGcEtvAjTAwcG7vgxcJZn7yRb/7Jdxn/sHca/vhV4uVIJ/W/wm1duypcmRmoI67Bj0MiAnJy0niU9frxIF4avgiSDvRRZ7tZ7weZVKwpQRomqNbjJK5NaqkynAlwEWk01JQd0imMPZS0BJ3a5W0cfuiZCS4q5Z80eMmezbC01Hpe5j6ynUOf97Xkv3zjxhQvU8tl3RiQbCVUNAKbCreHxhOugBkx94AyxX2vmM2oX57x36FdQnJ+5Itgf3uQPB88e/+0v/OmW8cauo8eoA2UoBdoB71Du2iIWDAKFsHn4EuYhp/Cr+G3VWgYdDn30F8Wfv8NyCvndw==</latexit>
(
2
yi x i + n w, if yi (b + xTi w) < 1
rw `i [w, b] = 2
n w, otherwise
Recall SGD?
<latexit sha1_base64="kj0OhQQDEqm3x4G5pH3XsCuQA2E=">AAACKnicbVBNSxxBEO0xH5oxias55tJkDRhClhlB9CIaI+hRE1eFnWWo6a1ZG3t6hu4aZRnm9+TiX8nFgyJe/SH2rAsmmgdNv36viq56SaGkpSC48aZevHz1enrmjT/79t37udb8wqHNSyOwK3KVm+MELCqpsUuSFB4XBiFLFB4lpz8a/+gMjZW5PqBRgf0MhlqmUgA5KW59/4kClOK/drY3fH/xPK7oa1ivN3f9LUKCSEOiIK6iDOgkSavzuk6XHh9fFuNWO+gEY/DnJJyQNptgL25dRoNclBlqEgqs7YVBQf0KDEmhsPaj0mIB4hSG2HNUQ4a2X41Xrflnpwx4mht3NPGx+ndHBZm1oyxxlc2M9qnXiP/zeiWla/1K6qIk1OLho7RUnHLe5MYH0qAgNXIEhJFuVi5OwIAgl67vQgifrvycHC53wpVOsL/c3tyaxDHDPrJPbImFbJVtsl22x7pMsN/sD7ti196Fd+ndeLcPpVPepOcD+wfe3T2v2qbl</latexit>
<latexit sha1_base64="ZbedqVbsEKnvjc26yu1L3Ftjv2I=">AAACmXicbVFbaxQxGM2Mt7reVgVf+hJclIp1mRFL+1ChKkrxqWK3LWzGIcl+sxuayQxJpu0S0t/kb/HNf2Nmd+ql9YPA4TvnfLewWgpjk+RnFF+7fuPmrZXbvTt3791/0H/46MBUjeYw4pWs9BGjBqRQMLLCSjiqNdCSSThkxx9a/vAEtBGV2rfzGrKSTpUoBKc2pPL+d6IokzR3zBOQMhdjUlI7Y4U79essw28xYTAVyvHQxHhMVGXptIcX4V7Nc3GhP8uF9+v4OSYWzqw7F8W5d4FfYy9/S3wuvu3/afBiOw0VSVctac1u6a7sDPSpMOB94AmoycUAeX+QDJNF4Ksg7cAAdbGX93+QScWbEpTlkhozTpPaZo5qK7gE3yONgZryYzqFcYCKlmAyt7isx89CZoKLSoenLF5k/3Y4WhozL1lQtluZy1yb/B83bmyxlTmh6saC4stGRSOxrXD7TXgiNHAr5wFQrkWYFfMZ1ZTb8Jm9cIT08spXwcHrYboxTL68Gey8786xglbRU7SGUrSJdtAu2kMjxKMn0Xb0MfoUr8bv4t3481IaR53nMfon4q+/ADC2yfk=</latexit>
(
yi x i , if yi (b +xTi w) <1
rb `i [w, b] = wt+1 = wt ⌘rw f (w)
0, otherwise
35
What is the Perceptron Optimizing?
Perceptron update rule:
wk+1 wk xk
= + yk 1{yi (b + xTi w) < 0}
bk+1 bk 1
SVM objective:
n
X n
X
max{0, 1 yi (b + xTi w)} + ||w||22 = `i (w, b)
i=1 i=1
<latexit sha1_base64="V7X8wpUGFe289x05X4MVTkQuX/c=">AAAC23icfVLLbtQwFHXCq4RHB1iysRiBilpGScVrAVIFG5ZF6rQjjUPkeG5mrDpOZDu0I8uzYQFCbPkxdvwGX4Azk9LSIq5k6ejcx7kP57Xg2sTxzyC8dPnK1Wtr16MbN2/dXu/dubuvq0YxGLJKVGqUUw2CSxgabgSMagW0zAUc5IdvW//BR1CaV3LPzGtISzqVvOCMGk9lvV9E0lzQzJKSmlle2CPnCAiR8fEps5Wn+DUmOUy5tMyraYeJrAydRnhp9sk84yfxxxl3m6RQlNltInwrE+qsdGcEtvAjTAwcG7vgxcJZn7yRb/7Jdxn/sHca/vhV4uVIJ/W/wm1duypcmRmoI67Bj0MiAnJy0niU9frxIF4avgiSDvRRZ7tZ7weZVKwpQRomqNbjJK5NaqkynAlwEWk01JQd0imMPZS0BJ3a5W0cfuiZCS4q5Z80eMmezbC01Hpe5j6ynUOf97Xkv3zjxhQvU8tl3RiQbCVUNAKbCreHxhOugBkx94AyxX2vmM2oX57x36FdQnJ+5Itgf3uQPB88e/+0v/OmW8cauo8eoA2UoBdoB71Du2iIWDAKFsHn4EuYhp/Cr+G3VWgYdDn30F8Wfv8NyCvndw==</latexit>
(
2
yi x i + n w, if yi (b + xTi w) < 1
rw `i [w, b] = 2
n w, otherwise
Perceptron is just SGD
<latexit sha1_base64="ZbedqVbsEKnvjc26yu1L3Ftjv2I=">AAACmXicbVFbaxQxGM2Mt7reVgVf+hJclIp1mRFL+1ChKkrxqWK3LWzGIcl+sxuayQxJpu0S0t/kb/HNf2Nmd+ql9YPA4TvnfLewWgpjk+RnFF+7fuPmrZXbvTt3791/0H/46MBUjeYw4pWs9BGjBqRQMLLCSjiqNdCSSThkxx9a/vAEtBGV2rfzGrKSTpUoBKc2pPL+d6IokzR3zBOQMhdjUlI7Y4U79essw28xYTAVyvHQxHhMVGXptIcX4V7Nc3GhP8uF9+v4OSYWzqw7F8W5d4FfYy9/S3wuvu3/afBiOw0VSVctac1u6a7sDPSpMOB94AmoycUAeX+QDJNF4Ksg7cAAdbGX93+QScWbEpTlkhozTpPaZo5qK7gE3yONgZryYzqFcYCKlmAyt7isx89CZoKLSoenLF5k/3Y4WhozL1lQtluZy1yb/B83bmyxlTmh6saC4stGRSOxrXD7TXgiNHAr5wFQrkWYFfMZ1ZTb8Jm9cIT08spXwcHrYboxTL68Gey8786xglbRU7SGUrSJdtAu2kMjxKMn0Xb0MfoUr8bv4t3481IaR53nMfon4q+/ADC2yfk=</latexit>
(
yi x i , if yi (b + xTi w) < 1 on SVM with = 0, ⌘ = 1!
rb `i [w, b] =
0, otherwise
36
What is the Perceptron Optimizing?
Perceptron update rule:
wk+1 wk xk
= + yk 1{yi (b + xTi w) < 0}
bk+1 bk 1
SVM objective:
https://ronan.collobert.com/pub/matos/2004_links_icml.pdf
Links between Perceptrons, MLPs and SVMs
<latexit sha1_base64="V7X8wpUGFe289x05X4MVTkQuX/c=">AAAC23icfVLLbtQwFHXCq4RHB1iysRiBilpGScVrAVIFG5ZF6rQjjUPkeG5mrDpOZDu0I8uzYQFCbPkxdvwGX4Azk9LSIq5k6ejcx7kP57Xg2sTxzyC8dPnK1Wtr16MbN2/dXu/dubuvq0YxGLJKVGqUUw2CSxgabgSMagW0zAUc5IdvW//BR1CaV3LPzGtISzqVvOCMGk9lvV9E0lzQzJKSmlle2CPnCAiR8fEps5Wn+DUmOUy5tMyraYeJrAydRnhp9sk84yfxxxl3m6RQlNltInwrE+qsdGcEtvAjTAwcG7vgxcJZn7yRb/7Jdxn/sHca/vhV4uVIJ/W/wm1duypcmRmoI67Bj0MiAnJy0niU9frxIF4avgiSDvRRZ7tZ7weZVKwpQRomqNbjJK5NaqkynAlwEWk01JQd0imMPZS0BJ3a5W0cfuiZCS4q5Z80eMmezbC01Hpe5j6ynUOf97Xkv3zjxhQvU8tl3RiQbCVUNAKbCreHxhOugBkx94AyxX2vmM2oX57x36FdQnJ+5Itgf3uQPB88e/+0v/OmW8cauo8eoA2UoBdoB71Du2iIWDAKFsHn4EuYhp/Cr+G3VWgYdDn30F8Wfv8NyCvndw==</latexit>
(
2
yi x i + n w, if yi (b + xTi w) < 1
rw `i [w, b] = 2
n w, otherwise
Perceptron is just SGD
<latexit sha1_base64="ZbedqVbsEKnvjc26yu1L3Ftjv2I=">AAACmXicbVFbaxQxGM2Mt7reVgVf+hJclIp1mRFL+1ChKkrxqWK3LWzGIcl+sxuayQxJpu0S0t/kb/HNf2Nmd+ql9YPA4TvnfLewWgpjk+RnFF+7fuPmrZXbvTt3791/0H/46MBUjeYw4pWs9BGjBqRQMLLCSjiqNdCSSThkxx9a/vAEtBGV2rfzGrKSTpUoBKc2pPL+d6IokzR3zBOQMhdjUlI7Y4U79essw28xYTAVyvHQxHhMVGXptIcX4V7Nc3GhP8uF9+v4OSYWzqw7F8W5d4FfYy9/S3wuvu3/afBiOw0VSVctac1u6a7sDPSpMOB94AmoycUAeX+QDJNF4Ksg7cAAdbGX93+QScWbEpTlkhozTpPaZo5qK7gE3yONgZryYzqFcYCKlmAyt7isx89CZoKLSoenLF5k/3Y4WhozL1lQtluZy1yb/B83bmyxlTmh6saC4stGRSOxrXD7TXgiNHAr5wFQrkWYFfMZ1ZTb8Jm9cIT08spXwcHrYboxTL68Gey8786xglbRU7SGUrSJdtAu2kMjxKMn0Xb0MfoUr8bv4t3481IaR53nMfon4q+/ADC2yfk=</latexit>
(
yi x i , if yi (b + xTi w) < 1 on SVM with = 0, ⌘ = 1!
rb `i [w, b] =
0, otherwise
37
SVMs vs logistic regression
38
SVMs vs logistic regression
■ Logistic Regression:
■ Prediction could be interpreted as risk/likelihood
■ However calibration needed to precisely map to probabilities
— most complex models are not calibrated
39
SVMs vs logistic regression
■ Logistic Regression:
■ Prediction could be interpreted as risk/likelihood
■ However calibration needed to precisely map to
probabilities — most complex models are not calibrated
41
SVMs vs logistic regression
■ Multiclass setting:
Softmax naturally generalizes logistic regression
SVMs ? https://nlp.stanford.edu/IR-book/html/htmledition/
multiclass-svms-1.html
■ What about good old least squares?
42
What’s Next?
■ Nearest neighbor-based classification
■ Kernel methods
Generalize SVM to linear sum of non-linear
functions
■ Unsupervised learning: PCA and k-means
clustering
■ Bootstrap method:
Another way to cross-validate
Crucial when statistics are estimated
■ P-values and Hypothesis testing
Bayesian methods
43