10 Support Vector Machine

Topics to be covered...
Introduction to SVM
Concept of maximum margin hyperplane
Linear SVM
1 Calculation of MMH
2 Learning a linear SVM
3 Classifying a test sample using linear SVM
4 Classifying multi-class data
Non-linear SVM
Concept of non-linear data

Soft-margin SVM
Kernel Trick
09/24/2021 1 / 131
Introduction to SVM
09/24/2021 2 / 131
Introduction
A classification that has received considerable attention is support
vector machine and popularly abbreviated as SVM.
This technique has its roots in statistical learning theory (Vlamidir
Vapnik, 1992).
As a task of classification, it searches for optimal hyperplane(i.e.,
decision boundary, see Fig. 1 in the next slide) separating the
tuples of one class from another.
SVM works well with higher dimensional data and thus avoids
dimensionality problem.
Although the SVM based classification (i.e., training time) is
extremely slow, the result, is however highly accurate. Further,
testing an unknown data is very fast.
SVM is less prone to over fitting than other methods. It also
facilitates compact model for classification.
09/24/2021 3 / 131
Introduction
Figure 1: Decision boundary in SVM.
09/24/2021 4 / 131
Introduction
In this lecture, we shall discuss the following.
1 Maximum margin hyperplane : a key concept in SVM.

2 Linear SVM : a classification technique when training data are
linearly separable.
3 Non-linear SVM : a classification technique when training data are
linearly non-separable.
09/24/2021 5 / 131
Maximum Margin Hyperplane
09/24/2021 6 / 131
In our subsequent discussion, we shall assume a simplistic

situation that given a training data D ={t1 ,t 2 ,....tn } with a set of n
tuples, which belong to two classes either + or - and each tuple is
described by two attributes say A1, A2.
09/24/2021 7 / 131
Figure 2: A 2D data linearly seperable by hyperplanes
09/24/2021 8 / 131
Maximum Margin Hyperplane contd...
Figure 2 shows a plot of data in 2-D. Another simplistic assumption

here is that the data is linearly separable, that is, we can find a
hyperplane (in this case, it is a straight line) such that all +’s reside
on one side whereas all -’s reside on other side of the hyperplane.
From Fig. 2, it can be seen that there are an infinite number of

separating lines that can be drawn. Therefore, the following two
questions arise:
1 Whether all hyperplanes are equivalent so far the classification of
data is concerned?
2 If not, which hyperplane is the best?
09/24/2021 9 / 131
We may note that so far the classification error is concerned (with

training data), all of them are with zero error.
However, there is no guarantee that all hyperplanes perform

equally well on unseen (i.e., test) data.
09/24/2021 10 / 131
Thus, for a good classifier it must choose one of the infinite

number of hyperplanes, so that it performs better not only on
training data but as well as test data.
To illustrate how the different choices of hyperplane influence the

classification error, consider any arbitrary two hyperplanes H1 and
H2 as shown in Fig. 3.
09/24/2021 11 / 131
Figure 3: Hyperplanes with decision boundaries and their margins.
H2
09/24/2021 12 / 131
In Fig. 3, two hyperplanes H1 and H2 have their own boundaries

called decision boundaries(denoted as b11 and b12 for H1 and b21
and b22 for H2).
A decision boundary is a boundary which is parallel to hyperplane

and touches the closest class in one side of the hyperplane.
The distance between the two decision boundaries of a

hyperplane is called the margin. So, if data is classified using
Hyperplane H1, then it is with larger margin then using
Hyperplane H2.
The margin of hyperplane implies the error in classifier. In other

words, the larger the margin, lower is the classification error.
09/24/2021 13 / 131
Intuitively, the classifier that contains hyperplane with a small

margin are more susceptible to model over fitting and tend
to classify with weak confidence on unseen data.
Thus during the training or learning phase, the approach would be

to search for the hyperplane with maximum margin.
Such a hyperplane is called maximum margin hyperplane and

abbreviated as MMH.
We may note the shortest distance from a hyperplane to one of

its decision boundary is equal to the shortest distance from the
hyperplane to the decision boundary at its other side.
Alternatively, hyperplane is at the middle of its decision

boundaries.
09/24/2021 14 / 131
Linear SVM
09/24/2021 15 / 131
Linear SVM
A SVM which is used to classify data which are linearly separable

is called linear SVM.
In other words, a linear SVM searches for a hyperplane with the

maximum margin.
This is why a linear SVM is often termed as a maximal margin

classifier (MMC).
09/24/2021 16 / 131
Finding MMH for a Linear SVM
In the following, we shall discuss the mathematics to find the

MMH given a training set.
In our discussion, we shall consider a binary classification

problem consisting of n training data.
Each tuple is denoted by (Xi ,Yi ) where Xi = (xi 1,xi

2.......xim) corresponds to the attribute set for the i
th tuple
(data in
m-dimensional space) and Yi E [+,-] denotes its class
label.
Note that choice of which class should be labeled as + or -

is arbitrary.
09/24/2021 17 / 131
Finding MMH for a Linear SVM
Thus, given {(X i , Yi ) }ni , we are to obtain a hyperplane which

separates all Xi |ni into =1 two sides of it (of course with
maximum
gap). =1
Before, going to a general equation of a plane in n-dimension, let

us consider first, a hyperplane in 2-D plane.
09/24/2021 18 / 131
Equation of a hyperplane in 2-D
Let us consider a 2-D training tuple with attributes A1 and A2 as

X =(x1,x2), where x1 and x2 are values of attributes A1 and A2,
respectively for X .
Equation of a plane in 2-D space can be written as
w0 + w1 x1 + w2 x2 = 0 [e.g., ax + by + c = 0]
where w0, w1, and w2 are some constants defining the slope and
intercept of the line.
Any point lying above such a hyperplane satisfies

w0 + w1 x1 + w2 x2 > 0 (1)
09/24/2021 19 / 131
Equation of a hyperplane in 2-D
Similarly, any point lying below the hyperplane satisfies
w0 + w1 x1 + w2 x2 < 0 (2)
An SVM hyperplane is an n-dimensional generalization of a
straight line in 2-D.
It can be visualized as a plane surface in 3-D, but it is not easy

to visualize when dimensionality is greater than 3!
09/24/2021 20 / 131
Equation of a hyperplane
In fact, Euclidean equation of a hyperplane in Rm is
w1 x1 + w2 x2 + ........ + wm xm = b (3)
where wi ’s are the real numbers and b is a real constant (called

the intercept, which can be positive or negative).
09/24/2021 21 / 131
Finding a hyperplane
In matrix form, a hyperplane thus can be represented as
W.X + b = 0
(4)
where W =[w1,w2.......wm] and X = [x1,x2.......xm] and b is a real
constant.
Here, W and b are parameters of the classifier model to be

evaluated given a training set D.
09/24/2021 22 / 131
Let us consider a two-dimensional training set consisting two

classes + and - as shown in Fig. 4.
Suppose, b1 and b2 are two decision boundaries above and below

a hyperplane, respectively.
Consider any two points X+ and X− as shown in Fig. 4.
For X+ located above the decision boundary, the equation can be

written as
W.X + + b = K where
K >0 (5)
09/24/2021 23 / 131
Figure 4: Computation of the MMH
09/24/2021 24 / 131
Similarly, for any point X− located below the decision boundary,

the equation is
W.X − + b = K t
where K t < 0 (6)
Thus, if we label all +’s are as class label + and all -’s are class
label -, then we can predict the class label Y for any test data X
as f
+ if W .X + b >
(Y )
0
=
— if W .X + b < 0
09/24/2021 25 / 131
Hyperplane and Classification
Note that W .X + b = 0, the equation representing hyperplane

can be interpreted as follows.
Here, W represents the orientation and b is the intercept of the

hyperplane from the origin.
If both W and b are scaled (up or down) by dividing a non zero

constant, we get the same hyperplane.
This means there can be infinite number of solutions using various

scaling factors, all of them geometrical representing the same
hyperplane.
09/24/2021 26 / 131
Hyperplane and Classification
To avoid such a confusion, we can make W and b unique

by adding a constraint that W t.X + bt = ±1 for data points
on boundary of each class.
It may be noted that W t.X + bt = ±1 represents two hyperplane

parallel to each other.
For clarity in notation, we write this as W.X + b = ±1.
Having this understating, now we are in the position to calculate

the margin of a hyperplane.
09/24/2021 27 / 131
Calculating Margin of a Hyperplane
Suppose, x1 and x2 (refer Figure 3) are two points on the decision
boundaries b1 and b2, respectively. Thus,
W.x1 + b = 1
(7)
W.x 2 + b = −1
(8)
or
W.(x 1 − x2 ) = 2
2 (10)
d = ||W
✓ || (9)
where ||W || = w12 + w22 w 2 in an m-dimension
This represents a dot (.) product ofmtwo vectors W and x1-x2.
+ ....... 09/24/2021
space. 28 / 131
We calculate the margin more mathematically, as in the following.

Consider two parallel hyperplanes H1 and H2 as shown in Fig. 5.
Let the equations of hyperplanes be
H1 : w1 x1 + w2 x2 − b1 = 0 (11)
H2 : w1 x1 + w2 x2 − b2 = 0 (12)
To draw a perpendicular distance d between H1 and H2, we draw
a right-angled triangle ABC as shown in Fig.5.
09/24/2021 29 / 131
Figure 5: Detail of margin calculation.
X2
H2
w
x
1
+
1
wx
2
–
2
b
d =
2
H1 0
w
x
1
+
1
wx C
2
–
2
b
=
1
0
A B
(b1 / w1 , 0) (b2 / w 1 , 0) X1
09/24/2021 30 / 131
w
Being parallel, the slope of H1 (and H2 ) is tanθ = − 1 .
w2
In triangle ABC, AB is the hypotenuse and AC is the perpendicular
distance between H1 and H2.
AC
Thus , sin(180 − θ) = or AC = AB ·
AB
sinθ.
b2 b1 |b2 − b1| w1
AB = − = w , sinθ = ✓
w1 w1 1 w12 + w2
w1 2
(Since, tanθ = −
). w 2
Hence, AC = |b − b |
✓ 2 1
.
w12 + 22
w
09/24/2021 31 / 131
This can be generalized to find the distance between two parallel

margins of any hyperplane in n-dimensional space as
d= |b2 − b1 | ~ |b2 − b1 |
✓
w12 + w 2 + · · · n2 /I W /I
+2w ✓
where, /I W /I= w12 + w22 + · · · n2
+w .
In SVM literature, this margin is famously written as µ(W, b).
09/24/2021 32 / 131
The training phase of SVM involves estimating the parameters W

and b for a hyperplane from a given training data.
The parameters must be chosen in such a way that the following
two inequalities are satisfied.
W .xi + b ≥ 1 if yi = 1 (13)
W .xi + b ≤ −1 if yi = −1 (14)
These conditions impose the requirements that all training tuples
from class Y = + must be located on or above the hyperplane
W .x + b = 1, while those instances from class Y = − must be
located on or below the hyperplane W.x + b = −1 (also see
Fig. 4).
09/24/2021 33 / 131
Learning for a Linear SVM
Both the inequalities can be summarized as
yi (W.x 1 + b) ≥ 1 ∀i i = 1, 2, ...,
n tuples that lie on the hyperplanes H and
Note that any (15)
H2 are
1
called support vectors.
Essentially, the support vectors are the most difficult tuples to
classify and give the most information regarding classification.
In the following, we discuss the approach of finding MMH and the
support vectors.
The above problem is turned out to be an optimization problem,
that is, to maximize µ(W , b) = 1w
2 .
09/24/2021 34 / 131
Searching for MMH
Maximizing the margin is, however, equivalent to minimizing the

following objective function
||W ||
µ ‘ (W, b) =
(16)
2
In nutshell, the learning task in SVM, can be formulated as the
following constrained optimization problem.
minimize µ‘ (W , b)
subject to yi (W.x i + b) ≥ 1, i = 1, 2,
3.....n (17)
09/24/2021 35 / 131
Searching for MMH
The above stated constrained optimization problem is popularly

known as convex optimization problem, where objective function
is quadratic and constraints are linear in the parameters W and
b.
The well known technique to solve a convex optimization problem

is the standard Lagrange Multiplier method.
First, we shall learn the Lagrange Multiplier method, then come

back to the solving of our own SVM problem.
09/24/2021 36 / 131
Lagrange Multiplier Method
The Lagrange multiplier method follows two different steps

depending on type of constraints.
1 Equality constraint optimization problem: In this case, the

problem is of the form:
minimize f (x1 , x2 , ........, xd )
subject to gi (x ) = 0, i = 1, 2, ......, p
2 Inequality constraint optimization problem: In this case, the

problem is of the form:
minimize f (x1 , x2 , ........, xd )
subject to hi (x ) ≤ 0, i = 1, 2, ......, p
09/24/2021 37 / 131
Equality constraint optimization problem solving
The following steps are involved in this case:
1 Define the Lagrangian as follows:
p
(X , λ ) = f (X ) + λ i .gi (18)
(x ) i
=1
where λJi s are dummy variables called Lagrangian multipliers.
2
Set the first order derivatives of the Lagrangian with respect to x
and the Lagrangian multipliers λJi s to zero’s. That is
δL = 0, i = 1, 2, ......,
δxi d
δL
= 0, i = 1, 2, ......,
p
3 Solve the (d + p) equations δλ i to find the optimal value of
X =[x1 , x2 ......., xd ] and λ Ji s.
09/24/2021 38 / 131
Example: Equality constraint optimization problem

Suppose, minimize f (x, y ) = x +
2y subject to x 2 + y 2 − 4 = 0
1
Lagrangian L(x, y ,λ )=x + 2y +λ (x 2 + y 2 −
4)
δx = 1 + 2λ x = 0
2 δL
δy = 1 + 2λ y = 0
δL
δL =x2+y2−4=
δλ0
3 Solving the above three equations for x , y and λ , we get x = ∓
√2
5
,
y = ∓√45 and λ = √ 5
±
4
09/24/2021 39 / 131
Example : Equality constraint optimization problem
√
When λ2 = 45 ,
x = −√ 5
,y = −√4 ,
5
10
√
we get f(x, y, 5
λ )=−
√
Similarly, when λ = − 45
, = √2 ,
x 5
y = √45,
we get f(x, y, λ )= 5
√10
Thus, the function f(x, y ) has its minimum value
at = −√2 , y = √4
x 5 5
−
09/24/2021 40 / 131
Inequality constraint optimization problem solving
The method for solving this problem is quite similar to the

Lagrange multiplier method described above.
It starts with the Lagrangian

p
L = f (x ) + λ i .hi (19)
(x ) i =1
In addition to this, it introduces additional constraints, called

Karush-Kuhn-Tucker (KKT) constraints, which are stated in the
next slide.
09/24/2021 41 / 131
Inequality constraint optimization problem solving
δL
= 0, i = 1, 2, ......,
d
δxi
λ i ≥ 0, i = 1, 2......., p
hi (x ) ≤ 0, i = 1, 2......., p
λ i .hi (x ) = 0, i = 1, 2......., p
Solving the above equations, we can find the optimal value of f

(x ).
09/24/2021 42 / 131
Example: Inequality constraint optimization problem

Consider the following problem.
Minimize f (x, y ) = (x − 1) 2 + (y − 3) 2
subject to x + y ≤
y≥x
2,
The Lagrangian for this problem is
L = (x − 1) 2 + (y − 3) 2 + λ 1 (x + y − 2) + λ 2 (x − y
).
subject to the KKT constraints, which are as follows:
09/24/2021 43 / 131
δL
= 2(x − 1) + λ 1 + λ 2 =
δx
δL
0
= 2(y − 3) + λ 1 − λ 2 = 0
δy
λ 1 (x + y − 2) = 0
λ 2 (x − y ) = 0
λ 1 ≥ 0, λ 2 ≥ 0
(x + y ) ≤ 2, y ≥ x
09/24/2021 44 / 131

To solve KKT constraints, we have to check the following tests:
Case 1: λ 1 = 0, λ 2 = 0
2(x − 1) = 0 | 2(y − 3) = 0 ⇒ x = 1, y = 3
since, x + y = 4, it violates x + y ≤ 2; is not a feasible solution.
Case 2: λ 1 = 0, λ 2 /= 0 2(x − y ) = 0 |
2(x − 1) + λ 2 = 0|
2(y − 3) − λ 2 = 0
⇒ x = 2, y = 2 and λ 2 = −2
since, x + y ≤ 4, it violates λ 2 ≥ 0; is not a feasible solution.
09/24/2021 45 / 131
Case 3: λ 1 /= 0, λ 2 = 0 2(x + y ) = 2
2(x − 1) + λ 1 = 0
2(y − 3) + λ 1 = 0
⇒ x = 0, y = 2 and λ 1 = 2; this is a feasible
solution. Case 4: λ 1 /= 0, λ 2 /= 0 2(x + y ) = 2

2(x − y ) = 0
2(x − 1) + λ 1 + λ 2 = 0
2(y − 3) + λ 1 − λ 2 = 0
⇒ x = 1, y = 1 and λ 1 = 2 λ 2 =
−2 This is not a feasible solution.
09/24/2021 46 / 131
LMM to Solve Linear SVM
The optimization problem for the linear SVM is inequality

constraint optimization problem.
The Lagrangian multiplier for this optimization problem can be

written as
n
||W 2
L= — λ i (yi (W .xi + b) − ) (20)
|| 2
1 i
=1
where the parameters λti s are the Lagrangian multipliers, and
W = [w1 , w2 ........wm ] and b are the model parameters.
09/24/2021 47 / 131
The KKT constraints are:
�n
δL
=0⇒ W i λ i .yi .xi
δw = =1
�n
δL
=0⇒ i λ i .yi = 0
δb =1
λ ≥ 0, i = 1, 2, .....n
λ i [yi (W .xi + b) − 1] = 0, i = 1, 2, .....n
yi (W .xi + b) ≥ 1, i = 1, 2, .....n
Solving KKT constraints are computationally expensive and can

be solved using a typical linear/ quadratic programming technique
(or any other numerical technique).
09/24/2021 48 / 131
We first solve the above set of equations to find all the feasible
solutions.
Then, we can determine optimum value of µ(W, b).
Note:
1 Lagrangian multiplier λ i must be zero unless the training instance xi
satisfies the equation yi (W.x i + b) = 1. Thus, the training tuples
with λ i > 0 lie on the hyperplane margins and hence are support
vectors.
2
The training instances that do not lie on the hyperplane margin
have λ i = 0.
09/24/2021 49 / 131
Classifying a test sample using Linear SVM
For a given training data, using SVM principle, we obtain MMH in

the form of W , b and λ i ’s. This is the machine (i.e., the SVM).
Now, let us see how this MMH can be used to classify a test tuple
say X . This can be done as follows.
n
δ(X ) = W .X + b = λ i .yi .xi .X + (21)
b i
=1
Note that
n
W = λ i .yi .xi
i
=1
09/24/2021 50 / 131
This is famously called as “Representer Theorem” which states

that the solution W always be represented as a linear
combination of training data.
n
Thus, δ(X ) = W .X + b = λ i .yi .xi .X +
b i
=1
09/24/2021 51 / 131
The above involves a dot product of xi .X ,where xi is a support

vector (this is so because (λ i ) = 0 for all training tuples except
the support vectors), we can check the sign of δ(X ).
If it is positive, then X falls on or above the MMH and so the

SVM predicts that X belongs to class label +. On the other hand,
if the sign is negative, then X falls on or below MMH and the
class prediction is -.
Note:
1 Once the SVM is trained with training data, the complexity of the
classifier is characterized by the number of support vectors.
2 Dimensionality of data is not an issue in SVM unlike in other
classifier.
09/24/2021 52 / 131
Illustration : Linear SVM
Consider the case of a binary classification starting with a training

data of 8 tuples as shown in Table 1.
Using quadratic programming, we can solve the KKT constraints
to obtain the Lagrange multipliers λ i for each training tuple,
which is shown in Table 1.
Note that only the first two tuples are support vectors in this
case.
Let W = (w1 , w2 ) and b denote the parameter to be determined
now. We can solve for w1 and w2 as follows:
w1 = λ i .yi .xi = 65.52 × 1 × .38 + 65.52 × −1 × 0.49 =
01 i −6.64
(22)
w2 = λ i .yi .xi = 65.52 × 1 × .47 + 65.52 × −1 × 0.61 =
02 i −9.32
(23)
09/24/2021 53 / 131
Table 1: Training Data
A1 A2 y λi
0.38 0.47 + 65.52
0.49 0.61 - 65.52
0.92 0.41 - 0
0.74 0.89 - 0
0.18 0.58 + 0
0.41 0.35 + 0
0.93 0.81 - 0
0.21 0.10 + 0
09/24/2021 54 / 131
Figure 6: Linear SVM example.
09/24/2021 55 / 131
The parameter b can be calculated for each support vector

as follows
b1 = 1 − W .x1 // for support vector x1
= 1 − (−6.64) × 0.38 − (−9.32) × 0.47 //using dot product
= 7.93
b2 = 1 − W .x2 // for support vector x2
= 1 − (−6.64) × 0.48 − (−9.32) × 0.611 //using dot
product
= 7.93
Averaging these values of b1 and b2, we get b = 7.93.
09/24/2021 56 / 131
Thus, the MMH is −6.64x1 − 9.32x2 + 7.93 = 0 (also see Fig.

6). Suppose, test data is X = (0.5, 0.5). Therefore,
δ(X ) = W .X + b
= −6.64 × 0.5 − 9.32 × 0.5 + 7.93
= −0.05
= −ve
This implies that the test data falls on or below the MMH and SVM
classifies that X belongs to class label -.
09/24/2021 57 / 131
Classification of Multiple-class Data
In the discussion of linear SVM, we have limited to binary

classification (i.e., classification with two classes only).
Note that the discussed linear SVM can handle any n-dimension,
n ≥ 2. Now, we are to discuss a more generalized linear SVM to
classify n-dimensional data belong to two or more classes.
There are two possibilities: all classes are pairwise linearly

separable, or classes are overlapping, that is, not linearly
separable.
If the classes are pair wise linearly separable, then we can extend
the principle of linear SVM to each pair. Theare two strategies:
1 One versus one (OVO) strategy
2 One versus all (OVA) strategy
09/24/2021 58 / 131
Multi-Class Classification: OVO Strategy
In OVO strategy, we are to find MMHs for each pair of classes.
Thus, if there are n classes, then nc2 pairs and hence so many
classifiers possible (of course, some of which may be redundant).
Also, see Fig. 7 for 3 classes (namely +, - and ×). Here, x
y
H
denotes MMH between class labels x and y .
Similarly, you can think for 4 classes (namely +, -, ×, ÷) and

more.
09/24/2021 59 / 131
Figure 7: 3-pairwise linearly separable classes.
09/24/2021 60 / 131
With OVO strategy, we test each of the classifier in turn and obtain
δij (X ), that is, the count for MMH between ith and ith classes for
test data
X.
If there is a class i, for which δij (X ) for all j( and j / i), gives
=
sign, then unambiguously, we can say that X is insame class i.
09/24/2021 61 / 131
Multi-Class Classification: OVA Strategy
OVO strategy is not useful for data with a large number of classes,
as the computational complexity increases exponentially with the
number of classes.
As an alternative to OVO strategy, OVA(one versus all) strategy

has been proposed.
In this approach, we choose any class say Ci and consider that all
tuples of other classes belong to a single class.
09/24/2021 62 / 131
This is, therefore, transformed into a binary classification problem

and using the linear SVM discussed above, we can find the
hyperplane. Let the hyperplane between Ci and remaining
classes be MMHi .
The process is repeated for each Ci E[C1, C2 , .....Ck ] and getting

MMHi .
Thus, with OVA strategies we get k classifiers.
09/24/2021 63 / 131
The unseen data X is then tested with each classifier so obtained.
Let δj (X ) be the test result with MMHj , which has the

maximum magnitude of test values.
That is
δj (X ) = max∀i { δi (X )} (24) Thus, X
is classified into class Cj .
09/24/2021 64 / 131
Note:
The linear SVM that is used to classify multi-class data fails, if all
classes are not linearly separable.
If one class is linearly separable to remaining other classes and

test data belongs to that particular class, then only it classifies
accurately.
09/24/2021 65 / 131
Further, it is possible to have some tuples which cannot be

classified none of the linear SVMs (also see Fig.8).
There are some tuples which cannot be classified unambiguously

by neither of the hyperplanes.
All these tuples may be due to noise, errors or data are not
linearly separable.
09/24/2021 66 / 131
Figure 8: Unclassifiable region in OVA strategy.
09/24/2021 67 / 131
Non-Linear SVM
09/24/2021 68 / 131
Non-Linear SVM
SVM classification for non separable data
Figure 9 shows a 2-D views of data when they are linearly

separable and not separable.
In general, if data are linearly separable, then there is a

hyperplane otherwise no hyperplane.
09/24/2021 69 / 131
Linear and Non-Linear Separable Data
Figure 9: Two types of training data.
09/24/2021 70 / 131
SVM classification for non separable data
Such a linearly not separable data can be classified using two

approaches.
1 Linear SVM with soft margin
2 Non-linear SVM
In the following, we discuss the extension of linear SVM to classify

linearly not separable data.
We discuss non-linear SVM in detail later.
09/24/2021 71 / 131
Linear SVM for Linearly Not Separable Data
If the number of training data instances violating linear separability

is less, then we can use linear SVM classifier to classify them.
The rational behind this approach can be better understood from

Fig. 10.
09/24/2021 72 / 131
Figure 10: Problem with linear SVM for linearly not separable data.
09/24/2021 73 / 131
Suppose, X1 and X2 are two instances.
We see that the hyperplane H1 classifies wrongly both X1 and X2.
Also, we may note that with X1 and X2, we could draw another
hyperplane namely H2, which could classify all training data
correctly.
However, H1 is more preferable than H2 as H1 has higher margin

compared to H2 and thus H1 is less susceptible to over fitting.
09/24/2021 74 / 131
In other words, a linear SVM can be refitted to learn a hyperplane

that is tolerable to a small number of non-separable training
data.
The approach of refitting is called soft margin approach (hence,

the SVM is called Soft Margin SVM), where it introduces slack
variables to the inseparable cases.
More specifically, the soft margin SVM considers a linear SVM

hyperplane (i.e., linear decision boundaries) even in situations
where the classes are not linearly separable.
The concept of Soft Margin SVM is presented in the following

slides.
09/24/2021 75 / 131
Soft Margin SVM
Recall that for linear SVM, we are to determine a maximum

margin hyperplane W.X + b = 0 with the following
optimization:
||W ||2
minimize
(25)
2
subject to yi .(W .xi + b) ≥ 1, i = 1, 2, ..., n
In soft margin SVM, we consider the similar optimization
technique except that a relaxation of inequalities, so that it also
satisfies the case of linearly not separable data.
09/24/2021 76 / 131
Soft Margin SVM
To do this, soft margin SVM introduces slack variable (ξ),

a positive-value into the constraint of optimization
problem.
Thus, for soft margin we rewrite the optimization problem as

follows.
||W ||2
minimize 2
subject to (W .xi + b) ≥ 1 − ξi , if yi = +1
(W .xi + b) ≤ −1 + ξi , if yi = −1 (26)
where ∀i , ξi ≥ 0.
Thus, in soft margin SVM, we are to calculate W , b and ξi s
as a solution to learn SVM.
09/24/2021 77 / 131
Soft Margin SVM : Interpretation of ξ
Let us find an interpretation of ξ, the slack variable in soft margin
SVM. For this consider the data distribution shown in the Fig.
11.
Figure 11: Interpretation of slack variable ξ.
09/24/2021 78 / 131
The data X is one of the instances that violates the constraints to

be satisfied for linear SVM.
Thus, W.X + b = −1 + ξ represents a hyperplane that is

parallel to the decision boundaries for class - and passes
through X .
It can||W
be shown that the distance between the hyperplanes is
d = ξ ||.
In other words, ξ provides an estimate of the error of decision
boundary on the training example X .
09/24/2021 79 / 131
In principle, Eqn. 25 and 26 can be chosen as the optimization

problem to train a soft margin SVM.
However, the soft margin SVM should impose a constraint on the

number of such non linearly separable data it takes into
account.
This is so because a SVM may be trained with decision

boundaries with very high margin thus, chances of misclassifying
many of the training data.
09/24/2021 80 / 131
This is explained in Fig. 12. Here, if we increase margin further,
then P and Q will be misclassified.
Thus, there is a trade-off between the length of margin and
training error.
Figure 12: MMH with wide margin and large training error.
09/24/2021 81 / 131
To avoid this problem, it is therefore necessary to modify the

objective function, so that penalizing for margins with a large gap,
that is, large values of slack variables.
The modified objective function can be written as
2 n
||W + c. (ξi )φ (27)
f (W ) = 2
|| i
=1
where c and φ are user specified parameters representing the
penalty of misclassifying the training data.
Usually, φ = 1. The larger value of c implies more penalty.
09/24/2021 82 / 131
Solving for Soft Margin SVM
We can follow the Lagrange multiplier method to solve the
inequality constraint optimization problem, which can be reworked
as follows:
2 n n n
||W || + c. ξi — λ i (yi (W.x i + b) − 1 +i ξ ) − µi .ξi (28)
L= 2
i =1 i i
=1 =1
Here, λ i ’s and µi ’s are Lagrange
multipliers. The inequality constraints are:
ξi ≥ 0, λ i ≥ 0, µi ≥ 0.
λ i { yi (W .xi + b) − 1 +
ξi } = 0
09/24/2021 83 / 131
The KKT constraints (in terms of first order derivative of L with

respect to different parameters) are:
n
δL
= wj − λ i .yi .xij =
0 i
δwjn =1
wj = λ i .yi .xij = 0 i = 1,
∀ i 2....n
=1
n
δL
=− λ i .yi =
0 i
δb =1
n
λ i .yi =
i 0
=1
09/24/2021 84 / 131
δL = c − λ i − µi =
0
δξi
λ i + µi = c and µi .ξi = 0∀i = 1, 2, ....n. (29)
The above set of equations can be solved for values of

W = [w1 , w2 .......wm ], b, λ ti s, µti s and ξits.
Note:
1 λ i /= 0 for support vectors or has ξi > 0 and
2
µi = 0 for those training data which are misclassified, that is, ξi > 0.
09/24/2021 85 / 131
Non-Linear SVM
Linear SVM undoubtedly better to classify data if it is trained by

linearly separable data.
Linear SVM also can be used for non-linearly separable data,

provided that number of such instances is less.
However, in real life applications, number of data overlapping is so

high that soft margin SVM cannot cope to yield accurate classifier.
As an alternative to this there is a need to compute a decision

boundary, which is not linear (i.e., not a hyperplane rather
hypersurface).
09/24/2021 86 / 131
Non-Linear SVM
For understanding this, see Figure 13.
Note that a linear hyperplane is expressed as a linear equation in

terms of n-dimensional component, whereas a non-linear
hypersurface is a non-linear expression.
Figure 13: 2D view of few class separabilities.
09/24/2021 87 / 131
Non-Linear SVM
A hyperplane is expressed as
linear : w1 x1 + w2 x2 + w3 x3 + c = 0 (30)
Whereas a non-linear hypersurface is expressed as.

Nonlinear : w1x 2 + w2x 2 + w3x1x2 + w4x 2 + w5x1x3 + c = 0
(31)
1 2 3
The task therefore takes a turn to find a nonlinear decision

boundaries, that is, nonlinear hypersurface in input space
comprising with linearly not separable data.
This task indeed neither hard nor so complex, and fortunately can
be accomplished extending the formulation of linear SVM, we
have already learned.
09/24/2021 88 / 131
Non-Linear SVM
This can be achieved in two major steps.

1 Transform the original (non-linear) input data into a higher
dimensional space (as a linear representation of data).
Note that this is feasible because SVM’s performance is
bysupport vectors (i.e., ≈ training data) not by the
decided of
number
dimension of data.
2 Search for the linear decision boundaries to separate the
transformed higher dimensional data.
The above can be done in the same line as we have done for linear
SVM.
In nutshell, to have a nonlinear SVM, the trick is to transform
non-linear data into higher dimensional linear data.
This transformation is popularly called non linear mapping or
attribute transformation. The rest is same as the linear
SVM.
09/24/2021 89 / 131
Concept of Non-Linear Mapping
In order to understand the concept of non-linear transformation of
original input data into a higher dimensional space, let us consider a
non-linear second order polynomial in a 3-D input space.
X (x1 , x2, x3 ) = w1x1 + w2x2 + w3x3 + w4x1 2 + w5x1x2 + w6x1x3 +

c
The 3-D input vector X (x1 , x2 , x3 ) can be mapped into a 6-D space
Z (z1 , z2 , z3 , z4 , z5 , z6 ) using the following mappings:
z1 = φ 1 (x ) =
x1 z2 = φ 2 (x )
= x2 z3 =
z4 = φ 4 (x ) = x1
2φ 3 (x ) = x3
z5 = φ 5 (x ) =
x1 .x2 z6 = φ 6 (x )
09/24/2021 90 / 131
The transformed form of linear data in 6-D space will look like.
Z : w1 z1 + w2 z2 + w3 z3 + w4 z4 + w5 z5 + w6 x1 z6 + c
Thus, if Z space has input data for its attributes x1 , x2 , x3 (and
hence Z ’s values), then we can classify them using linear
decision boundaries.
09/24/2021 91 / 131
Example: Non-linear mapping to linear SVM
The below figure shows an example of 2-D data set consisting of
class label +1 (as +) and class label -1 (as -).
Figure 14: Non-linear mapping to Linear SVM.
09/24/2021 92 / 131

We see that all instances of class -1 can be separated from
instances of class +1 by a circle, The following equation of the
decision boundary can be thought of:
✓
X (x1 , x2 ) = +1 if (x1 − 0.5) 2 + (x2 − 0.5) 2 > 2
X (x1 , x2 ) = −1 (32)
otherwise
09/24/2021 93 / 131

The decision boundary can be written as:
✓
X = (x1 − 0.5) 2 + (x2 − 0.5) 2 = 0.2
or x 2 − x + x2 − x = .46
1 1 2
0
A non-linear transformation
2 in 2-D space is proposed as follows:
Z (z , z ) : φ (x ) =2x −1 x 2, φ (x ) =
2
x (33)
1 2 1 1 2 2
−x
09/24/2021 94 / 131
The Z space when plotted will take view as shown in Fig. 15,
where data are separable with linear boundary, namely
Z : z1 + z2 = −0.46
Figure 15: Non-linear mapping to Linear SVM.
09/24/2021 95 / 131
Non-Linear to Linear Transformation: Issues
The non linear mapping and hence a linear decision boundary

concept looks pretty simple. But there are many potential
problems to do so.
1 Mapping: How to choose the non linear mapping to a higher

dimensional space?
In fact, the φ-transformation works fine for small example.
But, it fails for realistically sized problems.
2 Cost of mapping:For n-dimensional input instances there exist

NH = (N+d−1)!
d! different monomials comprising a feature space of
(N−
dimensionality NH . Here, d is the maximum degree of monomial.
1)!
09/24/2021 96 / 131
...
3 Dimensionality problem: It may suffer from the curse of
dimensionality problem often associated with a high dimensional
data.
More specifically, in the calculation of W.X or Xi .X (in δ(X ) see Eqn.

18), we need n multiplications and n additions (in their dot
products) for each of the n-dimensional input instances and
support vectors, respectively.
As the number of input instances as well as support vectors
are enormously large, it is therefore, computationally
expensive.
4 Computational cost: Solving the quadratic constrained optimization
problem in the high dimensional feature space is too a
computationally expensive task.
09/24/2021 Autumn 2018 97/ 131

97 / 131
Fortunately, mathematicians have cleverly proposes an elegant

solution to the above problems.
Their solution consist of the following:

1 Dual formulation of optimization problem
2 Kernel trick
In the next few slides, we shall learn about the above-

mentioned two topics.
09/24/2021 Autumn 2018 98/ 131

98 / 131
Dual Formulation of Optimization Problem
We have already learned the Lagrangian formulation to find the

maximum margin hyperplane as a linear SVM classifier.
Such a formulation is called primal form of the constraint

optimization problem.
Primal form of Lagrangian optimization problem is reproduced

further
||W ||2
Minimize 2
Subject to yi (W .xi + b) ≥ 1, i = 1, 2, (34)

3.......n.
09/24/2021 Autumn 2018 99/ 131

99 / 131
The primal form of the above mentioned inequality constraint

optimization problem(according to Lagrange multiplier method) is
given by
n
||W 2
Lp = — λ i (yi (W .xi + b) − (35)
|| 2
1) i
=1
where λ i ’s are called Lagrange multipliers.
This Lp is called the primal form of the Lagrangian optimization

problem.
09/24/2021 Autumn 2018 100/ 131

100 /
The dual form of the same problem can be derived as follows.
To minimize the Lagrangian, we must take the derivative of Lp with

respect to W , b and set them to zero.
n
δLp
=0⇒W = λ i .yi .xi (36)
δW
i
=1
n
δLp
=0⇒ λ i .yi = (37)
δb
0 i
=1
09/24/2021 Autumn 2018 101/ 131

101 /
From above two equation we get Lagrangian L as
n n n
1
L= λi + λ i .yi .λ j .y j .x i.x j − λ i .yi ( λ j .y j .x j )xi
2
i i,j i j
=1 =1 =1
n
1
= λi − λ i .λ j .yi .yj .xi .xj
2
i =1 i,j
This form is called the dual form of Lagrangian and distinguishably

written as:
n
1
LD = λ − λ i .λ j .yi .yj .xi .xj
i
2
i i,j (38)
=1
09/24/2021 Autumn 2018 102/ 131

102 /
This form is called the dual form of Lagrangian and distinguishably

written as:
n
1
LD = λ − λ i .λ j .yi .yj .xi .xj
i
2
i i,j (39)
=1
09/24/2021 Autumn 2018 103/ 131

103 /
There are key differences between primal (Lp ) and dual (LD )
forms of Lagrangian optimization problem as follows.
1 Lp involves a large number of parameters namely W , b and λ i ’s. On

the other hand, LD involves only λ i ’s, that is, Lagrange multipliers.
2
Lp is the minimization problem as the quadratic term is positive.
However, the quadratic term in LD is negative sign, Hence it is
turned out to be a maximization problem.
3
Lp involves the calculation of W.x , whereas LD involves the
calculation of xi .xj . This, in fact, advantageous, and we will realize it
when we learn Kernel-based calculation.
09/24/2021 Autumn 2018 104/ 131

104 /
...
�n classifier with primal form is δp (x ) = W.x + b with
The SVM
4
W = i λ i .yi .xi whereas the dual version of the classifier is

=1
m
δD (X ) = λ i .y i .(xi .x ) +
b i =1
where xi being the ith support vector, and there are m support
vectors. Thus, both Lp and LD are equivalent.
09/24/2021 Autumn 2018 105/ 131

105 /
Kernel Trick
We have already covered an idea that training data which are not
linearly separable, can be transformed into a higher dimensional
feature space such that in higher dimensional transformed
space a hyperplane can be decided to separate the transformed
data and hence original data(also see Fig.16).
Clearly the data on the left in the figure is not linearly separable.
Yet if we map it to a 3D space using φ and the mapped data, then

it is possible to have a decision boundaries and hence hyperplane
in 3D space.
09/24/2021 Autumn 2018 106/ 131

106 /
Kernel Trick
Figure 16: SVM classifier in transformed feature space
09/24/2021 Autumn 2018 107/ 131

107 /
Kernel Trick
Example: SVM classifier for non-linear data

Suppose, there are a set of data in R2(i.e., in 2-D space), φ is the
mapping for XER2 to Z (z1 , z2, z3)ER3, in the 3-D space.
R 2 ⇒ X (x1 , x2 )
R 3 ⇒ Z (z1 , z2 , z3 )
09/24/2021 Autumn 2018 108/ 131

108 /
Kernel Trick
φ(X ) ⇒ Z
√ 2
z = x2 , z = 2x x , = x2
1 1 2 1 2
z
The hyperplane in R3 2 is of the form
√
w x 2 + w 2x x + w x2 =
1 1 2 1 2 3
2
0
Which is the equation of an ellipse in 2D.
09/24/2021 Autumn 2018 109/ 131

109 /
Kernel Trick

After the transformation, φ as mentioned above, we have the
decision boundary of the form
w1 z1 + w2 z2 + w3 z2 = 0
This is clearly a linear form in 3-D space. In other words,
W.x + b = 0 in R2 has a mapped equivalent W.z + bt = 0
in R3
This means that data which are not linearly separable in 2-D are
separable in 3-D, that is, non linear data can be classified by a
linear SVM classifier.
09/24/2021 Autumn 2018 110/ /131

110
Kernel Trick
The above can be generalized in the following.
Classifier:
n
δ(x ) = λ i yi yi .x +
b i
=1
n
δ(z) = λ i yi φ(xi ).φ(x ) +
b i
=1
Learning:
n
1
Maximize λi − λ i λ j .yi .yj .xi .xj
2
i i ,j
=1
n
1
Maximize λi − λ i λ j .yi .yj φ(xi ).φ(xj )
2
i i ,j
=1
�
Subject to: λ i ≥ 0, i λ i .yi = 0
09/24/2021 Autumn 2018 111/ 131
111 /
Kernel Trick
Now, question here is how to choose φ, the mapping function
X ⇒ Z , so that linear SVM can be directly applied to.
A breakthrough solution to this problem comes in the form of a

method as the kernel trick.
We discuss the kernel trick in the following.
We know that (.) dot product is often regarded as a measure

of similarity between two input vectors.
For example, if X and Y are two vectors, then
X .Y = |X ||Y |cosθ
Here, similarity between X and Y is measured as cosine

similarity.
If θ=0 (i.e., cosθ=1), then they are most similar, otherwise
orthogonal means dissimilar. 09/24/2021 Autumn 2018 112/ /131
112
Kernel Trick
Analogously, if Xi and Xj are two tuples, then Xi .Xj is regarded

as a measure of similarity between Xi and Xj .
Again, φ(Xi ) and φ(Xj ) are the transformed features of Xi and

Xj , respectively in the transformed space; thus, φ(Xi ).φ(Xj ) is
also should be regarded as the similarity measure between φ(Xi
) and φ(Xj ) in the transformed space.
This is indeed an important revelation and is the basic idea behind

the kernel trick.
Now, naturally question arises, if both measures the similarity,

then what is the correlation between them (i.e., Xi .Xj and
φ(Xi ).φ(Xj )).
Let us try to find the answer to this question through an example.

113 /
09/24/2021 Autumn 2018 113 / 131
Kernel Trick
Example: Correlation between Xi .Xj and φ(Xi ).φ(Xj )

Without any loss of generality, let us consider a situation stated
below.
√
φ : R2 ⇒ R3 = x2 ⇒ z, x2 ⇒ z , 2x x ⇒ z (40)
1 2 2 1 2 3
Suppose, Xi = [xi 1 , xi 2] and Xj = [xj 1 , xj 2] are any two vectors

in
R2 . √
Similarly, φ(Xi ) = [xi21 , 2.xi 1.x i , xi2 ]
√
andj ) = [x2j 1 , 2.xj 1.x
φ(X 2 2 2
j 2 , xj 2] are two transformed version of Xi
and
Xj but in R3 .
09/24/2021 Autumn 2018 114/ /131

114
Kernel Trick
Example: Correlation between Xi .Xj and φ(Xi ).φ(Xj )

Now,  
x j2
2 √ 1 
φ(X ).φ(X ) = [x2 ,√ 2.xi 1.x i , i  2x j 1x j 
i j i
x ]1 2 2 xj22
2
= x 2 .x 2 + 2xi 1 xi 2 xj 1 xj 2 + x 2 .x
2
i1 j1 i
2 j2
= (xi 1 .xj 1 + ixi 2 .xlj 2 ) 2

x
= { [xi 1, i j1 } 2
x ] 2 xj 2
= (Xi .Xj ) 2
09/24/2021 Autumn 2018 115/ /131

115
Kernel Trick : Correlation between Xi .Xj and
φ(Xi ).φ(Xj )
With reference to the above example, we can conclude that
φ(Xi ).φ(Xj ) are correlated to Xi .Xj .
In fact, the same can be proved in general, for any feature

vectors and their transformed feature vectors.
A formal proof of this is beyond the scope of this course.
More specifically, there is a correlation between dot products of
original data and transformed data.
Based on the above discussion, we can write the following
implications.
Xi .Xj ⇒ φ(Xi ).φ(X j ) ⇒ K (Xi , Xj )
(41)
Here, K (Xi , Xj denotes a function
09/24/2021 more popularly called as
Autumn 2018 116/ /131
116
Kernel Trick : Significance
This kernel function K (Xi , Xj ) physically implies the similarity
in the transformed space (i.e., nonlinear similarity measure)
using the original attribute Xi , Xj . In other words, K , the
similarity function compute a similarity of both whether data in
original attribute space or transformed attribute space.
Implicit transformation
The first and foremost significance is that we do not require any φ
transformation to the original input data at all. This is evident from
the following re-writing of our SVM classification problem.
n
Classifier : δ(X ) = λ i .Yi .K (Xi , X ) + (42)
b i
=1
1
Learning : maximize λi − λ i .λ j Yi Yj .K (Xi .Xj ) (43)
2
i =1 i ,j
n
Subject to λ i ≥ 0 and λ i .Yi = 0 (44)
09/24/2021 i Autumn 2018 117/ /131
117
Computational efficiency:
Another important significance is easy and efficient computability.
We know that in the discussed SVM classifier, we need several and

repeated round of computation of dot products both in learning
phase as well as in classification phase.
On other hand, using kernel trick, we can do it once and with fewer
dot products.
This is explained in the following.
09/24/2021 Autumn 2018 118/ /131

118
We define a matrix called design matrix(X), which contains all

data and gram matrix (K), which contains all dot products are as
follows:
 
X1
X 2
 
.
Design matrix : X =  (45)
. 
 .
Xn
n×N
where n denotes the number of training data in N-dimensional

data space.
Note that X contains all input data in the original attribute
space.
09/24/2021 Autumn 2018 119/ /131

119
 T 
X1 .X 1 X1T .X2 . . . 1T n
T T T
Gram matrix : K = X2 .X 1 X .X 2 n
 . . .  (46)
 . X2 ..X2 . . . . 
XnT .X1 XXTn .X
.X2 . . . XT
n n×n
.Xn .
Note that K contains all dot products among all training data and
as XiT .Xj = XjT .Xi . We, in fact, need to compute only half of the
matrix.
More elegantly all dot products are mere in a matrix operation,
and that is too one operation only.
09/24/2021 Autumn 2018 120/ 131

120 /
In nutshell, we have the following.

Instead of mapping our data via φ and computing the dot
products, we can accomplish everything in one operation.
Classifier can be learnt and applied without explicitly

computing
φ(X ).
Complexity of learning depends on n (typically it is O(n3)), not on

N, the dimensionality of data space.
All that is required is the kernel K (Xi , Xj )
Next, we discuss the aspect of deciding kernel functions.
09/24/2021 Autumn 2018 121/ 131

121 /
Kernel Functions
Before going to learn the popular kernel functions adopted in SVM

classification, we give a precise and formal definition of kernel
k (Xi , Xj ).
Definition 10.1:
A kernel function K (Xi , Xj ) is a real valued function defined on R
such that there is another function φ : X → Z such that
K (Xi , Xj ) = φ(Xi ).φ(Xj ). Symbolically we write
φ : Rm → Rn : K (Xi , Xj ) = φ(Xi ).φ(Xj ) where n > m.
09/24/2021 Autumn 2018 122/ 131

122 /
Kernel Functions : Example
Table 2: Some standard kernel functions
Kernel name Functional form Remark
Linear kernel K (X , Y ) = X T Y The simplest kernel used in linear SVM
Polynomial kernel of degree p K (X , Y ) = (X T Y + 1)ρ , ρ > 0 It produces a large dot products. Power ρ is specified apriori by the user.
c||x−y|| 2
Gaussian (RBF) kernel K (X , Y ) = e 2σ 2 It is a nonlinear kernel called Gaussian Radia Bias function kernel
Laplacian kernel Follows laplacian mapping

K (X , Y ) = e−λ||x −y ||
K (X , Y ) = tanh(β0 X T y + β1 )
Sigmoid kernel Followed when statistical test data is known
Mahalanobis kernel Followed when statistical test data is known

K (X , Y ) = e−(X −y ) A(x −y )
T
09/24/2021 Autumn 2018 123/ 131

123 /
Different kernel functions follow different parameters. Those

parameters are called magic parameters and to be decided a
priori.
Further, which kernels to be followed also depends on the pattern

of data as well as prudent of user.
In general, polynomial kernels result in a large dot products,

Gaussian RBF produces more support vectors than other kernels.
To have an intuitive idea of kernels to be used in non-linear data

handling is shown in Fig. 17.
09/24/2021 Autumn 2018 124/ 131

124 /
(a) Polynomial kernel (b) Sigmoid kernel
(c) Laplacian kernel (d) Gaussian RBF kernel

Figure 17: Visual interpretation of few kernel functions.
09/24/2021 Autumn 2018 125/ 131
125 /
Mercers Theorem: Properties of Kernel Functions
Other than the standard kernel functions, we can define of our

own kernels as well as combining two or more kernels to another
kernel.
It is interesting to note the following properties. If K1 and K2 are

two kernels then,
1 K1+c
2 aK1
3 aK1+bK2
4 K1.K2
are also kernels. Here, a, b and c E
R+ .
09/24/2021 Autumn 2018 126/ 131

126 /
Another requirement for kernel function used in non-linear SVM is

that there must exist a corresponding transformation such that the
kernel function computed for a pair of vectors is equivalent to the
dot product between the vectors in the transformed space.
This requirement can be formally stated in the form of Mercer’s

theorem.
09/24/2021 Autumn 2018 127/ 131

127 /
Theore 10.1: Mercer’s Theorem

A kernel function K can be expressed as K (X, Y ) = φ(X ).φ(Y ), if
J
and only if, for any function g(x ) such that g(x ) 2 .dx is finite then
K (X , Y )g(x )g(y )dxdy ≥ 0
The kernels which satisfy the Mercer’s theorem are called Mercer
kernels.
Additional properties of Mercer’s kernels are:
09/24/2021 Autumn 2018 128/ 131

128 /
Symmetric: K (X , Y ) = K (Y , X )
Positive definite: aT .K.α ≥ 0 for all αERN where K is the n × n

gram matrix.
It can be prove that all the kernels listed in Table 2 are satisfying
the (i) kernel properties, (ii) Mercer’s theorem and hence (iii)
Mercer kernels.
PracticeProve that k (x, z) = X − X T z is not a valid kernel.
09/24/2021 Autumn 2018 129/ 131

129 /
Characteristics of SVM
1 The SVM learning problem can be formulated as a convex

optimization problem, in which efficient algorithms are available to
find the global minimum of the objective function. Other methods
namely rule based classifier, ANN classifier etc. find only local
optimum solutions.
2 SVM is the best suitable to classify both linear as well as
non-linear training data efficiently.
3 SVM can be applied to categorical data by introducing a suitable
similarity measures.
4 Computational complexity is influenced by number of training data
not the dimension of data. In fact, learning is a bit
computationally heavy and hence slow, but classification of test is
extremely fast and accurate.
09/24/2021 Autumn 2018 130/ 131

130 /

10 Support Vector Machine

Uploaded by

Copyright:

Available Formats

10 Support Vector Machine

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 Support Vector Machine

Uploaded by

Copyright:

Available Formats

Topics to be covered...

Concept of non-linear data

Figure 1: Decision boundary in SVM.

In this lecture, we shall discuss the following.

1 Maximum margin hyperplane : a key concept in SVM.

In our subsequent discussion, we shall assume a simplistic

Figure 2: A 2D data linearly seperable by hyperplanes

Figure 2 shows a plot of data in 2-D. Another simplistic assumption

From Fig. 2, it can be seen that there are an infinite number of

We may note that so far the classification error is concerned (with

However, there is no guarantee that all hyperplanes perform

Thus, for a good classifier it must choose one of the infinite

To illustrate how the different choices of hyperplane influence the

Figure 3: Hyperplanes with decision boundaries and their margins.

In Fig. 3, two hyperplanes H1 and H2 have their own boundaries

A decision boundary is a boundary which is parallel to hyperplane

The distance between the two decision boundaries of a

The margin of hyperplane implies the error in classifier. In other

Intuitively, the classifier that contains hyperplane with a small

Thus during the training or learning phase, the approach would be

Such a hyperplane is called maximum margin hyperplane and

We may note the shortest distance from a hyperplane to one of

Alternatively, hyperplane is at the middle of its decision

A SVM which is used to classify data which are linearly separable

In other words, a linear SVM searches for a hyperplane with the

This is why a linear SVM is often termed as a maximal margin

In the following, we shall discuss the mathematics to find the

In our discussion, we shall consider a binary classification

Each tuple is denoted by (Xi ,Yi ) where Xi = (xi 1,xi

Note that choice of which class should be labeled as + or -

Thus, given {(X i , Yi ) }ni , we are to obtain a hyperplane which

Before, going to a general equation of a plane in n-dimension, let

Let us consider a 2-D training tuple with attributes A1 and A2 as

Equation of a plane in 2-D space can be written as

Any point lying above such a hyperplane satisfies

Similarly, any point lying below the hyperplane satisfies

It can be visualized as a plane surface in 3-D, but it is not easy

In fact, Euclidean equation of a hyperplane in Rm is

where wi ’s are the real numbers and b is a real constant (called

In matrix form, a hyperplane thus can be represented as

Here, W and b are parameters of the classifier model to be

Let us consider a two-dimensional training set consisting two

Suppose, b1 and b2 are two decision boundaries above and below

Consider any two points X+ and X− as shown in Fig. 4.

For X+ located above the decision boundary, the equation can be

Figure 4: Computation of the MMH

Similarly, for any point X− located below the decision boundary,

Note that W .X + b = 0, the equation representing hyperplane

Here, W represents the orientation and b is the intercept of the

If both W and b are scaled (up or down) by dividing a non zero

This means there can be infinite number of solutions using various

To avoid such a confusion, we can make W and b unique

It may be noted that W t.X + bt = ±1 represents two hyperplane

For clarity in notation, we write this as W.X + b = ±1.

Having this understating, now we are in the position to calculate

We calculate the margin more mathematically, as in the following.

Figure 5: Detail of margin calculation.

This can be generalized to find the distance between two parallel

The training phase of SVM involves estimating the parameters W

Both the inequalities can be summarized as