10 Support Vector Machine
10 Support Vector Machine
10 Support Vector Machine
Introduction to SVM
Concept of maximum margin hyperplane
Linear SVM
1 Calculation of MMH
2 Learning a linear SVM
3 Classifying a test sample using linear SVM
4 Classifying multi-class data
Non-linear SVM
09/24/2021 1 / 131
Introduction to SVM
09/24/2021 2 / 131
Introduction
A classification that has received considerable attention is support
vector machine and popularly abbreviated as SVM.
This technique has its roots in statistical learning theory (Vlamidir
Vapnik, 1992).
As a task of classification, it searches for optimal hyperplane(i.e.,
decision boundary, see Fig. 1 in the next slide) separating the
tuples of one class from another.
SVM works well with higher dimensional data and thus avoids
dimensionality problem.
Although the SVM based classification (i.e., training time) is
extremely slow, the result, is however highly accurate. Further,
testing an unknown data is very fast.
SVM is less prone to over fitting than other methods. It also
facilitates compact model for classification.
09/24/2021 3 / 131
Introduction
09/24/2021 4 / 131
Introduction
09/24/2021 5 / 131
Maximum Margin Hyperplane
09/24/2021 6 / 131
Maximum Margin Hyperplane
09/24/2021 7 / 131
Maximum Margin Hyperplane
09/24/2021 8 / 131
Maximum Margin Hyperplane contd...
09/24/2021 9 / 131
Maximum Margin Hyperplane contd...
09/24/2021 10 / 131
Maximum Margin Hyperplane contd...
09/24/2021 11 / 131
Maximum Margin Hyperplane
H2
09/24/2021 12 / 131
Maximum Margin Hyperplane contd...
09/24/2021 13 / 131
Maximum Margin Hyperplane contd...
09/24/2021 14 / 131
Linear SVM
09/24/2021 15 / 131
Linear SVM
09/24/2021 16 / 131
Finding MMH for a Linear SVM
(data in
m-dimensional space) and Yi E [+,-] denotes its class
label.
09/24/2021 18 / 131
Equation of a hyperplane in 2-D
w0 + w1 x1 + w2 x2 = 0 [e.g., ax + by + c = 0]
where w0, w1, and w2 are some constants defining the slope and
intercept of the line.
09/24/2021 19 / 131
Equation of a hyperplane in 2-D
w0 + w1 x1 + w2 x2 < 0 (2)
An SVM hyperplane is an n-dimensional generalization of a
straight line in 2-D.
09/24/2021 20 / 131
Equation of a hyperplane
w1 x1 + w2 x2 + ........ + wm xm = b (3)
09/24/2021 21 / 131
Finding a hyperplane
W.X + b = 0
(4)
where W =[w1,w2.......wm] and X = [x1,x2.......xm] and b is a real
constant.
09/24/2021 22 / 131
Finding a hyperplane
W.X + + b = K where
K >0 (5)
09/24/2021 23 / 131
Finding a hyperplane
09/24/2021 24 / 131
Finding a hyperplane
W.X − + b = K t
where K t < 0 (6)
Thus, if we label all +’s are as class label + and all -’s are class
label -, then we can predict the class label Y for any test data X
as f
+ if W .X + b >
(Y )
0
=
— if W .X + b < 0
09/24/2021 25 / 131
Hyperplane and Classification
09/24/2021 26 / 131
Hyperplane and Classification
09/24/2021 27 / 131
Calculating Margin of a Hyperplane
Suppose, x1 and x2 (refer Figure 3) are two points on the decision
boundaries b1 and b2, respectively. Thus,
W.x1 + b = 1
(7)
W.x 2 + b = −1
(8)
or
W.(x 1 − x2 ) = 2
2 (10)
d = ||W
✓ || (9)
where ||W || = w12 + w22 w 2 in an m-dimension
This represents a dot (.) product ofmtwo vectors W and x1-x2.
+ ....... 09/24/2021
space. 28 / 131
Calculating Margin of a Hyperplane
H1 : w1 x1 + w2 x2 − b1 = 0 (11)
H2 : w1 x1 + w2 x2 − b2 = 0 (12)
To draw a perpendicular distance d between H1 and H2, we draw
a right-angled triangle ABC as shown in Fig.5.
09/24/2021 29 / 131
Calculating Margin of a Hyperplane
X2
H2
w
x
1
+
1
wx
2
–
2
b
d =
2
H1 0
w
x
1
+
1
wx C
2
–
2
b
=
1
0
A B
(b1 / w1 , 0) (b2 / w 1 , 0) X1
09/24/2021 30 / 131
Calculating Margin of a Hyperplane
w
Being parallel, the slope of H1 (and H2 ) is tanθ = − 1 .
w2
In triangle ABC, AB is the hypotenuse and AC is the perpendicular
distance between H1 and H2.
AC
Thus , sin(180 − θ) = or AC = AB ·
AB
sinθ.
b2 b1 |b2 − b1| w1
AB = − = w , sinθ = ✓
w1 w1 1 w12 + w2
w1 2
(Since, tanθ = −
). w 2
Hence, AC = |b − b |
✓ 2 1
.
w12 + 22
w
09/24/2021 31 / 131
Calculating Margin of a Hyperplane
d= |b2 − b1 | ~ |b2 − b1 |
✓
w12 + w 2 + · · · n2 /I W /I
+2w ✓
where, /I W /I= w12 + w22 + · · · n2
+w .
In SVM literature, this margin is famously written as µ(W, b).
09/24/2021 32 / 131
Calculating Margin of a Hyperplane
W .xi + b ≥ 1 if yi = 1 (13)
W .xi + b ≤ −1 if yi = −1 (14)
These conditions impose the requirements that all training tuples
from class Y = + must be located on or above the hyperplane
W .x + b = 1, while those instances from class Y = − must be
located on or below the hyperplane W.x + b = −1 (also see
Fig. 4).
09/24/2021 33 / 131
Learning for a Linear SVM
yi (W.x 1 + b) ≥ 1 ∀i i = 1, 2, ...,
n tuples that lie on the hyperplanes H and
Note that any (15)
H2 are
1
called support vectors.
Essentially, the support vectors are the most difficult tuples to
classify and give the most information regarding classification.
In the following, we discuss the approach of finding MMH and the
support vectors.
The above problem is turned out to be an optimization problem,
that is, to maximize µ(W , b) = 1w
2 .
09/24/2021 34 / 131
Searching for MMH
minimize µ‘ (W , b)
subject to yi (W.x i + b) ≥ 1, i = 1, 2,
3.....n (17)
09/24/2021 35 / 131
Searching for MMH
09/24/2021 36 / 131
Lagrange Multiplier Method
subject to gi (x ) = 0, i = 1, 2, ......, p
subject to hi (x ) ≤ 0, i = 1, 2, ......, p
09/24/2021 37 / 131
Lagrange Multiplier Method
Equality constraint optimization problem solving
The following steps are involved in this case:
1 Define the Lagrangian as follows:
p
(X , λ ) = f (X ) + λ i .gi (18)
(x ) i
=1
where λJi s are dummy variables called Lagrangian multipliers.
2
Set the first order derivatives of the Lagrangian with respect to x
and the Lagrangian multipliers λJi s to zero’s. That is
δL = 0, i = 1, 2, ......,
δxi d
δL
= 0, i = 1, 2, ......,
p
3 Solve the (d + p) equations δλ i to find the optimal value of
X =[x1 , x2 ......., xd ] and λ Ji s.
09/24/2021 38 / 131
Lagrange Multiplier Method
δy = 1 + 2λ y = 0
δL
δL =x2+y2−4=
δλ0
3 Solving the above three equations for x , y and λ , we get x = ∓
√2
5
,
y = ∓√45 and λ = √ 5
±
4
09/24/2021 39 / 131
Lagrange Multiplier Method
Example : Equality constraint optimization problem
√
When λ2 = 45 ,
x = −√ 5
,y = −√4 ,
5
10
√
we get f(x, y, 5
λ )=−
√
Similarly, when λ = − 45
, = √2 ,
x 5
y = √45,
we get f(x, y, λ )= 5
√10
Thus, the function f(x, y ) has its minimum value
at = −√2 , y = √4
x 5 5
−
09/24/2021 40 / 131
Lagrange Multiplier Method
09/24/2021 41 / 131
Lagrange Multiplier Method
δL
= 0, i = 1, 2, ......,
d
δxi
λ i ≥ 0, i = 1, 2......., p
hi (x ) ≤ 0, i = 1, 2......., p
λ i .hi (x ) = 0, i = 1, 2......., p
Minimize f (x, y ) = (x − 1) 2 + (y − 3) 2
subject to x + y ≤
y≥x
2,
The Lagrangian for this problem is
L = (x − 1) 2 + (y − 3) 2 + λ 1 (x + y − 2) + λ 2 (x − y
).
09/24/2021 43 / 131
Lagrange Multiplier Method
δL
= 2(x − 1) + λ 1 + λ 2 =
δx
δL
0
= 2(y − 3) + λ 1 − λ 2 = 0
δy
λ 1 (x + y − 2) = 0
λ 2 (x − y ) = 0
λ 1 ≥ 0, λ 2 ≥ 0
(x + y ) ≤ 2, y ≥ x
09/24/2021 44 / 131
Lagrange Multiplier Method
2(x − 1) = 0 | 2(y − 3) = 0 ⇒ x = 1, y = 3
Case 2: λ 1 = 0, λ 2 /= 0 2(x − y ) = 0 |
2(x − 1) + λ 2 = 0|
2(y − 3) − λ 2 = 0
⇒ x = 2, y = 2 and λ 2 = −2
since, x + y ≤ 4, it violates λ 2 ≥ 0; is not a feasible solution.
09/24/2021 45 / 131
Lagrange Multiplier Method
Case 3: λ 1 /= 0, λ 2 = 0 2(x + y ) = 2
2(x − 1) + λ 1 = 0
2(y − 3) + λ 1 = 0
⇒ x = 1, y = 1 and λ 1 = 2 λ 2 =
−2 This is not a feasible solution.
09/24/2021 46 / 131
LMM to Solve Linear SVM
n
||W 2
L= — λ i (yi (W .xi + b) − ) (20)
|| 2
1 i
=1
where the parameters λti s are the Lagrangian multipliers, and
W = [w1 , w2 ........wm ] and b are the model parameters.
09/24/2021 47 / 131
LMM to Solve Linear SVM
The KKT constraints are:
�n
δL
=0⇒ W i λ i .yi .xi
δw = =1
�n
δL
=0⇒ i λ i .yi = 0
δb =1
λ ≥ 0, i = 1, 2, .....n
yi (W .xi + b) ≥ 1, i = 1, 2, .....n
We first solve the above set of equations to find all the feasible
solutions.
Note:
1 Lagrangian multiplier λ i must be zero unless the training instance xi
satisfies the equation yi (W.x i + b) = 1. Thus, the training tuples
with λ i > 0 lie on the hyperplane margins and hence are support
vectors.
2
The training instances that do not lie on the hyperplane margin
have λ i = 0.
09/24/2021 49 / 131
Classifying a test sample using Linear SVM
Now, let us see how this MMH can be used to classify a test tuple
say X . This can be done as follows.
n
δ(X ) = W .X + b = λ i .yi .xi .X + (21)
b i
=1
Note that
n
W = λ i .yi .xi
i
=1
09/24/2021 50 / 131
Classifying a test sample using Linear SVM
n
Thus, δ(X ) = W .X + b = λ i .yi .xi .X +
b i
=1
09/24/2021 51 / 131
Classifying a test sample using Linear SVM
1 Once the SVM is trained with training data, the complexity of the
classifier is characterized by the number of support vectors.
2 Dimensionality of data is not an issue in SVM unlike in other
classifier.
09/24/2021 52 / 131
Illustration : Linear SVM
(23)
09/24/2021 53 / 131
Illustration : Linear SVM
A1 A2 y λi
0.38 0.47 + 65.52
0.49 0.61 - 65.52
0.92 0.41 - 0
0.74 0.89 - 0
0.18 0.58 + 0
0.41 0.35 + 0
0.93 0.81 - 0
0.21 0.10 + 0
09/24/2021 54 / 131
Illustration : Linear SVM
09/24/2021 55 / 131
Illustration : Linear SVM
09/24/2021 56 / 131
Illustration : Linear SVM
This implies that the test data falls on or below the MMH and SVM
classifies that X belongs to class label -.
09/24/2021 57 / 131
Classification of Multiple-class Data
Note that the discussed linear SVM can handle any n-dimension,
n ≥ 2. Now, we are to discuss a more generalized linear SVM to
classify n-dimensional data belong to two or more classes.
If the classes are pair wise linearly separable, then we can extend
the principle of linear SVM to each pair. Theare two strategies:
1 One versus one (OVO) strategy
2 One versus all (OVA) strategy
09/24/2021 58 / 131
Multi-Class Classification: OVO Strategy
Thus, if there are n classes, then nc2 pairs and hence so many
classifiers possible (of course, some of which may be redundant).
Also, see Fig. 7 for 3 classes (namely +, - and ×). Here, x
y
H
denotes MMH between class labels x and y .
09/24/2021 59 / 131
Multi-Class Classification: OVO Strategy
09/24/2021 60 / 131
Multi-Class Classification: OVO Strategy
With OVO strategy, we test each of the classifier in turn and obtain
δij (X ), that is, the count for MMH between ith and ith classes for
test data
X.
If there is a class i, for which δij (X ) for all j( and j / i), gives
=
sign, then unambiguously, we can say that X is insame class i.
09/24/2021 61 / 131
Multi-Class Classification: OVA Strategy
OVO strategy is not useful for data with a large number of classes,
as the computational complexity increases exponentially with the
number of classes.
In this approach, we choose any class say Ci and consider that all
tuples of other classes belong to a single class.
09/24/2021 62 / 131
Multi-Class Classification: OVA Strategy
09/24/2021 63 / 131
Multi-Class Classification: OVA Strategy
That is
09/24/2021 64 / 131
Multi-Class Classification: OVA Strategy
Note:
The linear SVM that is used to classify multi-class data fails, if all
classes are not linearly separable.
09/24/2021 65 / 131
Multi-Class Classification: OVA Strategy
All these tuples may be due to noise, errors or data are not
linearly separable.
09/24/2021 66 / 131
Multi-Class Classification: OVA Strategy
09/24/2021 67 / 131
Non-Linear SVM
09/24/2021 68 / 131
Non-Linear SVM
09/24/2021 69 / 131
Linear and Non-Linear Separable Data
09/24/2021 70 / 131
SVM classification for non separable data
09/24/2021 71 / 131
Linear SVM for Linearly Not Separable Data
09/24/2021 72 / 131
Linear SVM for Linearly Not Separable Data
Figure 10: Problem with linear SVM for linearly not separable data.
09/24/2021 73 / 131
Linear SVM for Linearly Not Separable Data
Also, we may note that with X1 and X2, we could draw another
hyperplane namely H2, which could classify all training data
correctly.
09/24/2021 74 / 131
Linear SVM for Linearly Not Separable Data
09/24/2021 75 / 131
Soft Margin SVM
||W ||2
minimize
(25)
2
subject to yi .(W .xi + b) ≥ 1, i = 1, 2, ..., n
In soft margin SVM, we consider the similar optimization
technique except that a relaxation of inequalities, so that it also
satisfies the case of linearly not separable data.
09/24/2021 76 / 131
Soft Margin SVM
subject to (W .xi + b) ≥ 1 − ξi , if yi = +1
(W .xi + b) ≤ −1 + ξi , if yi = −1 (26)
where ∀i , ξi ≥ 0.
Thus, in soft margin SVM, we are to calculate W , b and ξi s
as a solution to learn SVM.
09/24/2021 77 / 131
Soft Margin SVM : Interpretation of ξ
Let us find an interpretation of ξ, the slack variable in soft margin
SVM. For this consider the data distribution shown in the Fig.
11.
09/24/2021 78 / 131
Soft Margin SVM : Interpretation of ξ
It can||W
be shown that the distance between the hyperplanes is
d = ξ ||.
In other words, ξ provides an estimate of the error of decision
boundary on the training example X .
09/24/2021 79 / 131
Soft Margin SVM : Interpretation of ξ
09/24/2021 80 / 131
Soft Margin SVM : Interpretation of ξ
This is explained in Fig. 12. Here, if we increase margin further,
then P and Q will be misclassified.
Thus, there is a trade-off between the length of margin and
training error.
Figure 12: MMH with wide margin and large training error.
09/24/2021 81 / 131
Soft Margin SVM : Interpretation of ξ
2 n
||W + c. (ξi )φ (27)
f (W ) = 2
|| i
=1
where c and φ are user specified parameters representing the
penalty of misclassifying the training data.
09/24/2021 82 / 131
Solving for Soft Margin SVM
We can follow the Lagrange multiplier method to solve the
inequality constraint optimization problem, which can be reworked
as follows:
2 n n n
||W || + c. ξi — λ i (yi (W.x i + b) − 1 +i ξ ) − µi .ξi (28)
L= 2
i =1 i i
=1 =1
Here, λ i ’s and µi ’s are Lagrange
multipliers. The inequality constraints are:
ξi ≥ 0, λ i ≥ 0, µi ≥ 0.
λ i { yi (W .xi + b) − 1 +
ξi } = 0
09/24/2021 83 / 131
Solving for Soft Margin SVM
wj = λ i .yi .xij = 0 i = 1,
∀ i 2....n
=1
n
δL
=− λ i .yi =
0 i
δb =1
n
λ i .yi =
i 0
=1
09/24/2021 84 / 131
Solving for Soft Margin SVM
δL = c − λ i − µi =
0
δξi
λ i + µi = c and µi .ξi = 0∀i = 1, 2, ....n. (29)
Note:
1 λ i /= 0 for support vectors or has ξi > 0 and
2
µi = 0 for those training data which are misclassified, that is, ξi > 0.
09/24/2021 85 / 131
Non-Linear SVM
09/24/2021 86 / 131
Non-Linear SVM
09/24/2021 87 / 131
Non-Linear SVM
A hyperplane is expressed as
linear : w1 x1 + w2 x2 + w3 x3 + c = 0 (30)
This task indeed neither hard nor so complex, and fortunately can
be accomplished extending the formulation of linear SVM, we
have already learned.
09/24/2021 88 / 131
Non-Linear SVM
z1 = φ 1 (x ) =
x1 z2 = φ 2 (x )
= x2 z3 =
z4 = φ 4 (x ) = x1
2φ 3 (x ) = x3
z5 = φ 5 (x ) =
x1 .x2 z6 = φ 6 (x )
09/24/2021 90 / 131
Concept of Non-Linear Mapping
The transformed form of linear data in 6-D space will look like.
Z : w1 z1 + w2 z2 + w3 z3 + w4 z4 + w5 z5 + w6 x1 z6 + c
Thus, if Z space has input data for its attributes x1 , x2 , x3 (and
hence Z ’s values), then we can classify them using linear
decision boundaries.
09/24/2021 91 / 131
Concept of Non-Linear Mapping
Example: Non-linear mapping to linear SVM
The below figure shows an example of 2-D data set consisting of
class label +1 (as +) and class label -1 (as -).
09/24/2021 92 / 131
Concept of Non-Linear Mapping
✓
X (x1 , x2 ) = +1 if (x1 − 0.5) 2 + (x2 − 0.5) 2 > 2
X (x1 , x2 ) = −1 (32)
otherwise
09/24/2021 93 / 131
Concept of Non-Linear Mapping
✓
X = (x1 − 0.5) 2 + (x2 − 0.5) 2 = 0.2
or x 2 − x + x2 − x = .46
1 1 2
0
A non-linear transformation
2 in 2-D space is proposed as follows:
Z (z , z ) : φ (x ) =2x −1 x 2, φ (x ) =
2
x (33)
1 2 1 1 2 2
−x
09/24/2021 94 / 131
Concept of Non-Linear Mapping
Example: Non-linear mapping to linear SVM
The Z space when plotted will take view as shown in Fig. 15,
where data are separable with linear boundary, namely
Z : z1 + z2 = −0.46
09/24/2021 95 / 131
Non-Linear to Linear Transformation: Issues
09/24/2021 96 / 131
Non-Linear to Linear Transformation: Issues
...
3 Dimensionality problem: It may suffer from the curse of
dimensionality problem often associated with a high dimensional
data.
||W ||2
Minimize 2
n
||W 2
Lp = — λ i (yi (W .xi + b) − (35)
|| 2
1) i
=1
where λ i ’s are called Lagrange multipliers.
n
δLp
=0⇒W = λ i .yi .xi (36)
δW
i
=1
n
δLp
=0⇒ λ i .yi = (37)
δb
0 i
=1
n n n
1
L= λi + λ i .yi .λ j .y j .x i.x j − λ i .yi ( λ j .y j .x j )xi
2
i i,j i j
=1 =1 =1
n
1
= λi − λ i .λ j .yi .yj .xi .xj
2
i =1 i,j
There are key differences between primal (Lp ) and dual (LD )
forms of Lagrangian optimization problem as follows.
...
�n classifier with primal form is δp (x ) = W.x + b with
The SVM
4
where xi being the ith support vector, and there are m support
vectors. Thus, both Lp and LD are equivalent.
We have already covered an idea that training data which are not
linearly separable, can be transformed into a higher dimensional
feature space such that in higher dimensional transformed
space a hyperplane can be decided to separate the transformed
data and hence original data(also see Fig.16).
Clearly the data on the left in the figure is not linearly separable.
R 2 ⇒ X (x1 , x2 )
R 3 ⇒ Z (z1 , z2 , z3 )
φ(X ) ⇒ Z
√ 2
z = x2 , z = 2x x , = x2
1 1 2 1 2
z
The hyperplane in R3 2 is of the form
√
w x 2 + w 2x x + w x2 =
1 1 2 1 2 3
2
0
Which is the equation of an ellipse in 2D.
w1 z1 + w2 z2 + w3 z2 = 0
This is clearly a linear form in 3-D space. In other words,
W.x + b = 0 in R2 has a mapped equivalent W.z + bt = 0
in R3
This means that data which are not linearly separable in 2-D are
separable in 3-D, that is, non linear data can be classified by a
linear SVM classifier.
Classifier:
n
δ(x ) = λ i yi yi .x +
b i
=1
n
δ(z) = λ i yi φ(xi ).φ(x ) +
b i
=1
Learning:
n
1
Maximize λi − λ i λ j .yi .yj .xi .xj
2
i i ,j
=1
n
1
Maximize λi − λ i λ j .yi .yj φ(xi ).φ(xj )
2
i i ,j
=1
�
Subject to: λ i ≥ 0, i λ i .yi = 0
09/24/2021 Autumn 2018 111/ 131
111 /
Kernel Trick
Now, question here is how to choose φ, the mapping function
X ⇒ Z , so that linear SVM can be directly applied to.
2
= x 2 .x 2 + 2xi 1 xi 2 xj 1 xj 2 + x 2 .x
2
i1 j1 i
2 j2
x ] 2 xj 2
= (Xi .Xj ) 2
(41)
Here, K (Xi , Xj denotes a function
09/24/2021 more popularly called as
Autumn 2018 116/ /131
116
Kernel Trick : Significance
This kernel function K (Xi , Xj ) physically implies the similarity
in the transformed space (i.e., nonlinear similarity measure)
using the original attribute Xi , Xj . In other words, K , the
similarity function compute a similarity of both whether data in
original attribute space or transformed attribute space.
Implicit transformation
The first and foremost significance is that we do not require any φ
transformation to the original input data at all. This is evident from
the following re-writing of our SVM classification problem.
n
Classifier : δ(X ) = λ i .Yi .K (Xi , X ) + (42)
b i
=1
1
Learning : maximize λi − λ i .λ j Yi Yj .K (Xi .Xj ) (43)
2
i =1 i ,j
n
Subject to λ i ≥ 0 and λ i .Yi = 0 (44)
09/24/2021 i Autumn 2018 117/ /131
117
Kernel Trick : Significance
Computational efficiency:
Another important significance is easy and efficient computability.
On other hand, using kernel trick, we can do it once and with fewer
dot products.
T
X1 .X 1 X1T .X2 . . . 1T n
T T T
Gram matrix : K = X2 .X 1 X .X 2 n
. . . (46)
. X2 ..X2 . . . .
XnT .X1 XXTn .X
.X2 . . . XT
n n×n
.Xn .
Note that K contains all dot products among all training data and
as XiT .Xj = XjT .Xi . We, in fact, need to compute only half of the
matrix.
More elegantly all dot products are mere in a matrix operation,
and that is too one operation only.
Definition 10.1:
A kernel function K (Xi , Xj ) is a real valued function defined on R
such that there is another function φ : X → Z such that
K (Xi , Xj ) = φ(Xi ).φ(Xj ). Symbolically we write
φ : Rm → Rn : K (Xi , Xj ) = φ(Xi ).φ(Xj ) where n > m.
Polynomial kernel of degree p K (X , Y ) = (X T Y + 1)ρ , ρ > 0 It produces a large dot products. Power ρ is specified apriori by the user.
c||x−y|| 2
Gaussian (RBF) kernel K (X , Y ) = e 2σ 2 It is a nonlinear kernel called Gaussian Radia Bias function kernel
K (X , Y ) = tanh(β0 X T y + β1 )
Sigmoid kernel Followed when statistical test data is known
1 K1+c
2 aK1
3 aK1+bK2
4 K1.K2
are also kernels. Here, a, b and c E
R+ .
The kernels which satisfy the Mercer’s theorem are called Mercer
kernels.
Additional properties of Mercer’s kernels are:
Symmetric: K (X , Y ) = K (Y , X )
It can be prove that all the kernels listed in Table 2 are satisfying
the (i) kernel properties, (ii) Mercer’s theorem and hence (iii)
Mercer kernels.