Kernel Method

Overview of Kernel
Methods (Part 2)
Steve Vincent
November 17, 2004
Overview
Kernel methods offer a modular framework.

In a first step, a dataset is processed into a kernel matrix.
Data can be of various types, and also heterogeneous
types.
In a second step, a variety of kernel algorithms can be
used to analyze the data, using only the information
contained in the kernel matrix
What will be covered

today
PCA vs. Kernel PCA
Algorithm
Example
Comparison
Text Related Kernel Methods
Bag of Words
Semantic Kernels
String Kernels
Tree Kernels
PCA algorithm
1.
2.
3.
4.
Subtract the mean from all the data points

Compute the covariance matrix S= c x x T
n 1 n n
Diagonalize S to get its eigenvalues and
eigenvectors
Retain c eigenvectors corresponding to the c
N
largest eigenvalues such that c
j / j 1 j
j
1
equals the desired variance to be captured
Project the data points on the eigenvectors
5.
1
a n E xk
N
x
i 1
Kernel PCA algorithm (1)

1.
2.
3.
4.
5.
Given N data points in d dimensions let X={x1|

x2|..|xN} where each column represents one
data point
Subtract the mean from all the data points
Choose an appropriate kernel k
Form the NxN Gram matrix K|ij=[k(xi,xj)}
Form the modified Gram
matrix
T
1NxN
1NxN
~
K I
K I
N
N
where 1NxN is an NxN matrix with all entries

equal to 1
5
Kernel PCA algorithm (2)

6.
7.
8.
Diagonalize K to get its eigenvalues n and its

eigenvectors an
Normalize an
a n / n
Retain c eigenvectorsC corresponding

to c
N
largest eigenvalues such
j / that
j
j 1
9.
j 1
equals desired variance to be captured

Project the data points on the eigenvectors
k ( x1 , x)
1NxN
1NxN
T
ya I
... K

N
N
k ( x N , x)
Data Mining Problem
Data Source
Computational Intelligence and Learning Cluster

Challenge 2000,
http://www.wi.leidenuniv.nl/~putten/library/cc2000
/index.html
Supplied by Sentient Machine Research,
http://www.smr.nl
Problem Definition:
Given data which incorporates both socio-economic and

various insurance policy ownership attributes, can we derive
models which help in determining factors or attributes which
may influence or signify individuals who purchase a caravan
insurance policy.
Data Selection
5,822 records for training

4,000 records for evaluation
86 attributes
Attributes 1 through 43:
Attributes 44 through 85:
Socio-demographic data derived from zip code

areas
Product ownership for customers
Attribute 86
Purchased caravan insurance
Data Transformation and

Reduction
Principal Components Analysis (PCA)
[From MatLab]
#
Attributes
%
Variance
#
Attributes
%
Variance
25
73.29
55
98.20
30
80.66
60
98.86
35
86.53
65
99.35
40
91.25
70
99.65
45
94.94
75
99.83
50
97.26
80
99.96
Relative Performance
PCA run time: 6.138

Kernel PCA run time:
Used Radial Basis Function Kernel

u v
K (u , v) exp
5.668
Matlab Code for PCA and Kernel PCA

algorithm can be supplied if needed
10
Modeling, Test and

Evaluation
Manually
Reduced Dataset:
* Legend: a no, b yes
Nave Bayes Overall 82.79% Correctly Classified
a
b
3155
544
a 14.71% False Positive
132
98
b 42.61% Correctly Classified
PCA Reduced Dataset:
a
b
3507
255
207
31
Kernel PCA Reduced Dataset:
a
b
3238 541
175
74
11
Overall Results
KPCA and PCA had similar time

performance
KPCA is much gives results closer
to manually reduced dataset
Future Work:
Examine other Kernels

Vary the parameters for the Kernels
Use other Data Mining Algorithms
12
Bag of words kernels (1)
Document seen as a vector d, indexed by all

the elements of a (controlled) dictionary. The
entry is equal to the number of occurrences.
A training corpus is therefore represented by
a Term-Document matrix,
noted D=[d1 d2 dm-1 dm]
From this basic representation, we will apply
a sequence of successive embeddings,
resulting in a global (valid) kernel with all
desired properties
13
BOW kernels (2)
Properties:
All order information is lost (syntactical relationships,

local context, )
Feature space has dimension N (size of the dictionary)
Similarity is basically defined by:

k(d1,d2)=d1d2= d1t.d2
or, normalized (cosine similarity):
k (d1 , d 2 )
k (d1 , d 2 )
k (d1 , d1 ).k (d 2 , d 2 )
Efficiency provided by sparsity (and sparse dotproduct algorithm): O(|d1|+|d2|)

14
Latent concept Kernels
Basic idea :
terms
terms
terms
terms
terms
documents
Size d
K(d1,d2)=?
Size t
2
Size k <<t
Concepts space
15
Semantic Kernels (1)
k(d1,d2)=(d1)SS(d2)
where S is the semantic matrix

S can be defined as S=RP where
R is a diagonal matrix giving the term weightings or

relevances
P is the proximity matrix defining the semantic spreading
between the different terms of the corpus
The measure for the inverse document frequency

for a term t is given by:
l=# of documents
w(t ) ln
df (t )
df(t)=# of
documents
containing term t
The matrix R is diagonal with entries:

Rtt=w(t)
16
The associated kernel is:
~
k (d1 , d 2 ) (d1 ) RR ' (d 2 )'
For the proximity matrix (P) the

associated kernel is:
~
k (d1 , d 2 ) (d1 ) PP' (d 2 )' (d1 ) i Qij (d 2 ) j
i, j
Where Qij encodes the amount of

semantic relation between terms i and j.
17
Most natural method of

incorporating semantic information
is be inferring the relatedness of
terms from an external source of
domain knowledge
Example: WordNet Ontology
Semantic Distance
Path length of hierarchical tree

Information content
18
Latent Semantic Kernels

(LSK)/ Latent Semantic
Singular Value Decomposition (SVD):
Indexing (LSI)
D
'
V
'
where is a diagonal matrix of the same
dimensions as D, and U and V are unitary matrices

whose columns are the eigenvectors of DD and DD
respectively
LSI projects the documents into the space spanned
by the first k columns of U, suing the new kdimensional vectors for subsequent processing
d (d )U k
where Uk is the matrix containing the first k columns

of U
19
Latent Semantic Kernels

(LSK)/ Latent Semantic
New
kernel becomes
that of Kernel PCA
Indexing
(LSI)
~
k (d1 , d 2 ) (d1 )U kU k (d 2 )
LSK is implemented by projecting onto the

features:
( d )U k
1 / 2
i
(v )
i
k ( d j , d )
j 1
where k is the base kernel,
and i vi are i 1
eigenvalue, eigenvector pairs of the kernel matrix
Can represent the LSKs with the proximity matrix
P U kU k
20
String and Sequence
An alphabet is a finite set of ||

symbols.
A string s=s1s|s| is any finite sequence of

symbols from including the empty
sequence.
We denote n the set of all finite strings of
length n
String matching: implies contiguity

Sequence matching : only implies order
21
p-spectrum kernel (1)
Features of s = p-spectrum of s =
histogram of all (contiguous)
substrings of length p
Feature space indexed by all
elements of p
u(s)=number of occurrences of u in s
The associated kernel is defined as
k p ( s, t )
up
p
u
( s ) (t )
p
22
p-spectrum kernel
example
Example: 3-spectrum kernel
s=statistics and t=computation

The two strings contain the following
substrings of length 3:
sta,tat, ati, tis, ist, sti, tic,

ics
com, omp, mpu, put, uta, tat,
ati, tio, ion
Common substrings of tat and ati,

so the inner product k(s,t)=2
23
p-spectrum Kernels
Recursion
k-suffix kernel is defined by
k ks ( s, t )
p-spectrum kernel can be evaluated using the equation:
k p ( s, t )
1 if s s1u , t t1u , for u k

0
otherwise
| s| p 1 |t | p 1
s
k
p (s(s(i : i p), t ( j : j p)
i 1
j 1
in O(p |s| |t|) operations
The evaluation of one row of the table for the p-suffix kernel
corresponds to performing a search in the string t for the psuffix of a prefix in s.
24
All-subsequences kernels
Feature mapping defined by all

contiguous or non-contiguous
subsequences of a string
Feature space indexed by all elements of
*={}U U 2U 3U
u(s)=number of occurrences of u as a
(non-contiguous) subsequence of s
Explicit computation rapidly infeasible
(exponential in |s| even with sparse rep.)
25
Recursive implementation
Consider the addition of one extra symbol a to

s: common subsequences of (sa,t) are either in
s or must end with symbol a (in both sa and t).
Mathematically,
k (s, ) 1
k (sa, t ) k (s, t )
k (s, t (1 : j 1))
j : t j a
This gives a complexity of O(|s||t|2)

26
Fixed-length subsequence
kernels
Feature space indexed by all elements of p

u(s)=number of occurrences of the p-gram
u as a (non-contiguous) subsequence of s
Recursive implementation (will create a series of p
tables)
k0 (s, ) 1
k p (s, ) 0
for p 0
k p (sa, t ) k p (s, t )
j : t j a
p 1
(s, t (1 : j 1))
Complexity: O(p|s||t|) , but we have the k-length subseq.

kernels (k<=p) for free easy to compute k(s,t)=alkl(s,t)
27
Gap-weighted subsequence
kernels (1)
Feature space indexed by all elements of p

u(s)=sum of weights of occurrences of the pgram u as a (non-contiguous) subsequence of s,
the weight being length penalizing: length(u)) [NB:
length includes both matching symbols and
gaps]
Example (1)
The string gon occurs as a subsequence of the

strings gone, going and galleon, but we
consider the first occurrence as more important since
it is contiguous, while the final occurrence is the
weakest of all three
28
kernels (2)
Example(2)
D1 : ATCGTAGACTGTC
D2 : GACTATGC
(D1)CAT = 28+and(D2)CAT = 4
k(D1,D2)CAT=212+214
Naturally built as a dot product valid kernel

For alphabet of size 80, there are 512,000
trigrams
For alphabet of size 26, there are 12 x 106 5grams
29
kernels (3)
Hard to perform explicit expansion

and dot-product
Efficient recursive formulation
(dynamic programming type), whose
complexity is O(k |D1| |D2|)
30
Word Sequence Kernels (1)
Here words are considered as symbols
Meaningful symbols more relevant matching

Linguistic preprocessing can be applied to improve performance
Shorter sequence sizes improved computation time
But increased sparsity (documents are more : orthogonal)
Motivation : the noisy stemming hypothesis (important Ngrams approximate stems), confirmed experimentally in a
categorization task
31
Word Sequence Kernels (2)
Link between Word Sequence Kernels and other

methods:
For k=1, WSK is equivalent to basic Bag Of Words

approach
For =1, close relation to polynomial kernel of degree k,
WSK takes order into account
Extension of WSK:
Symbol dependant decay factors (way to introduce IDF

concept, dependence on the POS, stop words)
Different decay factors for gaps and matches (e.g. noun<adj
when gap; noun>adj when match)
Soft matching of symbols (e.g. based on thesaurus, or on
dictionary if we want cross-lingual kernels)
32
Tree Kernels
Application: categorization [one doc=one tree], parsing

(disambiguation) [one doc = multiple trees]
Tree kernels constitute a particular case of more general
kernels defined on discrete structure (convolution kernels).
Intuitively, the philosophy is
to split the structured objects in parts,

to define a kernel on the atoms and a way to recursively
combine kernel over parts to get the kernel over the whole.
Feature space definition: one feature for each possible proper

subtree in the training data; feature value = number of
occurrences
A subtree is defined as any part of the tree which includes
more than one node, with the restriction there is no partial
rule production allowed.
33
Trees in Text : example

S
Example :
NP
S
VP
John
VP
Mary
VP
NP
V
VP
N
N
loves
loves
A Parse Tree
Mary
a few among
the many subtrees
of this tree!
loves
Mary
VP
V
N
34
Tree Kernels : algorithm
Kernel = dot product in this high dimensional feature

space
Once again, there is an efficient recursive algorithm (in
polynomial time, not exponential!)
Basically, it compares the production of all possible
pairs of nodes (n1,n2) (n1T1, n2 T2); if the production
is the same, the number of common subtrees routed at
both n1 and n2 is computed recursively, considering the
number of common subtrees routed at the common
children
Formally, let kco-rooted(n1,n2)=number of common
k (T1 , T2 ) at both n1
kcoand
(n1 , n2 )
rootedn2
subtrees rooted
n1T1 n2 T2
35
All sub-tree kernel
Kco-rooted(n1,n2)=0 if n1 or n2 is a leaf
Kco-rooted(n1,n2)=0 if n1 and n2 have different

production or, if labeled, different label
Else Kco-rooted(n1,n2)= (1 kco rooted (ch(n1 , i ),ch(n2 , i )))
children i
Production is left intentionally ambiguous to

both include unlabelled tree and labeled tree
Complexity s O(|T1|.|T2|)
36
References
J. Shawe-Tayor and N. Cristianini, Kernel

Methods for Pattern Analysis, 2004 (Chapter
10 and 11)
J. Tian, PCA/Kernel PCA for Image
Denoising, September 16, 2004
T. Gartner, A Survey of Kernels for Structured
Data, ACM SIGKDD Explorations Newsletter,
July 2003
N. Cristianini, Latent Semantic Kernels,
Proceedings of ICML-01, 18th International
Conference on Machine Learning, 2001
37

Kernel Method

Uploaded by

Copyright:

Available Formats

Kernel Method

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kernel Method

Uploaded by

Copyright:

Available Formats

Overview of Kernel

Kernel methods offer a modular framework.

What will be covered

PCA vs. Kernel PCA

Text Related Kernel Methods

Subtract the mean from all the data points

Kernel PCA algorithm (1)

Given N data points in d dimensions let X={x1|

where 1NxN is an NxN matrix with all entries

Kernel PCA algorithm (2)

Diagonalize K to get its eigenvalues n and its

Retain c eigenvectorsC corresponding

equals desired variance to be captured

Data Mining Problem

Computational Intelligence and Learning Cluster

Given data which incorporates both socio-economic and

5,822 records for training

Attributes 1 through 43:

Attributes 44 through 85:

Socio-demographic data derived from zip code

Purchased caravan insurance

Data Transformation and

Principal Components Analysis (PCA)

PCA run time: 6.138

Used Radial Basis Function Kernel

Matlab Code for PCA and Kernel PCA

Modeling, Test and

KPCA and PCA had similar time

Examine other Kernels

Bag of words kernels (1)

Document seen as a vector d, indexed by all

BOW kernels (2)

All order information is lost (syntactical relationships,

Similarity is basically defined by:

Efficiency provided by sparsity (and sparse dotproduct algorithm): O(|d1|+|d2|)

Latent concept Kernels

Semantic Kernels (1)

where S is the semantic matrix

R is a diagonal matrix giving the term weightings or

The measure for the inverse document frequency

The matrix R is diagonal with entries:

Semantic Kernels (2)

The associated kernel is:

For the proximity matrix (P) the

Where Qij encodes the amount of

Semantic Kernels (3)

Most natural method of

Example: WordNet Ontology

Path length of hierarchical tree

Latent Semantic Kernels

dimensions as D, and U and V are unitary matrices

where Uk is the matrix containing the first k columns

Latent Semantic Kernels

LSK is implemented by projecting onto the

String and Sequence

An alphabet is a finite set of ||

A string s=s1s|s| is any finite sequence of

String matching: implies contiguity