Kernel Method

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 37

Overview of Kernel

Methods (Part 2)
Steve Vincent
November 17, 2004

Overview

Kernel methods offer a modular framework.


In a first step, a dataset is processed into a kernel matrix.
Data can be of various types, and also heterogeneous
types.
In a second step, a variety of kernel algorithms can be
used to analyze the data, using only the information
contained in the kernel matrix

What will be covered


today

PCA vs. Kernel PCA

Algorithm
Example
Comparison

Text Related Kernel Methods

Bag of Words
Semantic Kernels
String Kernels
Tree Kernels

PCA algorithm
1.
2.
3.

4.

Subtract the mean from all the data points


Compute the covariance matrix S= c x x T
n 1 n n
Diagonalize S to get its eigenvalues and
eigenvectors
Retain c eigenvectors corresponding to the c
N
largest eigenvalues such that c
j / j 1 j
j

1
equals the desired variance to be captured
Project the data points on the eigenvectors

5.

1
a n E xk
N

x
i 1

Kernel PCA algorithm (1)


1.

2.
3.
4.
5.

Given N data points in d dimensions let X={x1|


x2|..|xN} where each column represents one
data point
Subtract the mean from all the data points
Choose an appropriate kernel k
Form the NxN Gram matrix K|ij=[k(xi,xj)}
Form the modified Gram
matrix
T
1NxN
1NxN
~

K I
K I

N
N

where 1NxN is an NxN matrix with all entries


equal to 1
5

Kernel PCA algorithm (2)


6.

7.
8.

Diagonalize K to get its eigenvalues n and its


eigenvectors an
Normalize an

a n / n

Retain c eigenvectorsC corresponding


to c
N
largest eigenvalues such
j / that
j
j 1

9.

j 1

equals desired variance to be captured


Project the data points on the eigenvectors
k ( x1 , x)

1NxN
1NxN

T
ya I
... K

N
N

k ( x N , x)

Data Mining Problem

Data Source

Computational Intelligence and Learning Cluster


Challenge 2000,
http://www.wi.leidenuniv.nl/~putten/library/cc2000
/index.html
Supplied by Sentient Machine Research,
http://www.smr.nl

Problem Definition:

Given data which incorporates both socio-economic and


various insurance policy ownership attributes, can we derive
models which help in determining factors or attributes which
may influence or signify individuals who purchase a caravan
insurance policy.

Data Selection

5,822 records for training


4,000 records for evaluation
86 attributes

Attributes 1 through 43:

Attributes 44 through 85:

Socio-demographic data derived from zip code


areas
Product ownership for customers

Attribute 86

Purchased caravan insurance

Data Transformation and


Reduction

Principal Components Analysis (PCA)

[From MatLab]

#
Attributes

%
Variance

#
Attributes

%
Variance

25

73.29

55

98.20

30

80.66

60

98.86

35

86.53

65

99.35

40

91.25

70

99.65

45

94.94

75

99.83

50

97.26

80

99.96

Relative Performance

PCA run time: 6.138


Kernel PCA run time:

Used Radial Basis Function Kernel


u v
K (u , v) exp

5.668

Matlab Code for PCA and Kernel PCA


algorithm can be supplied if needed
10

Modeling, Test and


Evaluation
Manually
Reduced Dataset:
* Legend: a no, b yes
Nave Bayes Overall 82.79% Correctly Classified
a
b
3155
544
a 14.71% False Positive
132
98
b 42.61% Correctly Classified
PCA Reduced Dataset:
Nave Bayes Overall 88.45% Correctly Classified
a
b
3507
255
a 6.77% False Positive
207
31
b 13.03% Correctly Classified
Kernel PCA Reduced Dataset:
Nave Bayes Overall 82.22% Correctly Classified
a
b
3238 541
a 14.3% False Positive
175
74
b 29.7% Correctly Classified

11

Overall Results

KPCA and PCA had similar time


performance
KPCA is much gives results closer
to manually reduced dataset
Future Work:

Examine other Kernels


Vary the parameters for the Kernels
Use other Data Mining Algorithms
12

Bag of words kernels (1)

Document seen as a vector d, indexed by all


the elements of a (controlled) dictionary. The
entry is equal to the number of occurrences.
A training corpus is therefore represented by
a Term-Document matrix,
noted D=[d1 d2 dm-1 dm]
From this basic representation, we will apply
a sequence of successive embeddings,
resulting in a global (valid) kernel with all
desired properties
13

BOW kernels (2)

Properties:

All order information is lost (syntactical relationships,


local context, )
Feature space has dimension N (size of the dictionary)

Similarity is basically defined by:


k(d1,d2)=d1d2= d1t.d2
or, normalized (cosine similarity):
k (d1 , d 2 )

k (d1 , d 2 )
k (d1 , d1 ).k (d 2 , d 2 )

Efficiency provided by sparsity (and sparse dotproduct algorithm): O(|d1|+|d2|)


14

Latent concept Kernels

Basic idea :

terms
terms
terms
terms
terms

documents
Size d

K(d1,d2)=?

Size t

2
Size k <<t

Concepts space

15

Semantic Kernels (1)

k(d1,d2)=(d1)SS(d2)

where S is the semantic matrix


S can be defined as S=RP where

R is a diagonal matrix giving the term weightings or


relevances
P is the proximity matrix defining the semantic spreading
between the different terms of the corpus

The measure for the inverse document frequency


for a term t is given by:
l=# of documents

w(t ) ln
df (t )

df(t)=# of
documents
containing term t

The matrix R is diagonal with entries:


Rtt=w(t)

16

Semantic Kernels (2)

The associated kernel is:

~
k (d1 , d 2 ) (d1 ) RR ' (d 2 )'

For the proximity matrix (P) the


associated kernel is:

~
k (d1 , d 2 ) (d1 ) PP' (d 2 )' (d1 ) i Qij (d 2 ) j
i, j

Where Qij encodes the amount of


semantic relation between terms i and j.
17

Semantic Kernels (3)

Most natural method of


incorporating semantic information
is be inferring the relatedness of
terms from an external source of
domain knowledge

Example: WordNet Ontology

Semantic Distance

Path length of hierarchical tree


Information content
18

Latent Semantic Kernels


(LSK)/ Latent Semantic
Singular Value Decomposition (SVD):
Indexing (LSI)
D
'

V
'
where is a diagonal matrix of the same

dimensions as D, and U and V are unitary matrices


whose columns are the eigenvectors of DD and DD
respectively
LSI projects the documents into the space spanned
by the first k columns of U, suing the new kdimensional vectors for subsequent processing

d (d )U k

where Uk is the matrix containing the first k columns


of U

19

Latent Semantic Kernels


(LSK)/ Latent Semantic
New
kernel becomes
that of Kernel PCA
Indexing
(LSI)
~

k (d1 , d 2 ) (d1 )U kU k (d 2 )

LSK is implemented by projecting onto the


features:

( d )U k

1 / 2
i

(v )
i

k ( d j , d )

j 1
where k is the base kernel,
and i vi are i 1
eigenvalue, eigenvector pairs of the kernel matrix
Can represent the LSKs with the proximity matrix

P U kU k

20

String and Sequence

An alphabet is a finite set of ||


symbols.

A string s=s1s|s| is any finite sequence of


symbols from including the empty
sequence.
We denote n the set of all finite strings of
length n

String matching: implies contiguity


Sequence matching : only implies order
21

p-spectrum kernel (1)

Features of s = p-spectrum of s =
histogram of all (contiguous)
substrings of length p
Feature space indexed by all
elements of p
u(s)=number of occurrences of u in s
The associated kernel is defined as

k p ( s, t )

up

p
u

( s ) (t )
p

22

p-spectrum kernel
example

Example: 3-spectrum kernel

s=statistics and t=computation


The two strings contain the following
substrings of length 3:

sta,tat, ati, tis, ist, sti, tic,


ics
com, omp, mpu, put, uta, tat,
ati, tio, ion

Common substrings of tat and ati,


so the inner product k(s,t)=2
23

p-spectrum Kernels
Recursion

k-suffix kernel is defined by

k ks ( s, t )

p-spectrum kernel can be evaluated using the equation:

k p ( s, t )

1 if s s1u , t t1u , for u k


0
otherwise

| s| p 1 |t | p 1

s
k
p (s(s(i : i p), t ( j : j p)

i 1
j 1
in O(p |s| |t|) operations
The evaluation of one row of the table for the p-suffix kernel
corresponds to performing a search in the string t for the psuffix of a prefix in s.

24

All-subsequences kernels

Feature mapping defined by all


contiguous or non-contiguous
subsequences of a string
Feature space indexed by all elements of
*={}U U 2U 3U
u(s)=number of occurrences of u as a
(non-contiguous) subsequence of s
Explicit computation rapidly infeasible
(exponential in |s| even with sparse rep.)
25

Recursive implementation

Consider the addition of one extra symbol a to


s: common subsequences of (sa,t) are either in
s or must end with symbol a (in both sa and t).
Mathematically,

k (s, ) 1
k (sa, t ) k (s, t )

k (s, t (1 : j 1))

j : t j a

This gives a complexity of O(|s||t|2)


26

Fixed-length subsequence
kernels

Feature space indexed by all elements of p


u(s)=number of occurrences of the p-gram
u as a (non-contiguous) subsequence of s

Recursive implementation (will create a series of p

tables)

k0 (s, ) 1
k p (s, ) 0

for p 0

k p (sa, t ) k p (s, t )

j : t j a

p 1

(s, t (1 : j 1))

Complexity: O(p|s||t|) , but we have the k-length subseq.


kernels (k<=p) for free easy to compute k(s,t)=alkl(s,t)

27

Gap-weighted subsequence
kernels (1)

Feature space indexed by all elements of p


u(s)=sum of weights of occurrences of the pgram u as a (non-contiguous) subsequence of s,
the weight being length penalizing: length(u)) [NB:
length includes both matching symbols and
gaps]
Example (1)

The string gon occurs as a subsequence of the


strings gone, going and galleon, but we
consider the first occurrence as more important since
it is contiguous, while the final occurrence is the
weakest of all three

28

Gap-weighted subsequence
kernels (2)

Example(2)

D1 : ATCGTAGACTGTC
D2 : GACTATGC
(D1)CAT = 28+and(D2)CAT = 4
k(D1,D2)CAT=212+214

Naturally built as a dot product valid kernel


For alphabet of size 80, there are 512,000
trigrams
For alphabet of size 26, there are 12 x 106 5grams
29

Gap-weighted subsequence
kernels (3)

Hard to perform explicit expansion


and dot-product
Efficient recursive formulation
(dynamic programming type), whose
complexity is O(k |D1| |D2|)

30

Word Sequence Kernels (1)

Here words are considered as symbols

Meaningful symbols more relevant matching


Linguistic preprocessing can be applied to improve performance
Shorter sequence sizes improved computation time
But increased sparsity (documents are more : orthogonal)

Motivation : the noisy stemming hypothesis (important Ngrams approximate stems), confirmed experimentally in a
categorization task

31

Word Sequence Kernels (2)

Link between Word Sequence Kernels and other


methods:

For k=1, WSK is equivalent to basic Bag Of Words


approach
For =1, close relation to polynomial kernel of degree k,
WSK takes order into account

Extension of WSK:

Symbol dependant decay factors (way to introduce IDF


concept, dependence on the POS, stop words)
Different decay factors for gaps and matches (e.g. noun<adj
when gap; noun>adj when match)
Soft matching of symbols (e.g. based on thesaurus, or on
dictionary if we want cross-lingual kernels)

32

Tree Kernels

Application: categorization [one doc=one tree], parsing


(disambiguation) [one doc = multiple trees]
Tree kernels constitute a particular case of more general
kernels defined on discrete structure (convolution kernels).
Intuitively, the philosophy is

to split the structured objects in parts,


to define a kernel on the atoms and a way to recursively
combine kernel over parts to get the kernel over the whole.

Feature space definition: one feature for each possible proper


subtree in the training data; feature value = number of
occurrences
A subtree is defined as any part of the tree which includes
more than one node, with the restriction there is no partial
rule production allowed.

33

Trees in Text : example


S

Example :

NP

S
VP

John

VP
Mary

VP

NP
V

VP
N

N
loves

loves
A Parse Tree

Mary

a few among
the many subtrees
of this tree!

loves

Mary

VP
V

N
34

Tree Kernels : algorithm

Kernel = dot product in this high dimensional feature


space
Once again, there is an efficient recursive algorithm (in
polynomial time, not exponential!)
Basically, it compares the production of all possible
pairs of nodes (n1,n2) (n1T1, n2 T2); if the production
is the same, the number of common subtrees routed at
both n1 and n2 is computed recursively, considering the
number of common subtrees routed at the common
children
Formally, let kco-rooted(n1,n2)=number of common
k (T1 , T2 ) at both n1
kcoand
(n1 , n2 )
rootedn2
subtrees rooted

n1T1 n2 T2

35

All sub-tree kernel

Kco-rooted(n1,n2)=0 if n1 or n2 is a leaf

Kco-rooted(n1,n2)=0 if n1 and n2 have different


production or, if labeled, different label
Else Kco-rooted(n1,n2)= (1 kco rooted (ch(n1 , i ),ch(n2 , i )))

children i

Production is left intentionally ambiguous to


both include unlabelled tree and labeled tree
Complexity s O(|T1|.|T2|)
36

References

J. Shawe-Tayor and N. Cristianini, Kernel


Methods for Pattern Analysis, 2004 (Chapter
10 and 11)
J. Tian, PCA/Kernel PCA for Image
Denoising, September 16, 2004
T. Gartner, A Survey of Kernels for Structured
Data, ACM SIGKDD Explorations Newsletter,
July 2003
N. Cristianini, Latent Semantic Kernels,
Proceedings of ICML-01, 18th International
Conference on Machine Learning, 2001
37

You might also like