Machine Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Pattern Recognition 41 (2008) 176 190

www.elsevier.com/locate/pr

A survey of kernel and spectral methods for clustering


Maurizio Filippone a, , Francesco Camastra b , Francesco Masulli a , Stefano Rovetta a
a Department of Computer and Information Science, University of Genova, and CNISM, Via Dodecaneso 35, I 16146 Genova, Italy
b Department of Applied Science, University of Naples Parthenope, Via A. De Gasperi 5, I 80133 Napoli, Italy

Received 19 October 2006; received in revised form 30 April 2007; accepted 29 May 2007

Abstract
Clustering algorithms are a useful tool to explore data structures and have been employed in many disciplines. The focus of this paper is the
partitioning clustering problem with a special interest in two recent approaches: kernel and spectral methods. The aim of this paper is to present
a survey of kernel and spectral clustering methods, two approaches able to produce nonlinear separating hypersurfaces between clusters. The
presented kernel clustering methods are the kernel version of many classical clustering algorithms, e.g., K-means, SOM and neural gas. Spectral
clustering arise from concepts in spectral graph theory and the clustering problem is congured as a graph cut problem where an appropriate
objective function has to be optimized. An explicit proof of the fact that these two paradigms have the same objective is reported since it has
been proven that these two seemingly different approaches have the same mathematical foundation. Besides, fuzzy kernel clustering methods
are presented as extensions of kernel K-means clustering algorithm.
2007 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.
Keywords: Partitional clustering; Mercer kernels; Kernel clustering; Kernel fuzzy clustering; Spectral clustering

1. Introduction
Unsupervised data analysis using clustering algorithms provides a useful tool to explore data structures. Clustering methods [1,2] have been addressed in many contexts and disciplines
such as data mining, document retrieval, image segmentation
and pattern classication. The aim of clustering methods is to
group patterns on the basis of a similarity (or dissimilarity)
criteria where groups (or clusters) are set of similar patterns.
Crucial aspects in clustering are pattern representation and the
similarity measure. Each pattern is usually represented by a
set of features of the system under study. It is very important
to notice that a good choice of representation of patterns can
lead to improvements in clustering performance. Whether it is
possible to choose an appropriate set of features depends on the
system under study. Once a representation is xed it is possible

Corresponding author. Tel.: +39 010 3536610; fax: +39 010 3536699.

E-mail addresses: [email protected] (M. Filippone),


[email protected]
(F. Camastra), [email protected] (F. Masulli), [email protected]
(S. Rovetta).

to choose an appropriate similarity measure among patterns.


The most popular dissimilarity measure for metric representations is the distance, for instance the Euclidean one [3].
Clustering techniques can be roughly divided into two categories:
hierarchical;
partitioning.
Hierarchical clustering techniques [1,4,5] are able to nd structures which can be further divided in substructures and so on recursively. The result is a hierarchical structure of groups known
as dendrogram.
Partitioning clustering methods try to obtain a single partition of data without any other sub-partition like hierarchical
algorithms do and are often based on the optimization of an
appropriate objective function. The result is the creation of
separations hypersurfaces among clusters. For instance we can
consider two nonlinear clusters as in Fig. 1. Standard partitioning methods (e.g., K-means, fuzzy c-means, SOM and neural
gas) using two centroids are not able to separate in the desired
way the two rings. The use of many centroids could solve this
problem providing a complex description of a simple data set.

0031-3203/$30.00 2007 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.
doi:10.1016/j.patcog.2007.05.018

M. Filippone et al. / Pattern Recognition 41 (2008) 176 190

6
4

x2

2
0
2
4
6
6

0
x1

Fig. 1. A data set composed of two rings of points.

For this reason several modications and new approaches have


been introduced to cope with this problem.
Among the large amount of modications we can mention
the fuzzy c-varieties [6], but the main drawback is that some a
priori information on the shape of clusters must be included.
Recently, some clustering methods that produce nonlinear
separating hypersurfaces among clusters have been proposed.
These algorithms can be divided in two big families: kernel and
spectral clustering methods.
Regarding kernel clustering methods, several clustering
methods have been modied incorporating kernels (e.g.,
K-means, fuzzy c-means, SOM and neural gas). The use of
kernels allows to map implicitly data into an high dimensional
space called feature space; computing a linear partitioning in
this feature space results in a nonlinear partitioning in the input
space.
Spectral clustering methods arise from concepts in spectral
graph theory. The basic idea is to construct a weighted graph
from the initial data set where each node represents a pattern
and each weighted edge simply takes into account the similarity between two patterns. In this framework the clustering
problem can be seen as a graph cut problem, which can be
tackled by means of the spectral graph theory. The core of this
theory is the eigenvalue decomposition of the Laplacian matrix of the weighted graph obtained from data. In fact, there is
a close relationship between the second smallest eigenvalue of
the Laplacian and the graph cut [7,8].
The aim of this paper is to present a survey of kernel and spectral clustering methods. Moreover, an explicit proof of the fact
that these two approaches have the same mathematical foundation is reported. In particular it has been shown by Dhillon
et al. that kernel K-means and spectral clustering with the ratio association as the objective function are perfectly equivalent [911]. The core of both approaches lies in their ability
to construct an adjacency structure between data avoiding to
deal with a prexed shape of clusters. These approaches have
a slight similarity with hierarchical methods in the use of an
adjacency structure with the main difference in the philosophy
of the grouping procedure.
A comparison of some spectral clustering methods has been
recently proposed in Ref. [12], while there are some theoretical

177

results on the capabilities and convergence properties of spectral


methods for clustering [1316]. Recently, kernel methods have
been applied to fuzzy c-varieties also [17] with the aim of
nding varieties in feature space and there are some interesting
clustering methods using kernels such as Refs. [18,19].
Since the choice of the kernel and of the similarity measure is
crucial in these methods, many techniques have been proposed
in order to learn automatically the shape of kernels from data
as in Refs. [2023].
Regarding the applications, most of these algorithms (e.g.,
Refs. [17,21,24]) have been applied to standard benchmarks
such as Ionosphere [25], Breast Cancer [26] and Iris [27].1
Kernel fuzzy c-means proposed in Refs. [2830] has been applied in image segmentation problems while in Ref. [31] it has
been applied in handwritten digits recognition. There are applications of kernel clustering methods in face recognition using
kernel SOM [32], in speech recognition [33] and in prediction of crop yield from climate and plantation data [34]. Spectral methods have been applied in clustering of articial data
[35,36], in image segmentation [23,37,38], in bioinformatics
[39], and in co-clustering problems of words and documents
[40] and genes and conditions [41]. A semi-supervised spectral
approach to bioinformatics and handwritten character recognition have been proposed in Ref. [42]. The protein sequence
clustering problem has been faced using spectral techniques in
Ref. [43] and kernel methods in Ref. [44].
In the next section we briey introduce the concepts of linear
partitioning methods by recalling some basic crisp and fuzzy
algorithms. Then the paper is organized as follows: Section 3
shows the kernelized version of the algorithms presented in
Section 2, in Section 4 we discuss spectral clustering, while
in Section 5 we report the equivalence between spectral and
kernel clustering methods. In the last section conclusions are
drawn.
2. Partitioning methods
In this section we briey recall some basic facts about partitioning clustering methods and we will report the clustering
methods for which a kernel version has been proposed. Let
X={x1 , . . . , xn } be a data set composed by n patterns for which
every xi Rd . The codebook (or set of centroids) V is dened
as the set V = {v1 , . . . , vc }, typically with c>n. Each element
vi Rd is called codevector (or centroid or prototype).2
The Voronoi region Ri of the codevector vi is the set of
vectors in Rd for which vi is the nearest vector:
Ri = {z Rd |i = arg min z vj 2 }.
j

(1)

It is possible to prove that each Voronoi region is convex [45]


and the boundaries of the regions are linear segments.
1 These data sets can be found at: (ftp://ftp.ics.uci.edu/pub/machinelearning-databases/).
2 Among many terms to denote such objects, we will use codevectors
as in vector quantization theory.

178

M. Filippone et al. / Pattern Recognition 41 (2008) 176 190

Fig. 2. An example of Voronoi tessellation where each black point is a


codevector.

The denition of the Voronoi set i of the codevector vi is


straightforward. It is the subset of X for which the codevector
vi is the nearest vector:
i = {x X|i = arg min x vj 2 },
j

(2)

that is, the set of vectors belonging to Ri . A partition on Rd


induced by all Voronoi regions is called Voronoi tessellation or
Dirichlet tessellation (Fig. 2).
2.1. Batch K-means
A simple algorithm able to construct a Voronoi tessellation
of the input space was proposed in 1957 by Lloyd [46] and
it is known as batch K-means. Starting from the nite data
set X this algorithm moves iteratively the k codevectors to the
arithmetic mean of their Voronoi sets {i }i=1,...,k . Theoretically
speaking, a necessary condition for a codebook V to minimize
the empirical quantization error:
E(X) =

k
1  
x vi 2 ,
2n
x
i=1

(3)

is that each codevector vi fullls the centroid condition [47].


In the case of a nite data set X and with Euclidean distance,
the centroid condition reduces to
1 
x.
(4)
vi =
|i | x
i

Batch K-means is formed by the following steps:

Batch K-means can be viewed as an expectationmaximization


[48] algorithm, ensuring the convergence after a nite number
of steps.
This approach presents many disadvantages [3]. Local minima of E(X) make the method dependent on initialization, and
the average is sensitive to outliers. Moreover, the number of
clusters to nd must be provided, and this can be done only
using some a priori information or additional validity criterion. Finally, K-means can deal only with clusters with spherically symmetrical point distribution, since Euclidean distances
of patterns from centroids are computed leading to a spherical invariance. Different distances lead to different invariance
properties as in the case of Mahalanobis distance which produces invariance on ellipsoids [3].
The term batch means that at each step the algorithm takes
into account the whole data set to update the codevectors. When
the cardinality n of the data set X is very high (e.g., several
hundreds of thousands) the batch procedure is computationally expensive. For this reason an on-line update has been introduced leading to the on-line K-means algorithm [45,49]. At
each step, this method simply randomly picks an input pattern
and updates its nearest codevector, ensuring that the scheduling of the updating coefcient is adequate to allow convergence
and consistency.
2.2. Self-organizing maps (SOM)
A self-organizing map (SOM) [50] also known as selforganizing feature map (SOFM) represents data by means of
codevectors organized on a grid with xed topology. Codevectors move to adapt to the input distribution, but adaptation
is propagated along the grid also to neighboring codevectors,
according to a given propagation or neighborhood function.
This effectively constrains the evolution of codevectors. Grid
topologies may differ, but in this paper we consider a twodimensional, square-mesh topology [51,52]. The distance on
the grid is used to determine how strongly a codevector is
adapted when the unit aij is the winner. The metric used on
a rectangular grid is the Manhattan distance, for which the
distance between two elements r = (r1 , r2 ) and s = (s1 , s2 ) is
drs = |r1 s1 | + |r2 s2 |.

(5)

The SOM algorithm is the following:

(1) choose the number k of clusters;


(2) initialize the codebook V with vectors randomly picked
from X;
(3) compute the Voronoi set i associated to the codevector vi ;
(4) move each codevector to the mean of its Voronoi set using
Eq. (4); and
(5) return to step 3 if any codevector has changed otherwise
return the codebook.

(1) Initialize the codebook V randomly picking from X.


(2) Initialize the set C of connections to form the rectangular
grid of dimension n1 n2 .
(3) Initialize t = 0.
(4) Randomly pick an input x from X.
(5) Determine the winner

At the end of the algorithm a codebook is found and a Voronoi


tessellation of the input space is provided. It is guaranteed that
after each iteration the quantization error does not increase.

(6) Adapt each codevector:

s(x) = arg min x vj .


vj V

vj = (t)h(drs )(x vj ),

(6)

(7)

M. Filippone et al. / Pattern Recognition 41 (2008) 176 190

where h is a decreasing function of d as for instance:




d2
h(drs ) = exp 2rs
.
(8)
2 (t)
(7) Increment t.
(8) If t < tmax go to step 4.
(t) and (t) are decreasing functions of t, for example [53]:

(t) = i

f
i

t/tmax
,

 t/tmax
f
(t) = i
,
i

(9)

where i , f and i , f are the initial and nal values for the
functions (t) and (t).
A nal note on the use of SOM for clustering. The method
was originally devised as a tool for embedding multidimensional data into typically two dimensional spaces, for data
visualization. Since then, it has also been frequently used as a
clustering method, which was originally not considered appropriate because of the constraints imposed by the topology. However, the topology itself serves an important purpose, namely,
that of limiting the exibility of the mapping in the rst training
cycles, and gradually increasing it (while decreasing the magnitude of updates to ensure convergence) as more cycles were
performed. The strategy is similar to that of other algorithms,
including these described in the following, in the capacity control of the method which has the effect of avoiding local minima. This accounts for the fast convergence often reported in
experimental works.
2.3. Neural gas
Another technique that tries to minimize the distortion error
is the neural gas algorithm [54], based on a soft adaptation rule.
This technique resembles the SOM in the sense that not only the
winner codevector is adapted. It is different in that codevectors
are not constrained to be on a grid, and the adaptation of the
codevectors near the winner is controlled by a criterion based
on distance ranks. Each time a pattern x is presented, all the
codevectors vj are ranked according to their distance to x (the
closest obtains the lowest rank). Denoting with j the rank of
the distance between x and the codevector vj , the update rule is
vj = (t)h (j )(x vj ),

(10)

with (t) [0, 1] gradually lowered as t increases and h (j )


a function decreasing with j with a characteristic decay ;
usually h (j ) = exp(j /). The Neural Gas algorithm is the
following:
(1)
(2)
(3)
(4)

Initialize the codebook V randomly picking from X.


Initialize the time parameter t = 0.
Randomly pick an input x from X.
Order all elements vj of V according to their distance to
x, obtaining the j .
(5) Adapt the codevectors according to Eq. (10).
(6) Increase the time parameter t = t + 1.
(7) If t < tmax go to step 3.

179

2.4. Fuzzy clustering methods


Bezdek [6] introduced the concept of hard and fuzzy partition in order to extend the notion of membership of pattern
to clusters. The motivation of this extension is related to the
fact that a pattern often cannot be thought of as belonging to a
single cluster only. In many cases, a description in which the
membership of a pattern is shared among clusters is necessary.
Denition 2.1. Let Acn denote the vector space of c n real
matrices over R. Considering X, Acn and c N such that
2 c < n, the Fuzzy c-partition space for X is the set:


c



Mfc = U Acn  uih [0, 1] i, h;
uih = 1 h;

i=1

n

0<
uih < n i .
(11)
h=1

The matrix U is the so called membership matrix since each


element uih is the fuzzy membership of the hth pattern to the
ith cluster. The denition of Mfc simply tells that the sum
of the memberships of a pattern to all clusters is one (probabilistic constraint) and that a cluster cannot be empty or contain all patterns. This denition generalizes the notion of hard
c-partitions in Ref. [6].
The mathematical tool used in all these methods for working
out the solution procedure is the Lagrange multipliers technique. In particular a minimization of the intraclusters distance
functional with a probabilistic constraint on the memberships of a pattern to all clusters has to be achieved. Since
all the functionals involved in these methods depend on both
memberships and codevectors, the optimization is iterative and
follows the so called Picard iterations method [6] where each
iteration is composed of two steps. In the rst step a subset of
variables (memberships) is kept xed and the optimization is
performed with respect to the remaining variables (codevectors) while in the second one the role of the xed and moving
variables is swapped. The optimization algorithm stops when
variables change less than a xed threshold.
2.4.1. Fuzzy c-means
The fuzzy c-means algorithm [6] identies clusters as fuzzy
sets. It minimizes the functional:
J (U, V ) =

c
n 


(uih )m xh vi 2 ,

(12)

h=1 i=1

with respect to the membership matrix U and the codebook V


with the probabilistic constraints:
c


uih = 1

i = 1, . . . , n.

(13)

i=1

The parameter m controls the fuzziness of the memberships


and usually it is set to two; for high values of m the algorithm
tends to set all the memberships equals while for m tending to
one we obtain the K-means algorithm where the memberships

180

M. Filippone et al. / Pattern Recognition 41 (2008) 176 190

are crisp. The minimization of Eq. (12) is done introducing a


Lagrangian function for each pattern for which the constraint
is in Eq. (13).
 c

c


Lh =
(uih )m xh vi 2 + h
uih 1 .
(14)
i=1

i=1

Then the derivatives of the sum of the Lagrangian are computed


with respect to the uih and vi and are set to zero. This yields
the iteration scheme of these equations:
u1
ih =


c 

xh vi  2/(m1)
j =1

xh vj 

n
(uih )m xh
vi =
h=1
n
m .
h=1 (uih )

(15)

(16)

At each iteration it is possible to evaluate the amount of change


of the memberships and codevectors and the algorithm can be
stopped when these quantities reach a predened threshold. At
the end a soft partitioning of the input space is obtained.
2.5. Possibilistic clustering methods
As a further modication of the K-means algorithm, the possibilistic approach [55,56] relaxes the probabilistic constraint
on the membership of a pattern to all clusters. In this way a
pattern can have a low membership to all clusters in the case
of outliers, whereas for instance, in the situation of overlapped
clusters, it can have high membership to more than one cluster. In this framework the membership represents a degree of
typicality not depending on the membership values of the same
pattern to other clusters. Again the optimization procedure is
the Picard iteration method, since the functional depends both
on memberships and codevectors.
2.5.1. Possibilistic c-means
There are two formulations of the possibilistic c-means, that
we will call PCM-I [55] and PCM-II [56]. The rst one aims
to minimize the following functional with respect to the membership matrix U and the codebook V = {v1 , . . . , vc }:
J (U, V ) =

c
n 


(uih )m xh vi 2

h=1 i=1

c


i

i=1

n


(1 uih )m ,

(17)

h=1

while the second one addresses the functional:


J (U, V ) =

c
n 


uih xh vi 2

i=1

i

n

h=1

(uih ln(uih ) uih ) .

xh vi 2
uih = exp
i


.

(20)

The constraint on the memberships uih [0, 1] is automatically satised given the form assumed by Eqs. (19) and (20).
The updates of the centroids for PCM-I and PCM-II are,
respectively:

n
(uih )m xh
vi =
h=1
(21)
n
m ,
h=1 (uih )

n
uih xh
.
(22)
vi =
h=1
n
h=1 uih
The parameter i regulates the trade-off between the two terms
in Eq. (17) and Eq. (18) and it is related to the width of the
clusters. The authors suggest to estimate i for PCM-I using
this formula:

n
(uih )m xh vi 2
,
(23)
i = h=1
n
m
h=1 (uih )
which is a weighted mean of the intracluster distance of the ith
cluster and the constant is typically set at one. The parameter i can be estimated with scale estimation techniques as
developed in the robust clustering literature for M-estimators
[57,58].The value of i can be updated at each step of the algorithm or can be xed for all iterations. The former approach
can lead to instabilities since the derivation of the equations has
been obtained considering i xed. In the latter case a good estimation of i can be done only starting from an approximate
solution. For this reason often the possibilistic c-means is run
as a rening step of a fuzzy c-means.
3. Kernel clustering methods
In machine learning, the use of the kernel functions [59]
has been introduced by Aizerman et al. [60] in 1964. In 1995
Cortes and Vapnik introduced support vector machines (SVMs)
[61] which perform better than other classication algorithms
in several problems. The success of SVM has brought to extend
the use of kernels to other learning algorithms (e.g., Kernel
PCA [62]). The choice of the kernel is crucial to incorporate a
priori knowledge on the application, for which it is possible to
design ad hoc kernels.
3.1. Mercer kernels
We recall the denition of Mercer kernels [63,64], considering, for the sake of simplicity, vectors in Rd instead of Cd .

h=1 i=1
c


The minimization of Eqs. (17) and (18) with respect to the uih
leads, respectively, to the following equations:

1/(m1) 1

xh vi 2
uih = 1 +
,
(19)
i

(18)

Denition 3.1. Let X = {x1 , . . . , xn } be a nonempty set where


xi Rd . A function K : X X R is called a positive

M. Filippone et al. / Pattern Recognition 41 (2008) 176 190

denite kernel (or Mercer kernel) if and only if K is symmetric


(i.e. K(xi , xj ) = K(xj , xi )) and the following equation holds:
n 
n


ci cj K(xi , xj ) 0

n 2,

(24)

i=1 j =1

where cr R r = 1, . . . , n.

(25)

where
: X F performs a mapping from the input space
X to a high dimensional feature space F. One of the most
relevant aspects in applications is that it is possible to compute
Euclidean distances in F without knowing explicitly
. This
can be done using the so called distance kernel trick [62,65]:

(xi )
(xj )2
= (
(xi )
(xj )) (
(xi )
(xj ))
=
(xi )
(xi ) +
(xj )
(xj ) 2
(xi )
(xj )
= K(xi , xi ) + K(xj , xj ) 2K(xi , xj )

(26)

in which the computation of distances of vectors in feature


space is just a function of the input vectors. In fact, every
algorithm in which input vectors appear only in dot products
with other input vectors can be kernelized [66]. In order to
simplify the notation we introduce the so called Gram matrix
K where each element kij is the scalar product
(xi )
(xi ).
Thus, Eq. (26) can be rewritten as

(xi )
(xj )2 = kii + kjj 2kij .

(27)

Examples of Mercer kernels are the following [67]:


Linear:
K (l) (xi , xj ) = xi xj .

(28)

Polynomial of degree p:
K (p) (xi , xj ) = (1 + xi xj )p ,
Gaussian:
K

(g)

p N.

xi xj 2
(xi , xj ) = exp
22

(29)


,

 R.

elements of a set X and possibly, for some of these representations, the clusters can be easily identied.
In literature there are some applications of kernels in clustering. These methods can be broadly divided in three categories,
which are based, respectively, on:
kernelization of the metric [29,68,69];
clustering in feature space [31,7073];
description via support vectors [24,74].

Each Mercer kernel can be expressed as follows:


K(xi , xj ) =
(xi )
(xj ),

181

(30)

Methods based on kernelization of the metric look for centroids


in input space and the distances between patterns and centroids
is computed by means of kernels:

(xh )
(vi )2 = K(xh , xh ) + K(vi , vi ) 2K(xh , vi ).
(32)
Clustering in feature space is made by mapping each pattern
using the function
and then computing centroids in feature
space. Calling vi
the centroids in feature space, we will see
in the next sections that it is possible to compute the distances

(xh ) vi
2 by means of the kernel trick.
The description via support vectors makes use of One Class
SVM to nd a minimum enclosing sphere in feature space able
to enclose almost all data in feature space excluding outliers.
The computed hypersphere corresponds to nonlinear surfaces
in input space enclosing groups of patterns. The support vector
clustering algorithm allows to assign labels to patterns in input
space enclosed by the same surface. In the next subsections we
will outline these three approaches.
3.2. Kernel K-means
Given the data set X, we map our data in some feature space
F, by means of a nonlinear map
and we consider k centers
in feature space (vi
F with i = 1, . . . , k) [62,75]. We call
the set V
= (v1
, . . . , vk
) Feature Space Codebook since in
our representation the centers in the feature space play the same
role of the codevectors in the input space. In analogy with the
codevectors in the input space, we dene for each center vi
its
Voronoi region and Voronoi set in feature space. The Voronoi
region in feature space (Ri
) of the center vi
is the set of all
vectors in F for which vi
is the closest vector
Ri
= {x
F|i = arg min x
vj
}.
j

(33)

It is important to stress that the use of the linear kernel in Eq.


(26) simply leads to the computation of the Euclidean norm in
the input space. Indeed:

The Voronoi set in feature space 

i of the center vi is the set

of all vectors x in X such that vi is the closest vector to their


images
(x) in the feature space:

xi xj 2 = xi xi + xj xj 2xi xj

i = {x X|i = arg min 


(x) vj }.

= K (xi , xi ) + K (xj , xj ) 2K (xi , xj )


2

(31)
=
(xi )
(xj ) ,
(l)

(l)

(l)

shows that choosing the kernel K (l) implies


= I (where I
is the identity function). Following this consideration we can
think that kernels can offer a more general way to represent the

(34)

The set of the Voronoi regions in feature space dene a Voronoi


Tessellation of the feature space. The Kernel K-means algorithm
has the following steps:
(1) Project the data set X into a feature space F, by means of
a nonlinear mapping
.

182

M. Filippone et al. / Pattern Recognition 41 (2008) 176 190

(2) Initialize the codebook V


= (v1
, . . . , vk
) with vi
F.
(3) Compute for each center vi
the set 

i .

(4) Update the codevectors vi in F


vi
=

|

i | x

(x).

(35)
3.3. Kernel SOM

(5) Go to step 3 until any vi


changes.
(6) Return the feature space codebook.
This algorithm minimizes the quantization error in feature
space.
Since we do not know explicitly
it is not possible to compute directly Eq. (35). Nevertheless, it is always possible to
compute distances between patterns and codevectors by using
the kernel trick, allowing to obtain the Voronoi sets in feature
space 

i . Indeed, writing each centroid in feature space as a


combination of data vectors in feature space we have
vj

n


during the annealing and at the end of the procedure the memberships have become crisp; therefore a tessellation of the feature space is found. This linear partitioning in F, back to the
input space, forms a nonlinear partitioning of the input space.

j h
(xh ),

(36)

The kernel version of the SOM algorithm [70,71] is based


on the distance kernel trick. The method tries to adapt the grid
of codevectors vj
in feature space. In kernel SOM we start
writing each codevector as a combination of points in feature
space:
vj

n


j h
(xh ),

(40)

h=1

where the coefcients ih are initialized once the grid is created.


Computing the winner by writing Eq. (6) in feature space leads
to:
s(
(xi )) = arg min 
(xi ) vj
,

(41)

vj
V

h=1

where j h is one if xh 

j and zero otherwise. Now the


quantity

2
n





j h
(xh )
(37)

(xi ) vj
2 =
(xi )

that can be written, using the kernel trick:




 

s(
(xi )) = arg min kii 2
j h kih
j r j s krs .

can be expanded by using the scalar product and the kernel


trick in Eq. (26):
2

n






j h
(xh ) = kii 2
j h kih

(xi )

To update the codevectors we rewrite Eq. (7):

vj
V

(42)

h=1

h=1

 
r

j r j s krs .

(38)

This allows to compute the closest feature space codevector for


each pattern and to update the coefcients j h . It is possible to
repeat these two operations until any j h changes to obtain a
Voronoi tessellation of the feature space.
An on-line version of the kernel K-means algorithm can be
found in Ref. [62]. A further version of K-means in feature
space has been proposed by Girolami [75]. In his formulation
the number of clusters is denoted by c and a fuzzy membership
matrix U is introduced. Each element uih denotes the fuzzy
membership of the point xh to the Voronoi set 

i . This algorithm tries to minimize the following functional with respect


to U :
J
(U, V
) =

c
n 


uih 
(xh ) vi
2 .

(39)

h=1 i=1

The minimization technique used by Girolami is deterministic


annealing [76] which is a stochastic method for optimization.
A parameter controls the fuzziness of the membership during
the optimization and can be thought proportional to the temperature of a physical system. This parameter is gradually lowered

vj
 = vj
+ (t)h(drs )(
(x) vj
).

(43)

Using Eq. (40):


n

h=1

j h
(xh ) =

n


j h
(xh ) + (t)h(drs )

h=1


(x)

n



j h
(xh ) .

(44)

h=1

Thus the rule for the update of j h is



(1 (t)h(drs )) j h
if i  = j,
j h =
(1 (t)h(drs )) j h + (t)h(drs ) otherwise.

(45)

3.4. Kernel neural gas


The neural gas algorithm provides a soft update rule for the
codevectors in input space. The kernel version of neural gas
[72] applies the soft rule for the update to the codevectors in
feature space. Rewriting Eq. (10) in feature space for the update
of the codevectors we have
vj
= h (j )(
(x) vj
).

(46)

Here j is the rank of the distance 


(x) vj
. Again it is
possible to write vj
as a linear combination of
(xi ) as in
Eq. (40), allowing to compute such distances by means of the

M. Filippone et al. / Pattern Recognition 41 (2008) 176 190

183

kernel trick. As in the kernel SOM technique, the updating rule


for the centroids becomes an updating rule for the coefcients
of such combination.

when j = 0 and 0 < j < C, the image of xj lies on the


surface of the hypersphere. These points are called support
vectors.

3.5. One class SVM

Moreover, it is possible to write the Wolfe dual form [77], whose


optimization leads to this quadratic programming problem with
respect to the j :


JW =
kjj j
kij i j .
(51)

This approach provides a support vector description in feature space [74,77,78]. The idea is to use kernels to project data
into a feature space and then to nd the sphere enclosing almost all data, namely not including outliers. Formally, a radius
R and the center v of the smallest enclosing sphere in feature
space are dened. The constraint is thus:

(xj ) v2 R 2 + j

j ,

(47)

where the non negative slack variables j have been added.


The Lagrangian for this problem is dened [79]:

(R 2 + j 
(xj ) v2 ) j
L = R2

j j + C

j ,

(48)

and j 0 are Lagrange multipliers, C is a conwhere j 0

stant and C j j is a penalty term. Computing the partial


derivative of L with respect to R, v and j and setting them to
zero leads to the following equations:


j = 1, v =
j
(xj ), j = C j .
(49)
j

The KarushKuhnTucker (KKT) complementary conditions


[79] result in
j j = 0,

(R 2 + j 
(xj ) v2 ) j = 0.

(50)

Following simple considerations regarding all these conditions


it is possible to see that:
when j > 0, the image of xj lies outside the hypersphere.
These points are called bounded support vectors; and

The distance from the image of a point xj and the center v of


the enclosing sphere can be computed as follows:


dj = 
(xj ) v2 = kjj 2
r kj r +
r s krs .
r

(52)
In Fig. 3 it is possible to see the ability of this algorithm to nd
the smallest enclosing sphere without outliers.
3.5.1. Support vector clustering
Once boundaries in input space are found, a labeling procedure is necessary in order to complete clustering. In Ref. [74]
the cluster assignment procedure follows a simple geometric
idea. Any path connecting a pair of points belonging to different clusters must exit from the enclosing sphere in feature
space. Denoting with Y the image in feature space of one of
such paths and with y the elements of Y , it will result that
R(y) > R for some y. Thus it is possible to dene an adjacency
structure in this form:

1 if R(y) < R y Y,
(53)
0 otherwise.
Clusters are simply the connected components of the graph
with the adjacency matrix just dened. In the implementation in
Ref. [77] the check is made sampling the line segment Y in 20
equidistant points.There are some modications on this labeling algorithm (e.g., Refs. [80,81]) that improve performances.
An improved version of SVC algorithm with application in
handwritten digits recognition can be found in Ref. [82].

x2

x2

0
2

4
6
5

0
x1

0
x1

Fig. 3. One class SVM applied to two data sets with outliers. The gray line shows the projection in input space of the smallest enclosing sphere in feature
space. In: (a) a linear kernel and (b) a Gaussian kernel have been used.

184

M. Filippone et al. / Pattern Recognition 41 (2008) 176 190

3.5.2. Camastra and Verri algorithm


A technique combining K-means and One Class SVM can
be found in Ref. [24]. The algorithm uses a K-means-like strategy, i.e., moves repeatedly all centers vi
in the feature space,
computing One Class SVM on their Voronoi sets 

i , until no
center changes anymore. Moreover, in order to introduce robustness against outliers, the authors have proposed to compute

One Class SVM on 

i () of each center vi . The set i () is


dened as

We obtain for the memberships:

i () = {xj i and 


(xj ) vi  < }.

3.6.2. Kernel fuzzy c-means in feature space


Here we derive the fuzzy c-means in feature space, which is a
clustering method which allows to nd a soft linear partitioning
of the feature space. This partitioning, back to the input space,
results in a soft nonlinear partitioning of data. The functional to
optimize [31,73] with the probabilistic constraint in Eq. (13) is

(54)

i () is the Voronoi set in the feature space of the center vi


without outliers, that is the images of data points whose distance
from the center is larger than . The parameter  can be set up
using model selection techniques [48] (e.g., cross-validation).
In summary, the algorithm has the following steps:

(1) Project the data set X into a feature space F, by means of


a nonlinear mapping
.
(2) Initialize the codebook V
= (v1
, . . . , vk
) with vi
F.

(3) Compute 

i () for each center vi .

(4) Apply One Class SVM to each i () and assign the center
obtained to vi
.
(5) Go to step 2 until any vi
changes.
(6) Return the feature space codebook.
3.6. Kernel fuzzy clustering methods

u1
ih =


c 

1 K(xh , vi ) 1/(m1)
1 K(xh , vj )

j =1

and for the codevectors:

n
(uih )m K(xh , vi )xh
vi =
h=1
.
n
m
h=1 (uih ) K(xh , vi )

J
(U, V
) =

c
n 


(uih )m 
(xh ) vi
2 .

(57)

(58)

(59)

h=1 i=1

It is possible to rewrite the norm in Eq. (59) explicitly by using:

n
n

(uih )m
(xh )

n
vi = h=1
=
a
(uih )m
(xh ),
(60)
i
m
h=1 (uih )
h=1

which is the kernel version of Eq. (16). For simplicity of notation we used:
ai1 =

n


(uir )m .

(61)

r=1

Here we show some kernelized versions of fuzzy c-means algorithms, showing in particular fuzzy and possibilistic c-means.
Now it is possible to write the kernel version of Eq. (15):

1/(m1)

c

khh 2ai nr=1 (uir )m khr + ai2 nr=1 ns=1 (uir )m (uis )m krs
1
uih =
.
(62)


n
m
m
m
2 n
k

2a
(u
)
k
+
a
(u
)
(u
)
k
hh
j
j
r
hr
j
r
j
s
rs
r=1
r=1
s=1
j
j =1
Eq. (62)gives the rule for the update of the membership uih .
In the rst subsection we show the method of the kernelization
of the metric while in the second one the fuzzy c-means in
feature space is shown. The third subsection is devoted to the
kernelized version of the possibilistic c-means.
3.6.1. Kernel fuzzy c-means with kernelization of the metric
The basic idea is to minimize the functional [29,68,69]:
J
(U, V ) =

c
n 


(uih )m 
(xh )
(vi )2 ,

(55)

h=1 i=1

with the probabilistic constraint over the memberships


(Eq. (13)). The procedure for the optimization of J
(U, V )
is again the Picard iteration technique. Minimization of the
functional in Eq. (55) has been proposed only in the case of
a Gaussian kernel K (g) . The reason is that the derivative of
J
(U, V ) with respect to the vi using a Gaussian kernel is
particularly simple since it allows to use the kernel trick:
jK(xh , vi ) (xh vi )
=
K(xh , vi ).
jvi
2

(56)

3.6.3. Possibilistic c-means with the kernelization of the metric


The formulation of the possibilistic c-means PCM-I with
the kernelization of the metric used in Ref. [69] involves the
minimization of the following functional:
J
(U, V ) =

c
n 

h=1 i=1
c


(uih )m 
(xh )
(vi )2

i

i=1

n


(1 uih )m .

(63)

h=1

Minimization leads to
1/(m1)


(xh )
(vi )2
=
1
+
,
u1
ih
i
that can be rewritten, considering a Gaussian kernel, as


1 K(xh , vi ) 1/(m1)
u1
=
1
+
2
.
ih
i

(64)

(65)

M. Filippone et al. / Pattern Recognition 41 (2008) 176 190

185

The update of the codevectors follows:

optimize the normalized cut [37]:

n
(uih )m K(xh , vi )xh
vi =
h=1
.
n
m
h=1 (uih ) K(xh , vi )

= cut(S, S)

N cut(S, S)

(66)

4. Spectral clustering

vi S,vj V

Spectral clustering methods [39] have a strong connection


with graph theory [7,83]. A comparison of some spectral clustering methods has been recently proposed in Ref. [12]. Let
X = {x1 , . . . , xn } be the set of patterns to cluster. Starting
from X, we can build a complete, weighted undirected graph
G(V , A) having a set of nodes V = {v1 , . . . , vn } corresponding
to the n patterns and edges dened through the n n adjacency
(also afnity) matrix A. The adjacency matrix for a weighted
graph is given by the matrix whose element aij represents the
weight of the edge connecting nodes i and j . Being an undirected graph, the property aij = aj i holds. Adjacency between
two patterns can be dened as follows:


h(xi , xj )

if i  = j,

otherwise.

(67)

The function h measures the similarity between patterns and


typically a Gaussian function is used:

d(xi , xj )
h(xi , xj ) = exp
,
22


(68)

where d measures the dissimilarity between patterns and 


controls the rapidity of decay of h. This particular choice has
the property that A has only some terms signicantly different
from 0, i.e., it is sparse.
The degree matrix D is the diagonal matrix whose elements
are the degrees of the nodes of G.
dii =

n


aij .

(69)

j =1

In this framework the clustering problem can be seen as a graph


cut problem [7] where one wants to separate a set of nodes
S V from the complementary set S = V \S. The graph cut
problem can be formulated in several ways depending on the
choice of the function to optimize. One of the most popular
functions to optimize is the cut [7]:
=
cut(S, S)

aij .


1
1
, (71)
+
V)
assoc(S, V ) assoc(S,

where the association assoc(S, V ) is also known as the volume


of S:


assoc(S, V ) =
aij vol(S) =
dii .
(72)

The computation of the i is straightforward.

aij =

(70)

vi S,vj S

It is easy to verify that the minimization of this objective function favors partitions containing isolated nodes. To achieve a
better balance in the cardinality of S and S it is suggested to

vi S

There are other denitions of functions to optimize (e.g., the


conductance [13], the normalized association [37], ratio cut
[10]).
The complexity in optimizing these objective functions is
very high (e.g., the optimization of the normalized cut is a NPhard problem [37,84]) and for this reason it has been proposed
to relax it by using spectral concepts of graph analysis. This
relaxation can be formulated by introducing the Laplacian matrix [7]:
L = D A,

(73)

which can be seen as a linear operator on G. In addition to this


denition of Laplacian there are alternative denitions:
Normalized Laplacian LN = D 1/2 LD 1/2 .
Generalized Laplacian LG = D 1 L.
Relaxed Laplacian L = L D.
Each denition is justied by special properties desirable in
a given context. The spectral decomposition of the Laplacian
matrix can give useful information about the properties of the
graph. In particular it can be seen that the second smallest eigenvalue of L is related to the graph cut [8] and the corresponding
eigenvector can cluster together similar patterns [7,37,85].
Spectral approach to clustering has a strong connection with
Laplacian eigenmaps [86]. The dimensionality reduction problem aims to nd a proper low dimensional representation of a
data set in a high dimensional space. In Ref. [86], each node
in the graph, which represents a pattern, is connected just with
nodes corresponding to neighboring patterns and the spectral
decomposition of the Laplacian of the obtained graph permits
to nd a low dimensional representation of X. The authors
point out the close connection with spectral clustering and local
linear embedding [87] providing theoretical and experimental
validations.
4.1. Shi and Malik algorithm
The algorithm proposed by Shi and Malik [37] applies
the concepts of spectral clustering to image segmentation
problems. In this framework each node is a pixel and the
denition of adjacency between them is suitable for image segmentation purposes. In particular, if xi is the position of the ith
pixel and fi a feature vector which takes into account several
of its attributes (e.g., intensity, color and texture information),

186

M. Filippone et al. / Pattern Recognition 41 (2008) 176 190

they dene the adjacency as



fi fj 2
aij = exp
221


2
x

x

i
j

exp
222

(3) Compute a normalized version of A, dening this Laplacian:

L = D 1/2 AD 1/2 .

if xi xj  < R,

(74)

otherwise.

Here R has an inuence on how many neighboring pixels can


be connected with a pixel, controlling the sparsity of the adjacency and Laplacian matrices. They provide a proof that the
can be done solving the eigenvalue
minimization of N cut(S, S)
problem for the normalized Laplacian LN . In summary, the
algorithm is composed of these steps:
(1) Construct the graph G starting from the data set X calculating the adjacency between patterns using Eq. (74).
(2) Compute the degree matrix D.
(3) Construct the matrix LN = D 1/2 LD 1/2 .
(4) Compute the eigenvector e2 associated to the second smallest eigenvalue 2 .
(5) Use D 1/2 e2 to segment G.
In the ideal case of two non connected subgraphs, D 1/2 e2
assumes just two values; this allows to cluster together the
components of D 1/2 e2 with the same value. In a real case
the splitting point must be chosen to cluster the components of
D 1/2 e2 and the authors suggest to use the median value, zero
or the value for which the clustering gives the minimum N cut.
The successive partitioning can be made recursively on the obtained sub-graphs or it is possible to use more than one eigenvector. An interesting approach for clustering simultaneously
the data set in more than two clusters can be found in Ref. [88].
4.2. Ng, Jordan and Weiss algorithm
The algorithm that has been proposed by Ng et al. [35] uses
the adjacency matrix A as Laplacian. This denition allows to
consider the eigenvector associated with the largest eigenvalues
as the good one for clustering. This has a computational
advantage since the principal eigenvectors can be computed for
sparse matrices efciently using the power iteration technique.
The idea is the same as in other spectral clustering methods,
i.e., one nds a new representation of patterns on the rst k
eigenvectors of the Laplacian of the graph.
The algorithm is composed of these steps:
(1) Compute the afnity matrix A Rnn :

aij =

xi xj 2
exp
22
0

(2) Construct the matrix D.


if i  = j,
otherwise.

(75)

(76)

(4) Find the k eigenvectors {e1 , . . . , ek } of L associated to the


largest eigenvalues {1 , . . . , k }.
(5) Form the matrix Z by stacking the k eigenvectors in
columns.
(6) Compute the matrix Y by normalizing each of the Zs rows
to have unit length:
zij
yij =
k

2
r=1 zir

(77)

In this way all the original points are mapped into a unit
hypersphere.
(7) In this new representation of the original n points apply
a clustering algorithm that attempts to minimize distortion
such as K-means.
As a criterion to choose  they suggest to use the value
that guarantees the minimum distortion when the clustering
stage is performed on Y . They tested this algorithm on articial
data sets showing the capability of the algorithm to separate
nonlinear structures. Here we show the steps of the algorithm
when applied to the data set in Fig. 1. Once the singular value
decomposition of L is computed, we can see the matrices Z and
Y in Fig. 4 (here obtained with  = 0.4). Once Y is computed,
it is easy to cluster the two groups of points obtaining the result
shown in Fig. 5.
4.3. Other methods
An interesting view of spectral clustering is provided by
Meila et al. [23] who describe it in the framework of Markov
random walks [23] leading to a different interpretation of the
graph cut problem. It is known, from the theory of Markov random walks, that if we construct the stochastic matrix P =D 1 A,
each element pij represents the probability of moving from
node i to node j . In their work they provide an explicit connection between the spectral decomposition of L and P showing
that both have the same solution with eigenvalues of P equal
to 1 i where i are the eigenvalues of L. Moreover, they
propose a method to learn a function of the features able to produce a correct segmentation starting from a segmented image.
An interesting study on spectral clustering has been conducted by Kannan et al. [13]. The authors exploit the objective
function with respect to some articial data sets showing that
there is no objective function able to properly cluster every
data set. In other words there always exists some data set for
which the optimization of a particular objective function has
some drawback. For this reason they propose a bi-criteria objective function. These two objectives are, respectively, based
on the conductance and the ratio between the auto-association
of a subset of nodes S and its volume. Again the relaxation of
this problem is achieved by the decomposition of the Laplacian
of the graph associated to the data set.

M. Filippone et al. / Pattern Recognition 41 (2008) 176 190

187

0.6
0.4

0.05

0.2
0.00
y2

z2

0.0

0.05

0.4

0.10
0.8
0.10

0.08

0.06
z1

0.04

0.02

0.80

0.75

0.70
y1

0.65

0.60

Fig. 4. (a) The matrix Z obtained with the rst two eigenvectors of the matrix L. (b) The matrix Y obtained by normalizing the rows of Z clustered by
K-means algorithm with two centroids.

kernel K-means [11]. We introduce a weight matrix W having


weights wk on the diagonal. Recalling that we denote with i
the ith cluster we have that the functional to minimize is the
following:

6
4

x2

2
0

J
(W, V
) =

c 

i=1 xk i

where

vi

0
x1

Fig. 5. The result of the Ng and Jordan algorithm on the ring data set.

xk i wk
(xk )

xk i wk

wk 
(xk ) vi
2 ,

xk i wk
(xk )

si

where we have introduced:



si =
wk .

(78)

(79)

(80)

xk i

5. A unied view of spectral and kernel clustering methods


Recently, a possible connection between unsupervised kernel algorithms and spectral methods has been studied to nd
whether these two seemingly different approaches can be described under a more general framework. The hint for this unifying theory lies the adjacency structure constructed by both
these approaches. In the spectral approach there is an adjacency
between patterns which is the analogous of the kernel functions
in kernel methods.
A direct connection between Kernel PCA and spectral methods has been shown [89,90]. More recently a unifying view
of kernel K-means and spectral clustering methods [911] has
been pointed out. In this section we show explicitly the equivalence between them highlighting that these two approaches
have the same foundation and in particular that both can be
viewed as a matrix trace maximization problem.
5.1. Kernel clustering methods objective
To show the direct equivalence between kernel and spectral
clustering methods we introduce the weighted version of the

Now lets dene the matrix Z having


 1/2
if xk i ,
s
zki = i
0
otherwise.

(81)

Since the columns of Z are mutually orthogonal it is easy to


verify that
si1 = (Z T Z)ii ,

(82)

and that only the diagonal elements are not null.


Now we denote with F the matrix whose columns are the

(xk ). It is easy to verify that the matrix F W yields a matrix whose columns are the wk
(xk ). Moreover, the expression
F W ZZ T gives a matrix having n columns which are the nearest centroids in feature space of the
(xk ).
Thus, substituting Eq. (79) in Eq. (78) we obtain the following matrix expression for J
(W, V
):
J
(W, V
) =

n


wk Fk (F W ZZ T )k 2 .

(83)

k=1

Here the dot has to be considered as a selection of the kth


column of the matrices. Introducing the matrix Y = W 1/2 Z,

188

M. Filippone et al. / Pattern Recognition 41 (2008) 176 190

which is orthonormal (Y T Y = I ), the objective function can be


rewritten as

n


J (W, V ) =

of the minimization of the normalized cut which is one of the


most used objective functions, the functional to minimize is
J (S1 , . . . , Sc ) = tr(Y T D 1/2 AD 1/2 Y ).

wk Fk (F W 1/2 Y Y T W 1/2 )k 2

k=1

(91)

where the norm  F is the Frobenius norm [91]. Using the


fact that AF = tr(AAT ) and the properties of the trace, it
is possible to see that the minimization of the last equation is
equivalent to the maximization of the following [9,10]:

Thus the correspondence with the objective in the kernel K-means imposes to choose Y = D 1/2 Z, W = D and
K = D 1 AD 1 . It is worth noting that for an arbitrary A it is
not guaranteed that D 1 AD 1 is denite positive. In this case
the kernel K-means will not necessarily converge. To cope
with this problem in Ref. [9] the authors propose to enforce
positive deniteness by means of a diagonal shift [94]:

J
(W, V
) = tr(Y T W 1/2 F T F W 1/2 Y ).

K = D 1 + D 1 AD 1 ,

= F W 1/2 F W 1/2 Y Y T 2F ,

(84)

(85)

5.2. Spectral clustering methods objective


Recalling that the denition of association between two sets
of edges S and T of a weighted graph is the following:

aij ,
(86)
assoc(S, T ) =
iS,j T

it is possible to dene many objective functions to optimize in


order to perform clustering. Here, for the sake of simplicity,
we consider just the ratio association problem, where one has
to maximize:
J (S1 , . . . , Sc ) =

c

assoc(Si , Si )

|Si |

i=1

(87)

where |Si | is the size of the ith partition. Now we introduce


the indicator vector zi whose kth value is zero if xk
/ i and
one otherwise. Rewriting the last equation in a matrix form we
obtain the following:
J (S1 , . . . , Sc ) =

c

zT Azi
i

i=1

ziT zi

(88)

Normalizing the zi letting:


yi =

zi
,
T
(zi zi )1/2

(89)

we obtain
J (S1 , . . . , Sc ) =

c


yiT Ayi = tr(Y T AY ).

(90)

i=1

(92)

where  is a positive coefcient large enough to guarantee the


positive deniteness of K. Since the mathematical foundation
of these methods is the same, it is possible to choose which
algorithm to use for clustering choosing, for instance, the approach with the less computational complexity for the particular application.
6. Conclusions
Clustering is a classical problem in pattern recognition. Recently spectral and kernel methods for clustering have provided
new ideas and interpretations to the solution of this problem.
In this paper spectral and kernel methods for clustering have
been reviewed paying attention to fuzzy kernel methods for
clustering and to the connection between kernel and spectral
approaches. Unlike classical partitioning clustering algorithms
they are able to produce nonlinear separating hypersurfaces
among data since they construct an adjacency structure from
data. These methods have been successfully tested on several
benchmarks, but we can nd few applications to real world
problem due to the high computational cost. Therefore an extensive validation on real world applications remains a big challenge for kernel and spectral clustering methods.
Acknowledgments
This work was partially supported by the Italian Ministry
of Education, University, and Research (PRIN 2004 Project
2004062740) and by the University of Genova.
References

5.3. A unied view of the two approaches


Comparing Eqs. (90) and (80) it is possible to see the perfect
equivalence between kernel K-means and the spectral approach
to clustering when one wants to maximize the ratio association. To this end, indeed, it is enough to set the weights in the
weighted kernel K-means equal to one obtaining the classical
kernel K-means. It is possible to obtain more general results
when one wants to optimize other objective functions in the
spectral approach, such as the ratio cut [92], the normalized cut
and the KernighanLin [93] objective. For instance, in the case

[1] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM
Comput. Surveys 31 (3) (1999) 264323.
[2] R. Xu, D.I.I. Wunsch, Survey of clustering algorithms, IEEE Trans.
Neural Networks 16 (3) (2005) 645678.
[3] R.O. Duda, P.E. Hart, Pattern Classication and Scene Analysis, Wiley,
New York, 1973.
[4] P.H.A. Sneath, R.R. Sokal, Numerical Taxonomy: The Principles and
Practice of Numerical Classication, W.H. Freeman, San Francisco, 1973.
[5] J.H. Ward, Hierarchical grouping to optimize an objective function, J.
Am. Stat. Assoc. 58 (1963) 236244.
[6] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function
Algorithms, Kluwer Academic Publishers, Norwell, MA, USA, 1981.

M. Filippone et al. / Pattern Recognition 41 (2008) 176 190


[7] F.R.K. Chung, Spectral Graph Theory, CBMS Regional Conference
Series in Mathematics, vol. 92, American Mathematical Society,
Providence, RI, February 1997.
[8] M. Fiedler, Algebraic connectivity of graphs, Czechoslovak Math. J. 23
(98) (1973) 298305.
[9] I.S. Dhillon, Y. Guan, B. Kulis, Weighted graph cuts without eigenvectors:
a multilevel approach, IEEE Trans. Pattern Anal. Mach. Intell. (2007),
to appear.
[10] I.S. Dhillon, Y. Guan, B. Kulis, A unied view of kernel k-means,
spectral clustering and graph partitioning, Technical Report TR-04-25,
UTCS, 2005.
[11] I.S. Dhillon, Y. Guan, B. Kulis, Kernel k-means: spectral clustering and
normalized cuts, in: KDD 04: Proceedings of the Tenth ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining,
ACM Press, New York, NY, USA, 2004, pp. 551556.
[12] D. Verma, M. Meila, A comparison of spectral clustering algorithms,
Technical Report, Department of CSE University of Washington Seattle,
WA, 981952350, 2005.
[13] R. Kannan, S. Vempala, A. Vetta, On clusterings: good, bad, and spectral,
in: Proceedings of the 41st Annual Symposium on the Foundation
of Computer Science, IEEE Computer Society, Silver Spring, MD,
November 2000, pp. 367380.
[14] U. von Luxburg, M. Belkin, O. Bousquet, Consistency of spectral
clustering, Technical Report 134, Max Planck Institute for Biological
Cybernetics, 2004.
[15] U. von Luxburg, O. Bousquet, M. Belkin, Limits of spectral clustering,
in: L.K. Saul, Y. Weiss, L. Bottou (Eds.), Advances in Neural Information
Processing Systems (NIPS), vol. 17, MIT Press, Cambridge, MA, 2005.
[16] H. Zha, X. He, C.H.Q. Ding, M. Gu, H.D. Simon, Spectral relaxation
for k-means clustering, in: NIPS, 2001, pp. 10571064.
[17] J. Leski, Fuzzy c-varieties/elliptotypes clustering in reproducing kernel
hilbert space, Fuzzy Sets and Systems 141 (2) (2004) 259280.
[18] A.S. Have, M.A. Girolami, J. Larsen, Clustering via kernel
decomposition, IEEE Trans. Neural Networks 17 (1) (2006) 256264.
[19] D. Horn, Clustering via Hilbert space, Phys. A Stat. Mech. Appl. 302
(2001) 7079.
[20] F.R. Bach, M.I. Jordan, Learning spectral clustering, Technical
Report UCB/CSD-03-1249, EECS Department, University of California,
Berkeley, 2003.
[21] N. Cristianini, J.S. Taylor, A. Elisseeff, J.S. Kandola, On kernel-target
alignment, in: NIPS, 2001, pp. 367373.
[22] I. Fischer, I. Poland, New methods for spectral clustering, Technical
Report IDSIA-12-04, IDSIA, 2004.
[23] M. Meila, J. Shi, Learning segmentation by random walks, in: NIPS,
2000, pp. 873879.
[24] F. Camastra, A. Verri, A novel kernel method for clustering, IEEE Trans.
Pattern Anal. Mach. Intell. 27 (5) (2005) 801804.
[25] V.G. Sigillito, S.P. Wing, L.V. Hutton, K.B. Baker, Classication of radar
returns from the ionosphere using neural networks, Johns Hopkins APL
Tech. Digest 10 (1989) 262266.
[26] W.H. Wolberg, O.L. Mangasarian, Multisurface method of pattern
separation for medical diagnosis applied to breast cytology, Proc. Natl.
Acad. Sci. USA 87 (1990) 91939196.
[27] R.A. Fisher, The use of multiple measurements in taxonomic problems,
Ann. Eugenics 7 (1936) 179188.
[28] S.-C. Chen, D.-Q. Zhang, Robust image segmentation using FCM with
spatial constraints based on new kernel-induced distance measure, IEEE
Trans. Systems Man Cybernet. B 34 (4) (2004) 19071916.
[29] D.-Q. Zhang, S.-C. Chen, A novel kernelized fuzzy c-means algorithm
with application in medical image segmentation, Artif. Intell. Med. 32
(1) (2004) 3750.
[30] D.-Q. Zhang, S.-C. Chen, Z.-S. Pan, K.-R. Tan, Kernel-based fuzzy
clustering incorporating spatial constraints for image segmentation, in:
International Conference on Machine Learning and Cybernetics, vol. 4,
2003, pp. 21892192.
[31] T. Graepel, K. Obermayer, Fuzzy topographic kernel clustering, in:
W. Brauer (Ed.), Proceedings of the Fifth GI Workshop Fuzzy Neuro
Systems 98, 1998, pp. 9097.

189

[32] X. Tan, S. Chen, Z.H. Zhou, F. Zhang, Robust face recognition from a
single training image per person with kernel-based som-face, in: ISNN
(1), 2004, pp. 858863.
[33] D.S. Satish, C.C. Sekhar, Kernel based clustering and vector quantization
for speech recognition, in: Proceedings of the 2004 14th IEEE Signal
Processing Society Workshop, 2004, pp. 315324.
[34] A. Majid Awan, Mohd Noor Md. Sap, An intelligent system based on
kernel methods for crop yield prediction, in: W.K. Ng, M. Kitsuregawa,
J. Li, K. Chang (Eds.), PAKDD, Lecture Notes in Computer Science,
vol. 3918, 2006, pp. 841846.
[35] A.Y. Ng, M.I. Jordan, Y. Weiss, On spectral clustering: analysis and
an algorithm, in: T.G. Dietterich, S. Becker, Z. Ghahramani (Eds.),
Advances in Neural Information Processing Systems, vol. 14, MIT Press,
Cambridge, MA, 2002.
[36] A. Rahimi, B. Recht, Clustering with normalized cuts is clustering with
a hyperplane, Statist. Learn. Comput. Vis. (2004).
[37] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Trans.
Pattern Anal. Mach. Intell. 22 (8) (2000) 888905.
[38] A.N. Srivastava, Mixture density Mercer kernels: a method to learn
kernels directly from data, in: SDM, 2004.
[39] N. Cristianini, J.S. Taylor, J.S. Kandola, Spectral kernel methods for
clustering, in: NIPS, 2001, pp. 649655.
[40] I.S. Dhillon, Co-clustering documents and words using bipartite spectral
graph partitioning, in: KDD 01: Proceedings of the Seventh ACM
SIGKDD International Conference on Knowledge Discovery and Data
Mining, ACM Press, New York, NY, USA, 2001, pp. 269274.
[41] Y. Kluger, R. Basri, J.T. Chang, M. Gerstein, Spectral biclustering of
microarray data: coclustering genes and conditions, Genome Res. 13 (4)
(2003) 703716.
[42] B. Kulis, S. Basu, I.S. Dhillon, R. Mooney, Semi-supervised graph
clustering: a kernel approach, in: ICML 05: Proceedings of the 22nd
International Conference on Machine Learning, ACM Press, New York,
NY, USA, 2005, pp. 457464.
[43] A. Paccanaro, C. Chennubhotla, J.A. Casbon, M.A.S. Saqi, Spectral
clustering of protein sequences, in: International Joint Conference on
Neural Networks, vol. 4, 2003, pp. 30833088.
[44] J. Weston, C. Leslie, E. Ie, D. Zhou, A. Elisseeff, W.S. Noble, Semisupervised protein classication using cluster kernels, Bioinformatics 21
(15) (2005) 32413247.
[45] Y. Linde, A. Buzo, R. Gray, An algorithm for vector quantizer design,
IEEE Trans. Commun. 1 (1980) 8495.
[46] S. Lloyd, Least squares quantization in pcm, IEEE Trans. Inf. Theory
28 (1982) 129137.
[47] A. Gersho, R.M. Gray, Vector Quantization and Signal Compression,
Kluwer, Boston, 1992.
[48] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford
University Press, Oxford, UK, 1996.
[49] J.B. Macqueen, Some methods of classication and analysis of
multivariate observations, in: Proceedings of the Fifth Berkeley
Symposium on Mathematical Statistics and Probability, 1967,
pp. 281297.
[50] T. Kohonen, The self-organizing map, in: Proceedings of the IEEE, vol.
78, 1990, pp. 14641480.
[51] T. Kohonen, Self-organized formation of topologically correct feature
maps, Biol. Cybernet. 43 (1) (1982) 5969.
[52] T. Kohonen, Self-Organizing Maps, Springer, New York, Inc., Secaucus,
NJ, USA, 2001.
[53] H.J. Ritter, T.M. Martinetz, K.J. Schulten, Neuronale Netze, AddisonWesley, Mnchen, Germany, 1991.
[54] T.M. Martinetz, S.G. Berkovich, K.J. Schulten, Neural gas network for
vector quantization and its application to time-series prediction, IEEE
Trans. Neural Networks 4 (4) (1993) 558569.
[55] R. Krishnapuram, J.M. Keller, A possibilistic approach to clustering,
IEEE Trans. Fuzzy Systems 1 (2) (1993) 98110.
[56] R. Krishnapuram, J.M. Keller, The possibilistic c-means algorithm:
insights and recommendations, IEEE Trans. Fuzzy Systems 4 (3) (1996)
385393.
[57] P.J. Huber, Robust Statistics, Wiley, New York, 1981.

190

M. Filippone et al. / Pattern Recognition 41 (2008) 176 190

[58] O. Nasraoui, R. Krishnapuram, An improved possibilistic c-means


algorithm with nite rejection and robust scale estimation, in: North
American Fuzzy Information Processing Society Conference, Berkeley,
CA, June 1996.
[59] J. Mercer, Functions of positive and negative type and their connection
with the Theory of integral equations, Proc. R. Soc. London 209 (1909)
415446.
[60] M. Aizerman, E. Braverman, L. Rozonoer, Theoretical foundations of
the potential function method in pattern recognition learning, Automat.
Remote Control 25 (1964) 821837.
[61] C. Cortes, V. Vapnik, Support vector networks, Mach. Learn. 20 (1995)
273297.
[62] B. Schlkopf, A.J. Smola, K.R. Mller, Nonlinear component analysis as
a kernel eigenvalue problem, Neural Comput. 10 (5) (1998) 12991319.
[63] N. Aronszajn, Theory of reproducing kernels, Trans. Am. Math. Soc. 68
(3) (1950) 337404.
[64] S. Saitoh, Theory of Reproducing Kernels and its Applications, Longman
Scientic & Technical, Harlow, England, 1988.
[65] K.R. Mller, S. Mika, G. Rtsch, K. Tsuda, B. Schlkopf, An introduction
to kernel-based learning algorithms, IEEE Trans. Neural Networks 12
(2) (2001) 181202.
[66] B. Schlkopf, A.J. Smola, Learning with Kernels: Support Vector
Machines, Regularization, Optimization, and Beyond, MIT Press,
Cambridge, MA, USA, 2001.
[67] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer, New
York, Inc., New York, NY, USA, 1995.
[68] Z.D. Wu, W.X. Xie, J.P. Yu, Fuzzy c-means clustering algorithm based
on kernel method, Comput. Intell. Multimedia Appl. (2003).
[69] D.-Q. Zhang, S.-C. Chen, Kernel based fuzzy and possibilistic c-means
clustering, in: Proceedings of the International Conference Articial
Neural Network, Turkey, 2003, pp. 122125.
[70] R. Inokuchi, S. Miyamoto, LVQ clustering and SOM using a kernel
function, in: Proceedings of IEEE International Conference on Fuzzy
Systems, vol. 3, 2004, pp. 14971500.
[71] D. Macdonald, C. Fyfe, The kernel self-organising map, in: Fourth
International Conference on Knowledge-Based Intelligent Engineering
Systems and Allied Technologies 2000, vol. 1, 2000, pp. 317320.
[72] A.K. Qinand, P.N. Suganthan, Kernel neural gas algorithms with
application to cluster analysis, ICPR 17th International Conference
on Pattern Recognition (ICPR04), vol. 4, 2004, pp. 617620.
[73] D.-Q. Zhang, S.-C. Chen, Fuzzy clustering using kernel method, in: The
2002 International Conference on Control and Automation, 2002, ICCA,
2002, pp. 162163.
[74] A.B. Hur, D. Horn, H.T. Siegelmann, V. Vapnik, Support vector
clustering, J. Mach. Learn. Res. 2 (2001) 125137.
[75] M. Girolami, Mercer kernel based clustering in feature space, IEEE
Trans. Neural Networks 13 (3) (2002) 780784.
[76] K. Rose, Deterministic annealing for clustering, compression,
classication, regression, and related optimization problems, Proc. IEEE
86 (11) (1998) 22102239.

[77] A.B. Hur, D. Horn, H.T. Siegelmann, V. Vapnik, A support vector method
for clustering, in: T.K. Leen, T.G. Dietterich, V. Tresp (Eds.), NIPS,
2000, pp. 367373.
[78] D.M.J. Tax, R.P.W. Duin, Support vector domain description, Pattern
Recognition Lett. 20 (1113) (1999) 11911199.
[79] C.J.C. Burges, A tutorial on support vector machines for pattern
recognition, Data Min. Knowl. Discovery 2 (2) (1998) 121167.
[80] D. Lee, An improved cluster labeling method for support vector
clustering, IEEE Trans. Pattern Anal. Mach. Intell. 27 (3) (2005)
461464.
[81] J. Yang, V. Estirill-Castro, S.K. Chalup, Support vector clustering through
proximity graph modelling, in: Proceedings of the Ninth International
Conference on Neural Information Processing, vol. 2, 2002, pp. 898903.
[82] J.-H. Chiang, P.-Y. Hao, A new kernel-based fuzzy clustering approach:
support vector clustering with cell growing, IEEE Trans. Fuzzy Systems
11 (4) (2003) 518527.
[83] W.E. Donath, A.J. Hoffman, Lower bounds for the partitioning of graphs,
IBM J. Res. Dev. 17 (1973) 420425.
[84] D. Wagner, F. Wagner, Between min cut and graph bisection, in:
Mathematical Foundations of Computer Science, 1993, pp. 744750.
[85] M. Brand, K. Huang, A unifying theorem for spectral embedding and
clustering, in: C.M. Bishop, B.J. Frey (Eds.), Proceedings of the Ninth
International Workshop on Articial Intelligence and Statistics, 2003.
[86] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction
and data representation, Neural Comput. 15 (6) (2003) 13731396.
[87] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally
linear embedding, Science 290 (5500) (2000) 23232326.
[88] S.X. Yu, J. Shi, Multiclass spectral clustering, in: ICCV 03: Proceedings
of the Ninth IEEE International Conference on Computer Vision, IEEE
Computer Society, Silver Spring, MD, Washington, DC, USA, 2003.
[89] Y. Bengio, O. Delalleau, N. Le Roux, J.F. Paiement, P. Vincent, M.
Ouimet, Learning eigenfunctions links spectral embedding and kernel
PCA, Neural Comput. 16 (10) (2004) 21972219.
[90] Y. Bengio, P. Vincent, J.F. Paiement, Spectral clustering and kernel PCA
are learning eigenfunctions, Technical Report 2003s-19, CIRANO, 2003.
[91] G.H. Golub, C.F.V. Loan, Matrix Computations, Johns Hopkins Studies
in Mathematical Sciences, The Johns Hopkins University Press,
Baltimore, MD, October 1996.
[92] P.K. Chan, M. Schlag, J.Y. Zien, Spectral k-way ratio-cut partitioning
and clustering, in: Proceeding of the 1993 Symposium on Research
on Integrated Systems, MIT Press, Cambridge, MA, USA, 1993,
pp. 123142.
[93] B.W. Kernighan, S. Lin, An efcient heuristic procedure for partitioning
graphs, Bell System Tech. J. 49 (1) (1970) 291307.
[94] V. Roth, J. Laub, M. Kawanabe, J.M. Buhmann, Optimal cluster
preserving embedding of nonmetric proximity data, IEEE Trans. Pattern
Anal. Mach. Intell. 25 (12) (2003) 15401551.

About the AuthorMAURIZIO FILIPPONE received a Laurea degree in Physics in 2004 from the University of Genova and now he is pursuing a Ph.D. in
Computer Science at University of Genova. His research interests are focused on Machine Learning techniques and their applications to Bioinformatics and
to time series analysis and forecasting.
About the AuthorFRANCESCO CAMASTRA received Laurea degree in Physics in 1985 from University of Milan and the Ph.D. degree in Computer
Science from University of Genova in 2004. He was the recipient of Eduardo Caianiello Award 2005 and since February 2006 he has been Assistant Professor
at University of Napoli Parthenope. His research interests are Machine Learning and Kernel Methods.
About the AuthorFRANCESCO MASULLI received the Laurea degree in Physics in 1976 from the University of Genova. He is currently an Associate
Professor in Computer Science at the University of Genova. From 1983 to 2001 he has been an Assistant Professor at the University of Genova and from
2001 to 2005 an Associate Professor at University of Pisa. He authored or co-authored more than 120 scientic papers in Machine Learning, Neural Networks,
Fuzzy Systems, Pattern Recognition and Bioinformatics.
About the AuthorSTEFANO ROVETTA received a Laurea degree in Electronic Engineering in 1992 and a Ph.D. degree in Models, Methods and Tools
for Electronic and Electromagnetic Systems in 1996, both from the University of Genova. He is currently an Assistant Professor in Computer Science at the
University of Genova. His current research interests include Machine Learning and Clustering techniques for high-dimensional biomedical data analysis and
document analysis.

You might also like