Machine Learning
Machine Learning
Machine Learning
www.elsevier.com/locate/pr
Received 19 October 2006; received in revised form 30 April 2007; accepted 29 May 2007
Abstract
Clustering algorithms are a useful tool to explore data structures and have been employed in many disciplines. The focus of this paper is the
partitioning clustering problem with a special interest in two recent approaches: kernel and spectral methods. The aim of this paper is to present
a survey of kernel and spectral clustering methods, two approaches able to produce nonlinear separating hypersurfaces between clusters. The
presented kernel clustering methods are the kernel version of many classical clustering algorithms, e.g., K-means, SOM and neural gas. Spectral
clustering arise from concepts in spectral graph theory and the clustering problem is congured as a graph cut problem where an appropriate
objective function has to be optimized. An explicit proof of the fact that these two paradigms have the same objective is reported since it has
been proven that these two seemingly different approaches have the same mathematical foundation. Besides, fuzzy kernel clustering methods
are presented as extensions of kernel K-means clustering algorithm.
2007 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.
Keywords: Partitional clustering; Mercer kernels; Kernel clustering; Kernel fuzzy clustering; Spectral clustering
1. Introduction
Unsupervised data analysis using clustering algorithms provides a useful tool to explore data structures. Clustering methods [1,2] have been addressed in many contexts and disciplines
such as data mining, document retrieval, image segmentation
and pattern classication. The aim of clustering methods is to
group patterns on the basis of a similarity (or dissimilarity)
criteria where groups (or clusters) are set of similar patterns.
Crucial aspects in clustering are pattern representation and the
similarity measure. Each pattern is usually represented by a
set of features of the system under study. It is very important
to notice that a good choice of representation of patterns can
lead to improvements in clustering performance. Whether it is
possible to choose an appropriate set of features depends on the
system under study. Once a representation is xed it is possible
Corresponding author. Tel.: +39 010 3536610; fax: +39 010 3536699.
0031-3203/$30.00 2007 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.
doi:10.1016/j.patcog.2007.05.018
6
4
x2
2
0
2
4
6
6
0
x1
177
(1)
178
(2)
k
1
x vi 2 ,
2n
x
i=1
(3)
(5)
(6)
(7)
f
i
t/tmax
,
t/tmax
f
(t) = i
,
i
(9)
where i , f and i , f are the initial and nal values for the
functions (t) and (t).
A nal note on the use of SOM for clustering. The method
was originally devised as a tool for embedding multidimensional data into typically two dimensional spaces, for data
visualization. Since then, it has also been frequently used as a
clustering method, which was originally not considered appropriate because of the constraints imposed by the topology. However, the topology itself serves an important purpose, namely,
that of limiting the exibility of the mapping in the rst training
cycles, and gradually increasing it (while decreasing the magnitude of updates to ensure convergence) as more cycles were
performed. The strategy is similar to that of other algorithms,
including these described in the following, in the capacity control of the method which has the effect of avoiding local minima. This accounts for the fast convergence often reported in
experimental works.
2.3. Neural gas
Another technique that tries to minimize the distortion error
is the neural gas algorithm [54], based on a soft adaptation rule.
This technique resembles the SOM in the sense that not only the
winner codevector is adapted. It is different in that codevectors
are not constrained to be on a grid, and the adaptation of the
codevectors near the winner is controlled by a criterion based
on distance ranks. Each time a pattern x is presented, all the
codevectors vj are ranked according to their distance to x (the
closest obtains the lowest rank). Denoting with j the rank of
the distance between x and the codevector vj , the update rule is
vj = (t)h (j )(x vj ),
(10)
179
c
n
(uih )m xh vi 2 ,
(12)
h=1 i=1
uih = 1
i = 1, . . . , n.
(13)
i=1
180
i=1
c
xh vi 2/(m1)
j =1
xh vj
n
(uih )m xh
vi =
h=1
n
m .
h=1 (uih )
(15)
(16)
c
n
(uih )m xh vi 2
h=1 i=1
c
i
i=1
n
(1 uih )m ,
(17)
h=1
c
n
uih xh vi 2
i=1
i
n
h=1
xh vi 2
uih = exp
i
.
(20)
The constraint on the memberships uih [0, 1] is automatically satised given the form assumed by Eqs. (19) and (20).
The updates of the centroids for PCM-I and PCM-II are,
respectively:
n
(uih )m xh
vi =
h=1
(21)
n
m ,
h=1 (uih )
n
uih xh
.
(22)
vi =
h=1
n
h=1 uih
The parameter i regulates the trade-off between the two terms
in Eq. (17) and Eq. (18) and it is related to the width of the
clusters. The authors suggest to estimate i for PCM-I using
this formula:
n
(uih )m xh vi 2
,
(23)
i = h=1
n
m
h=1 (uih )
which is a weighted mean of the intracluster distance of the ith
cluster and the constant is typically set at one. The parameter i can be estimated with scale estimation techniques as
developed in the robust clustering literature for M-estimators
[57,58].The value of i can be updated at each step of the algorithm or can be xed for all iterations. The former approach
can lead to instabilities since the derivation of the equations has
been obtained considering i xed. In the latter case a good estimation of i can be done only starting from an approximate
solution. For this reason often the possibilistic c-means is run
as a rening step of a fuzzy c-means.
3. Kernel clustering methods
In machine learning, the use of the kernel functions [59]
has been introduced by Aizerman et al. [60] in 1964. In 1995
Cortes and Vapnik introduced support vector machines (SVMs)
[61] which perform better than other classication algorithms
in several problems. The success of SVM has brought to extend
the use of kernels to other learning algorithms (e.g., Kernel
PCA [62]). The choice of the kernel is crucial to incorporate a
priori knowledge on the application, for which it is possible to
design ad hoc kernels.
3.1. Mercer kernels
We recall the denition of Mercer kernels [63,64], considering, for the sake of simplicity, vectors in Rd instead of Cd .
h=1 i=1
c
The minimization of Eqs. (17) and (18) with respect to the uih
leads, respectively, to the following equations:
1/(m1) 1
xh vi 2
uih = 1 +
,
(19)
i
(18)
ci cj K(xi , xj ) 0
n 2,
(24)
i=1 j =1
where cr R r = 1, . . . , n.
(25)
where
: X F performs a mapping from the input space
X to a high dimensional feature space F. One of the most
relevant aspects in applications is that it is possible to compute
Euclidean distances in F without knowing explicitly
. This
can be done using the so called distance kernel trick [62,65]:
(xi )
(xj )2
= (
(xi )
(xj )) (
(xi )
(xj ))
=
(xi )
(xi ) +
(xj )
(xj ) 2
(xi )
(xj )
= K(xi , xi ) + K(xj , xj ) 2K(xi , xj )
(26)
(27)
(28)
Polynomial of degree p:
K (p) (xi , xj ) = (1 + xi xj )p ,
Gaussian:
K
(g)
p N.
xi xj 2
(xi , xj ) = exp
22
(29)
,
R.
elements of a set X and possibly, for some of these representations, the clusters can be easily identied.
In literature there are some applications of kernels in clustering. These methods can be broadly divided in three categories,
which are based, respectively, on:
kernelization of the metric [29,68,69];
clustering in feature space [31,7073];
description via support vectors [24,74].
181
(30)
(33)
xi xj 2 = xi xi + xj xj 2xi xj
(l)
(l)
(34)
182
i .
|
i | x
(x).
(35)
3.3. Kernel SOM
n
during the annealing and at the end of the procedure the memberships have become crisp; therefore a tessellation of the feature space is found. This linear partitioning in F, back to the
input space, forms a nonlinear partitioning of the input space.
j h
(xh ),
(36)
n
j h
(xh ),
(40)
h=1
(41)
vj
V
h=1
where j h is one if xh
vj
V
(42)
h=1
h=1
r
j r j s krs .
(38)
c
n
uih
(xh ) vi
2 .
(39)
h=1 i=1
vj
= vj
+ (t)h(drs )(
(x) vj
).
(43)
j h
(xh ) =
n
j h
(xh ) + (t)h(drs )
h=1
(x)
n
j h
(xh ) .
(44)
h=1
(45)
(46)
183
This approach provides a support vector description in feature space [74,77,78]. The idea is to use kernels to project data
into a feature space and then to nd the sphere enclosing almost all data, namely not including outliers. Formally, a radius
R and the center v of the smallest enclosing sphere in feature
space are dened. The constraint is thus:
(xj ) v2 R 2 + j
j ,
(47)
j j + C
j ,
(48)
(R 2 + j
(xj ) v2 )j = 0.
(50)
(52)
In Fig. 3 it is possible to see the ability of this algorithm to nd
the smallest enclosing sphere without outliers.
3.5.1. Support vector clustering
Once boundaries in input space are found, a labeling procedure is necessary in order to complete clustering. In Ref. [74]
the cluster assignment procedure follows a simple geometric
idea. Any path connecting a pair of points belonging to different clusters must exit from the enclosing sphere in feature
space. Denoting with Y the image in feature space of one of
such paths and with y the elements of Y , it will result that
R(y) > R for some y. Thus it is possible to dene an adjacency
structure in this form:
1 if R(y) < R y Y,
(53)
0 otherwise.
Clusters are simply the connected components of the graph
with the adjacency matrix just dened. In the implementation in
Ref. [77] the check is made sampling the line segment Y in 20
equidistant points.There are some modications on this labeling algorithm (e.g., Refs. [80,81]) that improve performances.
An improved version of SVC algorithm with application in
handwritten digits recognition can be found in Ref. [82].
x2
x2
0
2
4
6
5
0
x1
0
x1
Fig. 3. One class SVM applied to two data sets with outliers. The gray line shows the projection in input space of the smallest enclosing sphere in feature
space. In: (a) a linear kernel and (b) a Gaussian kernel have been used.
184
i , until no
center changes anymore. Moreover, in order to introduce robustness against outliers, the authors have proposed to compute
(54)
(3) Compute
(4) Apply One Class SVM to each i () and assign the center
obtained to vi
.
(5) Go to step 2 until any vi
changes.
(6) Return the feature space codebook.
3.6. Kernel fuzzy clustering methods
u1
ih =
c
1 K(xh , vi ) 1/(m1)
1 K(xh , vj )
j =1
n
(uih )m K(xh , vi )xh
vi =
h=1
.
n
m
h=1 (uih ) K(xh , vi )
J
(U, V
) =
c
n
(uih )m
(xh ) vi
2 .
(57)
(58)
(59)
h=1 i=1
n
n
(uih )m
(xh )
n
vi = h=1
=
a
(uih )m
(xh ),
(60)
i
m
h=1 (uih )
h=1
which is the kernel version of Eq. (16). For simplicity of notation we used:
ai1 =
n
(uir )m .
(61)
r=1
Here we show some kernelized versions of fuzzy c-means algorithms, showing in particular fuzzy and possibilistic c-means.
Now it is possible to write the kernel version of Eq. (15):
1/(m1)
c
khh 2ai nr=1 (uir )m khr + ai2 nr=1 ns=1 (uir )m (uis )m krs
1
uih =
.
(62)
n
m
m
m
2 n
k
2a
(u
)
k
+
a
(u
)
(u
)
k
hh
j
j
r
hr
j
r
j
s
rs
r=1
r=1
s=1
j
j =1
Eq. (62)gives the rule for the update of the membership uih .
In the rst subsection we show the method of the kernelization
of the metric while in the second one the fuzzy c-means in
feature space is shown. The third subsection is devoted to the
kernelized version of the possibilistic c-means.
3.6.1. Kernel fuzzy c-means with kernelization of the metric
The basic idea is to minimize the functional [29,68,69]:
J
(U, V ) =
c
n
(uih )m
(xh )
(vi )2 ,
(55)
h=1 i=1
(56)
c
n
h=1 i=1
c
(uih )m
(xh )
(vi )2
i
i=1
n
(1 uih )m .
(63)
h=1
Minimization leads to
1/(m1)
(xh )
(vi )2
=
1
+
,
u1
ih
i
that can be rewritten, considering a Gaussian kernel, as
1 K(xh , vi ) 1/(m1)
u1
=
1
+
2
.
ih
i
(64)
(65)
185
n
(uih )m K(xh , vi )xh
vi =
h=1
.
n
m
h=1 (uih ) K(xh , vi )
= cut(S, S)
N cut(S, S)
(66)
4. Spectral clustering
vi S,vj V
h(xi , xj )
if i = j,
otherwise.
(67)
(68)
n
aij .
(69)
j =1
aij .
1
1
, (71)
+
V)
assoc(S, V ) assoc(S,
aij =
(70)
vi S,vj S
It is easy to verify that the minimization of this objective function favors partitions containing isolated nodes. To achieve a
better balance in the cardinality of S and S it is suggested to
vi S
(73)
186
2
x
x
i
j
exp
222
L = D 1/2 AD 1/2 .
if xi xj < R,
(74)
otherwise.
xi xj 2
exp
22
0
if i = j,
otherwise.
(75)
(76)
2
r=1 zir
(77)
In this way all the original points are mapped into a unit
hypersphere.
(7) In this new representation of the original n points apply
a clustering algorithm that attempts to minimize distortion
such as K-means.
As a criterion to choose they suggest to use the value
that guarantees the minimum distortion when the clustering
stage is performed on Y . They tested this algorithm on articial
data sets showing the capability of the algorithm to separate
nonlinear structures. Here we show the steps of the algorithm
when applied to the data set in Fig. 1. Once the singular value
decomposition of L is computed, we can see the matrices Z and
Y in Fig. 4 (here obtained with = 0.4). Once Y is computed,
it is easy to cluster the two groups of points obtaining the result
shown in Fig. 5.
4.3. Other methods
An interesting view of spectral clustering is provided by
Meila et al. [23] who describe it in the framework of Markov
random walks [23] leading to a different interpretation of the
graph cut problem. It is known, from the theory of Markov random walks, that if we construct the stochastic matrix P =D 1 A,
each element pij represents the probability of moving from
node i to node j . In their work they provide an explicit connection between the spectral decomposition of L and P showing
that both have the same solution with eigenvalues of P equal
to 1 i where i are the eigenvalues of L. Moreover, they
propose a method to learn a function of the features able to produce a correct segmentation starting from a segmented image.
An interesting study on spectral clustering has been conducted by Kannan et al. [13]. The authors exploit the objective
function with respect to some articial data sets showing that
there is no objective function able to properly cluster every
data set. In other words there always exists some data set for
which the optimization of a particular objective function has
some drawback. For this reason they propose a bi-criteria objective function. These two objectives are, respectively, based
on the conductance and the ratio between the auto-association
of a subset of nodes S and its volume. Again the relaxation of
this problem is achieved by the decomposition of the Laplacian
of the graph associated to the data set.
187
0.6
0.4
0.05
0.2
0.00
y2
z2
0.0
0.05
0.4
0.10
0.8
0.10
0.08
0.06
z1
0.04
0.02
0.80
0.75
0.70
y1
0.65
0.60
Fig. 4. (a) The matrix Z obtained with the rst two eigenvectors of the matrix L. (b) The matrix Y obtained by normalizing the rows of Z clustered by
K-means algorithm with two centroids.
6
4
x2
2
0
J
(W, V
) =
c
i=1 xk i
where
vi
0
x1
Fig. 5. The result of the Ng and Jordan algorithm on the ring data set.
xk i wk
(xk )
xk i wk
wk
(xk ) vi
2 ,
xk i wk
(xk )
si
(78)
(79)
(80)
xk i
(81)
(82)
(xk ). It is easy to verify that the matrix F W yields a matrix whose columns are the wk
(xk ). Moreover, the expression
F W ZZ T gives a matrix having n columns which are the nearest centroids in feature space of the
(xk ).
Thus, substituting Eq. (79) in Eq. (78) we obtain the following matrix expression for J
(W, V
):
J
(W, V
) =
n
wk Fk (F W ZZ T )k 2 .
(83)
k=1
188
n
J (W, V ) =
k=1
(91)
Thus the correspondence with the objective in the kernel K-means imposes to choose Y = D 1/2 Z, W = D and
K = D 1 AD 1 . It is worth noting that for an arbitrary A it is
not guaranteed that D 1 AD 1 is denite positive. In this case
the kernel K-means will not necessarily converge. To cope
with this problem in Ref. [9] the authors propose to enforce
positive deniteness by means of a diagonal shift [94]:
J
(W, V
) = tr(Y T W 1/2 F T F W 1/2 Y ).
K = D 1 + D 1 AD 1 ,
(84)
(85)
c
assoc(Si , Si )
|Si |
i=1
(87)
c
zT Azi
i
i=1
ziT zi
(88)
zi
,
T
(zi zi )1/2
(89)
we obtain
J (S1 , . . . , Sc ) =
c
(90)
i=1
(92)
[1] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM
Comput. Surveys 31 (3) (1999) 264323.
[2] R. Xu, D.I.I. Wunsch, Survey of clustering algorithms, IEEE Trans.
Neural Networks 16 (3) (2005) 645678.
[3] R.O. Duda, P.E. Hart, Pattern Classication and Scene Analysis, Wiley,
New York, 1973.
[4] P.H.A. Sneath, R.R. Sokal, Numerical Taxonomy: The Principles and
Practice of Numerical Classication, W.H. Freeman, San Francisco, 1973.
[5] J.H. Ward, Hierarchical grouping to optimize an objective function, J.
Am. Stat. Assoc. 58 (1963) 236244.
[6] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function
Algorithms, Kluwer Academic Publishers, Norwell, MA, USA, 1981.
189
[32] X. Tan, S. Chen, Z.H. Zhou, F. Zhang, Robust face recognition from a
single training image per person with kernel-based som-face, in: ISNN
(1), 2004, pp. 858863.
[33] D.S. Satish, C.C. Sekhar, Kernel based clustering and vector quantization
for speech recognition, in: Proceedings of the 2004 14th IEEE Signal
Processing Society Workshop, 2004, pp. 315324.
[34] A. Majid Awan, Mohd Noor Md. Sap, An intelligent system based on
kernel methods for crop yield prediction, in: W.K. Ng, M. Kitsuregawa,
J. Li, K. Chang (Eds.), PAKDD, Lecture Notes in Computer Science,
vol. 3918, 2006, pp. 841846.
[35] A.Y. Ng, M.I. Jordan, Y. Weiss, On spectral clustering: analysis and
an algorithm, in: T.G. Dietterich, S. Becker, Z. Ghahramani (Eds.),
Advances in Neural Information Processing Systems, vol. 14, MIT Press,
Cambridge, MA, 2002.
[36] A. Rahimi, B. Recht, Clustering with normalized cuts is clustering with
a hyperplane, Statist. Learn. Comput. Vis. (2004).
[37] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Trans.
Pattern Anal. Mach. Intell. 22 (8) (2000) 888905.
[38] A.N. Srivastava, Mixture density Mercer kernels: a method to learn
kernels directly from data, in: SDM, 2004.
[39] N. Cristianini, J.S. Taylor, J.S. Kandola, Spectral kernel methods for
clustering, in: NIPS, 2001, pp. 649655.
[40] I.S. Dhillon, Co-clustering documents and words using bipartite spectral
graph partitioning, in: KDD 01: Proceedings of the Seventh ACM
SIGKDD International Conference on Knowledge Discovery and Data
Mining, ACM Press, New York, NY, USA, 2001, pp. 269274.
[41] Y. Kluger, R. Basri, J.T. Chang, M. Gerstein, Spectral biclustering of
microarray data: coclustering genes and conditions, Genome Res. 13 (4)
(2003) 703716.
[42] B. Kulis, S. Basu, I.S. Dhillon, R. Mooney, Semi-supervised graph
clustering: a kernel approach, in: ICML 05: Proceedings of the 22nd
International Conference on Machine Learning, ACM Press, New York,
NY, USA, 2005, pp. 457464.
[43] A. Paccanaro, C. Chennubhotla, J.A. Casbon, M.A.S. Saqi, Spectral
clustering of protein sequences, in: International Joint Conference on
Neural Networks, vol. 4, 2003, pp. 30833088.
[44] J. Weston, C. Leslie, E. Ie, D. Zhou, A. Elisseeff, W.S. Noble, Semisupervised protein classication using cluster kernels, Bioinformatics 21
(15) (2005) 32413247.
[45] Y. Linde, A. Buzo, R. Gray, An algorithm for vector quantizer design,
IEEE Trans. Commun. 1 (1980) 8495.
[46] S. Lloyd, Least squares quantization in pcm, IEEE Trans. Inf. Theory
28 (1982) 129137.
[47] A. Gersho, R.M. Gray, Vector Quantization and Signal Compression,
Kluwer, Boston, 1992.
[48] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford
University Press, Oxford, UK, 1996.
[49] J.B. Macqueen, Some methods of classication and analysis of
multivariate observations, in: Proceedings of the Fifth Berkeley
Symposium on Mathematical Statistics and Probability, 1967,
pp. 281297.
[50] T. Kohonen, The self-organizing map, in: Proceedings of the IEEE, vol.
78, 1990, pp. 14641480.
[51] T. Kohonen, Self-organized formation of topologically correct feature
maps, Biol. Cybernet. 43 (1) (1982) 5969.
[52] T. Kohonen, Self-Organizing Maps, Springer, New York, Inc., Secaucus,
NJ, USA, 2001.
[53] H.J. Ritter, T.M. Martinetz, K.J. Schulten, Neuronale Netze, AddisonWesley, Mnchen, Germany, 1991.
[54] T.M. Martinetz, S.G. Berkovich, K.J. Schulten, Neural gas network for
vector quantization and its application to time-series prediction, IEEE
Trans. Neural Networks 4 (4) (1993) 558569.
[55] R. Krishnapuram, J.M. Keller, A possibilistic approach to clustering,
IEEE Trans. Fuzzy Systems 1 (2) (1993) 98110.
[56] R. Krishnapuram, J.M. Keller, The possibilistic c-means algorithm:
insights and recommendations, IEEE Trans. Fuzzy Systems 4 (3) (1996)
385393.
[57] P.J. Huber, Robust Statistics, Wiley, New York, 1981.
190
[77] A.B. Hur, D. Horn, H.T. Siegelmann, V. Vapnik, A support vector method
for clustering, in: T.K. Leen, T.G. Dietterich, V. Tresp (Eds.), NIPS,
2000, pp. 367373.
[78] D.M.J. Tax, R.P.W. Duin, Support vector domain description, Pattern
Recognition Lett. 20 (1113) (1999) 11911199.
[79] C.J.C. Burges, A tutorial on support vector machines for pattern
recognition, Data Min. Knowl. Discovery 2 (2) (1998) 121167.
[80] D. Lee, An improved cluster labeling method for support vector
clustering, IEEE Trans. Pattern Anal. Mach. Intell. 27 (3) (2005)
461464.
[81] J. Yang, V. Estirill-Castro, S.K. Chalup, Support vector clustering through
proximity graph modelling, in: Proceedings of the Ninth International
Conference on Neural Information Processing, vol. 2, 2002, pp. 898903.
[82] J.-H. Chiang, P.-Y. Hao, A new kernel-based fuzzy clustering approach:
support vector clustering with cell growing, IEEE Trans. Fuzzy Systems
11 (4) (2003) 518527.
[83] W.E. Donath, A.J. Hoffman, Lower bounds for the partitioning of graphs,
IBM J. Res. Dev. 17 (1973) 420425.
[84] D. Wagner, F. Wagner, Between min cut and graph bisection, in:
Mathematical Foundations of Computer Science, 1993, pp. 744750.
[85] M. Brand, K. Huang, A unifying theorem for spectral embedding and
clustering, in: C.M. Bishop, B.J. Frey (Eds.), Proceedings of the Ninth
International Workshop on Articial Intelligence and Statistics, 2003.
[86] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction
and data representation, Neural Comput. 15 (6) (2003) 13731396.
[87] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally
linear embedding, Science 290 (5500) (2000) 23232326.
[88] S.X. Yu, J. Shi, Multiclass spectral clustering, in: ICCV 03: Proceedings
of the Ninth IEEE International Conference on Computer Vision, IEEE
Computer Society, Silver Spring, MD, Washington, DC, USA, 2003.
[89] Y. Bengio, O. Delalleau, N. Le Roux, J.F. Paiement, P. Vincent, M.
Ouimet, Learning eigenfunctions links spectral embedding and kernel
PCA, Neural Comput. 16 (10) (2004) 21972219.
[90] Y. Bengio, P. Vincent, J.F. Paiement, Spectral clustering and kernel PCA
are learning eigenfunctions, Technical Report 2003s-19, CIRANO, 2003.
[91] G.H. Golub, C.F.V. Loan, Matrix Computations, Johns Hopkins Studies
in Mathematical Sciences, The Johns Hopkins University Press,
Baltimore, MD, October 1996.
[92] P.K. Chan, M. Schlag, J.Y. Zien, Spectral k-way ratio-cut partitioning
and clustering, in: Proceeding of the 1993 Symposium on Research
on Integrated Systems, MIT Press, Cambridge, MA, USA, 1993,
pp. 123142.
[93] B.W. Kernighan, S. Lin, An efcient heuristic procedure for partitioning
graphs, Bell System Tech. J. 49 (1) (1970) 291307.
[94] V. Roth, J. Laub, M. Kawanabe, J.M. Buhmann, Optimal cluster
preserving embedding of nonmetric proximity data, IEEE Trans. Pattern
Anal. Mach. Intell. 25 (12) (2003) 15401551.
About the AuthorMAURIZIO FILIPPONE received a Laurea degree in Physics in 2004 from the University of Genova and now he is pursuing a Ph.D. in
Computer Science at University of Genova. His research interests are focused on Machine Learning techniques and their applications to Bioinformatics and
to time series analysis and forecasting.
About the AuthorFRANCESCO CAMASTRA received Laurea degree in Physics in 1985 from University of Milan and the Ph.D. degree in Computer
Science from University of Genova in 2004. He was the recipient of Eduardo Caianiello Award 2005 and since February 2006 he has been Assistant Professor
at University of Napoli Parthenope. His research interests are Machine Learning and Kernel Methods.
About the AuthorFRANCESCO MASULLI received the Laurea degree in Physics in 1976 from the University of Genova. He is currently an Associate
Professor in Computer Science at the University of Genova. From 1983 to 2001 he has been an Assistant Professor at the University of Genova and from
2001 to 2005 an Associate Professor at University of Pisa. He authored or co-authored more than 120 scientic papers in Machine Learning, Neural Networks,
Fuzzy Systems, Pattern Recognition and Bioinformatics.
About the AuthorSTEFANO ROVETTA received a Laurea degree in Electronic Engineering in 1992 and a Ph.D. degree in Models, Methods and Tools
for Electronic and Electromagnetic Systems in 1996, both from the University of Genova. He is currently an Assistant Professor in Computer Science at the
University of Genova. His current research interests include Machine Learning and Clustering techniques for high-dimensional biomedical data analysis and
document analysis.