An Adaptive Kernel Method for
Semi-supervised Clustering
Bojun Yan and Carlotta Domeniconi
Department of Information and Software Engineering
George Mason University
Fairfax, Virginia 22030, USA
[email protected],
[email protected]
Abstract. Semi-supervised clustering uses the limited background
knowledge to aid unsupervised clustering algorithms. Recently, a kernel method for semi-supervised clustering has been introduced, which
has been shown to outperform previous semi-supervised clustering approaches. However, the setting of the kernel’s parameter is left to manual
tuning, and the chosen value can largely affect the quality of the results.
Thus, the selection of kernel’s parameters remains a critical and open
problem when only limited supervision, provided in terms of pairwise
constraints, is available. In this paper, we derive a new optimization
criterion to automatically determine the optimal parameter of an RBF
kernel, directly from the data and the given constraints. Our approach
integrates the constraints into the clustering objective function, and optimizes the parameter of a Gaussian kernel iteratively during the clustering
process. Our experimental comparisons and results with simulated and
real data clearly demonstrate the effectiveness and advantages of the
proposed algorithm.
1
Introduction
As a recent emerging technique, semi-supervised clustering has attracted significant research interest. Compared to traditional clustering algorithms, which
only use unlabeled data, semi-supervised clustering employs both unlabeled and
supervised data to obtain a partitioning that conforms more closely with the
user’s preferences. Several recent papers have discussed this problem [16,8,1,18,
2, 12].
In semi-supervised clustering, limited supervision is provided as input. The
supervision can have the form of labeled data or pairwise constraints. In many
applications it is natural to assume that pairwise constraints are available [1,
16]. For example, in protein interaction and gene expression data [13], pairwise
constraints can be derived from the background domain knowledge. Similarly,
in information and image retrieval, it is easy for the user to provide feedback
concerning a qualitative measure of similarity or dissimilarity between pairs of
objects. Thus, in these cases, although class labels may be unknown, a user can
still specify whether pairs of points belong to the same cluster or to different
J. Fürnkranz, T. Scheffer, and M. Spiliopoulou (Eds.): ECML 2006, LNAI 4212, pp. 521–532, 2006.
c Springer-Verlag Berlin Heidelberg 2006
522
B. Yan and C. Domeniconi
ones. Furthermore, a set of classified points implies an equivalent set of pairwise
constraints, but not vice versa.
Recently, a kernel method for semi-supervised clustering has been introduced
[12]. This technique extends semi-supervised clustering to a kernel space, thus
enabling the discovery of clusters with non-linear boundaries in input space.
While a powerful technique, the applicability of a kernel-based semi-supervised
clustering approach is limited in practice, due to the critical settings of kernel’s
parameters. In fact, the chosen parameter values can largely affect the quality
of the results. While solutions have been proposed in supervised learning to
estimate the optimal kernel’s parameters, the problem presents open challenges
when no labeled data are provided, and all we have available is a set of pairwise
constraints.
In this paper, we derive a new optimization criterion to automatically estimate
the optimal parameter of a Gaussian kernel, directly from the data and the given
constraints. Our approach integrates the constraints into the clustering objective
function, and optimizes the parameter of a Gaussian kernel iteratively during
the clustering process. As a result, our technique is able to automatically embed,
during the clustering process, the optimal non-linear similarity within the feature
space. This makes our adaptive technique capable of discovering clusters with
non-linear boundaries in input space with high accuracy, as demonstrated in our
experiments. Our proposed method enables the practical utilization of powerful
kernel-based semi-supervised clustering approaches by providing a mechanism
to automatically set the involved critical parameters.
The rest of the paper is organized as follows. Section 2 provides the necessary
background on kernel-based clustering and semi-supervised clustering. Section
3 motivates our approach, and discusses the details of our algorithm. Section
4 describes our experimental settings and results. Section 5 discusses the related work, and finally we provide conclusions and future research directions in
Section 6.
2
Background
This section introduces the necessary background on kernel-based clustering and
semi-supervised clustering.
2.1
Kernel KMeans
D
Let X be a dataset of N samples and D dimensions, X = {xi }N
i=1 ⊆ . Let
D
D
φ : → be a non-linear mapping function, which maps data from the
input (D dimensional) space to a feature space (D dimensional), with D > D.
The Kernel KMeans algorithm generates a partition {πc }kc=1 of X (πc represents
k
φ
the cth cluster) so that the objective function
xi ∈πc φ(xi ) − mc is
c=1
minimized, where mφc = |π1c | xi ∈πc φ(xi ) represents the centroid of cluster πc
in feature space. The key issue of Kernel-KMeans is the computation of distances
An Adaptive Kernel Method for Semi-supervised Clustering
523
in feature space. The distance of a point xi from mφc in feature space can be
expressed
as: φ(xi ) − mφc = Aii + Bcc −
Dic , where Aii = φ(xi ) · φ(xi ), Dic =
2
1
φ(x
)
·
φ(x
),
and
B
=
i
j
cc
2
x
∈π
xj ,xj ∈πc φ(xj ) · φ(xj ).
|πc |
|πc |
j
c
Following the standard SVM method, we can represent the dot product of
points in kernel space using an appropriate Mercer kernel K(xi , xj ) = φ(xi ) ·
φ(xj ) [15]. Since data points always appear in the form of dot products, the terms
for distance
computation can be rewritten using the
kernel trick: Aii = K(xi , xj ),
Dic = |π2c | xj ∈πc K(xi , xj ), and Bcc = |π1c |2 xj ,xj ∈πc K(xj , xj ). We note
that Aii is common to every cluster, thus we can avoid calculating it, while Bcc
must be calculated once in each iteration.
2.2
HMRF Model and Kernel-Based Semi-supervised Clustering
In semi-supervised clustering, we are given a set of pairwise constraints: mustlink M L = {(xi , xj )} and cannot-link CL = {(xi , xj )}. The goal is to partition the data into k clusters so that a given measure of distorsion between
each point and the corresponding cluster representative is minimized, and, at
the same time, the smallest number of constraint violation is achieved. Basu
et al. (2004) [2] proposed a framework for semi-supervised clustering based on
Hidden Markov Random Fields (HMRFs). Considering the squared Euclidean
distance as a measure of cluster distortion, and the generalized Potts potential
as constraint violation potential, the semi-supervised clustering objective can be
expressed as [2]:
Jobj ({πc }kc=1 ) =
k
xi − mc 2 +
c=1 xi ∈πc
wij +
xi ,xj ∈ML,li =lj
w ij
xi ,xj ∈CL,li =lj
where mc is the centroid of cluster πc , M L is the set of must-link constraints,
CL is the set of cannot-link constraints, wij and w ij are the penalty costs for
violating a must-link and a cannot-link constraint respectively, and li represents
the cluster label of xi .
Kulis et al. (2005) [12] extended this framework to a kernel-based semisupervised clustering. Instead of adding a penalty term for a must-link violation, a reward is given for the satisfaction of the constraint. This is achieved by
subtracting the corresponding penalty term from the objective:
Jobj ({πc }kc=1 ) =
k
c=1 xi ∈πc
φ(x)i − mφc 2 −
xi ,xj ∈ML,li =lj
wij +
wij
xi ,xj ∈CL,li =lj
The algorithm derived in [12] (called SS-Kernel-KMeans), when combined with
the Gaussian kernel, is shown to outperform the HMRF-KMeans approach [2],
and SS-Kernel-KMeans combined with a linear kernel. However, the setting of
the kernel’s parameter is left to manual tuning, and the chosen value can largely
affect the quality of the results. Thus, the selection of kernel’s parameters remains
524
B. Yan and C. Domeniconi
a critical and open problem when only limited supervision is available. This leads
to the motivation of our approach discussed in the next Section.
3
Adaptive Kernel-Based Semi-supervised Clustering
In kernel-based learning algorithms it is important that the kernel function in use
conforms with the learning target. For classification, the distribution of data in
feature space should be correlated to the label distribution. Similarly, in semisupervised clustering, one wishes to learn a kernel that maps pairs of points
subject to a must-link constraint close to each other in feature space, and maps
points subject to a cannot-link constraint far apart in feature space.
The authors in [9] introduce the concept of kernel alignment to measure the
correlation between the groups of data in feature space and the labeling to be
learned. In [17], a Fisher discriminant rule is used to estimate the optimal spread
parameter of a Gaussian kernel. The selection of kernel’s parameters is indeed a
critical problem. For example, empirical results in the literature have shown that
the value of the spread parameter σ of a Gaussian kernel can strongly affect the
generalization performance of an SVM. Values of σ which are too small or too
large lead to poor generalization capabilities. When σ → 0, the kernel matrix
becomes the identity matrix. In this case, the resulting optimization problem
gives Lagrangians which are all 1s, and therefore every point becomes a support
vector. On the other hand, when σ → ∞, the kernel matrix has entries all equal
to 1, and thus each point in feature space is maximally similar to each other. In
both cases, the machine will generalize very poorly.
The problem of setting kernel’s parameters, and of finding in general a proper
mapping in feature space, is even more difficult when no labeled data are provided, and all we have available is a set of pairwise constraints. In this paper
we utilize the given constraints to derive an optimization criterion to automatically estimate the optimal kernel’s parameters. Our approach integrates the
constraints into the clustering objective function, and optimize the kernel’s parameters iteratively while discovering the clustering structure. Specifically, we
steer the search for optimal parameter values by measuring the amount of mustlink and cannot-link constraint violations in feature space. Following the method
proposed in [2, 4], we scale the penalty terms by the distances of points, that
violate the constraints, in feature space. That is, for violation of a must-link
constraint (xi , xj ), the larger the distance between the two points xi and xj in
feature space, the larger the penalty; for violation of a cannot-link constraint
(xi , xj ), the smaller the distance between the two points xi and xj in feature
space, the larger the penalty. According to these rules, we can formulate the
penalty terms as follows:
lj )
PML (xi , xj ) = wij φ(xi ) − φ(xj )2 1(li =
φ
2
2
lj )
PCL (xi , xj ) = w ij ((Dmax ) − φ(xi ) − φ(xj ) )1(li =
(1)
(2)
φ
Dmax
is the maximum distance between any pair of points in feature space;
it ensures that the penalty for violated cannot-link constraints is non-negative.
An Adaptive Kernel Method for Semi-supervised Clustering
525
By combining these two penalty terms with the objective function of Kernel
KMeans, we obtain the objective function for our adaptive semi-supervised kernel KMeans (Adaptive-SS-Kernel-KMeans) approach:
Jobj =
k
c=1 xi ∈πc
wij φ(xi ) − φ(xj )2
(xi ,xj )∈ML,li =lj
+
(φ(xi ) − mφc 2 ) +
φ
w ij ((Dmax
)2 − φ(xi ) − φ(xj )2 )
(3)
(xi ,xj )∈CL,li =lj
Suppose x and x are the farthest points in feature space. We use the equality
k
k
xi −xj 2
2
to re-formulate Equation
xi ∈πc xi − mc =
xi ,xj ∈πc
c=1
c=1
2|πc |
(3) as follows:
Jobj =
k
c=1 (xi ,xj )∈πc
+
φ(xi ) − φ(xj )2
+
2|πc |
wij φ(xi ) − φ(xj )2
(xi ,xj )∈ML,li =lj
wij (φ(x ) − φ(x )2 − φ(xi ) − φ(xj )2 )
(xi ,xj )∈CL,li =lj
By expanding the distance computation in feature space φ(xi ) − φ(xj )2 , and
using the kernel trick K(xi , xj ) = φ(xi ) · φ(xj ), we obtain:
Jobj =
k
c=1 xi ,xj ∈πc
+
K(xi , xi ) + K(xj , xj ) − 2K(xi , xj )
2|πc |
wij (K(xi , xi ) + K(xj , xj ) − 2K(xi , xj ))
(xi ,xj )∈ML,li =lj
+
wij (K(x , x ) + K(x , x ) − 2K(x , x )
(xi ,xj )∈CL,li =lj
− K(xi , xi ) − K(xj , xj ) + 2K(xi , xj ))
(4)
Let us consider the Gaussian kernel function: K(xi , xj ) = exp(−xi −
xj 2 /(2σ 2 )). (From now on we utilize the Gaussian kernel to derive our algorithm, since it has excellent learning properties. Other kernel functions can be
used as well.) We want to minimize Jobj with respect to the kernel parameter
σ. As observed earlier, when σ → ∞, all points in feature space are maximally
similar to each other, and the objective function (4) is trivially minimized. To
avoid this degenerate case, we add the following constraint:
φ(xi ) − φ(xr )2 ≥ Const
(5)
xi ∈X
where xr is a point randomly selected from X. By incorporating constraint (5)
into the objective function, and applying the kernel trick for distance computation in feature space, we finally obtain:
526
B. Yan and C. Domeniconi
k
Jkernel−obj =
c=1 xi ,xj ∈πc
+
1 − K(xi , xj )
+
|πc |
2w ij (K(xi , xj ) − K(x , x )) − (
∂K(xi ,xj )
∂σ
2(1 − K(xi , xr )) − Const)
xi ∈X
(xi ,xj )∈CL,li =lj
Given
2wij (1 − K(xi , xj ))
(xi ,xj )∈ML,li =lj
= exp(
∂Jkernel−obj
=
∂σ
−xi −xj 2 xi −xj 2
) σ3 ,
2σ2
k
−
c=1 (xi ,xj )∈πc
−
(xi ,xj )∈ML,li =lj
+
(xi ,xj )∈ML,li =lj
we compute
∂Jkernel−obj
:
∂σ
−xi − xj 2 xi − xj 2
1
exp(
)
|πc |
2σ 2
σ3
2wij exp(
(6)
−xi − xj 2 xi − xj 2
)
)
2σ 2
σ3
2wij (exp(
−xi − xj 2 xi − xj 2
)
2σ 2
σ3
−x − x 2 x − x 2
)
)
2σ 2
σ3
−xi − xr 2 xi − xr 2
2exp(
)
2σ 2
σ3
− exp(
+
xi ∈X
In the following we derive an EM-based strategy to optimize Jkernel−obj by
gradient descent.
3.1
EM-Based Strategy
To minimize the objective function Jkernel−obj , we use an EM-based strategy.
We initialize the clusters utilizing the mechanism proposed in [12]: we take the
transitive closure of the constraints to form neighborhoods, and then perform a
farthest-first traversal on these neighborhoods to get the k initial clusters. We
ensure that the same set of constraints is given to the competitive algorithm in
our experiments.
E-step: The algorithm assigns data points to clusters so that the objective
function Jkernel−obj is minimized. Since the objective function integrates the
given must-link and cannot-link constraints, it is minimized by assigning each
point to the cluster with the closest centroid (first term of Jkernel−obj ) which
causes a minimal penalty for violations of constraints (second and third term of
Jkernel−obj ). The fourth term of Jkernel−obj is constant during the assignment
of data points in each iteration. When updating the cluster assignment of a
given point, the assignment for the other points is kept fixed [3,19]. During each
iteration, data points are re-ordered randomly. The process is repeated until no
change in point assignment occurs.
An Adaptive Kernel Method for Semi-supervised Clustering
527
M-step: The algorithm re-computes the cluster representatives. In practice,
since we map data in kernel space and do not have access to the coordinates of
cluster representatives, we re-compute the term Bcc (as discussed in Section 2.1),
which will be used to re-assign points to clusters in the E-step. Constraints are
not used in this step. Therefore, only the first term of Jkernel−obj is minimized.
We note that all the steps so far are executed with respect to the current
feature space. We now optimize the feature space by optimizing the kernel parameter σ. To this extent, we apply the gradient descent rule to update the
∂J
, where ρ is a
parameter σ of the Gaussian kernel: σ (new) = σ (old) − ρ kernel−obj
∂σ
scalar step length parameter optimized via a line-search method. The expression
∂J
for kernel−obj
is given in Equation (6).
∂σ
A description of the algorithm (Adaptive-SS-Kernel-KMeans) is provided in
Figure 1.
Algorithm: Adaptive-SS-Kernel-KMeans
Input:
–
–
–
–
–
Set of data points X = {xi }N
i=1
Set of must-link constraints M L
Set of cannot-link constraints CL
Number of clusters k
Constraint violation costs wij and wij
Output:
– Partition of X into k clusters
Method:
(0)
Initialize clusters {πc }kc=1 using the given constraints; set t = 0.
Repeat Step3 - Step6 until convergence.
(t)
E-step: Assign each data point xi to a cluster πc so that Jkernel−obj is minimized.
(t)
M-step(1): Re-compute Bcc , for c = 1, 2, · · · , k.
M-step(2): Optimize the kernel parameter σ using gradient descent according to
∂J
.
the rule: σ (new) = σ (old) − ρ kernel−obj
∂σ
6. Increment t by 1.
1.
2.
3.
4.
5.
Fig. 1. Adaptive-SS-Kernel-KMeans
4
4.1
Experimental Evaluation
Datasets
We performed experiments on one simulated dataset and four real datasets. (1)
The simulated dataset contains two clusters in two dimensions distributed as
concentric circles (See Figure 2(a)). Each cluster contains 200 points. (2) Digits:
This dataset is the pendigits handwritten character recognition dataset from the
528
B. Yan and C. Domeniconi
UCI repository1 [5]. 10% of the data were chosen randomly from the three classes
{3, 8, 9} as done in [12]. This results in 317 points and 16 dimensions. (3) Spectf:
This dataset is also from the UCI repository [5]. It describes the diagnosis of
cardiac Single Proton Emission Computed Tomography (SPECT) images. Each
patient is classified into one of two categories: normal or abnormal. 267 SPECT
image sets (patients) were processed to extract features that summarize the
original SPECT images. As a result, 44 continuous features were created for each
patient. (4) Vowel: This dataset concerns the recognition of eleven steady state
vowels of British English, using a specified training set of lpc derived log area
ratios2 . Three class corresponding to the vowels ”i”, ”I”, and ”E” were chosen,
for a total of 126 points and 10 dimensions; (5) Segmentation: This dataset
is from UCI repository [5]. It has 210 points and 19 dimensions. The instances
were drawn randomly from a database of 7 outdoor images. The images were
hand-segmented to create a classification for every pixel.
4.2
Evaluation Criterion
To evaluate the clustering results, we use the Rand Statistic index [14,18,16]. The
Rand Statistic is an external cluster validity measure that estimates the quality
of the clustering results with respect to the underlying classes of the data. Let
P1 be the partition of the data X after applying a clustering algorithm, and
P2 be the underlying class structure of the data. We refer to a pair of points
(xu , xv ) ∈ X × X from the data using the following terms:
– SS: if both points belong to the same cluster of P1 and to the same group
of the underlying class structure P2 .
– SD: if the two points belong to the same cluster of P1 and to different groups
of P2 .
– DS: if the two points belong to different clusters of P1 and to the same group
of P2 .
– DD: if both points belong to different clusters of P1 and to different groups
of P2 .
Assume now that NSS , NSD , NDS and NDD are the number of SS, SD, DS and
DD pairs respectively, then NSS + NSD + NDS + NDD = NP air which is the
maximum number of all pairs in the data set3 . The Rand Statistic index measures
the degree of similarity between P1 and P2 as follows:
RandStatistic = (NSS + NDD )/NP air
4.3
(7)
Results and Discussion
To evaluate the effectiveness of our proposed method Adaptive-SS-KernelKMeans we perform comparisons with SS-Kernel-KMeans [12]. As shown in
1
2
3
http://www.ics.uci.edu/˜mlearn/MLRepository.html
http://www-stat-class.stanford.edu/˜tibs/ElemStatLearn/
NP air = N (N − 1)/2, where N is the total number of points in the data set.
An Adaptive Kernel Method for Semi-supervised Clustering
529
[12], SS-Kernel-KMeans combined with a Gaussian kernel outperforms HMRFKMeans and SS-Kernel-KMeans with a linear kernel. Therefore, the technique
SS-Kernel-KMeans with Gaussian kernel was the proper choice for our empirical comparisons. SS-Kernel-KMeans requires in input a predefined value for the
Gaussian kernel parameter σ. In absence of labeled data, parameters cannot be
cross-validated; thus, we estimate the expected accuracy of SS-Kernel-KMeans
by averaging the resulting clustering quality over multiple runs for different
values of σ. Specifically, in our experiments, we test the SS-Kernel-KMeans
algorithm with the values of σ 2 : 0.1, 1, 10, 100, 1000, 10000. We report the
average Rand Statistic achieved over the six σ values, as well as the average
over the best three performances achieved, in order to show the advantage of
our technique also in this latter case. The violation costs wij and wij are set
to 1 in our experiments since we assume no a-priori knowledge on such costs.
As value of k, we provide the actual number of classes in the data to both
algorithms.
Figures 2-4 show the learning curves using 20 runs of 2-fold cross-validation
for each data set (30% for training and 70% for testing). These plots show the
improvement in clustering quality on the test set as a function of an increasing
amount of pairwise constraints. To study the effect of constraints in clustering,
30% of the data was randomly drawn as the training set at any particular fold,
and the constraints are generated only using the training set. The clustering
algorithm was run on the whole data set, but we calculated the Rand Statistic
only on the test set. Each point on the learning curve is an average of results
over 20 runs.
The results shown in Figures 2-4 clearly demonstrate the effectiveness of our
proposed technique Adaptive-SS-Kernel-KMeans. For all five datasets, the clustering quality achieved by our adaptive approach significantly outperforms the
results provided by SS-Kernel-KMeans, averaged over the σ values tested. In
most cases (TwoConcentric, Vowel, Digits, and Segmentation), the Adaptive-SSKernel-KMeans technique also outperforms the average top three performances
of SS-Kernel-KMeans. For the Stectf data the two approaches show a similar
trend. These results show that our adaptive technique is capable of estimating
the optimal kernel parameter value from the given constraints. In particular,
for the TwoConcentric data (see Figure 2(b)), the Adaptive-SS-Kernel-KMeans
technique effectively uses the increased amount of constraints to learn a perfect
separation of the two clusters. For the Digits, Spectf, and Segmentation data,
the Adaptive-SS-Kernel-KMeans technique provides a clustering quality that is
significantly higher than the one given by SS-Kernel-KMeans, even when a small
amount of constraints is available. This behavior is very desirable since in practice only a limited amount of supervision might be available. We also emphasize
that the cluster initialization mechanism employed in the EM-based strategy
mitigates the sensitivity of the result at convergence from the starting point of
the search.
530
B. Yan and C. Domeniconi
1
10
8
0.95
6
0.9
Rand Statistic
4
2
0
−2
−4
Adaptive−SS−Kernel−KMeans
SS−Kernel−KMeans−BestThree
SS−Kernel−KMeans−Expectation
0.85
0.8
0.75
−6
0.7
−8
−10
−10
−5
0
5
0.65
50
10
100
150
(a)
200
250
300
350
Number of Constraints
400
450
500
(b)
Fig. 2. (a) TwoConcentric data (b) Clustering result on TwoConcentric data
1
1
Adaptive−SS−Kernel−KMeans
SS−Kernel−KMeans−BestThree
SS−Kernel−KMeans−Expectation
0.95
0.9
0.9
0.85
0.85
Rand Statistic
Rand Statistic
0.95
0.8
0.75
0.8
0.75
0.7
0.7
0.65
0.65
50
100
150
200
250
300
350
Number of Constraints
400
450
500
50
Adaptive−SS−Kernel−KMeans
SS−Kernel−KMeans−BestThree
SS−Kernel−KMeans−Expectation
100
150
(a)
200
250
300
350
Number of Constraints
400
450
500
(b)
Fig. 3. (a) Clustering result on Vowel data (b) Clustering result on Digits data
0.69
0.95
Adaptive−SS−Kernel−KMeans
SS−Kernel−KMeans−BestThree
SS−Kernel−KMeans−Expectation
0.68
0.9
0.66
Rand Statistic
Rand Statistic
0.67
Adaptive−SS−Kernel−KMeans
SS−Kernel−KMeans−BestThree
SS−Kernel−KMeans−Expectation
0.65
0.85
0.8
0.64
0.75
0.63
0.62
50
100
150
200
250
300
350
Number of Constraints
(a)
400
450
500
0.7
50
100
150
200
250
300
350
Number of Constraints
400
450
500
(b)
Fig. 4. (a) Clustering result on Spectf data (b) Clustering result on Segmentation data
An Adaptive Kernel Method for Semi-supervised Clustering
5
531
Related Work
In the context of supervised learning, the work in [7] considers the problem of
automatically tuning multiple parameters for a support vector machine. This is
achieved by minimizing the estimated generalization error achieved by means
of a gradient descent approach over the set of parameters. In [17], a Fisher
discriminant rule is used to estimate the optimal spread parameter of a Gaussian
kernel. The authors in [10] propose a new criterion to address the selection of
kernel’s parameters within a kernel Fisher discriminant analysis framework for
face recognition. A new formulation is derived to optimize the parameters of
a Gaussian kernel based on a gradient descent algorithm. This research makes
use of labeled data to address classification problems. In contrast, our approach
optimizes kernel’s parameters based on unlabeled data and pairwise constraints,
and aims at solving clustering problems.
In the context of semi-supervised clustering, the authors in [8] use a gradient
descent approach combined with a weighted Jensen-Shannon divergence for EM
clustering. The authors in [1] propose a method based on Redundant Component Analysis (RCA) that uses must-link constraints to learn a Mahalanobis
distance. [18] utilizes both must-link and cannot-link constraints to formulate a
convex optimization problem which is local-minima-free. [13] proposes a unified
Markov network with constraints. [2] introduces a more general HMRF framework, that works with different clustering distortion measures, including Bregman divergences and directional similarity measures. All these techniques use
the given constraints and an underlying (linear) distance metric for clustering
points in input space. [12] extends the semi-supervised clustering framework to a
non-linear kernel space. However, the setting of the kernel’s parameter is left to
manual tuning, and the chosen value can largely affect the results. The selection
of kernel’s parameters is a critical and open problem, which has been the driving
force behind the work presented in this paper.
6
Conclusion and Future Work
We proposed a new adaptive semi-supervised Kernel-KMeans algorithm. Our
approach integrates the given constraints with the kernel function, and is able
to automatically embed, during the clustering process, the optimal non-linear
similarity within the feature space. As a result, the proposed algorithm is capable
of discovering clusters with non-linear boundaries in input space with high accuracy. Our technique enables the practical utilization of powerful kernel-based
semi-supervised clustering approaches by providing a mechanism to automatically set the involved critical parameters. In our future work we will consider
active learning as a methodology to generate constraints which are most informative. We will also consider other kernel functions (e.g., polynomial) in our
future experiments, as well as combinations of different types of kernels.
532
B. Yan and C. Domeniconi
Acknowledgements
This work was in part supported by NSF CAREER Award IIS-0447814.
References
1. Bar-Hillel, A., Hertz, T., Shental, N., Weinshall, D.: Learning distance functions
using equivalence relations. International Conference on Machine Learning, pages
11-18, 2003.
2. Basu, S., Bilenko, M., Mooney, R.J.: A probabilistic framework for semi-supervised
clustering. International Conference on Knowledge Discovery and Data Mining,
2004.
3. Besag, J.: On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society, Series B (Methodological), 1986.
4. Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and Metric Learning in
semi-supervised clustering. International Conference on Machine Learning, 2004.
5. Blake, C.L., Merz, C.J.: UCI repository of machine learning databases.
http://www.ics.uci.edu/˜mlearn/MLRepository.html, 1998.
6. Boykov, Y., Veksler, O., Zabih, R.: Markov Random fields with efficient approximations. IEEE Computer Vision and pattern Recognition Conference, 1998.
7. Chapelle, O., Vapnik, V.: Choosing Mutiple Parameters for Support Vector Machines. Machine Learning Vol.46, No. 1. pp.131-159, 2002.
8. Cohn, D., Caruana, R., McCallum, A.: Semi-supervised clustering with user feedback. TR2003-1892, Cornell University, 2003.
9. Cristianini, N., Shawe-Taylor, J., Elisseeff, A.: On Kernel-Target Alignment, Neural
Information Processing Systems (NIPS), 2001.
10. Huang, J., Yuen, P.C., Chen, W.S., Lai, J. H.: Kernel Subspace LDA with optimized
Kernel Parameters on Face Recognition. The sixth IEEE International Conference
on Automatic Face and Gesture Recognition, 2004.
11. Kleinberg, J., Tardos, E.: Approximation algorithms for classification problems
with pairwise relationships: metric labeling and Markov random fields. The 40th
IEEE Symposium on Foundation of Computer Science, 1999.
12. Kulis, B., Basu, S., Dhillon, I., Moony, R.: Semi-supervised graph clustering: a
kernel approach. International Conference on Machine Learning, 2005.
13. Segal, E., Wang, H., Koller, D.: Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics, 2003.
14. Theodoridis, S., Koutroubas, K.: Pattern Recognition. Academic Press, 1999.
15. Vapnik., V.: The Nature of Statistical Learning Theory, Wiley, New York, USA,
1995.
16. Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained K-Means clustering
with background knowledge. International Conference on Machine Learning, pages
577-584, 2001.
17. Wang, W., Xu, Z., Lu W., Zhang, X.: Determination of the spread parameter in
the Gaussian Kernel for classification and regression. Neurocomputing, Vol. 55,
No. 3, 645, 2002.
18. Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance metric learning, with
application to clustering with side-information. Advances in Neural Information
Processing Systems 15, 2003.
19. Zhang, Y., Brady, M., Smith, S.: Hidden Markov random field model and segmentation of brain MR images. IEEE Transactions on Medical Imaging, 2001.