Bisecting K Means
Bisecting K Means
Bisecting K Means
Vipin Kumar
[email protected] ABSTRACT
This paper presents the results of an experimental study of some common document clustering techniques: agglomerative hierarchical clustering and K-means. (We used both a standard K-means algorithm and a bisecting K-means algorithm.) Our results indicate that the bisecting K-means technique is better than the standard K-means approach and (somewhat surprisingly) as good or better than the hierarchical approaches that we tested.
Keywords
K-means, hierarchical clustering, document clustering.
Table 1: Summary description of document sets. Data Set Source Documents Classes Words re0 Reuters 1504 13 11465 re1 Reuters 1657 25 3758 wap WebAce 1560 20 8460 tr31 TREC 927 7 10128 tr45 TREC 690 10 8261 fbis TREC 2463 17 2000 la1 TREC 3204 6 31472 la2 TREC 3075 6 31472
1. INTRODUCTION
Hierarchical clustering is often portrayed as the better quality clustering approach, but is limited because of its quadratic time complexity. In contrast, K-means and its variants have a time complexity that is linear in the number of documents, but are thought to produce inferior clusters. Sometimes K-means and agglomerative hierarchical approaches are combined so as to get the best of both worlds. For example, in the document domain, Scatter/Gather [1], a document browsing system based on clustering, uses a hybrid approach involving both K-means and agglomerative hierarchical clustering. K-means is used because of its run-time efficiency and agglomerative hierarchical clustering is used because of its quality. However, during the course of our experiments we discovered that a simple and efficient variant of K-means, bisecting K-means, can produce clusters of documents that are better than those produced by regular K-means and as good or better than those produced by agglomerative hierarchical clustering techniques. We have also been able to find what we think is a reasonable explanation for this behavior. We refer the reader to [2] for a review of cluster analysis and to [4] for a review of information retrieval. For a more complete version of this paper, please see [6]. The data sets that we used are ones that are described more fully in [6]. They are summarized in the following table.
3. Bisecting K-means
The bisecting K-means algorithm starts with a single cluster of all the documents and works in the following manner: 1. 2. 3. Pick a cluster to split. Find 2 sub-clusters using the basic K-means algorithm. Repeat step 2, the bisecting step, for a fixed number of times and take the split that produces the clustering with the highest overall similarity. (For each cluster, its similarity is the average pairwise document similarity, and we seek to minimize that sum over all clusters.) Repeat steps 1, 2 and 3 until the desired number of clusters is reached.
4.
We found little difference between the possible methods for selecting a cluster to split and chose to split the largest remaining cluster.
choice of which pair of clusters to merge is made by determining which pair of clusters will lead to smallest decrease in similarity. Centroid Similarity Technique: This hierarchical technique defines the similarity of two clusters to be the cosine similarity between the centroids of the two clusters. UPGMA: This is the UPGMA scheme as described in [2]. It defines the cluster similarity as follows, where d1 and d2 are documents in cluster1 and cluster2, respectively. cosine(d1 , d 2 ) similarity(cluster1,cluster2)= size(cluster1) * size(cluster 2)
For K-means we used a standard K-means and a variant of Kmeans, bisecting K-means. Our results indicate that the bisecting K-means technique is better than the standard K-means approach and as good or better than the hierarchical approaches that we tested. In addition, the run time of bisecting K-means is very attractive when compared to that of agglomerative hierarchical clustering techniques - O(n) versus O(n2). Table 2: Comparison of the F-measure Data Set Bisecting K-means UPGMA re0 re1 wap tr31 tr45 Fbis la1 la2 0.5863 0.7067 0.6750 0.8869 0.8080 0.6814 0.7856 0.7882 0.5859 0.6855 0.6434 0.8693 0.8528 0.6717 0.6963 0.7168
5. Results
In this paper we only compare the clustering algorithms when used to create hierarchical clusterings of documents, and only report results for the hierarchical algorithms and bisecting Kmeans. (The standard K-means and flat clustering results can be found in [6].) Figure 1 shows an example of our entropy results, which are more fully reported in [6]. Table 2 shows the comparison between F values for bisecting K-means and UPGMA, the best hierarchical technique. We state the two main results succinctly. Bisecting K-means is better than regular K-means and UPGMA in most cases. Even in cases where other schemes are better, bisecting K-means is only slightly worse. Regular K-means, although worse than bisecting K-means, is generally better than UPGMA. (Results not shown.)
REFERENCES [1] Cutting, D., Karger, D., Pedersen, J. and Tukey, J. W., Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, SIGIR 92, 318 329 (1992). [2] Dubes, R. C. and Jain, A. K., Algorithms for Clustering Data, Prentice Hall (1988). [3] Guha, S., Rastogi, R. and Shim, K., ROCK: A Robust Clustering Algorithm for Categorical Attributes, ICDE 15 (1999). [4] Kowalski, G., Information Retrieval Systems Theory and Implementation, Kluwer Academic Publishers (1997). [5] Larsen, B. and Aone, C., Fast and Effective Text Mining Using Linear-time Document Clustering, KDD-99, San Diego, California (1999). [6] Steinbach, M., Karypis, G., Kumar, V., A Comparison of Document Clustering Techniques, University of Minnesota, Technical Report #00-034 (2000). http://www.cs.umn.edu/tech_reports/
7. Conclusions
This paper presented the results of an experimental study of some common document clustering techniques. In particular, we compared the two main approaches to document clustering, agglomerative hierarchical clustering and K-means.