Academia.eduAcademia.edu

Enhanced Efficient K-Means Clustering Algorithm.pdf

The paper presents a novel algorithm for performing k-means clustering. It organizes all the patterns in a k-d tree structure such that one can find all the patterns which are closest to a given prototype efficiently. The main intuition behind the approach is as follows. All the prototypes are potential candidates for the closest prototype at the root level. However, for the children of the root node, may be able to prune the candidate set by using simple geometrical constraints. This approach can be applied recursively until the size of the candidate set is one for each node. Experimental results demonstrate that the scheme can improve the computational speed of the direct k-means algorithm by an order to two orders of magnitude in the total number of distance calculations and the overall time of computation.

International Journal of Engineering Research in Computer Science and Engineering (IJERCSE) Vol 3, Issue 3, March 2016 Enhanced Efficient K-Means Clustering Algorithm A.Avinash Goud, [2] K Abdul Basith, [3] Prasad B [1] II/IV, [2][3]Associate Professor [1][2][3] Department of CSE, Marri Laxman Reddy Institute of Technology and Management (MLRITM) Hyderabad [1] [email protected] [2], [email protected] [3][email protected] [1] Abstract The paper presents a novel algorithm for performing k-means clustering. It organizes all the patterns in a k-d tree structure such that one can find all the patterns which are closest to a given prototype efficiently. The main intuition behind the approach is as follows. All the prototypes are potential candidates for the closest prototype at the root level. However, for the children of the root node, may be able to prune the candidate set by using simple geometrical constraints. This approach can be applied recursively until the size of the candidate set is one for each node. Experimental results demonstrate that the scheme can improve the computational speed of the direct k-means algorithm by an order to two orders of magnitude in the total number of distance calculations and the overall time of computation. Keywords: Clustering, K-Means, Tree Traversal Algorithm And Pruning Algorithm I. INTRODUCTION Clustering of data is a method by which large sets of data are grouped into clusters of smaller sets of similar data. Clustering is the process of partitioning or grouping a given set of patterns into disjoint clusters. This is done such that patterns in the same cluster are alike and patterns belonging to two different clusters are different. Clustering has been a widely studied problem in a variety of application domains including neural networks, AI, and statistics. Several algorithms have been proposed in the literature for clustering: ISODATA [8, 3], CLARA [8], CLARANS[10], Focusing Techniques [5] P-CLUSTER [7]. DBSCAN [4], Ejcluster [6], BIRCH [14] and GRIDCLUS [12]. Among all these algorithms, the K-Means is preferred to be used as the clustering technique due to its advantages: Its effectiveness and procedural simplicity, K-Means is easy to implement as it is distance based and It is order independent. The k-means method has been shown to be effective in producing good clustering results for many practical applications. However, a direct algorithm of kmeans method requires time proportional to the product of number of patterns and number of clusters per iteration. This is computationally very expensive especially for large datasets. We propose a novel algorithm for implementing the kmeans method. Our algorithm produces the same or comparable (due to the round-off errors) clustering results to the direct k-means algorithm. It has significantly superior performance than the direct k-means algorithm in most cases. II. K-MEANS CLUSTERING The number of clusters k is assumed to be fixed in k-means clustering. Let the K prototypes (w1,w2,….,wk) be initialized to one of the n input patterns (i1, i2,…..in) Therefore, w j  il , j {1, 2...., k}, l {1, 2....n} Algorithm 1: Direct k-means clustering function Direct-k-means() Initialize K prototypes (w1,w2,….,wk ) such that w j  il , j {1, 2...., k}, l {1, 2....n} Each cluster Cj is associated with prototype Wj Repeate for each input vector il, where l {1, 2,....n} , do Assign il to the cluster Cj with nearest prototype Wj (i.e.,│il  w j││il  w j│, j {1, 2....k}) for each cluster Cj where j {1, 2...., k} , do Update the prototype wj to be the centroid of all sample currently in Cj, so that wj   il /│c j│ il c j E │i 2  w│ j Compute the error function: k j 1 il cj l Until E does not change significantly or cluster membership no longer changes Algorithm 1 shows a high level description of the direct k-means clustering algorithm. Cj is the jth cluster whose value is a disjoint subset of input All Rights Reserved © 2016 IJERCSE 437 International Journal of Engineering Research in Computer Science and Engineering (IJERCSE) Vol 3, Issue 3, March 2016 E │i patterns. The quality of the clustering is determined by the following error function: k j 1 il cj l 2  w│ j substantial component after reduction in the number of distance calculations) III. The appropriate choice of K is problem and domain dependent and generally a user tries several values of k. Assuming that there are n patterns, each of dimensions d, the computational cost of a direct k-means algorithm per iteration (of the repeat loop) can be decomposed into three parts: 1. The time required for the first for loop in algorithm is O(nkd) 2. The time required for calculating the centroids (second for loop in algorithm is O(nd) 3. The time required for calculating the error function is O(nd) The number of iterations required can vary in a wide range from a few to several thousand depending on the number of patterns, number of clusters, and the input data distribution. Thus, a direct implementation of the k-means method can be computationally very intensive. This is especially true for typical data mining applications with large number of pattern vectors. There are two main approaches described in the literature which can be used to reduce the overall computational requirements of the k-means clustering method especially for the distance calculations: Use the information from the previous iteration to reduce the number of distance calculations. PCLUSTER is a k-means-based clustering algorithm which exploits the fact that the change of the assignment of patterns to clusters is relatively few after the first few iterations [7]. It uses a heuristic which determines if the closest prototype of a pattern q has been changed or not by using a simple checks. If the assignment has not changed, no further distance calculations are required. It also uses the fact that the movement of the cluster centroids is small for consecutive iterations (especially after a few iterations). Organize the prototype vectors in a suitable data structure so that finding the closest prototype for a given pattern becomes more efficient [11, 13]. This problem reduces to finding the nearest neighbor problem for a given pattern in the prototype space. The number of distance calculations using this approach is proportional to n * f (k,d) per iteration. For many applications such as vector quantization, the prototype vectors are fixed. This allows for construction of optimal data structures to find the closest vector for a given input test pattern [11]. However, these optimizations are not applicable to the k-means algorithm as the prototype vectors will change dynamically. Further, it is not clear how these optimizations can be used to reduce the time for calculation of the error function (which becomes a ENHANCED K-MEANS ALGORITHM The main intuition behind this approach is as follows. All the prototypes are potential candidates for the closest prototype at the root level. However, for the children of the root node, we may be able to prune the candidate set by using simple geometrical constraints. Clearly, each child node will potentially have different candidate sets. Further, a given prototype may belong to the candidate set of several child nodes. This approach can be applied recursively till the size of the candidate set is one for each node. At this stage, all the patterns in the subspace represented by the subtree have the sole candidate as their closest prototype. Using this approach, we expect that the number of distance calculation for the first loop (in Algorithm1) will be proportional to n*F(k,d) where F(k,d) is much smaller than f(k,d). This is because the distance calculation has to be performed only with internal nodes (representing many patterns) and not the patterns themselves in most cases. This approach can also be used to significantly reduce the time requirements for calculating the prototypes for the next iteration (second for loop in Algorithm1). We also expect the time requirement for the second for loop to be proportional to n*F(k,d) The improvements obtained using this approach are crucially dependent on obtaining good pruning methods for obtaining candidate sets for the next level. We propose to use the following strategy.  For each candidate wi, find the minimum and maximum distances to any point in the subspace.  Find the minimum of maximum distances, call it MIN,MAX  Prune out all candidates with minimum distance greater than MIN,MAX The above strategy guarantees that no candidate is pruned if it can potentially be closer than any other candidate prototype to a given subspace. The proposed algorithm is based on organizing the pattern vectors so that one can find all the patterns which are closest to a given prototype efficiently. In the first phase of the algorithm, we build a k-d tree to organize the pattern vectors. The root of such a tree represents all the patterns, while the children of the root represent subsets of the patterns completely contained in subspaces (Boxes). The nodes at the lower levels represent smaller boxes. For each node of the tree, we keep the following information: 1. The number of points (m ) All Rights Reserved © 2016 IJERCSE 438 International Journal of Engineering Research in Computer Science and Engineering (IJERCSE) Vol 3, Issue 3, March 2016 p m 2. The linear sum of the points ( LS ) i.e. 3. The square sum of the points (SS), i.e. i 1 m i i 1 i p function TraverseTree ( node, p , l, d) 2 Let the number of dimensions be d and the depth of the k-d tree be D. The extra time and space requirements for maintaining the above information at each node is proportional to O(nd)2 Computing the medians at D levels takes time O(nd) [2]. These set of medians are needed in performing the splitting of the internal nodes of the tree. Therefore, the total time requirement for building this tree such that each internal node at a given level represents the same number of elements is O9n(d+D)).3 For building the k-d tree, there are several competing choices which affect the overall structure. 1. Choice of dimension used for performing the split: One option is to choose a common dimension across all the nodes at the same level of the tree. The dimensions are chosen in a round-robin fashion for different levels as we go down the tree. The second option is to use the splitting dimension with the longest length. 2. Choice of splitting point along the chosen dimension: We tried two approaches based on choosing the central splitting point or median splitting point. The former divides the splitting dimensions into two equal parts (by width) while the latter divides the dimensions such that there are equal number of patterns on either side. We will refer to these approaches as midpoint-based and median-based approaches respectively. Clearly, the cost of the median-based approach is slightly higher as it requires calculation of the median. 3. We have empirically investigated the effect of the two choices on the overall performance of our algorithms. These results show that splitting along the longest dimension and choosing a midpoint-based approach for splitting is preferable [1]. In the second phase of the kmeans algorithm, the initial prototypes are derived. Just as in the direct k-means algorithm, these initial prototypes are generated randomly or drawn from the dataset randomly. In the third phase, the algorithm performs a number of iterations (as in the direct algorithm) until a termination condition is met. For each cluster i we maintain the number of points Cn i the i linear sum of the points CLS and the square sum of the i points CSS . Algorithm2: Tree traversal algorithm Alive = pruning ( node, p , l, d) If │alive│=1 then /* All the points in node belong to the alive cluster */ Update the centroid’s statistics based on the information stored in the node return if node is a leaf then for each point in node Find the near prototype pi Assign point to pi Update the centrid’s statistics return for each child node do Traverse Tree( child, Alive, │Alive│, d) In each iteration, we traverse the k-d tree using a depthfirst strategy (Algorithm2) as follows. We start from the root node with all k candidate prototypes. At each node of the tree, we apply a pruning function on the candidates prototypes. If the number of candidate prototypes is equal to one, the traversal below that internal node is not pursued. All the points belonging to this node have the surviving candidate as the closest prototype. The cluster statistics are updated based on the information about the number of points, linear sum, and square sum stored for that internal node. A direct k-means algorithm is applied on the leaf node if there is more than one candidate prototype. This direct algorithm performs one iteration of the direct algorithm on the candidate prototypes and the points of the leaf node. Algorithm3: Pruning algorithm function Pruning(subtree, p , l, d) Alive = p For each prototype pi  p do Compute the minimum ( minii) and maximum (maxi) distances for any point in the box representing the subtree Find the minimum of maxi  0  i  l , call it MinMaxdist for each prototype pi  p do if mini > MinMaxdist then Alive = Alive – {pi} return(Alive) An example of the pruning achieved by using our algorithm is shown in Figure 1. Our approach is a conservative approach and may miss some of the pruning opportunities. For example, the candidate shown as an x with a square around it could be pruned with a more complex pruning strategy. However, our approach is relatively inexpensive and can be shown to require time proportional to k. choosing a more expensive pruning algorithm may decrease the overall number of distance All Rights Reserved © 2016 IJERCSE 439 International Journal of Engineering Research in Computer Science and Engineering (IJERCSE) Vol 3, Issue 3, March 2016 calculations. This may, however, be at the expense of higher overall computation time due to an offsetting increase in cost of pruning. Figure 1 : Example of pruning achieved by our algorithm. X represents the candidate set. d is the MinMax distance. All the candidates@ which are circled get pruned. The candidate with a square around it is not pruned by algorithm At the end of each iteration, the new set of cancroids is derived and the error function is computed as follows. 1. The new centroid for cluster i is : 2. The error function is:  (CSS i  k i 1 CLS i Cn i (CLS i )2 Cni The leaf size is an important parameter for tuning the overall performance of our algorithm. Small leaf size results in larger cost for constructing the tree, and increases the overall cost of pruning as the pruning may have to be continued to lower levels. However, a small leaf size decreases the overall cost for distance calculations for finding the closest prototype. Calculating the Minimum and Maximum Distances: The pruning algorithm requires calculation of the minimum as well as maximum distance to any given box from a given prototype. It can be easily shown that the maximum distance will be to one of the corners of the box. Let furthesti be that corner for prototype i (pi). The coordinates of furthesti ( furthesti1 , furthesti 2 ,......... furthestid ) Can B :│Bj  pij││B  pij│ furthestij  { ju B j : otherwise....................(1) be computed as fallows : l l u j Where Bjl and Bju are the lower and upper coordinates of the box along dimension j. The maximum distance can be computed as fallows: dist   d j 1 ( Pij  furthestij )2 A naive approach for calculating maximum and minimum distances for each prototype will perform the above calculations for each node (box) of the tree independently; which will require O(d) time. The coordinates of the box of the child node is exactly the same as its parent except for one dimension which has been used for splitting at the parent node. This information can be exploited to reduce the time to constant time. This requires the use of the maximum distance of the prototype to the parent node. This can be used to express the maximum square distance for the child node in terms of its parent. The computation cost of the above approach is O(1) for each candidate prototype. The overall computational requirement for a node with k candidate prototypes is A O(k). The value of minimum distance can be obtained similarly [1]. IV. RESULTS We have evaluated our algorithm on several datasets. We have compared our results with direct k-means algorithm interms of the number of performed distance calculations and the total execution time. A direct comparison with other algorithms (such as the P-Cluster [7] and [13] ) is not feasible due to unavailability of their datasets and software. However, we present some qualitative comparisons. All the xperimental results reported are on a IBM RS/6000 running AIX version 4. The clock speed of the processor is 66 MHz and the memory size is 128 MByte. For each dataset and the number of clusters, we compute the factors FRDand FRT of reduction in distance calculations and overall execution time over the direct algorithm respectively as well as the average number of distance calculations per pattern ACD. The number of distance calculations for the direct algorithm is (k+1)n per iteration.All time measurements are in seconds. The main aim in this paper is to study the computational aspects of the k-means method. We used several datasets all of which have been generated synthetically. This was done to study the scaling properties of our algorithm for different values of n and k respectively. Table 1 gives a description for all the datasets. The datasets used are as follows: 1. We used three datasets (DS1, DS2 and DS3)[14]. 2. For the datasets R1 through R12, we have generated k points randomly in a cube of appropriate dimensionality. For the i th point we generate i 2n points around (k  1)k it using uniform distribution. These result in clusters with non-uniform number of points. All Rights Reserved © 2016 IJERCSE 440 International Journal of Engineering Research in Computer Science and Engineering (IJERCSE) Vol 3, Issue 3, March 2016 We experimented with leaf sizes of 4, 16, 64 and 256. For most of our datasets, we found that choosing a leaf size of 64 resulted in optimal or near optimal performance. Further, the overall performance was not sensitive to the leaf size except when the leaf size was very small. Tables 2 and 3 present the performance of our algorithms for different number of clusters and iterations assuming a leaf size of 64. For each combination used, we present the factor reduction in overall time (FRT) and the time of the direct k-means algorithm. Table 2: The overall results for 10 iterations Table 1: Description of the datasets. The range along each dimension is the same unless explicitly stated We also present the factor reduction in distance calculations (FRD) and the average number of distance calculations per pattern (ADC). These results show that our algorithm can improve the overall performance of k-means clustering by an order to two orders of magnitude. The average number of distance calculations required is very small and can vary anywhere from 0.17 to 11.17 depending on the dataset and the number of clusters required. Table 3: The overall results for 50 iterations and 64 clusters The results presented in [7] show that their methods result in factor of 4 to 5 improvements in overall computational time. Our improvements are substantially better. However, we note that the datasets used are different and a direct comparison may not be accurate. All Rights Reserved © 2016 IJERCSE 441 International Journal of Engineering Research in Computer Science and Engineering (IJERCSE) Vol 3, Issue 3, March 2016 V CONCLUSIONS In this paper, we presented a novel algorithm for performing k-means clustering. Our experimental results demonstrated that our scheme can improve the direct kmeans algorithm by an order to two orders of magnitude in the total number of distance calculations and the overall time of computation. There are several improvements ossible to the basic strategy presented in this paper. One approach will be to restructure the tree every few iterations to further reduce the value of F(k,d)The intuition here is that the earlier iterations provide some partial clustering information. This information can potentially be used to construct the tree such that the pruning is more effective. Another possibility is to add the optimizations related to incremental approaches presented in [7]. These optimizations seem to be orthogonal and can be used to further reduce the number of distance calculations. REFERENCES [1] K. Alsabti, S. Ranka, and V. Singh. An Efficient KMeans Clustering Algorithm. ttp:// www. cise. ufl. edu / ranka/, 1997. [10] R. T. Ng and J. Han. Efficient and Effective Clustering Methods for Spatial Data Mining. Proc. of the 20th Int’l Conf. on Very Large Databases, Santiago, Chile, pages 144– 155, 1994. [11] V. Ramasubramanian and K. Paliwal. Fast KDimensional Tree Algorithms for Nearest Neighbor Search with Application to Vector Quantization Encoding. IEEE Transactions on Signal Processing, 40:(3), March 1992. [12] E. Schikuta. Grid Clustering: An Efficient Hierarchical Clustering Method for Very Large Data Sets. Proc. 13th Int’l. Conference on Pattern Recognition, 2, 1996. [13] J. White, V. Faber, and J. Saltzman. United States Patent No. 5,467,110. Nov. 1995. [14] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases. Proc. of the 1996 ACM SIGMOD Int’l Conf. on Management of Data, Montreal, Canada, pages 103–114, June 1996 [2] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introductionto Algorithms. McGraw-Hill Book Company, 1990. [3] R. C. Dubes and A. K. Jain. Algorithms for Clustering Data. Prentice Hall, 1988. [4] M. Ester, H. Kriegel, J. Sander, and X. Xu. A DensityBased Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proc. of the 2nd Int’l Conf. on Knowledge Discovery and Data Mining, August 1996. [5] M. Ester, H. Kriegel, and X. Xu. Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification. Proc. of the Fourth Int’l. Symposium on Large Spatial Databases, 1995. [6] J. Garcia, J. Fdez-Valdivia, F. Cortijo, and R. Molina. Dynamic Approach for Clustering Data. Signal Processing, 44:(2), 1994. [7] D. Judd, P. McKinley, and A. Jain. Large-Scale Parallel Data Clustering. Proc. Int’l Conference on Pattern Recognition, August 1996. [8] L. Kaufman and P. J. Rousseeuw. Finding Groups in Data:an Introduction to Cluster Analysis. John Wiley & Sons, 1990. [9] K. Mehrotra, C. Mohan, and S. Ranka. Elements of Artificial Neural Networks. MIT Press, 1996. All Rights Reserved © 2016 IJERCSE 442