Academia.eduAcademia.edu

A new algorithm based on metaheuristics for data clustering

2010, Journal of Zhejiang University SCIENCE A

This paper presents a new algorithm for clustering a large amount of data. We improved the ant colony clustering algorithm that uses an ant's swarm intelligence, and tried to overcome the weakness of the classical cluster analysis methods. In our proposed algorithm, improvements in the efficiency of an agent operation were achieved, and a new function "cluster condensation" was added. Our proposed algorithm is a processing method by which a cluster size is reduced by uniting similar objects and incorporating them into the cluster condensation. Compared with classical cluster analysis methods, the number of steps required to complete the clustering can be suppressed to 1% or less by performing this procedure, and the dispersion of the result can also be reduced. Moreover, our clustering algorithm has the advantage of being possible even in a small-field cluster condensation. In addition, the number of objects that exist in the field decreases because the cluster condenses; therefore, it becomes possible to add an object to a space that has become empty. In other words, first, the majority of data is put on standby. They are then clustered, gradually adding parts of the standby data to the clustering data. The method can be adopted for a large amount of data. Numerical experiments confirmed that our proposed algorithm can theoretically applied to an unrestricted volume of data.

921 Shohdohji et al. / J Zhejiang Univ-Sci A (Appl Phys & Eng) 2010 11(12):921-926 Journal of Zhejiang University-SCIENCE A (Applied Physics & Engineering) ISSN 1673-565X (Print); ISSN 1862-1775 (Online) www.zju.edu.cn/jzus; www.springerlink.com E-mail: [email protected] A new algorithm based on metaheuristics for data clustering* 1 2 Tsutomu SHOHDOHJI , Fumihiko YANO , Yoshiaki TOYODA 3 (1Department of Computer and Information Engineering, Faculty of Engineering, Nippon Institute of Technology, Gakuendai 4-1, Miyashiro-Machi, Saitama 345-8501, Japan) (2Division of Integrated Sciences, J. F. Oberlin University, Tokiwa 3758, Machida, Tokyo 194-0294, Japan) (3Aoyama Gakuin University, Fuchinobe 5-10-1, Sagamihara, Kanagawa 252-5258, Japan) E-mail: [email protected]; [email protected]; [email protected] Received Oct. 28, 2010; Revision accepted Oct. 29, 2010; Crosschecked Oct. 29, 2010 Abstract: This paper presents a new algorithm for clustering a large amount of data. We improved the ant colony clustering algorithm that uses an ant’s swarm intelligence, and tried to overcome the weakness of the classical cluster analysis methods. In our proposed algorithm, improvements in the efficiency of an agent operation were achieved, and a new function “cluster condensation” was added. Our proposed algorithm is a processing method by which a cluster size is reduced by uniting similar objects and incorporating them into the cluster condensation. Compared with classical cluster analysis methods, the number of steps required to complete the clustering can be suppressed to 1% or less by performing this procedure, and the dispersion of the result can also be reduced. Moreover, our clustering algorithm has the advantage of being possible even in a small-field cluster condensation. In addition, the number of objects that exist in the field decreases because the cluster condenses; therefore, it becomes possible to add an object to a space that has become empty. In other words, first, the majority of data is put on standby. They are then clustered, gradually adding parts of the standby data to the clustering data. The method can be adopted for a large amount of data. Numerical experiments confirmed that our proposed algorithm can theoretically applied to an unrestricted volume of data. Key words: Metaheuristics, Ant colony clustering, Data clustering, Swarm intelligence doi:10.1631/jzus.A1001030 Document code: A CLC number: TP301 1 Introduction Classification of an enormous volume of data enables us to determine structure and relativity within the data and identify useful information. The cluster analysis technique is one of popular data classification techniques. However, this technique has several drawbacks: it tends to revert to localized best-fit solutions; it requires prior specification of the number of clusters; and, a natural classification of data is difficult. Lumer and Faieta (1994) have proposed a new technique called “ant colony clustering (ACC)”, which is modeled on the intelligence of an ant swarm. Swarm intelligence is the result of information gath* Project (No. 18510132) supported by the Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research © Zhejiang University and Springer-Verlag Berlin Heidelberg 2010 ering by a very large number of individuals who perform simple processing and influence each other. Swarm intelligence is also characterized by the ability to achieve optimum processing results from a large crowd operating as a unified whole. ACC can potentially address the shortcomings of existing cluster analysis techniques through the use of swarm intelligence. The purpose of this study is to modify the ACC algorithm to enhance efficiency and enable more accurate clustering. 2 Ant colony clustering 2.1 Outline of the ACC algorithm The ACC algorithm is a clustering algorithm that imitates the burial action of ants. The behavior of ants, as they collect their companions’ corpses together in 922 Shohdohji et al. / J Zhejiang Univ-Sci A (Appl Phys & Eng) 2010 11(12):921-926 one location, is applied to the algorithm. Fig. 1 (Bonabeau et al., 1999) shows the appearance of the change of the mass of ant corpses according to the passage of time. In the ACC algorithm, an ant corresponds to an agent and an ant corpse corresponds to an object. In this algorithm, the burial behavior of artificial ants is represented as data clustering. The artificial ants (agents) have change place where to carry the corpses (data) by object classification. Moreover, the lattice space with some area is prepared as a place for clustering. The agent moves it at some new locations if neither an object in it nor similar objects in surroundings. On the other hand, the agent acts according to the rule of putting it if there are some similar objects. Also, the agent can move at random in the search space (Shohdohji et al., 2007). The clustering procedures in the ACC algorithm are as follows: Step 1: Initialization. Arrange objects with agents. Step 2: State confirmation of site. Confirm state of site surrounding the agents. Step 3: Selection of action. The action (either pick up or put down) is determined in accordance with the situation. Step 4: Movement. Move to a site that can be moved at random. Step 5: End of procedure determination. All agents must execute Steps 2 through 4. The procedures end when all steps as provided beforehand have been executed. 2.2 Deficiencies in the ACC algorithm The ACC algorithm forms the cluster by moving objects on the search field. Therefore, a lot of problems that accompany the concept of physical distance on the search field remain unsolved. 1. Number of necessary steps The ACC algorithm requires an enormous number of steps to complete clustering. If hundreds of datasets are involved, the number of steps required for clustering can reach several millions, depending on the parameter settings. This is thought to be largely attributable to the agent’s locomotion strategy. The agent frequently shuttles back and forth within a constant range since it moves between eight adjacent sites at random with an equal probability. This is inefficient to reach remote locations. Moreover, it is considered as a highly unnatural operation, despite having been observed as a natural behavior in the ant world. 2. Clustering accuracy Clustering accuracy is governed by the distance between clusters and the size of clusters. The potential influence on clustering accuracy constitutes a problem. It is thought that the initial formation of clustering influences the final clustering configuration. In a place with many similar objects, there is a high probability that the agent will put down the object. Moreover, a cluster of a significant large initial size will often have an influence. It is thought that these factors influence the final clustering configuration. The followings are the possible phenomena: (1) Objects are mixed together. (2) Objects are surrounded by a certain cluster and cannot escape. These phenomena are thought to be attributable to the distance between clusters. Short distances between different types of clusters are difficult to identify. However, above phenomena are hard to resolve and integrate when the same kind of clusters are mutually parted. (3) Restriction of search domain by area. When the clustering field is limited, a limitation is imposed on the search domain. It is necessary to the collection of objects in remote locations as well as to Fig. 1 Clustering process of ant corpses according to the passage of time (Bonabeau et al., 1999) 923 Shohdohji et al. / J Zhejiang Univ-Sci A (Appl Phys & Eng) 2010 11(12):921-926 the preparation of a field with sufficient area to allow time for the integration of clusters, in a field that is wide relative to the quantity of data. Furthermore, data in excess of the field area cannot be processed. The problem associated with distance is not solved only by preparing a wide field. Meanwhile, the efficiency of clustering decreases greatly. 3 Our proposed algorithm Algorithms that describing phenomena such as the division of labor by agents and distribution of global information are regarded as improvements to the current ACC technique. The algorithm proposed in this study contains a number of improvements that are designed to preserve the essential characteristics of swarm intelligence. 3.1 More efficient ant movement The efficiency improvement of ant movement is a function to improve the efficiency of clustering by making the agent’s movement more natural. Based on observations of actual ants, we added the following improvements to agent movements. Table 1 shows the differences in agent movement between the ACC algorithm (Lumer and Faieta, 1994) and our proposed algorithm. 1. Movement in three directions forward We assumed the probability of going straight to be 50% to reproduce the natural movement of an ant, and the probability for the diagonal left side and right side to be 25%. That is, we designed the algorithm to take the action to which an artificial ant gave priority to straight advancement. These probabilities are based on observations of actual ant’s movements. When the range within which the ant can recognize objects is made 3×3 sites forward, and the agent puts the object on the site, the traveling direction will be reversed. This provides visual confirmation of the road that has already been traveled to the rear side. Moreover, advancing requires the object to be placed ahead, where there are already many similar objects. To avoid this situation, it was decided to allow the ant to move only in three directions forward. 2. Object priority movement In the natural world, ants do not move completely at random, but rather are continuously searching out the direction of travel with their feelers. Ants are thought to exhibit this searching behavior irrespective of whether there are obstructions in the traveling direction or whether they have been asked for something that does not exist. It was assumed that the object had been confirmed beforehand and existed in view of the ant when moving, so that the ant moved to the object’s location by priority. We therefore believe that the efficiency of object searching has improved. 3. Unification of probability equation In the previous choice judgment method, the decision of where to put down the object was made after the decision of whether to pick up the object, and Table 1 Comparison of two algorithms in agent movement* Algorithm Direction of movement Ant’s view Probability of picking up an object Probability of putting down an object 2 ⎛ k ⎞ Pp = ⎜ 1 ⎟ , ⎝ k1 + f ⎠ ACC algorithm (Lumer and Faieta, 1994) f = Surrounding 8 sites Surrounding 5×5 sites ⎧2 f , if f < k2 , Pd = ⎨ ⎩1, otherwise ⎛ d (Oi , O j ) ⎞ 1 1− ⎟ 2 ∑ ⎜ α s Oj ⎝ ⎠ 2 f = Our proposed algorithm Pj = 1 − Forward 3 sites * 1 s ∑ Pn , s 2 n =1 d (Oi , O j ) 2 α Forward 3×3 sites In the probability equations, s is the range (s×s) on the view site, d is the degree of similarity (distance) between objects, Oi is an object that an agent is carrying, Oj is a compared object, and k1, k2, etc. are parameters 924 Shohdohji et al. / J Zhejiang Univ-Sci A (Appl Phys & Eng) 2010 11(12):921-926 the respective probabilities were determined. In this method, when each agent (artificial ant) picks up the object, non-resemblance is judged in the same space, and when the agent places it, it might be judged that it is reversely resembles. This is very unnatural. In addition, the fixed criteria would be given when many parameters and thresholds were set in k1, k2, etc., which could cause the swarm intelligence characteristic to decrease. To overcome these problems, two expressions to calculate the similarity degree of data were integrated into one expression and some improvements were added. 3.2 Condensation of clusters Condensation of clusters unites objects together under specific conditions. This technique states that the ratio of the area that the cluster on the field occupies will decrease. In the ACC algorithm, the distance of the object on the field has no meaning and only the distinction whether it is the same kind of clusters is important. According to the ACC algorithm, an object group deemed to be the same kind of cluster is compactly condensed. As a result, it becomes possible to move objects in each cluster since two or more sets of objects (i.e., a cluster) are treated as one object. This improves both clustering efficiency and precision. Moreover, the small cluster size virtually eliminates the problem of distance between clusters. Therefore, there is a further advantage of no need to prepare an unnecessarily wide field relative to the number of objects. The process of cluster condensation is described below. 1. Conditions. The cluster condensation proposed in this study is a function that performs additional processing after the decision about where the agent puts the object. First of all, the algorithm compares the object that exists on the agent’s view site with the carried object and determines the degree of similarity. When a similar decision goes out to all objects in the view site, it will be condensed into the cluster. If the object that exists in the agent’s view is one cluster, it will resemble all of the objects. This approach is based on the principle that the object carried by the agent should belong to the cluster. 2. Condensation processing. Actual processing of cluster condensation is an operation that unites “object of the highest degree of similarity in the view site” and “object that the agent carries”. In this case, the attribute value of the cluster (hereafter, condensation object) rendered as an object by cluster condensation can be obtained from Eq. (1): x(Oi + O j ) = x(Oi )n(Oi ) + x(O j )n(O j ) n(Oi ) + n(O j ) , (1) where x is the attribute value of the object and n is the number of objects united thus far. 3.3 Improved version of cluster condensation The improved version of the cluster condensation algorithm contains partial modifications to the cluster condensation movement previously described. Objects deemed to have a constant frequency, similar to the technique for determining the degree of similarity in the improved algorithm, are condensed mutually. We believe that this enables a higher speed cluster condensation. The procedure is described below. 1. Comparison of objects in view site. Comparison of the object in the view site (Oj) and the possession object (Oi) is performed as per the previous algorithm. The similarity judgment frequency of objects Oj and Oi is counted as one when objects Oj and Oi are judged similar. The operation described above is executed for all objects that exist on the view site. 2. Condensation processing. If a similar count reaches a constant frequency when objects are compared, objects Oj and Oi are united. Condensation processing at this time is the same as the standard cluster condensation. 3. Relocation processing. An object will be relocated if condensation processing has not been completed when the comparison of objects ends and if something is judged to resemble one of the objects that already exist. The improved algorithm differs greatly from the previous algorithms in that objects are not put in a place but are arranged on the site within the field that has become vacant at random. This is designed to minimize the influence of dispersion on the initial placement in the location where the cluster is formed and the object positioned. 3.4 Addition of objects The addition of objects involves adding new datasets in turn as additional targets, until the number 925 Shohdohji et al. / J Zhejiang Univ-Sci A (Appl Phys & Eng) 2010 11(12):921-926 4 Numerical experiment 4.1 Numerical experiment We verified the efficiency of the proposed algorithm on the condition of Table 2. Fig. 2 shows the attribute value data used for the numerical experiment. These data have given arbitrary variance as becoming two clusters. Four algorithms were used in the experiment: ACC algorithm, efficiency improvement of movement (EIM), cluster condensation (CC), and the improved version of cluster condensation (CC2). We compared the number of steps after clustering as recorded by each algorithm. 4.2 Results and discussion of the numerical experiment Table 3 shows the results of numerical experiment. It is understood that the number of steps has declined dramatically, when make into efficiency of movement. In the ACC algorithm, the same kinds of clusters have become to multiple forms in the early stage. And then, it is difficult to progress to reach the optimal clustering. If the movement of agents is inefficient, it takes time to reach the optimal solution and difficult to combine between clusters that have intervals. It is understood that the improvement of the agent’s operation greatly contributes from the above-mentioned to the efficiency of clustering. Table 2 Parameters used in the numerical experiment Parameter Number of experiments Number of ants Number of objects Field size (grid) Threshold of ACC algorithm Threshold of CC2 algorithm Value 10 30 400 100×100 k1=0.1, k2=0.1, α=0.5 α=0.1 1.20 1.00 Attribute Y of objects existing on the field by cluster condensation to decrease. Suppose that 10 000 datasets are to be classified. The datasets are not all classified at the same time, but rather are classified incrementally by adding groups of 500–1000 datasets to the field. This constitutes a major flaw in the current ACC algorithm. The addition of this function has effectively lifted the restriction on the quantity of data that can be processed. This function also reduces the processing load by enabling classification with a narrow field and a small number of agents. To boost classification accuracy, it is useful to classify a large amount of data using a small number of agents. Despite the improved efficiency of classification, however, the level of accuracy is still in question, since the number of objects being transported increases in direct proportion to the number of agents. This can be illustrated by imagining the extreme example of classifying 100 objects by handling 100 agents. As objects can be added at any time, this algorithm can be adjusted to a dynamically changing database with a parallel distributed processing. 0.80 0.60 0.40 0.20 0.00 -0.20 0.00 -0.20 0.20 0.40 0.60 0.80 Attribute X 1.00 1.20 Fig. 2 Scatter diagram of data used for the numerical experiment Table 3 Number of steps after clustering of each algorithm Algorithm ACC EIM CC CC2 Number of steps (×103) Maximum Minimum Average 7420 1860 4410 1016 430 657 223 128 177 39.7 35.6 37.1 ACC: ant colony clustering; EIM: efficiency improvement of movement; CC: cluster condensation; CC2: improved version of cluster condensation As a result, it can be seen that the improved efficiency of movement contributes significantly to the rate of clustering. In addition, the algorithm has been further improved by adding the cluster condensation function. The problem with objects from different types of clusters being left in the cluster is not resolved solely through more efficient agent movement. The improved algorithm classifies data at faster speed than the ACC algorithm. However, it was repeatedly shown that these clusters did not become a single cluster at the final stage in the new algorithm when formed at the position where the same type of clusters parted. 926 Shohdohji et al. / J Zhejiang Univ-Sci A (Appl Phys & Eng) 2010 11(12):921-926 These problems are considered to be almost completely resolved since the cluster size decreases as the cluster condenses. Furthermore, clustering was completed more quickly in the CC2. Since the CC2 provides virtually the same level of accuracy, it can be used to generate condensation decisions more quickly than the ACC algorithm. The dispersion of measurements has also been largely eliminated. 4.3 Summary of numerical experiment Classification speed and accuracy were found to be significantly better following the improvements in agent movement and cluster condensation. However, several new problems were identified. For instance, the number of objects on the field decreases too greatly when the condensation of the cluster reaches the critical limit. Because the cluster is formed with a few objects in such a state, the distinction of the cluster is difficult. In some cases, there occurs a phenomenon whereby the cluster does not exist on the field, causing the agent to pick up all objects that should form into a single cluster. This can be attributed to an excessive drop in the number of objects. It is thought that this phenomenon can be avoided by ensuring that the number of objects remains above a fixed minimum. 2. The necessity for similar level space beforehand. 3. The lack of definition of the range considered to be a cluster. A random value in each agent is also provided to make the best use of swarm intelligence and methods (such as selectively deciding the action from a current experience of the agent). Incorporating concepts such as the genetic algorithm and the immune algorithm would undoubtedly produce an even more efficient algorithm. Similarly, the approach of classifying data while sharing information involves parallel distributed processing using two or more clients; i.e., ACC swarm clustering, which could potentially double the benefits of swarm intelligence via distribution of processing load. In the near future, we hope to resolve the remaining problems and further refine the algorithm into a general-purpose algorithm suitable for application to real-life problems. Acknowledgments The authors thank Mr. Natsuki SAMURA (Panasonic System Networks Software Co., Ltd., Japan) for his helpful comments and suggestions. References 5 Conclusions Algorithm performance increased markedly through the introduction to the ACC algorithm of more efficient movement, improved cluster condensation, and additional object functions. Despite the performance improvement, we have identified three key issues that still need to be addressed: 1. The reliance on monitoring to determine the timing of clustering completion. Bonabeau, E., Dorigo, M., Theraulaz, G., 1999. Swarm Intelligence: From Natural to Artificial Systems. Oxford University Press, USA. Lumer, E.D., Faieta, B., 1994. Diversity and Adaptation in Populations of Clustering Ants. Proceedings of the 3rd International Conference on the Simulation of Adaptive Behavior, p.501-508. Shohdohji, T., Samura, N., Yano, F., Toyoda, Y., 2007. An Improvement of Ant Colony Clustering Algorithm Based on Ant Behavior. Proceedings of the 37th International Conference on Computers and Industrial Engineering, p.13-21.