A Dynamic K-Means Clustering For Data Mining
A Dynamic K-Means Clustering For Data Mining
A Dynamic K-Means Clustering For Data Mining
Md. Zakir Hossain1, Md. Nasim Akhtar2, R.B. Ahmad3, Mostafijur Rahman4
1,2Departmentof Computer Science and Engineering, Dhaka University of Engineering and Technology, Bangladesh
3Faculty
of Informatics and Computing, Universiti Sultan Zainal Abidin (UniSZA), Malaysia
4Department of Software Engineering, Daffodil International University (DIU), Bangladesh
Corresponding Author:
Md. Zakir Hossain,
Department of Computer Science and Engineering,
Dhaka University of Engineering and Technology, Gazipur, Bangladesh.
Email: [email protected]
1. INTRODUCTION
The new interdisciplinary field of computer science is data mining. This is the process of finding
data pattern automatically from the large database [1]. The necessity of data mining is increasing day by day
since previous ten or fifteen years and so now in this time on the marketplace is very challenging competition
to efficiency of information and information rapidly performed an important role to find out a decision of
plan and provided a great offer of information in industry, society and all together. In real-world, a large
number of data is available in which it is difficult to retrieve the useful information. Due to the practical
importance, it is important to retrieve the structure of data within the given time budget. The data mining
provides the way of eliminatingunnecessary noises from data. It helpstoprovide necessary information from
the large dataset and present it in the proper form when it is necessary for a specific task. It's very helpful to
analyze the market trend, search the new technology, production control based on the demand of customer
and so on. In a word, the data mining is harvesting of knowledge from a large amount of data. We can predict
the type or behavior of any pattern using data mining.
Cluster evaluation of data is an important task in knowledge finding and data mining.
Cluster formation is the process of creating data group based on the data similarities from large dataset.
The clustering process is done by supervised, semi-supervised or unsupervised manner [2].
The clustering algorithms are powerful meta-learning tools for analyzing the data produced by
modern applications. The purpose of clustering is to classify the data into groups according to similarities,
traits, and behavior of data [3].
Many clustering algorithms have been proposed for classification of data. Most of these algorithms
are based on the assumption that the number of clusters in a large data is fixed.The problem with this
assumption is that if the assumed number of cluster is small then there is a higher chance of adding dissimilar
items into the same group. On the other hand, if the number of cluster is large, then there is a higher chance
of adding similar data placed into different groups [4]. In addition, in the real situation, it is difficult to know
the number of clusters in advance.
In this paper, we develop a dynamic K-Means clustering algorithm. This algorithm firstly calculates
a threshold value based on the data set and then groups the data set without fixing the number of clusters (K).
In the proposed algorithm analyze the data set based on the threshold value and finally the data set is clusters.
The threshold value is the key to this proposed method. The threshold value determines the data are same
group or create a new group.
3. RELATED WORK
In this section, we will give a brief discussion of the existing K-means algorithms. In [2], a modified
K-Means algorithm is proposed to select the initial center of cluster based on the improvement of the
sensitivity. This algorithm divides the whole space in segment and calculates the frequency between the
segment and each data point. The maximum frequency of data point selects the centroid. In this method,
the number K is defined by user as defined by the traditional K-mean algorithm. For this algorithm, the
number of divisions will be k*k, where ‘k’ vertically as well as ‘k’ horizontally.
In [10], an improved k-means algorithm is proposed. In this algorithm, the information of data
structure needs to store in each iteration. This information used in next iteration. This proposed method
without calculating the distance between each data points and cluster centers repeatedly, so saving the
running time.
In [11], an optimized k-means clustering method is proposed based on three optimization principles
named k*-means. Firstly, a hierarchical optimization principle initialized by k* cluster centers (k*> k) to
Indonesian J Elec Eng & Comp Sci, Vol. 13, No. 2, February 2019 : 521 – 526
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 523
reduce the risk of randomly seeds selection. Secondly, a cluster pruning strategy is proposed for improving
the efficiency of k-means. Finally, it implements an optimized update theory to optimize the k- means
iteration updating.
4. PROPOSED METHOD
Our proposed method clusters dynamically all data from a large data set without specifying (K)
value, where (K) is the number of clusters. In K-Means firstly select the (K) value then start clustering based
on the value of (K). But, at first, it is the difficult task to select. For this reason K-means clustering result
quality becomes poor. In our proposed method to cluster large data set based on the threshold value and the
result of clustering quality is improved.
N 1 dist ( xi , x j )
N 1 N
i 0
j 0
N
(eq 1)
(1)
N 1
Min dist ( xi , x j ) (eq 2)
i 0 , j 0 (2)
Figure 1(a) showing the comparison between K-Means algorithm and propose algorithm based on
sum of inter-cluster distance. Our proposed algorithms apply in iris setosa. It creates six data group
dynamically based on similarity. So the sum of inter-cluster distance is increased. In K-Means algorithm sum
of inter-cluster distance is decreased shown in Figure 1(a). Figure 1(b) showing the comparison between K-
Means algorithm and propose algorithm based on sum of square error. Our proposed algorithms apply in iris
data sets. Then sum of square error is decrease. In K-Means algorithm sum of square error is increased shown
in Figure 1(b). For Our generated data sets result given in Table 2.
(a) (b)
Figure 1(a). Sum of inter cluster distance for iris data set, (b). Sum of square error for iris data set
Table 2 shows our generated data set result and compare between proposed method and K-Means
clustering based on sum of inter cluster distance and sum of square error. We are generated some data sets.
The range of data sets between 0 to 100 and each data set has 100, 200, 300, 400, 500, and 1000 instance.
Figure 2(a) showing the comparison between K-Means algorithm and propose algorithm based on
sum of inter-cluster distance using our generated data sets. Figure 2(a) show when number of data points
increase then sum of inter cluster distance increase for our proposed method. So, data points are group
Indonesian J Elec Eng & Comp Sci, Vol. 13, No. 2, February 2019 : 521 – 526
Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 525
efficiently. In K-Means algorithm sum of inter-cluster distance is decrease. Figure 2(b) showing the
comparison between K-Means algorithm and propose algorithm based on sum of square error. Figure 2(b)
show sum of square error is decrease for our proposed method. In K-means sum of square error is increased.
So, Cluster quality is poor.
(a) (b)
Figure 2(a). Sum of inter cluster distance for our generated data set, (b). Sum of square error for our
generated data set
6. CONCLUSION
In this paper, we propose a new K-Means algorithm to remove the difficulties of the existing K-
Means algorithm. The proposed method dynamically forms the clusters for a given data set. We compare our
proposed method with the existing K-Means algorithm. The results show that the proposed method
outperforms the existing method for the well-known iris data set.
REFERENCES
[1] S. Sharma, J. Agrawal, S. Agarwal, S. Sharma, “Machine Learning Techniques for Data Mining: A Survey”, in
IEEE International Conference on Computational Intelligence and Computing Research, 2013.
[2] R. V. Singh, M.P.S. Bhatia, “Data Clustering with Modified K-means Algorithm”, in IEEE-International
Conference on Recent Trends in Information Technology (ICRTIT 2011), June 2011.
[3] V. W. Ajin, L.D. Kumar, “Big data and clustering algorithms”, in International Conference on Research Advances
in Integrated Navigation Systems (RAINS), May 2016.
[4] A. Shafeeq. B. M, Hareesha. K. S, “Dynamic Clustering of Data with Modified K-Means Algorithm”, in
International Conference on Information and Computer Networks (ICICN 2012), IPCSIT vol. 27 IACSIT Press,
Singapore, 2012.
[5] L. Guoli, W. Tingting, Y.Limei, “The improved research on k-means clustering algorithm in initial values”, in
International Conference on Mechatronic Sciences, Electric Engineering and Computer (MEC), Shengyang, China,
2013.
[6] S. Jigui, L. Jie, Z. Lianyu, “Clustering algorithms Research”, in Journal of Software, 2008; 19(1): 48-61.
[7] D.Neha, B.M. Vidyavathi, “A Survey on Applications of Data Mining using Clustering Techniques”, in
International Journal of Computer Applications (0975–8887) Volume 126 – No.2, 2015.
[8] M.Fahim, A. M. Salem, F. A. Torkey, “An efficient enhanced k-means clustering algorithm”, in Journal of
Zhejiang University Science A, 2006; 10: 1626-1633.
[9] K.A. Abdul Nazeer, M.P. Sebastian, “Improving the Accuracy and Efficiency of the k-means Clustering
Algorithm”, in Proceeding of the World Congress on Engineering, vol 1, London, July 2009.
[10] L. ShiNa, G. Xumin, “Research on k-means Clustering Algorithm an Improved k-means Clustering Algorithm”, in
Third International Symposium on Intelligent Information Technology and Security Informatics.
[11] J. Qi, Y. Yu, L. Wang, J. Liu, “K*-Means: An Effective and Efficient K-means Clustering Algorithm”, in IEEE
International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking
(SocialCom), Sustainable Computing and Communications (SustainCom), 2016.
BIOGRAPHIES OF AUTHORS
Md. Zakir Hossain received the B.Sc Engineering degree in Computer Science and Engineering
Department from Dhaka University of Engineering and Technology (DUET), Gazipur,
Bangladesh, in 2015 and he is currently pursuing the M.Sc Engineering degree in Computer
Science and Engineering Department in Dhaka University of Engineering and Technology
(DUET), Gazipur. His research interest includes Data Mining, Big Data, AI, Machine Learning,
Cloud Computing, Software Engineering, Computer Network, IoT. He has presented papers at
conferences both at home and abroad.
Md. Nasim Akhtar received the M.Eng and Ph.D degrees from National Technical University of
Ukraine, Kiev, Ukraine and Moscow State Academy of Fine Chemical Technology, Russia,
in 1998 and 2010, respectively. Currently, he is a Professor in the Department of Computer
Science and Engineering, Dhaka University of Engineering and Technology (DUET), Gazipur,
Bangladesh. His research interests includes Distributed Data Warehouse System On Large
Clusters, Digital Image Processing and Water Marking, Peer to Peer Networking, Cloud
Computing, Operating System. He has presented papers at conferences both at home and abroad,
published articles and papers in various journals.
Mostafijur Rahman completed his BSc in Computer Science from National University of
Bangladesh (2003). He Pursued his MSc (2009) and PhD (2017) in Computer Engineering, from
UNIMAP, Malaysia. He worked as Lecturer since 2009 to September, 2017 for School of
Computer and Communication Engineering in UNIMAP. Currently he is serving as Assistant
Professor in the Department of Software Engineering at Daffodil International University (DIU),
Bangladesh. His research interest in Software Testing, Multimedia and Creativity in Medical
Science, Computer Security, Cloud Computing, Algorithm Optimization, Parallel and
Distributed System, Device Driver for GNU/Linux based embedded OS.
Indonesian J Elec Eng & Comp Sci, Vol. 13, No. 2, February 2019 : 521 – 526