Academia.eduAcademia.edu

A BIRCH-Based Clustering Method for Large Time Series Databases

2012, Lecture Notes in Computer Science

This paper presents a novel approach for time series clustering which is based on BIRCH algorithm. Our BIRCH-based approach performs clustering of time series data with a multi-resolution transform used as feature extraction technique. Our approach hinges on the use of cluster feature (CF) tree that helps to resolve the dilemma associated with the choices of initial centers and significantly improves the execution time and clustering quality. Our BIRCH-based approach not only takes full advantages of BIRCH algorithm in the capacity of handling large databases but also can be viewed as a flexible clustering framework in which we can apply any selected clustering algorithm in Phase 3 of the framework. Experimental results show that our proposed approach performs better than k-Means in terms of clustering quality and running time, and better than I-k-Means in terms of clustering quality with nearly the same running time.

A BIRCH-Based Clustering Method for Large Time Series Databases Vo Le Quy Nhon and Duong Tuan Anh Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology, Vietnam [email protected] Abstract. This paper presents a novel approach for time series clustering which is based on BIRCH algorithm. Our BIRCH-based approach performs clustering of time series data with a multi-resolution transform used as feature extraction technique. Our approach hinges on the use of cluster feature (CF) tree that helps to resolve the dilemma associated with the choices of initial centers and significantly improves the execution time and clustering quality. Our BIRCH-based approach not only takes full advantages of BIRCH algorithm in the capacity of handling large databases but also can be viewed as a flexible clustering framework in which we can apply any selected clustering algorithm in Phase 3 of the framework. Experimental results show that our proposed approach performs better than k-Means in terms of clustering quality and running time, and better than I-k-Means in terms of clustering quality with nearly the same running time. Keywords: time series clustering, cluster feature, DWT, BIRCH-based. 1 Introduction Time series data arise in so many applications of various areas ranging from science, engineering, business, finance, economic, medicine to government. Because of this fact, there has been an explosion of research effort devoted to time series data mining in the last decade, in particular, in behavior-related data analytics [8]. Beside similarity search, another crucial task in time series data mining which has received an increasing amount of attention lately is time series clustering. Given a set of unlabeled time series, it is often desirable to determine groups of similar time series in such a way that time series belonging to the same group are more “similar” to each other rather than time series from different groups. Although there have been much research on clustering in general, most classic machine learning and data mining algorithms do not work well for time series due to their unique characteristics. In particular, the high dimensionality, very high feature correlation and the large amount of noise in time series data present a difficult challenge. Although a few time series clustering algorithms have been proposed, most of them do not scale well to large datasets and work only in a batch fashion. L. Cao et al. (Eds.): PAKDD 2011 Workshops, LNAI 7104, pp. 148–159, 2012. c Springer-Verlag Berlin Heidelberg 2012  A BIRCH-Based Clustering Method for Large Time Series Databases 149 This paper proposes a novel approach for time series clustering which is based on BIRCH algorithm [13]. We adopt BIRCH method in the context of time series data due to its three inherent benefits. First, BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e. available memory and time constraints). Second, BIRCH can find a good clustering with a single scan of the data and improve the quality further with a few additional scans. Third, BIRCH can scale well to very large datasets. To deal with the characteristics of time series data, we propose a BIRCH-based clustering approach that works by first using a multi-resolution transform to perform feature extraction on all time series in the database and then applying BIRCH algorithm to cluster the transformed time series data. Our BIRCH-based approach hinges on the use of cluster feature (CF) tree that helps to resolve the dilemma associated with the choices of initial centers and thus significantly improves the execution time and clustering quality. Our BIRCH-based approach not only takes full advantages of BIRCH method in the capacity of handling very large databases but also can be viewed as a flexible clustering framework in which we can apply any selected clustering algorithm in Phase 3 of the framework. Particularly, our BIRCH-based approach with an anytime clustering algorithm used in Phase 3 will also be an anytime algorithm. Experimental results show that our proposed approach performs better than k-Means in terms of clustering quality and running time, and better than I-k-Means in terms of clustering quality with nearly the same running time. The rest of the paper is organized as follows. In Section 2 we review related work, and introduce the necessary background on multi-resolution transforms and BIRCH clustering algorithm. Section 3 describes our proposed approach for time series clustering. Section 4 presents our experimental evaluation which compares our method to classic k-Means and I-k-Means on real datasets. In section 5 we include some conclusions and suggestions for future work. 2 2.1 Background and Related Work Dimensionality Reduction Using Multi-resolution Transforms Discrete wavelet transform (DWT) is a typical case of time series dimensionality reduction method with multi-resolution property. This property is critical to our proposed framework. The Haar wavelet is the simplest and most popular wavelet proposed by Haar. The Haar wavelet transform can be seen as a sequence of averaging and difference operations on a time sequence. The transform is achieved by averaging two adjacent values on the time series at a given resolution to form a smoothed, lower-dimensional signal and the resulting coefficients at this given resolution are simply the differences between the values and their averages [1]. MPAA time series representation, proposed by Lin et al., 2005 [7], is another case of dimensionality reduction with multi-resolution property. MPAA divides time series Xi of length n into a series of lower dimensional signal with different