www.sciedu.ca/air
Artificial Intelligence Research
2014, Vol. 3, No. 1
ORIGINAL RESEARCH
A statistical approach for clustering in streaming
data
Niloofar Mozafari ∗, Sattar Hashemi, Ali Hamzeh
School of Electrical Computer Engineering, Shiraz University, Shiraz, Iran.
Received: June 30, 2013
DOI: 10.5430/air.v3n1p38
Accepted: November 26, 2013
Online Published: January 9, 2014
URL: http://dx.doi.org/10.5430/air.v3n1p38
Abstract
Recently data stream has been extensively explored due to its emergence in large deal of applications such as sensor networks,
web click streams and network flows. Vast majority of researches in the context of data stream mining are devoted to supervise
learning, whereas, in real word human practice label of data are rarely available to the learning algorithms. Hence, clustering as
the most important unsupervised learning has been in the gravity of focus of quite a lot number of the researchers in data stream
community. Clustering paradigms basically place the similar objects together and separate the dissimilar ones into different
clusters.
In this paper, we propose a Statistical framework for data Stream Clustering, which abbreviated as StatisStreamClust that makes
use of two components to find clusters in data stream. The first component especially designed to detect concept change where
data underlying distributions change from time to time. Upon detection of concept change by the first component, the second
component is triggered to update the whole clustering model. StatisStreamClust brings great benefits to data stream clustering
including no sensitivity to the number of clusters and dimensions, reasonable complexity and in the meantime desirable performance, and finally no need to determine window size a priori. To explore the advantages of our approach, quite a lot of
experiments with different settings and specifications are conducted. The obtained results are very promising.
Key Words: Data stream, Trend, Concept change, Precision, Recall, F1 measure, Mean delay time
1
Introduction
ing clustering algorithm. This approach has two major difRecently, data streams have been extensively investigated ficulties. First, due to huge data volume of streaming data,
due to the large amount of applications such as sensor net- reclustering of them is very costly. Second, in which time
works, web click streams and network flows.[1–6] Data the data must be reclustered. If the reclustering of data is
stream is an ordered sequence of data with huge volumes ar- done from time to time, most of the reclustering tasks are
riving at a high throughput that must be analyzed in a single unnecessary. On the other hand, if the distance of recluspass.[7–9] Vast majority of researches in the context of data tering tasks is partly far, some cluster information may be
stream mining are devoted to supervise learning, whereas, in loosed. So, we need a solution that is able to perform clusreal word human practice label of data are rarely available tering whenever it is necessary or whenever the nature of
to the learning algorithms. Clustering as the most impor- streaming data changes.
tant unsupervised learning has been in the gravity of focus In this paper, we propose StatisStreamClust (Statistical
of quite a lot numbers of researchers in data stream com- framework for data Stream Clustering) that partitions the inmunity. It inputs the similar objects together and separates put instances whenever it is needed and involves two comdissimilar ones into different clusters.[10]
ponents to find clusters in data stream. First, it detects
An intuitive approach for clustering data streams is to concept change where data underlying distributions change
recluster them periodically. At the predetermined time from time to time. Then, after detecting a change, the underpoint, clusters of data streams are updated with an exist- lying clustering model would be updated. Briefly, the con∗ Correspondence: Dr. Niloofar Mozafari; Email:
[email protected]; Address: Department of Computer Science and Engineering and
Information Technology, School of Electrical Computer Engineering, Zand Avenue, Shiraz University, Iran.
38
ISSN 1927-6974
E-ISSN 1927-6982
www.sciedu.ca/air
Artificial Intelligence Research
tributions of our approach are multiple folds. First, it does
not require a sliding window on the data stream whose size
is a well-known challenging issue; second, it works well
in multi-dimensional data stream and also it is not sensitive to the number of dimension or number of clusters. To
explore the advantages of our approach, quite a lot of experiments with different settings and specifications are conducted. The obtained results suggest that StatisStreamClust
is the method of choice by offering a reasonable complexity
and desirable performance.
The rest of this paper is organized as follows: Section 2
reviews the related works, and in Section 3 the proposed algorithm for data stream clustering is explained. The experimental results are given in Section 4 and Section 5 brings
the paper to the end by giving conclusion and future work.
2
Related work
The problem of concept change detection in unlabeled timeevolving data is formulated as follows: we are given a series of unlabeled data points D = z1 , z2 , . . . , zn . D can
be divided into s segments Di where i=1, 2 , .., s that follow different distributions. Change in the distribution of
data causes some difficulties in data stream learning problems. For example, performing clustering on the entire data
streams uninterested in concept change decreases the quality
of clusters. Hence, there has been developed some methods
to work along concept change. Some of these methods can
handle concept change using clustering and the others focus
exclusively on concept change detection. We classify them
into two groups: 1) Model driven concept change detection:
the main aim of these systems is clustering and update clusters whenever the accuracy of clustering is decreased. 2)
Data driven concept change detection: these methods focus
exclusively on concept change detection according to the
nature of data. In the following, we review two types of
these systems.
2.1 Model driven concept change detection
The main goal of these methods is clustering and update
clusters whenever the accuracy of clustering is decreased.
Thus, they handle concept change in data stream by update
clusters when the accuracy is decreased. This idea is first
introduced by Aggarwal[11] that introduced a framework for
data stream clustering, namely cluStream. That framework
is separated into two components: 1) Online component
that partitions data stream into clusters and adopts a very
efficient process to storage the appropriate summary statistics for each cluster. 2) The offline component that applies
whenever a user required. This component uses the summary statistics that are stored in the online component and
the other user input, to provide the user with a quit understanding of the clusters. However, cluStream has the property of interpretable ability to track evolving clusters, but
Published by Sciedu Press
2014, Vol. 3, No. 1
not designed to handle outliers.
In[12] a method for clustering data streams with arbitrary
shape and the ability to handle outliers is proposed. That
method is also followed two processes of online and offline
in order to partition data stream with considering concept
change. Nasraoui et al. proposed a density based method
for clustering data streams into one step.[8] That algorithm
adopts sliding window on the data stream so that partitions
instances according to a density based clustering in the first
window. It stores a representation for each cluster that involves the center of cluster and whose weight of instances.
The weight of instances in each cluster determines according to both the distance of each instance to the center of
cluster and also the time of entering instance to that cluster. Instances in each window belong to appropriate cluster
according to the representation of each cluster and then the
parameters of the clusters are updated. Also in that algorithm, outliers are removed using the statistical tests.
Another view of clustering in numerical data stream with
considering concept change, namely evolutionary clustering is proposed in.[13, 14] Evolutionary clustering optimizes
two potentially conflicting criteria: 1) Clustering in current
window should be similar to clustering in previous window (without considering concept change). 2) Clustering
in current window should be able to partition instances accurately (with considering concept change). Chakrabarti
et al.[13] formulated the concept of evolutionary clustering
and extended k-means[15] and agglomerative hierarchical[45]
to evolutionary clustering. The idea of temporal smoothness is proposed in.[14] Temporal smoothness implicate that
the current clustering do not deviate dramatically from the
most recent history clustering. According to this idea, two
frameworks for evolutionary spectral clustering[16] are proposed, so that temporal smoothness is incorporated into the
overall clustering quality. Although that those method have
good performance for data stream clustering with considering concept change, cannot handle sudden concept change
because they consider sudden concept change as noise.
Dai et al. proposed a framework for clustering in data
streams with considering concept change that unlike of previous works partitions data streams rather than their data
points into clusters.[17] This type of clustering data streams
not their data points has vast applications such as in sensor networks and stock markets. That framework involves
two phases: 1) Online maintenance phase that devotes an
efficient algorithm to store a summary for data stream with
multiple resolutions. 2) Offline phase that employs an adaptive clustering algorithm to retrieve the approximation of the
desired sub streams, based on the clustering queries specified by the user, form the summary which stored in online
phase. In[18] an online clustering without providing details
in offline phase was proposed for clustering data streams not
their data points. It continuously reports clusters within the
given distance threshold. Also, Beringer et al. presented an
algorithm for clustering over parallel data streams.[19] That
39
www.sciedu.ca/air
Artificial Intelligence Research
method summarized data streams using Discrete Fourier
Transform and reports clustering results by applying a sliding window on streams. Another work[20] constructs incrementally a hierarchy of clusters form a divisive point
of view to provide a time series system for whole clustering. That method performs clustering whenever a number
of data points of each time series are received. Yeh et al.
proposed an algorithm for clustering data streams not their
data points that reports dynamically cluster evolutions with
efficient cluster split and merge processes which trigged by
events.[22]
In[21] an algorithm for mining evolving user profiles in the
web is proposed. For each cluster, a series of information
such as birth, persistence, atavism and the death are defined
to model the profiling web usages. In order to track evolving
user profiles, the results of clustering are analyzed. In 2009,
Chen et al. presented a framework that is able to detect concept in categorical domain change and show the trend of
evolving clusters.[23]
2014, Vol. 3, No. 1
However, these techniques need to determine the number of
models in the ensemble technique.
Another family of concept change detection methods is
based on density estimation. For example, Aggarwal’s
method[38] uses velocity density estimation which is based
on some heuristics instead of classic statistical changes detectors to find changes. In order to provide an intuitive understanding of the rate of change in the density at a spatial
location over a window, concept of velocity density which
is determined according to forward time slice density estimate and reverse time slice density estimate is defined.
Aggarwal’s method visualizes the rate of flowing in data
stream. The user decides whether concept change occurred
or not. As another major works in this family, we could
mention Kifer’s[39] and Dasu’s works[40] which try to determine the changes based on comparing two probability distributions from two different windows.[39, 40] For example,
in[39] the change detection method based on KS test determines whether the two probability density estimations obtained from two consequent different windows are similar
or not. That approach uses two windows, namely reference window and current window over data streams and
2.2 Data driven concept change detection
determines whether the two samples in the windows creThese methods focus exclusively on concept change de- ated by same distribution function or not. Although that
tection according to concept change definition and the na- It assumes the instances are generated independently, but
ture of data. In general, change is defined as moving from does not make any assumption on the type of distribution
one state to another state.[24] There are some important function that generated instances. That approach for change
works to detect changes where some of them detect changes detection is impractical for high dimensional data streams.
with statistical hypothesis testing and multiple testing prob- Dasu et al. propose a method for change detection which
lems.[25] In the statistical literature, there are some works is related to Kulldorff’s test. This method is practical for
for change point detection.[26] However, most of the sta- multi-dimensional data streams.[40] However, this method
tistical tests are parametric and also needs the whole data relies on a discretization of the data space, thus it suffers
to run.[27, 28] These methods are not applicable in the data from the curse of dimensionality.
stream area, because they require storing all data in memory
[24]
In Ho’s
to run their employed tests.[28] Popular approaches for the Another major work is proposed by Ho et al.
method,
upon
arrival
of
new
data
point,
a
hypothesis
test
concept change detection uses three techniques including
takes
place
to
determine
whether
a
concept
change
has
been
(1) Sliding window which is adopted to select data points
This hypothesis test is driven by a family
for building a model.[29, 30] (2) Instance weighting which occurred or not.
[24]
of
martingales
which is based on Doob’s Maximal Inassumes that recent data points in window is more impor[41]
equality.
Although
Ho’s method detects changes points
[31, 32]
tant than the other.
(3) Ensemble learning which is
accurately,
it
can
only
detect some types of changes to be
created with multiple models with different window sizes
[42]
or parameter values or weighting functions. Then, the pre- more detailed in.
diction is based on the majority vote of the different models.[33–36] Both sliding window and instance weighting families suffer from some issues: First, they are parametric
methods; the sliding window techniques require determining window size and instance weighting methods need to
determine a proper weighting function. Second, when there
is no concept change in the data stream for a long period of
time, both of sliding window and instance weighting methods would not work well because they do not take into account or give low weights to the ancient instances.[37] The
ensemble methods try to overcome the problems that sliding
window and instance weighting are faced with by deciding
according to the reaction of multiple models with different
window sizes or parameter values or weighting functions.
40
3
StatisStreamClust
The proposed method partitions instances into k clusters after arriving n instances. Whenever an instance is received,
it belongs to the proper cluster according to its distance to
the mean of the clusters. In other words, the distance of
the new received instance to the mean of the existing clusters is calculated and the instance belongs to the cluster with
the shortest distance and the mean of that cluster is updated.
Suppose that mean of n received instances in the cluster k is
mk,n . The instance xn+1 belong to the proper cluster using
ISSN 1927-6974
E-ISSN 1927-6982
www.sciedu.ca/air
Artificial Intelligence Research
2014, Vol. 3, No. 1
whether concept change is occurred or not. This test is according to exchangeability condition and defined as follow:
H0 =T hereisnoconceptchageindatastream
xn+1 ∈ clk : k = min dist(xn+1 , mk,n )
(1)
k
H0 =T hereisaconceptchageindatastream
th
where clk is k cluster and dist(xn+1 ,mk,n ) is the distance In order to take place the hypothesis test, statisStreamClust
of instance xn+1 to the mean of cluster k. After belonging investigates the status of the new received instance in cominstance xn+1 to the cluster k, the new mean of the cluster parison of the other instances existing in the cluster k. If
there exists a change in the trend of instances existing in
is calculated using the following formula:
every cluster, it means a concept change occurred and the
clustering model must be updated. To do that, the distance
n.mk,n + xn+1
(2) of the all instances existing in the cluster k to m′k,n is calm′k,n+1 =
n+1
culated and the trend of instances was calculated using the
After that, a hypothesis testing takes place to determine following formula:
Formula (1):
p − value =
#{i : SMk,t > SMk,n } + θn #{i : SMk,i = SMk,n }
n
where θn is chosen from [0,1] randomly. SMk,i is the distance of the ith instance existing to the cluster k to its mean.
It determines how much a data point is different from the
others. SM is high, when data is farther from the mean of
the data points.
The changes of p_value toward higher values can be deemed
as data points are running away from their mean. In contrast, having data close to their mean conveys the meaning
that p_values are approaching smaller values. In order to
decide whether H0 must be accepted or not, a martingale[24]
is defined based on the sequence of p_values.
(ε)
Mi
(ε)
= εpε−1
Mi−1
i
Experimental Results and Discussion
This section composes of two subsections, precisely covering our observation and analysis. The first subsection
presents the data sets and evaluation measures. The latter
one presents and analyses the obtained results.
Published by Sciedu Press
Experimental Setup
This section introduces the examined data sets and evaluation measures respectively.
4.1.1
Data sets
To explore advantages of statisStreamClust, we conduct our
experiments on data set which was used previously in Aggarwal’s work.[11] The instances of this data set follow a
series of Gaussian distributions. In order to reflect the concept change, we change the mean and variance of the current
(4) Gaussian distribution after arriving 1050 instances.
According to Doob’s Maximal Inequality,[41] it is unlikely
for Mk to have a high value. Thus, we can detect changes
when martingale value is greater than λ.[42]
When a change is occurred in any cluster, all the previous
information is removed. To be more illustrative, we presented the outline of our method for clustering data stream
as follow:
StatisStreamClustStatisStreamClust
Partitions instances after arriving 50 instances using kmeans algorithm Loop A new unlabeled data stream zi is
received. Find the proper cluster Compute the distance of zi
to the mean of instances of that cluster. Compute p_values
and p-values’=1-p_values using (1). Compute Mi and Mi’
using (2) according to p_values (Mi) and the 1 – p_values
i′
> λ then Update Clustering Model Delete
(Mi’). If Mi +M
2
all previous information End if End loop
4
4.1
(3)
4.1.2
Evaluation measures
There are many evaluation measures to assess the performance of clustering algorithms. They can be categorized
into two groups: internal and external criteria. An internal criterion formulates the quality of clustering model as a
function of the instances and the similarities between them.
External criteria use external information not given to the
clustering algorithm, such as label of the instances. In this
paper, we evaluate our method with the both groups.
As an internal criterion, we use a famous cluster validation
technique –Silhouette validation.[43] This measure first assigns a quality measure to each instance,
sil(di ) =
b(di ) − a(d)
max{a(di ), b(di )}
(5)
where a(di ) is the average of the Euclidian distance of instance di to all other instances in the same cluster and b(si )
is the average of the Euclidian distance of instance di to all
41
www.sciedu.ca/air
Artificial Intelligence Research
2014, Vol. 3, No. 1
other instances in the closest cluster. A cluster silhouette for
a cluster Ck who has m instances is defined as follow:
sil(ck ) =
P
sil(di )
m
di ∈Ck
(6)
Finally, the Global Silhouette, GS, which is used to evaluate clustering quality as an internal criterion is defined as
Formula (7).
p
GS =
1X
silCk
p 1
(7)
where p is the number of clusters. Silhouette validation
takes into account the compactness of the instances in each
cluster and the separation of clusters.
To evaluate the accuracy of the proposed method with an
external criterion, we focus on Normalized Mutual Information (NMI) which is an information theoretic measure that
was previously used in.[44]
Pk(a) Pk(b)
Figure 1: The instances of this data set come from five
Gaussian distributions that mean of them changes after
1050 instances. Each color indicates the instances that
generates from a Gaussian distribution
n.nh,l
(a) (b) )
N M I(λ , λ ) = q
(a)
(b)
Pk(a) (a)
Pk(b) (b)
n
n
( h=1 nh log nh )( l=1 nl log nl )
(8)
a
b
h=1
l=1
log(
nh n l
In this formula, λa is the true label of instances and λb is
the result of clustering using the proposed method. K(a)
and K(b) are the number of clusters in λa and λb respec(a)
tively. nh is the number of data in hth cluster. nh,l defines
the set of common instances in cluster h and l.
The Normalized Mutual Information measures the degree
of similarity between two clustering. If two clustering have
much information in common, this measure gets high value;
i.e. close to 1 and vice versa.
Figure 2: The ability of Basic method to partition
streaming data. The basic algorithm does clustering once
total examples are collected. Horizontal axis shows the
time and the vertical one illustrates the value of instances in
each time
As Figure 2 shows Basic window algorithm has poor performance to partition instances over time because it does
not consider the concept change where the distribution of
4.2 Results and discussion
instances changes. Window based algorithm has the better
To assess the performance of StatisStreamClust, we com- performance in partitioning streaming data because it parpare it with two other methods: one is called Basic and the titions data after arriving 2000 instances and then removes
other is Window based algorithm. The Basic algorithm does all the instances existing in the previous window. Although
clustering once total instances are collected and the Win- the performance of Window based algorithm to partition
dow based algorithm partitions streaming data after arriving streaming data is partly good, it is not comparable to statis2000 instances. To have fair comparison, statisStreamClust StreamClust. The proposed algorithm partitions instances
and the other two algorithms use k-means algorithm. Fig- into k clusters using k-means algorithm after arriving 50
ures 2, 3 and 4 visualize the ability of Basic, Window based instances. Whenever an instance is received, it belongs to
and our method in partitioning streaming data respectively. the proper cluster according to its distance to the mean of
Horizontal axis shows the time and the vertical one illus- the clusters and then a martingale is run for that cluster. If
trates the value of instances in each time. The instances the martingale of each cluster is greater than λ, the clusof this data set come from five Gaussian distributions that tering model is updated and all the previous information is
mean of them changes after 1050 instances. Figure 1 illus- deleted. To conclude, statisStreamClust assigns instances
trates this data set. Each colour indicates the instances that into the proper cluster and updates clustering model whenever needed.
are generated from a Gaussian distribution.
42
ISSN 1927-6974
E-ISSN 1927-6982
www.sciedu.ca/air
Artificial Intelligence Research
Our method takes into account the trend of data behaviour
which can be close or away from the centre of data and after detecting change, removes all the previous information.
Basic algorithm does not consider concept change and run
clustering algorithm once all instances are collected. Window based algorithm partitions instances after arriving 2000
instances. So it has the better performance in comparison
of Basic. Although the performance of Window based is
not comparable to our algorithm because our method updates clustering model whenever it is needed. Otherwise it
assigns each instance to the proper cluster according to its
distance to the mean of cluster.
2014, Vol. 3, No. 1
Figure 5: The comparison of the proposed method with
two algorithms of Basic and Window based in the different
dimensions with NMI.
Figure 6: The comparison of the proposed method with
two algorithms of Basic and Window based in the different
dimensions with Silhouette.
Figure 3: The ability of Window based algorithm to
partition streaming data. Window based algorithm does
clustering after arriving a fixed number of instances (2000
instances). Horizontal axis shows the time and the vertical
one illustrates the value of instances in each time
Figure 4: The ability of statisStreamClust to partition
streaming data. The proposed approach for clustering
streaming data reclusters instances whenever a change is
detected. Otherwise the clustering model is updated.
Horizontal axis shows the time and the vertical one
illustrates the value of instances in each time
Published by Sciedu Press
Our experimental observations and theoretical analysis
clearly reveal that statisStreamClust is able to partition
streaming data in a robust manner. This method is robust
to both the number of clusters and number of dimensions.
To be more illustrative, we draw the readers’ attention to
the fact that part of our evaluations are carried out on data
stream with different number of clusters (Table 1). From
Figures 5 and 6, one may come up with the fact that robustness to the number of dimensions is an intrinsic nature of
statisStreamClust.
The complexity of statisStreamClust is O(nlogn). Upon arrival of new data point, a proper cluster for that instance is
found according to it distance to the mean of clusters by
complexity of O(k), where k is the number of clusters. After that, a hypothesis test takes place to determine whether
a concept change has been occurred in that cluster or not.
This hypothesis test is driven by a family of martingales
which is based on Doob’s Maximal Inequality. To do this
hypothesis test, statisStreamClust gets USM which ranks
data points according to their mean with the complexity
of O(n). In next step, p_value statistic is defined to rank
USM of new data point with respect to the other USM.
This step can be done by sorting the values and it would
be done in O(nlogn) when heap-sort is used. In order to decide whether to accept H0 or not, a martingale is defined
based on the sequence of p_value. The complexity of this
step is O(1). Therefore the complexity of our algorithm is
O(k)+O(n)+O(nlogn)+O(1)=O(nlogn).
43
www.sciedu.ca/air
Artificial Intelligence Research
2014, Vol. 3, No. 1
Table 1: comparison of StatisStreamClust with algorithms of Basic and Window based in different number of clusters
with two evaluation measures.
2 clusters
3 clusters
4 clusters
5 clusters
6 clusters
7 clusters
0.8861
5.6460e-016
0.9241
0.0024
0.9398
0.0037
0.9367
0.0015
0.9348
0.0024
0.9331
0.0025
0.8759
5.6460e-016
0.8513
0.0062
0.8291
0.0072
0.7723
0.0072
0.7725
0.0094
0.7525
0.0162
0.1449
2.0639e-004
0.3070
3.9502e-004
0.3912
3.1308e-005
0.4405
1.6552e-004
0.4851
3.2827e-004
0.5122
2.7958e-004
Silhouette
0.7277
1.5197e-005
0.7092
3.6691e-004
0.7087
2.7655e-005
0.7136
1.3144e-005
0.6979
8.2859e-005
0.7026
4.9147e-005
NMI
0.7250
3.3876e-016
0.7794
1.1292e-016
0.806
3.4237e-004
0.8076
4.5168e-016
0.8238
0.0013
0.7740
0.0670
Silhouette
0.8150
1.1292e-016
0.7904
1.0000e-007
0.7724
1.1052e-005
0.7652
3.3876e-016
0.7562
3.1007e-005
0.7544
0.0084
NMI
Our
Approach
Silhouette
NMI
Basic
Window
Based
5
Conclusion
Clustering as the most important unsupervised learning has
been in the gravity of focus of quite a lot number of researchers in data stream community. It inputs the similar
objects together and separates dissimilar ones into different clusters. Although data stream communities have recently focused on unsupervised domain and clustering as
the most important unsupervised learning, the proposed approaches are not yet matured to the point to be relied on.
In other words, most of them provide merely a mediocre
performance specially when applied on multi-dimensional
data streams. In this paper, we propose a statistical algo-
rithm that partitions streaming data whenever it is needed.
To do that, our algorithm first detects where the nature of
data changes over the time and then recluster instances in
those times. The advantages of our approach are the following. First, it does not require a sliding window on the
data stream whose size is a well-known challenging issue;
second, it works well in multi-dimensional data stream. To
explore the advantages of our approach, quite a lot of experiments with different settings and specifications are conducted. The obtained results are very promising. In the future work, we will investigate clustering data stream where
the number of clusters changes over the time.
References
[1] Babcock, B., Babu, S., Datar, R., Motwani, R. and Widom, J.: Models and Issues in Data Stream Systems, in proceedings of ACM
Symp, Principles of Databases Systems (PODS), pp. 1-16, 2002.
[2] Fan, W.: Systematic Data Selection to Mine Concept Drifting Data
Streams, in Proceedings of ACM SIGKDD, pp. 128-137, 2004.
PMid:15079794
[3] Liu, X., Guan, J., Hu, P.: Mining Frequent Closed Item Sets from
a Landmark Window Over Online Data Streams, in journal of computers and mathematics with applications, 57: 927-936, 2009.
[4] Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: On Demand Classification of Data Streams, in proceedings of ACM SIGKDD, pp. 503508, 2004.
[5] Jiang, T., Feng, Y., Zhang, B.: Online Detecting and Predicting
Special Patterns over Financial Data Streams, Journal of Universal
Computer Science, 15(13): 2566-2585, 2009.
[6] Hashemi, S., Yang, Y.: Flexible decision tree for data stream classification in the presence of concept change, noise and missing values, Data Mining and Knowledge Discovery, 19(1): 95-131 (2009).
http://dx.doi.org/10.1007/s10618-009-0130-9
[7] Han, J., Kamber, M.: Data Mining: Concepts and Techniques, Morgan Kaufmann, (2001).
[8] Nasraoui, O., Rojas, C.: Robust Clustering for Tracking Noisy
Evolving Data Streams, Proceedings of Sixth SIAM International
Conference of Data Mining (SDM), 2006.
[9] Hashemi, S., Yang, Y., Mirzamomen, Z., Kangavari, M.: Adapted
One-versus-All Decision Trees for Data Stream Classification, in
44
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
IEEE Transaction on Knowledge and Data Engineering, 21(5): 624637 (2009). http://dx.doi.org/10.1109/TKDE.2008.181
Mirking, B. G.,: Mathematical Classification and Clustering, 1996.
Aggarwal, C., Han, J., Wang, J., and Yu, P.: A Framework for Clustering Evolving Data Stream, VLDB conference, 2003.
Cao, F., Ester, M., Qian, W., and Zhou, A.: Density-Based Clustering over an Evolving Data Stream with Noise, in proceedings of
sixth SIAM international conference of Data Mining (SDM), 2006.
Chakrabarti, D., Kumar, R. and Tomkins, A.: Evolutionary Clustering, Proceedings of ACM SIGKDD ’06, pp. 554-560, 2006.
Chi, Y., Song, X -D., Zhou, D.-Y., Hino, K. and Tseng, B. L.: Evolutionary Spectral Clustering by Incorporating Temporal Smoothness,
in proceeding of ACM SIGKDD ’07, pp. 153-162, 2007.
Lloyd, S. P.: Least square quantization in PCM”, in IEEE Transactions on Information Theory, vol. 28, pp.: 129–137, 1957. http:
//dx.doi.org/10.1109/TIT.1982.1056489
Ng, A., Jordan, M. and Weiss, Y.: On spectral clustering: Analysis and an Algorithm, Advances in Neural Information Processing
Systems 14: Proceedings of the 2001.
Dai, B. -R., Huang, J. -W., Yeh, M. -Y. and Chen, M. -S.: Adaptive Clustering for Multiple Evolving Streams, IEEE transactions on
knowledge and data engineering, Vol. 18, No. 9, 2006.
Yang, J.: Dynamic Clustering of Evolving Streams with a Single
Pass, in proceeding of international conference on Data Engineering, pp. 695-697, 2003.
Beringer, J. and Hullermeier, E.: Online Clustering of Parallel Data
Streams, in Data and Knowledge Engineering, vol. 58, no. 2, pp.
ISSN 1927-6974
E-ISSN 1927-6982
www.sciedu.ca/air
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
Artificial Intelligence Research
180-204, 2005. http://dx.doi.org/10.1016/j.datak.2005.
05.009
Rodrigues, P.P., Gama, J. and Pedroso, J.P.: ODAC: Hierarchical
Clustering of Time Series Data Streams, in proceeding of 6th international conference on Data Mining, pp. 499-503, 2006.
Nasraoui, O., Soliman, M., Saka, E., Badia, A. and Germain, R.: A
Web Usage Mining Framework for Mining Evolving User Profiles in
Dynamic Web Sites, in IEEE Transactions on Knowledge and Data
Engineering, VOL. 20 , Issue 2, PP. 202-215, 2008.
Yeh, M. -Y., Dai, B. -R., and Chen, M. -S.: Clustering over Multiple
Evolving Streams by Events and Correlations, IEEE transactions on
knowledge and data engineering, VOL.19, NO.10, 2007.
Chen, H., Chen, M., Lin, S.: Catching the Trend: a Framework for
Clustering Concept-Drifting Categorical Data, in IEEE transaction
on knowledge and data engineering, VOL. 21, NO. 5, 2009.
Ho, S.-S., Wechsler, H.: A martingale framework for detecting changes in data streams by testing exchangeability, in IEEE
TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE
INTELLIGENCE, 2010.
Bickel, P.J. and Doksum, K.: Mathematical statistics: basic ideas
and selected topics Holden-Day, Inc., 1977.
Carlstein, E., Muller, H.-G. and Siegmund, D. Editors: Change point
problems, Institute of Mathematical Statistics, Hayward, California,
1994. PMid:8034709
Glaz, J., Balakrishnan, N., editors:
Scan statistics and
applications, Boston, 1999. http://dx.doi.org/10.1007/
978-1-4612-1578-3
Glaz, J., Naus, J. and Wallenstein, S.: Scan statistics,
springer, New York, 2001. http://dx.doi.org/10.1007/
978-1-4757-3460-7
Klinkenberg, R. and Joachims, T.: Detecting concept drift with support vector machines. in proceedings of 17th international conference on Machine Learning, P. Langley, Ed. Morgan Kaufmann, pp.
487–494, 2000.
Widmer, G., Kubat, M.: Learning in the Presence of Concept Drift
and Hidden Contexts, journal of Machine Learning, 23, pp. 69-101,
1996.
Klinkenberg, R.: Learning drifting concepts: examples selection vs
example weighting, Intelligent Data Analysis, Special Issue on Incremental Learning Systems capable of dealing with concept drift,
vol. 8, no. 3, pp. 281–300, 2004.
Chu, F., Wang, Y. and Zaniolo, C.: An adaptive learning approach
for noisy data streams. in proceedings of 4th IEEE internaional
conference on Data Mining. IEEE Computer Society, pp. 351–354,
2004
Published by Sciedu Press
2014, Vol. 3, No. 1
[33] Kolter, J. Z. and Maloof, M. A.: Dynamic weighted majority: A
new ensemble method for tracking concept drift. in proceedings of
3th IEEE international conference on Data Mining. IEEE Computer
Society, pp. 123–130, 2003.
[34] Wang, H., Fan, W., Yu, P. S. and Han, J.: Mining conceptdrifting data streams using ensemble classifiers, in proceedings of 9th
ACM SIGKDD internaional conference on Knowledge Discovery
and Data Mining, L. Getoor, T. E. Senator, P. Domingos, and C.
Faloutsos, Eds. ACM, pp. 226–235, 2003.
[35] Scholz, M. and Klinkenberg, R.: Boosting classifiers for drifting
concepts, in Intelligent Data Analysis, Vol. 11 No. 1, pp. 3-28, 2007.
[36] Bousquet, O. and Warmuth, M.: Tracking a Small Set of Experts by
Mixing Past Posteriors, in Journal of Machine Learning Research,
Vol. 3, pp. 363-396, 2002.
[37] Ho, S.-S. and Wechsler, H.: Detecting changes in unlabeled data
streams using martingale, in Proceeding 20th International Joint
Conference on Artificial Intelligence, M. Veloso, pp. 1912–1917,
2007.
[38] Aggarwal, C. C.: A framework for change diagnosis of data streams,
in proceedings of ACM SIGMOD internaional conference on Management of Data, pp. 575–586, 2003.
[39] Kifer, D., Ben-David, S. and Gehrke, J.: Detecting change in data
streams, in proceedings of 13th internaional conference on Very
Large Data Bases, M. A. Nascimento, M. T. O zsu, D. Kossmann,
R. J. Miller, J. A. Blakeley, and K. B. Schiefer, Eds. Morgan Kaufmann, pp. 180–191, 2004.
[40] Dasu, T., Krishnan, S., Venkatasubramanian, S. and Yi, K.:
An information-theoretic approach to detecting changes in multidimensional data streams, in Interface, 2006.
[41] Wetherill, G. B. and Glazebrook, K. D.: Sequential methods in
statistics, 3rd ed. Chapman and Hall, 1986.
[42] Mozafari, N., Hashemi, S., Hamzeh, A.: A Precise Statistical Approach for Concept Change Detection in Unlabeled Data Streams,
to be appeared in Journal of Computers and Mathematics with Applications, Elsevier, 2011.
[43] Rousseeuw, P.: Silhouettes: A Graphical Aid to the Interpretation
and Validation of Cluster Analysis, J. Computational and Applied
Math., vol. 20, pp. 53-65, 1987. http://dx.doi.org/10.1016/
0377-0427(87)90125-7
[44] Strehl, A., Ghosh, J.: Cluster ensembles – a knowledge reuse framework for combining partitionings, in the journal of machine learning
research (JMLR), 2002.
[45] Ward, J.: Hierarchical Grouping to Optimize an Objective Function, Journal of the American Statistical Association 58(301): 236244, 1963. http://dx.doi.org/10.1080/01621459.1963.
10500845
45