Online Classification of Nonstationary Data Streams
Online Classification of Nonstationary Data Streams
Online Classification of Nonstationary Data Streams
Mark Last1
Ben-Gurion University of the Negev
Department of Information Systems Engineering
Beer-Sheva 84105, Israel
Email: [email protected]
Abstract
Most classification methods are based on the assumption that the data conforms to
a stationary distribution. However, the real-world data is usually collected over certain
periods of time, ranging from seconds to years, and ignoring possible changes in the
underlying concept, also known as concept drift, may degrade the predictive performance
of a classification model. Moreover, the computation time, the amount of required
memory, and the model complexity may grow indefinitely with the continuous arrival of
new training instances. This paper describes and evaluates OLIN, an online classification
system, which dynamically adjusts the size of the training window and the number of
new examples between model re-constructions to the current rate of concept drift. By
using a fixed amount of computer resources, OLIN produces models, which have nearly
the same accuracy as the ones that would be produced by periodically re-constructing the
model from all accumulated instances. We evaluate the system performance on sample
segments from two real-world streams of non-stationary data.
Keywords. Classification, incremental learning, online learning, concept drift,
info-fuzzy networks.
Also with Department of Computer Science and Engineering, University of South Florida, USA
1. Introduction
The so-called Information Age has provided us with huge amounts of digital data,
which keep growing at unprecedented pace. Business information systems, Internet
servers, and telecommunication providers are producing and storing new data every day,
hour, and even minute. The reported rates include the maximum amount of 17,400 Web
page requests per minute at a single university campus [6], 500,000 transactions recorded
during less than one day by the Lycos search engine [1], and tens of millions of queries
sent every day to Yahoo! Search [25]. The accumulated data may contain some valuable
information for an organization that is storing it. Data mining is the core stage of the
KDD (Knowledge Discovery in Databases) process, which is aimed at identifying valid,
novel, potentially useful, and ultimately understandable patterns in data [9]. The
problems of knowledge discovery in online data include real-time monitoring of
manufacturing processes, prediction of stock prices, and intrusion detection in computer
networks.
Continuous streams of data pose new problems to classification methods of data
mining, like CART [4], ID3 [28], C4.5 [29], IFN [23], and many others. The common
approach of these methods is to store and process the entire set of training examples. The
growing amounts of training data increase the processing requirements of data mining
systems up to a point, where they either run out of memory, or their computation time
becomes prohibitively long. Furthermore, even if all the available examples can be
handled by the system, the patterns discovered by an algorithm in the data from the past,
may be hardly valid and useful for the new data obtained hours or even minutes later due
Mark Last
to unexpected changes in the data-generating process (e.g., a political event which affects
the stock prices).
Methods for extracting patterns from continuous streams of data are known as
incremental (online) learning algorithms. The basic idea of incremental induction is that
upon receiving a new instance, it is much less expensive to update an existing model than
to build a new one. On the other hand, as indicated in [6], the incremental algorithms
suffer from several shortcomings, like high sensitivity to the order of training examples
and longer training times than the non-incremental (batch) methods. Pure incremental
methods consider every new instance, which may be impractical in environments, where
transactions arrive at the rate of thousands per second. The algorithms for purely
incremental learning include WINNOW [21], COBWEB [10], ITI [31], and the stochastic
gradient descent algorithm for training a neural network (see [27]). Most of these
methods are focused on efficient ways of inducing a classification model from stationary
data streams.
The general problem of learning a drifting concept that is a concept which
changes over time is studied by Helmbold and Long in [12]. The rate of drift is defined
there as the probability that the target function disagrees over two successive examples.
Disagreements are minimized by an algorithm, which is polynomial in the sample size,
given that the rate of drift is bounded. Thus, for high-volume non-stationary data
streams, where the actual rate of drift is unknown in advance, the run time of the
algorithm may grow indefinitely.
Widmer and Kubat [32] describe a family of purely incremental algorithms for
learning in the presence of drift. These algorithms use a simple representational
Mark Last
framework, called FLORA, which stores descriptions of positive, negative, and noisy
examples in separate description sets. One of the incremental algorithms, FLORA2,
maintains a dynamically adjustable window of the latest training examples. Whenever a
concept drift is suspected, due to a drop in predictive accuracy or an explosion in the
number of descriptions, the window size is decreased, by discarding the oldest examples.
If the concept appears to be stable, the window size is left unchanged. As long as the
presence of drift is uncertain, no examples are forgotten, thus incrementally increasing
the window size. According to [32], this window adjustment strategy may efficiently
detect radical changes in the underlying concept, subject to a relatively low rate of
change. The FLORA algorithms also assume a limited rate of data arrival, since they
process one example at a time.
A recent paper by Domingos and Hulten [6] deals directly with the problem of
mining high-speed streams of data. Their data mining system, called VFDT, builds
decision trees from symbolic attributes by using sub-sampling of the entire data stream
generated by a stationary process. A similar assumption of stationary concepts is used by
the incremental method of Fan et al. [7]. The sample size is determined in VFDT from
distribution-free Hoeffding bounds. A new version of the VFDT system, called CVFDT
[16], learns decision trees from continuously changing data streams by repeatedly
applying the VFDT algorithm to a sliding window of fixed size. CVFDT is aimed at
detecting only one type of concept drift at the node level of the tree: namely, the
importance of the current input attribute vs. other attributes. The algorithm grows an
alternative subtree for each attribute having a relatively high information gain and
replaces the old subtree when a new one becomes more accurate.
Mark Last
Harries and Sammut [11] have developed an off-line meta-learning system (based
on C4.5) for partitioning the data stream into a set of time-dependent "conceptual
clusters." The off-line approach is aimed at analyzing concept drifts in historic data
rather than at the online detection of evolving changes. A strategy for dynamically
updating a linear regression classifier by using a linear model of dynamic behavior is
presented in [17]. The linear approach of[17] is computationally efficient, but it
obviously restricts the search space by eliminating nonlinear concepts and nonlinear
(especially, abrupt) patterns of concept drift.
Time-dependent changes in the class distributions of rules induced from data can
be detected by the CD3 algorithm of Black and Hickey [3][13]. CD3 treats the timestamp as an additional input attribute in a decision tree. Consequently, paths where the
value of the time-stamp attribute refers to the old period(s) represent rules, which are out
of date. When the process is stable for a long period of time, the time-stamp attribute
should not appear in any path of the tree.
Closely associated with the problem of change detection is the task of discovering
the robust knowledge, which is unlikely to be affected by database changes. A Bayesian
Network model for evaluating robustness of database rules is described in [14]. The
network probabilities are estimated by an off-line procedure, which assumes stationarity
of the database transactions.
Another related area is change detection, including an increasingly important
problem of intrusion detection in computers and computer networks. Lane and Brodley
[18] suggest a compromise between purely batch and purely incremental learning for the
task of detecting abnormal (possibly hostile) behavior of a computer user: their algorithm
Mark Last
is trained on a batch of the most recent user transactions. However, the minimum batch
size (80 examples) makes the system vulnerable to short-term cyber attacks. According
to Fawcett and Provost [8], the most efficient method for detecting an intruder is the
profiling approach, where a model of normal (stationary) activity is built and then used to
alarm on significant deviations from the normal.
This paper proposes an online classification system, which uses an info-fuzzy
network [23], or IFN, as a base classifier. As shown in [19][23], the IFN method is able
to produce much more compact models than other decision-tree methods, like CART and
C4.5, while preserving nearly the same level of predictive accuracy. Moreover, it can also
be used as an efficient feature selection method [20]. The proposed system, called OLIN
for On-Line Information Network, adapts itself automatically to the rate of concept drift
in a non-stationary data stream by dynamically adjusting the size of the training window
and the rate of model update. The system does not impose any limitations on the rate, the
extent, or the type of change in the underlying concept. Like the batch version of the IFN
method, it can handle both discrete and continuous attributes. OLIN saves computer
resources by increasing the update cycle when the concept appears to be stable and it
shrinks the size of the training window, whenever a concept drift is detected. Thus,
OLIN can be applied to a time-changing data stream of arbitrary duration. The
cumulative accuracy of the models produced by OLIN tends to be higher than the
accuracy obtained with a fixed-size sliding window though it may be slightly lower than
the accuracy of an incremental system that does not forget any past examples.
This paper is organized as follows. In the next section, we provide a brief
overview of the IFN method and its features. The OLIN system is described in the
Mark Last
following section. We then present the empirical results of applying OLIN to real-world
streams of non-stationary data coming from two different domains: semiconductor
manufacturing and stock market. The paper concludes with a discussion of results and
future work.
2. Information Networks
2.1. Network Structure
IFN, or info-fuzzy network [23] is a tree-like classification model, which is
designed to minimize the total number of predicting attributes. Beyond classification,
this model can be used for the tasks of discretization [23], feature selection [20], rule
induction [19], and evaluation of data reliability [24]. An info-fuzzy network has the
following components:
1) I - a subset of input (predicting) attributes used by the model. Input attributes are
selected from the set C of candidate input attributes (available features).
2) |I| - total number of hidden layers (levels) in a network. Unlike the standard decision
tree structure used in CART [4], ID3 [28], and C4.5 [29], where the nodes of the same
tree level are independent of each other, all nodes of a given network layer are labeled
by the same input attribute associated with that layer. This is why the number of
network layers is equal to the number of input attributes. In layers associated with
continuous attributes, an information network uses multiple splits, which are identical
at all nodes of the corresponding layer. Most other decision-tree methods apply only
binary splits in each layer. The first layer in the network (Layer 0) includes only the
root node and is not associated with any input attribute.
Mark Last
Mark Last
example is labeled with a single predicted class having the maximum probability at the
node or with the probability distribution of classes.
1,1
1
2
0
Layer No. 0
(the root node)
1,2
3,1
3,2
Layer No. 1
Connection
Layer No. 2
(First input
Weights
(Second input
attribute)
attribute)
2
3
Target
Layer
MI ( A i ' ; A i / z ) =
1M
i '
j= 0
P (V
ij
;V
i'
j ' ; z ) log
j '= 0
P (V
P (V
i' j'
ij
i' j'
/ z)
/ z ) P (V
ij
/ z)
(1)
where
Mi / Mi - number of distinct values of the target attribute i /candidate input
attribute i.
P (Vij/ z) - an estimated conditional (a posteriori) probability of a value j of a
candidate input attribute i given a node z
P (Vij/ z) - an estimated conditional (a posteriori) probability of a value j of the
target attribute i given a node z.
P (Vijij/ z) - an estimated conditional (a posteriori) probability of a value j of a
candidate input attribute i and a value j of the target attribute i given a node z.
P (Vij; Vij; z) - an estimated joint probability of a value j of the target attribute i, a
value j of a candidate input attribute i and a node z.
Conditional mutual information measures the benefit of adding an input attribute
to the information network. If the input and the target attributes are conditionally
independent given a node z, their conditional joint probability should be equal to the
product of their individual conditional probabilities. This makes the logarithmic terms in
Equation (1) equal to zero. On the other hand, if the knowledge of an input value either
Mark Last
10
(2)
(3)
11
At each iteration, the algorithm builds a new hidden layer by choosing an input
attribute (either discrete, or continuous), which provides the maximum significant
increase in mutual information relative to the previous layer. The nodes of a new layer
are defined for a Cartesian product of split nodes of the previous layer and the values of a
new input attribute. The chain rule of the information theory (see [5]) implies that the
mutual information between an info-fuzzy network and the target attribute is equal to the
sum of drops in conditional entropy (information gains) across all hidden layers.
If
there is no candidate input attribute significantly increasing the mutual information, the
network construction is stopped and the algorithm outputs the final network structure.
Mark Last
12
[32], is reasonable as long as the delay between receiving an example and receiving its
class is negligible with respect to the time between the arrivals of successive examples.
The general architecture of the OLIN system is presented in Figure 2. The system
has three modules: the Learning Module, which implements the IN algorithm to produce
an info-fuzzy network; the Classification Module, which uses the current network to
classify the incoming examples; and the Meta-Learning Module, which controls the
operation of the Learning Module. In the sample data stream of Figure 2, each one of V0
examples in the validation interval [t2, t3] is classified by a model induced from T0
examples of the training interval [t0, t2]. The number of examples in the training and the
validation intervals do not have to be equal. At the time t3 the network is re-constructed
by the Learning Module using T1 examples from the training interval [t1, t3] and
subsequently applied to V1 examples in the validation interval [t3, t4]. We assume here
that the first example in the interval [t3, t4] arrives after the network construction has been
completed. In massive data streams, examples that arrive in the process of network
construction may be classified by a partially constructed model using the anytime nature
of the IN algorithm (see [23]).
The Meta-Learning Module obtains as input the training and the validation
accuracy rates of the model, measured on the training and the validation intervals
respectively. It also gets the description of the model itself (selected attributes, entropy
information, etc.). Using the OLIN algorithm (see below), the module re-calculates the
size of the next training window (interval) and the number of validation examples to be
classified with the new model. The last validation example of the current window is
always the last training example of the next window. To classify every example in the
Mark Last
13
data stream exactly one time, the validation intervals ([t2, t3], [t3, t4], etc.) have to be
disjoint and consecutive. However, the training windows may overlap with each other
(but not with their respective validation intervals).
Meta-learning
Module
Window size
Learning
Module
Training
Data
Validation
accuracy
Training
accuracy
Update cycle
Classification
Model (IN)
Classification
Module
Validation
Data
t0
t1
t2
t3
V0
examples
t4
V1
examples
Data Stream
Mark Last
14
between the training and the validation accuracy of the current model as an indicator of
concept stability.
The purely incremental approach of FLORA2 does not seem appropriate for
massive data streams, where the concepts are not expected to change drastically between
the arrivals of consecutive instances. For this reason, the CVFDT algorithm of Hulten et
al. [16] checks for drift only once in a fixed number of examples (20,000). The OLIN
approach is to adjust dynamically the number of examples between model reconstructions by using the following heuristic: keep the current model for more examples
if the concept appears to be stable and reduce drastically the size of the validation
window, if a concept drift is detected.
Table 1 shows the pseudo-code outline of the OLIN algorithm. The algorithm
does some initializations and then processes a user-specified number of incoming
examples from a continuous data stream. If the user does not provide the number of the
last example to be classified by the system, the algorithm runs indefinitely. The
following parameters are calculated by OLIN: initial size of the training window, updated
window size, and the maximum difference between the training and the validation errors.
These calculations are described in the next sub-sections.
The IN algorithm, like other batch decision-tree methods, stores all the training
examples in the computer memory. Thus, applying IN to an indefinitely growing training
window is not feasible. Consequently, we have limited the maximum size of the training
window used by OLIN to Max_Win examples. This number can be adjusted to the
amount of available memory on a given computer. Another important parameter is the
training time required per each new example. As shown by us in [23], the computation
Mark Last
15
time of the IN algorithm is directly proportional to the number of training examples for
each discrete attribute and to the square of this number, if an attribute is continuous. This
imposes an additional limitation on the size of the training window, which can be handled
by a given computer system. Since the size of the training window is bounded by
Max_Win and the minimum number of examples in a validation interval is
Min_Add_Count, we can say that during the slowest periods of its operation, the training
time of OLIN per a discretely-valued new example can be proportional to Max_Win /
Min_Add_Count. With continuous attributes, this value increases to (Max_Win) 2 /
Min_Add_Count. However, when the concept appears to be stable, the algorithm training
time will go down to the order of Max_Win / Max_Add_Count or (Max_Win)2 /
Max_Add_Count, depending on the nature of the data stream attributes.
Mark Last
16
Sn min
n max
C
Sign
Pe
Init_Add_Count
Inc_Add_Count
Max_Add_Count
Red_Add_Count
Min_Add_Count
Max_Win
IFN
Output:
Procedure OLIN
Calculate the initial size of the training window Winit (using Equation 7)
Let the training window size W = Winit
Initialize the index i of the first training example to n min - W
Initialize the index j of the last training example to W
Initialize the number of validation examples Add_Count to Init_Add_Count
While j < n max Do
Obtain a model (IFN) by applying the IN algorithm to W latest training examples
Calculate the training error rate E tr of the obtained model on W training examples
Calculate the index of the last validation example k = j + Add_Count;
Calculate the validation error rate E Val of the obtained model on Add_Count validation
examples
Update the index of the last training example j = k
Find the maximum difference between the training and the validation errors Max_Diff
(using Equation 10)
If (E Val - E tr) < Max_Diff
// concept is stable
Add_Count =Min( Add_Count * (1+
(Inc_Add_Count/100)), Max_Add_Count)
W = Min (W + Add_Count, Max_Win)
Else
//concept drift detected
Re-calculate the size of the training window W (using
Equation 8)
Update the index of the first training record i = j - W
Add_Count = Max (Add_Count * (1(Red_Add_Count/100)), Min_Add_Count)
Return the current model (IFN)
End Do
Mark Last
17
(4)
Mark Last
18
window can always be derived from the number of examples required to split every node
in the network. We also assume that the first input attribute selected by the algorithm has
only two values: NTi (z) = 2. This number can be adjusted to the actual domain size of
candidate input attributes in a given dataset.
Since we consider only the attribute that is used to split the root node, the
conditional mutual information MI (Ai; Ai / z= 0) is equal to the mutual information MI
(Ai; Ai), which can be expressed as [5]):
MI (Ai; Ai) = H(Ai) H (Ai / Ai)
(5)
Where H(Ai) is the unconditional entropy of the target attribute Ai and H (Ai / Ai)
is the conditional entropy of Ai given an input attribute Ai. The unconditional entropy
H(Ai) can only be estimated by processing some training examples. However, it can be
approximated by its maximum value of log2 (NTi), where NTi is the number of classes [5].
If a dataset is very unbalanced, NTi may include only the most common classes. The
upper bound for the conditional entropy can be found from Fanos inequality[5]:
H (Ai / Ai) H (Pe) + Pe log2 (NTi -1)
(6)
Where Pe is the error rate of the information network having conditional entropy
of H (Ai / Ai). Since the right-hand expression is a non-decreasing function of Pe, we can
assume that Pe represents the maximum allowable error rate of the model. The
assumption of NTi (z) = 2 and Equations 4 6 lead to the following expression for the
initial size of the training window:
Winit =
2 ( NTi 1)
2 ln 2(log 2 ( NTi ) H ( Pe ) Pe log 2 ( NTi 1))
(7)
19
chi-square. More examples are needed to distinguish between a larger number of target
classes. Given that other factors remain unchanged, the error rate Pe is also proportional
to the number of required examples, since the algorithm needs more examples to confirm
the significance of a less accurate model. It is important to note that a high statistical
significance of the likelihood-ratio test does not necessarily imply a high predictive
accuracy of the resulting model. A significance level just represents the probability that
the model is not random, i.e., it is more accurate than the default (majority) prediction
rule.
The expression for the size of an updated window is also based on Equations 4
6, but it uses information, which is available from the latest window of the training
examples:
(8)
Where NTi is the number of values (or discretized intervals) for the first attribute
Ai in the info-fuzzy network, H (Ai) is the entropy of the target, and Etr is the training
error of the current model. The resulting number of examples W should be sufficient to
confirm the significance of the first layer in the latest trained model. We assume that a
new concept, which is still unknown, can be learned from at least the same number of
examples. If a new concept pertains for fewer examples than W, the algorithm cannot
detect it at all.
3.3. Comparing the Training and the Validation Error Rates
If a concept is stable, the examples in the training window and in the subsequent
validation interval should conform to the same distribution. Consequently, there should
Mark Last
20
not be a statistically significant difference between the training and the validation error
rates of the IFN model, as we have seen in our previous studies of static datasets [23].
On the other hand, a sharp increase in the error rate may indicate a possible concept drift
[32]. Using a Normal approximation to the Binomial distribution, we calculate the
variance of the difference between error rates by the following formula (based on [27]):
Var _ Diff =
(9)
If the concept is stable, the maximum difference between the error rates, at the
99% confidence level, is:
Max _ Diff = z0.99 Var _ Diff = 2.326 Var _ Diff
(10)
If the difference between the error rates exceeds Max_Diff, a concept drift is
detected and the size of the next training window is re-calculated by using the Equation 8.
Also, the number of examples in the next validation interval is reduced by
Red_Add_Count percent. Otherwise, the concept is considered stable and both the
training window and the validation interval are increased up to their maximum sizes.
4. Empirical Results
4.1. Manufacturing Data
We have applied OLIN to a sample of yield data recorded at a semiconductor
plant. In semiconductor industry, the yield is defined as the ratio between the number of
good parts (microelectronic chips) in a completed batch and the maximum number of
parts, which can be obtained from the same batch, if no chips are scraped at one of the
fabrication steps. Due to high complexity and variability of modern microelectronics
industry, the yield is anything but a stationary process. It is affected daily by hundreds of
Mark Last
21
Mark Last
22
Meaning
The number of the first example to be classified by the system
The number of the last example to be classified by the system
A user-specified significance level
Maximum allowable prediction error of the model
Initial number of examples to be classified by the first model
Amount to increase the number of examples between model reconstructions (if the model is stable)
Maximum number of examples between model re-constructions
Amount to reduce the number of examples between model reconstructions (if a concept drift is detected)
Minimum number of examples between model re-constructions
Maximum number of examples in a training window
Value
378
1377
0.1%
0.50
10
50%
100
75%
1
1,000
23
windowing. The columns of the table present the total number of training windows in the
process of classifying 1,000 examples, the average number of examples in a training
window, the total run time of the system on the stream of 378 training and 1,000
validation examples, the aggregated error rate on the validation records (including its
variance), and the probability that the error rate in a given run is significantly different
from the error rate of OLIN. One and two asterisks designate 5% and 1% significance
levels respectively.
Table 3 Manufacturing Data: Summary of Experiments
Run
No.
0
1
2
3
4
5
6
Initial
Window
378
50
100
50
50
100
117
Add
Count
1000
1
1
1
10
50
Dynamic
Remove
Count
0
0
0
1
10
50
Dynamic
Number Av.
of
Window
Windows Size
1
1000
1000
550
1000
600
1000
50
100
50
20
100
23
348.2
Run
Time
(sec.)
0.14
32.57
35.81
5.16
0.60
0.28
0.66
Error
Rate
0.558
0.507
0.512
0.563
0.598
0.556
0.515
Variance
0.2466
0.2500
0.2499
0.2460
0.2404
0.2469
0.2498
p-value
0.027
0.360
0.447
0.016
0.000
0.033
*
**
*
The high error rate of the no re-training approach (Run 0) indicates that even
during a relatively short period of several months, the yield does not preserve a stationary
pattern. It is also clear that OLIN (Run 6) is significantly more accurate on this data than
the static windowing (Runs 3-5). The no-forgetting approach (Runs 1-2) appears slightly
more accurate than OLIN, but the differences are not statistically significant. In terms of
system resources, OLIN with its average window size of 348.2 examples requires more
memory than the static windowing. On the other hand, the no-forgetting method requires
an indefinite amount of memory to store its ever-growing window. In terms of run time,
Mark Last
24
OLIN is comparable with the static windowing and it is considerably faster than the noforgetting approach even for a limited stream of 1,000 records.
Figure 3 shows how OLIN adjusts the window size to the rate of concept drift. To
remove short-term effects, the error rate is averaged over past 100 validation examples.
During the first segment of the run, approximately up to Example no. 500, the error rate
keeps going up until the system recognizes a concept drift and the window size is
drastically decreased. As a result of this window adjustment, the error rate goes down
rapidly from 70% to less than 40%. However, between Examples no. 660 and 760 there
is again a sharp increase in the average error rate, followed by a very slow decline in the
error. The second peak in the error rate is not high enough to be identified as a concept
drift, and, hence, the training window continues to grow until the end of the run, where it
approaches the maximum size of 1,000 examples.
0.8
1200
0.7
1000
800
Error Rate
0.5
0.4
600
0.3
400
0.6
0.2
200
0.1
0
350
0
550
750
950
1150
1350
Window Size
Mark Last
25
Figure 4 compares the performance of four methods for online learning: no retraining, no-forgetting (initial size = 100), static windowing (size = 100, increment = 50),
and dynamic windowing (OLIN). The error rates of all methods are calculated as moving
averages of the past 100 validation examples. The no re-training approach is leading in
the beginning of the run, but eventually its error goes up and it becomes inferior to other
methods most of the time, probably due to changes in the yield behavior. The noforgetting method is consistently providing the most accurate predictions for nearly the
entire length of the data stream. The static windowing is doing better than OLIN in the
first part of the stream, up to approximately Example no. 900. Afterwards, there is a
sharp increase in the error rate of the static windowing, while OLIN and the no-forgetting
provide the lowest error. In other words, by the end of the run, the large windows of
OLIN and the no-forgetting method are more accurate than the small, fixed-size windows
of the static method.
0.8
0.7
0.6
Error Rate
0.5
0.4
0.3
0.2
0.1
0
350
550
750
950
1150
1350
Static Window
No-forgetting
No re-training
Mark Last
26
Mark Last
27
similar as possible to the settings of the experiment with semiconductor data in subsection 4.1. Due to a larger size of this dataset, the initial number of validation examples
was set to 100 and it was allowed to vary between 10 and 400 examples. The number of
the last example to be classified is 5,461. The values of other parameters remained
unchanged and the runs were performed on the same computer that was used for the first
experiment.
Table 4 compares the performance of OLIN to other methods of online learning.
Run 0 represents the no re-training learner. Runs 1 3 were performed with the noforgetting learner for a varying number of new examples accumulated before each model
re-construction. We could not reduce this number below 100 due to extremely long
training times (more than 24 hours for Add_Count = 10). Runs 4 6 show the results for
a static window size, while Run 7 represents dynamic windowing with OLIN. Like in
Table 3, one and two asterisks denote 5% and 1% significance levels respectively vs.
dynamic windowing.
Table 4 Stock Data: Summary of Experiments
Run
No.
0
1
2
3
4
5
6
7
Initial
Window
462
100
400
400
100
400
400
41
Add
Remove Number of
Count
Count
Windows
5000
0
1
100
0
50
100
0
50
200
0
25
100
100
50
100
100
50
200
200
25
Dynamic Dynamic 41
Mark Last
Av.
Window
Size
5000
2600.0
2900.0
2900.0
100
400
400
274.0
Variance
0.2475
0.2365
0.2383
0.2396
0.2441
0.2423
0.2442
0.2428
p-value
0.000
0.001
0.010
0.042
0.209
0.380
0.181
**
**
**
*
28
Mark Last
29
500 examples, accompanied by an increase in OLINs window size, reveals the nature of
data pre-processing rather than a stock market phenomenon: the last interval of each
stock was truncated at the end of the period covered by the dataset (five years).
Consequently, the last intervals tend to be much shorter than the preceding intervals, and
their classification becomes almost deterministic.
0.7
2500
0.6
2000
1500
0.4
0.3
1000
Error Rate
0.5
0.2
500
0.1
0
450
0
1450
2450
3450
4450
5450
Window Size
Mark Last
30
successful than OLIN. However, OLIN itself performs worse in this segment than the
no-forgetting learner. This implies that in the second segment, retaining all the past
examples improves the classification accuracy, though it requires an increasing amount of
computer resources. A similar result has been obtained by Hulten et al. [16] in a
comparison between a static and a dynamic system. At the same time, one can see that
the gap between OLIN and the no-forgetting method remains nearly constant as more
examples arrive. It is also noteworthy that OLIN is better at adapting to a new, though
artificial concept in the last segment of the run: its error rate goes down much faster than
the error rates of the other methods, especially the no-forgetting learner. The no retraining approach is close to static windowing most of the time, but it becomes
extremely inaccurate for the last 800 examples of the run, since it does not adjust itself to
an abrupt change in the underlying concept.
0.7
0.6
Error Rate
0.5
0.4
0.3
0.2
0.1
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
Static Window
No-forgetting
No re-training
Mark Last
31
5. Conclusions
This paper has presented OLIN, a new system for on-line classification of
continuous data streams. To sum-up, the system is based on the following principles.
The induced concept is represented in the form of an info-fuzzy network (IFN), a treelike classification model. The system is repeatedly constructing a new network from a
sliding window of latest examples. The latest model classifies the examples that arrive
before the subsequent network re-construction. A concept drift is detected by an
unexpected rise in the classification error rate. When a concept is stable, the system
increases the size of the training window, up to a pre-specified limit, and reduces the
frequency of model updates. With the detection of a concept drift, the system recalculates the size of the training window by using the principles of information theory
and statistics.
The efficiency of OLIN has been confirmed by experiments on two real-world
sets of online data. In the case of 1,000 records from a manufacturing dataset, it took
OLIN 50 times less computation time to obtain practically the same classification
accuracy like the brute force no-forgetting method. In the second experiment,
performed on 5,000 records of stock data, the ratio between the computation times of
no-forgetting and OLIN was even more dramatic: about 240. In terms of classification
accuracy, the no-forgetting approach has outperformed OLIN by 2% only.
Hulten and Domingos [15] present a number of criteria for designing an online
data mining system. These criteria include constant time and memory per incoming
record, single scan of data, availability of a usable model at any time, equivalence to an
Mark Last
32
ordinary data mining algorithm, and adaptability to time-changing concepts along with
the preservation of concepts from the past that are still relevant. With respect to these
criteria, we have seen that OLIN requires a limited amount of time and memory per
record, does not revisit old records after they are removed from the training window, uses
an anytime algorithm for constructing an info-fuzzy network (see [23]), is almost as
accurate as models induced from all the available instances and has a method for
adjusting the system parameters in response to a concept drift. The empirical results
presented in this paper show that when applied to nonstationary data, OLIN tends to be
more accurate than the static windowing methods. However, OLIN is usually less
accurate than the extremely inefficient no-forgetting approach due to apparent presence
of long-term concepts, which are eventually forgotten by OLIN, while being retained by
the no-forgetting learner.
The future work includes evaluating OLIN on more real-world datasets, as well as
on artificially built data streams. Considering the time dimension as an explicit input
attribute available to the system (similar to the approach of [3][13]) is another promising
direction. Finding a better trade-off between the efficiency of OLIN and the accuracy of
the no-forgetting method should also be examined. Other topics to study include
incremental updating of information network structure and application of a similar
approach to other classification methods. A visual environment for running beta versions
of the batch algorithm (IN) and the online algorithm (OLIN) is available for download at
http://www.ise.bgu.ac.il/faculty/mlast/ifn.htm.
Acknowledgments. This work was partially supported by the National Institute for
Systems Test and Productivity at the University of South Florida, under SPAWAR
Mark Last
33
Research Grant N00039-01-1-2248. My thanks to Prof. Eric Backer for his helpful
comments.
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
Mark Last
34
[14]
[15]
[16]
[17]
M.G. Kelly, D.J. Hand, and N.M. Adams, The Impact of Changing
Populations on Classifier Performance, Proc. of KDD-99, pages 367371, 1999.
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
Mark Last
35
[28]
[29]
[30]
[31]
[32]
[33]
[34]
Mark Last
36