A Comprehensive Study of Data Stream Mining Techniques

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

International Journal of Application or Innovation in Engineering& Management (IJAIEM)

Web Site: www.ijaiem.org Email: [email protected]


Volume 03, Issue 09, September 2014 ISSN 2319 - 4847

Volume 3, Issue 9, September 2014 Page 209


ABSTRACT
Breathless flow in data collection and storage mechanism has enabled Firms to heap up a massive amount of data. Simple
transactions of everyday life, such as uses of credit card or super market transactions lead to automated data storage. In many
cases, these huge volumes of data can be mined for fascinating and applicable information in a wide range of applications.
When the volume of the underlying data is huge, it generates a number of computational and mining challenges. Today, a
large domain of data-concentrated applications, in which data is in the form of continuous streams, has been widely granted.
Not only is the size of the data for these unlimited applications, but the data arrives at a highly outbreak mode.

Keywords: data mining, stream, itemset, data.

1. INTRODUCTION

Rapid advances in data collection and storage technology have enabled organizations to accumulate vast amount of
data. Simple transactions of everyday life such as using a cash card, credit card, a telephone monitoring system or
browsing the web lead to automated data storage. In many cases, these large volumes of data can be mined for
interesting and relevant information in a wide types of applications. When the quantity of the underlying data is huge,
it tends to a number of computational and mining challenges. Currently, a wide class of data-intensive applications, in
which data is uses in the form of continuous flow of streams, has been widely recognized. Not only is the volume of the
data for these applications unbounded, but the data arrives in a highly busty mode. Digital data in many institutions can
grow without any upper limit at a Breathless rate of billions of data items per day. Every day WalMart records 7.2
billion transactions, Google handles 3.5 billion searches [32], Amazon's cloud routinely handles more than 500,000
transactions every second. Several applications naturally generate data streams: super market, banking sectors, traffic
and network monitoring systems, log records or click-streams in web tracking, manufacturing processes, data comes
from sensor applications, phone detail records in telecommunications, e-mail messages our internet etc. The basic
challenge is that data-intensive mining is constrained by bounded availability of resources, time, memory, and data
sample size. Data mining has traditionally been performed over static data-items, where data mining algorithms can
sustain to read the input data many times. When the source of data items is become continuous data stream, then it
become typical that all data can be loaded into the memory and static data mining with a fixed size dataset is no longer
technically suitable due to the unique characteristics of streaming data. The following given constraints apply in the
Data Stream model: 1. The size of data that has arrivedand will approach in the future is extremely large; in fact, the
series of arrival is potentially infinite. Thus, it is almost impossible to store it all. Only an important small summary
can be stored and computed, and the rest of the data is crushed. Even if it is possible to store all the information, it
would be unreliable to go through it for further processing. 2. The rate of arrival is large, so that each single element
has to be processed essentially in real time, and then shoot out. 3. The distribution generating the items can change
over time. Thus, data from the for-past may become irrelevant (or even dangerous) for the resent summary. Most
approach for working with time change contain hardwired constants, or else require input parameters, concerning with
the expected high speed or frequency of the change; some well-known examples are Apriori definitions of sliding
window size, parameters values of decay or forgetting, explicit upper and lower limit on maximum drift, etc. These
choices represent presumption on how fast or how quick the data are going to enlarge and, of course, they might be
totally wrong. Even more, no predefine choice may be right, since the stream may experience any permutation of hasty
changes, subtle ones, and long stationary class. More in general, an approach based on fixed and predefined parameters
A Comprehensive Study of Data Stream
Mining Techniques

Deepak Kumar Mishra
1
, Dr. Varsha Sharma
2
and Dr. Sanjeev Sharma
3


1
School of Information Technology, RGPV, Bhopal

2
Professore, School of Information Technology, RGPV, Bhopal

3
Professore, School of Information Technology, RGPV, Bhopal

International Journal of Application or Innovation in Engineering& Management (IJAIEM)
Web Site: www.ijaiem.org Email: [email protected]
Volume 03, Issue 09, September 2014 ISSN 2319 - 4847

Volume 3, Issue 9, September 2014 Page 210
will be undergo in the following categories: the user would like to use values of parameters that give more accurate and
efficient statistics (hence, more precision) during periods of solidity, but at the same occasion use the incompatible
values of parameters to quickly react to change, when they arise.
1.1 Data Mining:
Data mining, as we tend to describe it's the exploration and analysis of big volume of knowledge so as to get
meaningful patterns and rules. Data processing comes in 2 flavourscurrent result oriented and future result oriented.
Current result oriented data processing makes an attempt to elucidate or reason some specific target field like financial
gain or response. Future result oriented data processing makes an attempt to seek out patterns or similarities among
teams of records while not the employment of a selected target field or assortment of predefined categories. The
subsequent eight main tasks to be performed in data mining: Classification, Prediction, Estimation,Affinity, Clustering,
grouping, Description and identification[35]. The first 3 square measure all samples of current result oriented data
processing and therefore the last 5 square measure all samples of future result oriented data processing. Data
processing, transforms information into actionable results. Success is concerning creating business sense of the info,
not victimization specific algorithms or tools. Confine mind, however, that the info mining techniques and tools
represented here square measure equally applicable in fields starting from enforcement to astronomy, medicine, and
process management. In fact, hardly any of the info mining algorithms were 1st fictitious with business applications in
mind. the selection of { a specific |a selected} combination of techniques to use in a very particular state of affairs
depends on the character of the info mining task, the character of the obtainable information, and therefore the skills
and preferences of the info labourer. Data processing may be an approach of learning from the past thus on build higher
choices within the future. Data processing functionalities square measure accustomed specify the sort of patterns to be
found in data processing tasks. In general, data processing tasks will be classified into 2 categories: descriptive and
prophetic (predictive). Descriptive mining tasks characterize the overall properties of the data within the info. Prophetic
mining tasks perform logical thinking on the present information so as to form predictions. Numerous data processing
techniques like, decision trees, association rules and neural networks square measure already planned and become the
purpose of attention for many years. It suggests that a method of nontrivial extraction of implicit, previously unknown
and doubtless helpful data (such as information rules, constraints, and regularities) from information in info [15].
However, with the event of analysis in data processing application and therefore the uses of distributed info that need to
extend the privacy in info, there ought to gift a lot of and a lot of privacy protective protocol to produce the safety to the
info. Info is split into four main partition like vertical partitioning, horizontal partitioning, derived horizontal and
hybrid partition.
1.2 Data Stream Mining
Mining frequent itemsets is a necessary step in several data processing issues, like mining association rules, ordered
patterns, closed patterns, supreme pattern, and lots of different vital data processing tasks. the problem of mining
frequent itemsets in giant databases was first proposed by Agrawal et al. [2] in 1993. Recently, database and data
mining communities have cantered on a replacement knowledge model, wherever knowledge arrives within the type of
continuous streams. it's typically spoken data streams or streaming knowledge. Many applications generate large
amount of data streams in real time, such as sensor generated data from sensor networks, transaction flows in retail
world, A data stream is an ordered series of items that arrives in time. Different from data in traditional static databases
and data streams are continuous arrival, unbounded, sometimes escort high speed and have a data distribution that
often changes over time. One example application of data stream is to estimate missing data in sensor networks and
another one is prediction of data items in super market over continues data. Mining from data stream have following
two properties of streams is which is more challenging due to data streams, i.e., the data streams are continuous and
unbounded and data in the streams are not necessarily equally distributed; their distributions are usually vary over time
period. In recent years, many stream mining algorithms have been proposed, and they can be broadly categorized into
two classes, i.e., exact stream mining and appropriate stream mining algorithms. Exact algorithms find truly frequent
item sets and appropriate algorithms find frequent items by using appropriate procedures rules, that is, these algorithms
may find some non frequent item sets or may not found some frequent item sets. Here we propose a survey over stream
mining algorithm with their efficiency and accuracy. The first recognized frequent item sets mining algorithm for
traditional databases is Apriori. After word, many other algorithms based on the ideas of Apriori were developed for
performance improvement. Apriori-based algorithms require more scans of the original database, which leads to high
CPU and I/O costs. Therefore, these algorithms are not suitable for a data stream environment, in which data can be
scanned only once. Another class of association rule mining algorithms for traditional databases proposed by Han and
Pei are those using a frequent pattern tree (FP-tree) data structure and an FP-growth algorithm which allows mining of
frequent item sets without generating candidate item sets. Compared with Apriori-based algorithms, it achieves higher
performance by avoiding multiple candidate generations. Mining such streaming data differs from traditional data
mining in following aspects [3]: 1
st
, every data item in streaming data should be measured at most once. 2
nd
, storage
usage for data streams mining should be bounded even though new data items are continuously generated from the
continues data stream. 3
rd
, each data item in continues data streams should be processed as fast as possible. 4
th
, the
International Journal of Application or Innovation in Engineering& Management (IJAIEM)
Web Site: www.ijaiem.org Email: [email protected]
Volume 03, Issue 09, September 2014 ISSN 2319 - 4847

Volume 3, Issue 9, September 2014 Page 211
results obtained by the online processing algorithms should be instantly available when user needed. And in the last,
the frequency errors of the produced outputs generated by the online algorithms should be constricted as small as
possible. Data stream mining as we discus above is deferent from traditional data mining approaches, so to deal with
continues data stream we have to work in two steps: first to generate a temporary static data set and second to apply a
suitable mining method over this data set and continues these steps. There are following three main approach to work
with first step: land mark based mining, damped space based mining and sliding windows based mining. Examples of
Stream Sources: financial tickers performance measurements in network monitoring traffic management log records or
click-streams in web tracking and personalization manufacturing processes data feeds from sensor applications, call
detail records in telecommunications email messages Sensor Data Image Data Issues in Stream Processing
2. LITERATURE SURVEY

Digital data in many organizations can grow over the limit at a blast rate of billions of data per day. Several
applications automatically generate data streams: some examples are financial sector, performance measurements in
network monitoring and traffic control analysis management etc. The big challenge is that data-dependent mining is
constrained by limited resources of time stamp, memory, and size. The following constraints also apply in theData
Stream model: The amount of data that has arrived and will arrive in the future is extremely large; in fact, the sequence
is potentially infinite. Thus, it is impossible to store it all. Only a small summary can be computed and stored, and the
rest of the information is thrown away. Even if the information could be all stored, it would be unfeasible to go over it
for further processing. The speed of arrival is very fast, so that each indivisual data item has to be processed essentially
in real time. The distribution generating the items can change with time. So, data from the past may irrelevant for the
present processing summary.
Here in the figure 1, we describe the data stream mining process in a comparative context of traditional data mining
process.
















Figure 1.Data Stream Mining process in comparative context of traditional data mining process
There are various methods proposed in the field of data stream mining and every method and algorithms proposed in
literature for stream mining, here we present the brief overview of various proposed methods, which contributed
sufficiently.
Y. Dora Cai, David Clatter, Greg Pape et.al [21] proposed a system named MAIDS system. It consists of
thefollowing systems components: (1) Stream Query Engine, (2) Stream Data Classifier, (3) Stream Pattern Finder,
(4) Stream Cluster Analyzer, and (5) Stream Mining Visualizer. According to this, Data streams may contain many
hidden patterns, such as sequential patterns and frequent patterns, which can be used to discover hidden incidents by
comparing the current patterns with certain previous durations based on the stored patterns in the tilted time window.
A pattern mining mechanism, Stream Pattern Finder, has been constructed to discover frequent patterns and
sequential patterns. This algorithm essentially adopts the approx. frequent pattern growth approach [29] which
discovers frequent patterns for the interested sets of items specified by users. Since a user may have knowledge about
the set of interested frequent items to be traced, this MAID approach will derive precise patterns of user's interest.
Further extension of the MAID framework may handle additional frequent patterns beyond user specified items with
good approximation bound. The main important part of this system is its use of a FP-tree while scanning data
streams. Each node in the tree contains a tilted time window that accumulates counts of the frequent patterns foreach
time slot. The frequent patterns and association rules for a requested time horizon can be extracted and visualized
International Journal of Application or Innovation in Engineering& Management (IJAIEM)
Web Site: www.ijaiem.org Email: [email protected]
Volume 03, Issue 09, September 2014 ISSN 2319 - 4847

Volume 3, Issue 9, September 2014 Page 212
using the FP-tree structure [29]. The VFDT (Very Fast Decision Trees) method has been adapted to create decision
trees which are similar to those constructed by a conventional learner with the use of sampling based approximation
Concept-Drifting classification. Very Fast Decision Trees) is the implementation of Hoeffding Trees with few
heuristics added. The pseudo-code of VFDT is given bellow in algorithm 1.
VFDT(Stream n)
1 Let T be a tree with a single leaf (root)
2 Init counts nijk at root to 0
3 for each example (x; y) in Stream
4 do VFDTGROW((x; y); T; n)
Algorithm 1
Here VFDTGroth() is Hoeffding tree construction, and nijk is a splitting attribute. An efficient algorithm for mining
decision tree from data stream, based on VFDT known as CVFDT proposed by Geoff Hulten, Laurie Spencer andPedro
Domingos[12] in the year 2001. The CVFDT learned a model with more accuracy O(1) than O(width) inVFDT. The
CVFDT periodically scan the internal nodes looking for one where the split attribute no longer selected CVFDT grows
alternate sub trees to changed subtrees of HT, and modify HT when alteration is more accurate. CVFDT takes an
average of 1k sec and 70MB memory to process a data set of 1000k transaction. Chris Giannella,Jiawei Han Jian Pei,
Xifeng Yan [5] and Philip S. Yu, in 2002 define a method for miningfrequent pattern stream based in multiple time
granular. They propose computing and maintaining all the frequent patterns and dynamically updating them with the
incoming continues data streams. They extended the framework to mine time-sensitive patterns with approximate min
support. This method, incrementally maintain tilted-time windows for each pattern at multiple time granularities.
Interesting queries can be constructed and answered under this framework. Moreover, this is inspired by the fact that
the FP-tree provides an effective data structure for frequent pattern mining, they gives FP-stream, an effective FP-tree-
based model for mining frequent patterns from data streams. An FP-stream structure consists of; an in-memory
frequent pattern-tree to capture the frequent and sub-frequent itemset information, and a tilted-time window table for
each frequent pattern. Efficient this algorithm for constructing, maintaining and updating an FP-stream structure over
data streams are explored. In the BTS algorithm which is introduced in 2002 by Manku G and Motwani R [31]; two
estimated parameters: minimum support s and maximum support error e are used where 0<e<1. The incoming data
stream is conceptually divided into buckets of width w=1/e transaction each, and the current length of the stream is
denoted by N transactions. The BTS algorithm is composed of three steps. In the 1
st
step BST repeatedly reads the batch
of buckets into main memory. In 2
nd
step it decomposes each transaction within the current bucket into a set of itemsets,
and stores these itemsets into a summary data structure D which contains a set of entries of the form (i, i.freq, i. ),
where i is an itemset, i.freq is an approximate frequency of the itemset i and i. is the maximum possible error in
i.freq. For each itemset i extracted from the incoming transaction T, BTS performs two operations to maintain the
summary data structure D. First, it counts the occurrences of i in the current batch, and updates the value i.freq if the
itemset i already exists in the structure D. Second, BTS creates a new entry (i, i.freq, i.) in D, ifthe itemset i does not
occur in D, but its estimated frequency i.freq in the batch is greater than or equal to |batch|.freq, where the value of
maximal possible error i. is set to |batch|Finally, BTS outputs those entries i
x
in D, where i.freq (s - e).N. the
BTS takes approx 1k sec processing time and consumes nearly 40 MB memory size with error rate e=0.001%.
T.Calders, N.Dexters, andB.Goethals[34] has introduced a new stream mining method in 2007 that is; INSTANT is a
algorithm for mining maximal frequent itemsets over stream landmark model, which used a very simple but effective
data structure to maintain the maximal frequent itemsets and infrequent itemsets. INSTANT employs arrays A to store
different itemsets with the equal support, that is, A(i) is used to store the itemsets with support i; thus, given the
minimum support , arrays from A(1) to A() store the itemsets with support from 1 to , in which A() store the
maximal frequent itemsets. When a new transactions T arrives, INSTANT will compare T to each itemset I in A(i1)
and insert the interaction T I into A(i) if it is not one of the itemsets in A(i). the time complexity is O(x); if the
average size of these x itemsets is y, then the average time complexity is O(y); thus, the overall time complexity is
O(xy).
AWSOM (Arbitrary Window Stream mOdeling Method) is a method for interesting pattern discovery from
sensorsproposed by Papadimitriou et al. It is a one-pass algorithm that incrementally updates the patterns. This method
requires only O(log n) memory where n is the length of the sequence. It uses wavelet coefficients as compact
information representation and correlation structure detection, applying a linear regression model in the wavelet
domain. An efficient one-pass algorithm in sliding windows over data streams with an error bound guarantee is
proposed by Congying Han andLijunXu[29]in year 2008. This algorithm does not need to refer to obsolete transactions
whenthey are removed from the sliding window. The algorithm is named as FW-SM that is Frequent Window based
Stream Mining. It exploits a compact data structure to maintain potentially frequent itemsets so that it can output recent
frequent temsets at any time. Flexible queries for continuous transactions in the sliding window can be answered with
an error bound guarantee. A recent infrequent itemset may be frequent in the future. So the frequency of each itemset
should be recorded so as to compute frequent itemsets accurately. However, the method is impractical due to the
prohibitive number of itemsets occurring in data streams. In order to reduce the number of monitored itemsets, some
International Journal of Application or Innovation in Engineering& Management (IJAIEM)
Web Site: www.ijaiem.org Email: [email protected]
Volume 03, Issue 09, September 2014 ISSN 2319 - 4847

Volume 3, Issue 9, September 2014 Page 213
itemsets whose support values are far less than the support threshold are not necessarily recorded as they cannot be
frequent in the near future. In order to tell these itemsets from potentially frequent itemsets, we introduce the definition
of significant itemsets. The design of the window is as follows: The size of the block is k/e, where k is an integer and e
=e k/N. We can adjust k to change the size of the block. We assume there are /e tralnsactions being processed in
main memory. may change according to the available memory as long as k . Each block is uniquely identified by
an integer starting from 1. Large blocks with large labels contain recent transactions. FS(N) is defined as a collection of
synopses over the window, each of which maintains e-significant itemsets of one block. A compact data structure is
required to store the set of significant itemsets in main memory. A novel implementation, CP-tree (Concise Prefix tree)
is also designed for maintaining data structure. All sibling nodes are placed in an array in CP-tree. Here one CP-tree is
used to maintain all synopses in order to save space when some synopses contain identical itemsets or similar itemsets.
Besides the item, each node records a list of pairs <bid, fre>, where bid is the identifier of a block and fre is the
estimated frequency of the corresponding itemset in the block. The pseudo-code of the algorithm is given in Algorithm
2.
Procedure: CP-Tree generator
Assume the support threshold and the error parameter e;
Choose the size of the sliding window N and the size of the block k/B;
A. Process the incoming transactions;
B. Construct the CP-tree.
Output:
The updated CP-tree.
Algorithm 2
Two data sets, T20 and Kosarak, are used in the experiments. The T20 data set is generated by the IBM data generator
[1]. The data sets were broken into batches of 10K size transactions and provided to our program through standard
input. In the experiments, the block size is set to be 5 10^4 and the window size is fixed to 5 10^5. Meanwhile, the
number of transactions in buffer is not greater than 10^4. The average response time is not more than 150 sec with data
block size of 20K and the memory uses is 300 KB with 100K data block of T20. Hua-Fu Li, Suh-Yin Lee and Man-
Kwan Shan [17] was introduced a landmark based stream mining algorithmData Stream Mining for Frequent Itemset.
The DSM-FI (Data Stream Mining for Frequent Itemsets) is used to mine all frequent itemsets over the entire history of
data streams. DSM-FI has three major features, namely single streaming data scan for counting itemsets frequency
information, extended prefix-tree-based compact pattern representation, and top-down frequent itemset discovery
scheme. The performance study of this algorithm shows that DSM-FI outperforms the well-known algorithm Lossy
Counting in the same streaming environment. This is a new single-pass algorithm for mining all frequent itemsets in
data streams based on a landmark windows model when a user-specified minimum support threshold ms (0, 1), and a
user-defined maximal estimated support error threshold (0, ms) are given. The algorithm also define an in-memory
summary data structure called IsFI-forest, and also describe the construction of IsFI-forest. The construction of IsFI-
forest is described as follows: very firstly, DSM-FI reads a transaction T from the current block. Afterward in the
successive step, DSM-FI projects this single large transaction into small transactions and inserts these transactions into
IsFI-forest. The details explanatin of this process is: each transaction, such as T =(x1 x2xk), in the current block BN
is projected into the IsFI-forest by inserting k item-suffix transactions in it. This says that, transaction T is converted
into k small transactions; that is, (x1 x2 x3xk), (x2 x3xk), , (xk-1 xk) and (xk). These small transactions called
as item-suffix transactions, since the first item in each transaction is an item-suffix of original transaction T. And the
step is called Item-suffix Projection, and denoted it as IsProjection(T) ={x1|T, x2|T, , xk|T}, where x1|T =(x1
x2xk), x2|T =(x2 x3xk), , xk-1|T =(xk-1 xk), xk|T =(xk)}, and T =(x1 x2xk). Algorithm DSM-FI drop the
original transaction T after IsProjection(T). Next, the set of items in these item-suffix transactions are inserted into
these CFI-trees(item-suffixes) as a branch and updates these DHT(item-suffixes) according to the item-suffixes. If an
itemset (i.e., item-suffix transaction) share a prefix with an itemset already in the tree, the new itemset will share a
prefix of the branch representing that itemset. In addition, a support counter is associated with each node in the tree.
The counter is updated when the next item-suffix transaction causes the insertion of a new branch. After pruning all
infrequent information from the DHT, CFI-tree(item-suffix) and its corresponding DHT(item-suffix), and those CFI-
trees containing this item-suffix, current IsFI-forest contains all essential information about all frequent itemsets of the
stream seen so far. The outline of DSM-FI is as given in algorithm 3.
Algorithm: DSM-FI
B. Construct the summary data structure as bellow:
Construct the IsFI Forest with a CFI and a DHT. For each CFI there is a DHT and
an item id.
c. Assign a head link to each DHT with item-id and suffix-number

2. Prune out the infrequent items from the summary data structure.
3. Perform the top-down frequent item discovery scheme.
International Journal of Application or Innovation in Engineering& Management (IJAIEM)
Web Site: www.ijaiem.org Email: [email protected]
Volume 03, Issue 09, September 2014 ISSN 2319 - 4847

Volume 3, Issue 9, September 2014 Page 214
4. Repeat the step 2 and 3 for all blocks of data sets.
5. End
Algorithm 3
To evaluate the performance of algorithm DSM-FI, the empirical studies is based on the synthetic datasets. The
parameters of synthetic data generated by IBM synthetic data generator [2] are described as follows The first synthetic
dataset T10.I5 has average transaction size T with 10 items and the average size of frequent itemset I is 5-items. It is a
sparse dataset. In the second dataset T30.I20, the average transaction size T and average frequent itemset size I are set
to 30 and 20, respectively. It is a dense dataset. Both synthetic datasets have 1,000,000 transactions. In the experiments,
the synthetic data stream is broken into blocks with size 50K (i.e., 50,000) for simulating the continuous characteristic
of streaming data, where 1K denotes 1,000. Hence, there are total 20 blocks in these experiments. The various
experiment performed over synthetic data shows that the 400 sec and the memory uses is 30 KB with data blocks of
size 2,000k.
An real-time single-pass algorithm, called Top-k-FCI [24](top-K frequent closed itemsets of data streams), is proposed
by Jun LI, Sen GONG in year 2011, for mining top-K closed itemsets from data streams efficiently. A novel algorithm,
called can (T) (candidate itemset of the T), is developed for mining the essential candidate of closed itemsets generated
so far. To mine the top-k closed frequent itemset in data stream, there is tree major steps to be performed: The first step
is to delete the old transaction windows from data structure so that the new data set of current window is placed. Next
part is to add new transaction set on the Top-k-FCI the closed frequent item set is changed based on the lattice along
adding a new transaction. The final step is to generation of frequent pattern. The general algorithm is as given in
algorithm 4.
Algorithm: Top-k-FCI
1. Generation of candidate itemset using CIL (Closed Itemset Lattice) .
a.Generate the FCI(Frequent Closed Itemsets) from the CIL of current window.
2. Update the CIL by adding new transaction.
3. Generate the HTC(Heap Table of Candidates) from current window.
4. Generate the frequent itemset
set number =0;
for each key ki of HTC do
for each FCI Xi with the key ki of HTC Output from HTC;
number=number+1;
endfor if(number<k) then
break; endif
endfor
5. End
Algorithm 4.
The synthetic data set, T20I4D100K, generated by IBM synthetic data generator is used to evaluate the performance of
the proposed algorithm. As the size of window increases, Top-k-FCI algorithm will consume the longer processing time
as well as memory. On the other side the efficiency is better than its counter algorithms. In year 2012 Haifeng Li, Ning
Zhang, Zhixin Chen[33] explore a new stream maximal frequent itemsets mining algorithm named INSTANT+based
on a stream landmark model. The main ideas are as follows: Start with design a compact and useful data structure
named FP-FOREST to store the max. Frequent itemsets. In FP-FOREST, each tree node dynamically maintains the
itemsets with equal min support; so that the compress of the storage cost, and efficient computation of support become
possible. After the completion of first step, when new transaction arrives, a easy but efficient algorithm to dynamically
maintain the FP-FOREST, in which itemsets are pruned efficiently and the maximal frequent itemsets can be produced
as result in real time. Finally, evaluate the INSTANT+algorithm on two datasets in comparison to the state-of-the-art
maximal frequent itemset mining method INSTANT. The experimental results show that INSTANT+is effective and
efficient. in INSTANT+, the itemsets with equal support have been compressed in a FP-tree and indexed, if an itemset
has no interaction with T, the computation is ignored; suppose the the number of related itemsets is w(w <<n), then
the time complexity is also O(mw). In 2013, Luigi Troiano and GiacomoScibelli [26] present an advanc algorithm for
mining frequent itemsets in a stream of transactions within a limited time horizon. This algorithm is called as Apriori
with Window Test (AWT), it is based on the Concept of Apriori and Sliding window based WIS algorithm. In contrast
to other approaches that are presented in the literature, the proposed algorithm makes use of a test window that can
discard non-frequent itemsets from a set of candidates. The efficiency of this approach relies on the property that the
higher the support threshold is, the smaller the test window is. In addition to considering a sharp horizon, we consider
a smooth window. Indeed, in many applications that are of practical interest, not all of the time slots have the same
relevance, e.g., more recent slots can be more interesting than older slots. Smoothness can be determined in both
qualitative and quantitative terms. The most immediate approach is to assign a degree of relevance to each slot on a
International Journal of Application or Innovation in Engineering& Management (IJAIEM)
Web Site: www.ijaiem.org Email: [email protected]
Volume 03, Issue 09, September 2014 ISSN 2319 - 4847

Volume 3, Issue 9, September 2014 Page 215
subjective basis. This approach will lead to shape a trapezoidal window. Other approaches could be related to the
structural characteristics of the stream. A possible method can be outlined as follows: 1. Choose a collection of itemsets
to which we are interested; 2. Observe the stream for a period of time in such a way that we are able to characterize it
statistically; 3. Define the window width that can assure events with a given probability; 4. Compute the probability pl
of each itemset of interest to be frequent at the different possible lengths l of the window; 5. Assign at each length l a
degree of relevance decreasing by p. The main outlines of the AWT and WIS is given in the algorithm 5 and algorithm
6 respectively.
Procedure: AWT
1. Define the window size and find data item in window by WIS.
2. Generate the L1 itemset.
3. Generate (k+1)
th
candidate itemset as C+= L join L.
4. Find L+1 itemset by using min support S(I).
5. Repeat 3 and 4 until no more frequent itemsetremening.
6. End
Algorithm 5.
Procedure: WIS
1. Fix the optimal window size W
2. Find frequent item in window by using S.
3. Find leftmost item in window last(I)
4. Move W forward by h slots and change last(I).
5. Repeat 3,4 until no more item need to be processed.
6. End.
Algorithm 6.
In the practical of AWT T20I6D100K data set was used which contain 893 different items. The experiment performed
with the window size of 100k each of which collecting 15 different items the WIS takes 23s and 20MB memory to
perform with a support of 50%. Here in the figure 2. we present an approximate line chart of few major algorithms
execution time in comparative way and table 1 have a brief study of the several algorithms for stream mining, with
their performance measure

Figure 2: data source-outcomes of respective algorithms
Table 1: comparison of various stream mining algorithms

S. Algorithm Description Performance of algorithm
No. Name

1 MAIDS It is a single scan, multidimensional stream 40 MB memory used and 400s
alarming system based on the window concept execution time.

2 CVFDT-2001 It is a decision tree mining algorithm based on the It takes 1k sec and 70 MB
concept drift learning technique memory to process 1000k
transaction.

3 OLIN:Last-2002 It is a fuzzy network based online data stream Its time complexity is O(n); n is
mining algorithm. no of transaction.

International Journal of Application or Innovation in Engineering& Management (IJAIEM)
Web Site: www.ijaiem.org Email: [email protected]
Volume 03, Issue 09, September 2014 ISSN 2319 - 4847

Volume 3, Issue 9, September 2014 Page 216
4 BTS-2002 It is a landmark based stream mining technique It takes approx 1k sec and
based on minimum support and average error rate. 40MB to process 100k data
block

5 INSTANT-2007 It is a stream mining algorithm based on the Its time complexity is O(xy); x
landmark method, it is a simple and efficient is average size of y window.

6 AWSOM Arbitrary Window Stream mOdeling Method is a It requires only O(log n)
pattern discovery method for sensor data, it is a memory; n is length of window.
single pass algorithm.
7 FW-SM-2008 FW-SM is an one pass algorithm based on sliding 150s is average execution time
window, it uses a new type of structure CP-Tree to for data block size of 20k.
store the frequent itemsets.
8 DSM-FI-2007 Data Stream Mining for Frequent Itemsets is used
Average execution time and
memory required for data block of
size 2000k is: 400s and 30 KB.

frequent item discovery scheme, it uses landmark
window model.

9 Top-k-FCI-2011 It is a real time single pass algorithm, it uses a new It is efficient for small window
can(T) algorithm for candidate generation. of data.

10 INSTANT+ It is a improved type of INSTANT, based on a Its time complexity is O(mw);
-2012 stream landmark model. It uses FP-FOREST to w is window size.
store frequent item set.
11 WIS - 2014
Windowing Itemset shift is based on concept of
Apriori and siding window model.
Average execution time is 232ms
for data size of 10k and memory
used 20 MB.

3. CONCLUSION
In this paper we are discus about many important methods and algorithms of data stream mining over continues data
stream. Here the MAIDS and CVFDT are the few initial stream mining algorithms. Afterward algorithms mainly focus
on either block based or window based mining techniques, as we found from study of various methods that there is no
efficient and practically reliable possible way to develop direct algorithm for continues stream data. But there is more
possibility to improve the efficiency and to reduce the execution time as per the demand of todays real time data
processing and mining industry. After this study we conclude that there is several good techniques had been presented,
but there is a way to improve the efficiency and reduce the processing time by applying FP Tree algorithm with a new
sliding window based algorithm.

REFERANCES

[1.] R. Agrawal, T. Imeilinski, A. Swami, Database mining: a performance perspective, IEEE Trans. Knowl. Data
Eng. 5 (6) 914925(1993).
[2.] Charu C. Aggarwal Data Streams: Models and algorithms, Springer 2007.
[3.] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, A.I. Verkamo, Fast discovery of association rules, advances in
Knowledge Discovery and Data Mining 1996.
[4.] S. Babu and J. Widom Continuous queries over data streams, SIGMOD 2001 Record-30 pp. 109-120.
[5.] Chris Giannella, Jiawei Han, Jian Pei, Xifeng Yan, Philip S. Yu FP-stream Mining frequent patterns in data
streams at multiple time granularities.2002.
[6.] M. J. Zaki and C. J. Hsiao; Charm: An efficient algorithm for closed itemsets mining. SIAM Int'l Conf. on Data
Mining; April 2002.
[7.] C. Lucchese, S. Orlando, and R. Perego; Fast and memory efficient mining of frequent closed itemsets; Knowledge
and Data Engineering, IEEE Transactions; January 2006.
[8.] S. Brin, R. Motwani, J.D. Ullman, S. Tsur, Dynamic itemset counting and implication rules for market basket
data, SIGMOD'97: Proceedings of the 1997 ACM SIGMOD international conference on Management of data,
ACM, New York, NY, USA, pp. 255264, 1997.
[9.] J. Han, J. Pei, Y. Yin, R. Mao, Mining frequent patterns without candidate generation: a frequent-pattern tree
approach, in: H. Mannila (Ed.), Data Mining and Knowledge Discovery, Kluwer, New York, NY, USA, pp. 53
International Journal of Application or Innovation in Engineering& Management (IJAIEM)
Web Site: www.ijaiem.org Email: [email protected]
Volume 03, Issue 09, September 2014 ISSN 2319 - 4847

Volume 3, Issue 9, September 2014 Page 217
87, 2004.
[10.] Giannella. C, Han. J, Pei. J, et al., Mining frequent patterns in data streams at multiple time granularities, In:
Next Generation Data Mining. Cambridge, Massachusetts: MIT Press, pp. 11-212, 2003.
[11.] JoongHyuk Chang and Won Suk Lee, A Sliding Window Method for Finding Recently Frequent Itemsets over
Online Data Streams, Journal OF Information Science and Engineering20, 753-762(2004).
[12.] G. Hulten, L. Spencer, and P. Domingos. Mining timechanging data streams. In: 7th ACM SIGKDD Intl.
Conf. on Knowledge Discovery and Data Mining, San Francisco, CA, ACM Press, pages 97106, 2001.
[13.] M. Last, Y. Klein, A. Kandel, Knowledge Discovery in Time Series Databases, IEEE Transactions on Systems,
Man, and Cybernetics, Vol. 31, Part B, No. 1, pp. 160-169, 2001.
[14.] T. Uno, T. Asai, Y. Uchida, H. Arimura, LCM: an efficient algorithm for enumerating frequent closed item
sets,
2003.
[15.] Cheng, J.Ke, Y.Ng, W.: Maintaining Frequent Itemsets over High-Speed Data Streams. In Proceedings of
the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2006.
[16.] M. Last. Online classification of nonstationary data streams. Intelligent Data Analysis, 6(2):129147, 2002.
[17.] Hua- Fu Li, Man-Kwan Shan and Suh-Yin Lee, DSM-FI: an efficient algorithm for mining frequent itemsets in
data stream, Springer 2007.
[18.] H.-F. Li, SY. Lee, M.-K. Shan, An efficient algorithm for mining frequent itemsets over the entire history of
data streams, 1st Int. Workshop on Knowledge Discovery in Data Streams, in conjunction with 15th European
Conference on Machine Learning, Pisa (Italy), 2004.
[19.] Y. Yan, Z. Li, H. Chen, Fast mining maximal frequent itemsets based on fp-tree. In: ER'04, pp. 348361, 2004.
[20.] G.S. Manku, R. Motwani, Approximate frequency counts over data streams, Proceedings of the 28th
international conference on Very Large Data Bases, VLDB'02, VLDB Endowment, pp. 346357, 2002.
[21.] Y. Dora Cai, David Clutter, Greg Pape,Jiawei Han, Michael Welge, Loretta Auvil, MAIDS: Mining
AlarmingIncidents from Data Streams 2004.
[22.] J.H. Chang, W.S. Lee, Finding recent frequent itemsets adaptively over online data streams, Proceedings of the
ninth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD'03, ACM, New
York, NY, USA, pp. 487492, 2003.
[23.] C.H. Lee, C.R. Lin, M.S. Chen, Sliding window filtering: an efficient method for incremental mining on a time-
variant database, Inf. Syst. 30, pp.227244 (2005).
[24.] J.H. Chang, W.S. Lee, A sliding window method for finding recently frequent itemsets over online data
streams, J. Inf. Sci. Eng. 20 pp. 753762(2004).
[25.] Jun LI, Sen GONG; Top-k-FCI:Mining Top-k-Frequent Closed Itemsets in Data Stream; Journal of
Computational Information Systems 7: 13 pp. 4819-4826(2011).
[26.] Chris Giannella, Jiawei Han, Jian Pei, Xifeng Yan, Philip S. Yu. FP-stream Mining frequent patterns in data
streams at multiple time granularities.2002.
[27.] Li Tu, Ling Chen. Finding Frequent Items over Data Stream. Journal of Computational Information Systems, 6
(12): 4127- 4134, 2010.
[28.] Luigi Troiano, GiacomoScibelli, Mining frequent itemsets in data streams within a time horizon, Data &
Knowledge Engineering 89, pp.2137, (2014).
[29.] Congying Han, LijunXu and Guoping He, Mining Recent Frequent Itemsets in Sliding Windows over Data
Streams, Computing and Informatics, Vol. 27, pp.315339, 2008.
[30.] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proc.2000
ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'00) , Dallas, TX, pages 1-12, May 2000.
[31.] Manku G, Motwani R, Approximate Frequency counts over Data Stream, In: proceeding of the 28th
international conference on very large databases, Homg Kong, China pp 346-457 2002.
[32.] Web reference: http://www.internetlivestats.com/google-search-statistics.
[33.] Haifeng Li, Ning Zhang, Zhixin Chen; A Simple but Effective Maximal Frequent Itemset Mining Algorithm
over Streams, Journal of Software, VOL. 7, NO. 1, pp.25-32, January 2012.
[34.] T.Calders, N.Dexters, and B.Goethals, Mining Frequent Itemsets in a Stream, In proc. ICDM 2007.
[35.] K Kiruba, Dr B RosilineJeetha, A Comparative Study on Hierarchical Clustering in Data Mining, International
Journal of Engineering Sciences & Research Technology 3(2), pp.656-659, feb 2014.

You might also like