Volume 03, Issue 09, September 2014 ISSN 2319 - 4847
Volume 3, Issue 9, September 2014 Page 209
ABSTRACT Breathless flow in data collection and storage mechanism has enabled Firms to heap up a massive amount of data. Simple transactions of everyday life, such as uses of credit card or super market transactions lead to automated data storage. In many cases, these huge volumes of data can be mined for fascinating and applicable information in a wide range of applications. When the volume of the underlying data is huge, it generates a number of computational and mining challenges. Today, a large domain of data-concentrated applications, in which data is in the form of continuous streams, has been widely granted. Not only is the size of the data for these unlimited applications, but the data arrives at a highly outbreak mode.
Keywords: data mining, stream, itemset, data.
1. INTRODUCTION
Rapid advances in data collection and storage technology have enabled organizations to accumulate vast amount of data. Simple transactions of everyday life such as using a cash card, credit card, a telephone monitoring system or browsing the web lead to automated data storage. In many cases, these large volumes of data can be mined for interesting and relevant information in a wide types of applications. When the quantity of the underlying data is huge, it tends to a number of computational and mining challenges. Currently, a wide class of data-intensive applications, in which data is uses in the form of continuous flow of streams, has been widely recognized. Not only is the volume of the data for these applications unbounded, but the data arrives in a highly busty mode. Digital data in many institutions can grow without any upper limit at a Breathless rate of billions of data items per day. Every day WalMart records 7.2 billion transactions, Google handles 3.5 billion searches [32], Amazon's cloud routinely handles more than 500,000 transactions every second. Several applications naturally generate data streams: super market, banking sectors, traffic and network monitoring systems, log records or click-streams in web tracking, manufacturing processes, data comes from sensor applications, phone detail records in telecommunications, e-mail messages our internet etc. The basic challenge is that data-intensive mining is constrained by bounded availability of resources, time, memory, and data sample size. Data mining has traditionally been performed over static data-items, where data mining algorithms can sustain to read the input data many times. When the source of data items is become continuous data stream, then it become typical that all data can be loaded into the memory and static data mining with a fixed size dataset is no longer technically suitable due to the unique characteristics of streaming data. The following given constraints apply in the Data Stream model: 1. The size of data that has arrivedand will approach in the future is extremely large; in fact, the series of arrival is potentially infinite. Thus, it is almost impossible to store it all. Only an important small summary can be stored and computed, and the rest of the data is crushed. Even if it is possible to store all the information, it would be unreliable to go through it for further processing. 2. The rate of arrival is large, so that each single element has to be processed essentially in real time, and then shoot out. 3. The distribution generating the items can change over time. Thus, data from the for-past may become irrelevant (or even dangerous) for the resent summary. Most approach for working with time change contain hardwired constants, or else require input parameters, concerning with the expected high speed or frequency of the change; some well-known examples are Apriori definitions of sliding window size, parameters values of decay or forgetting, explicit upper and lower limit on maximum drift, etc. These choices represent presumption on how fast or how quick the data are going to enlarge and, of course, they might be totally wrong. Even more, no predefine choice may be right, since the stream may experience any permutation of hasty changes, subtle ones, and long stationary class. More in general, an approach based on fixed and predefined parameters A Comprehensive Study of Data Stream Mining Techniques
Deepak Kumar Mishra 1 , Dr. Varsha Sharma 2 and Dr. Sanjeev Sharma 3
1 School of Information Technology, RGPV, Bhopal
2 Professore, School of Information Technology, RGPV, Bhopal
3 Professore, School of Information Technology, RGPV, Bhopal
International Journal of Application or Innovation in Engineering& Management (IJAIEM) Web Site: www.ijaiem.org Email: [email protected] Volume 03, Issue 09, September 2014 ISSN 2319 - 4847
Volume 3, Issue 9, September 2014 Page 210 will be undergo in the following categories: the user would like to use values of parameters that give more accurate and efficient statistics (hence, more precision) during periods of solidity, but at the same occasion use the incompatible values of parameters to quickly react to change, when they arise. 1.1 Data Mining: Data mining, as we tend to describe it's the exploration and analysis of big volume of knowledge so as to get meaningful patterns and rules. Data processing comes in 2 flavourscurrent result oriented and future result oriented. Current result oriented data processing makes an attempt to elucidate or reason some specific target field like financial gain or response. Future result oriented data processing makes an attempt to seek out patterns or similarities among teams of records while not the employment of a selected target field or assortment of predefined categories. The subsequent eight main tasks to be performed in data mining: Classification, Prediction, Estimation,Affinity, Clustering, grouping, Description and identification[35]. The first 3 square measure all samples of current result oriented data processing and therefore the last 5 square measure all samples of future result oriented data processing. Data processing, transforms information into actionable results. Success is concerning creating business sense of the info, not victimization specific algorithms or tools. Confine mind, however, that the info mining techniques and tools represented here square measure equally applicable in fields starting from enforcement to astronomy, medicine, and process management. In fact, hardly any of the info mining algorithms were 1st fictitious with business applications in mind. the selection of { a specific |a selected} combination of techniques to use in a very particular state of affairs depends on the character of the info mining task, the character of the obtainable information, and therefore the skills and preferences of the info labourer. Data processing may be an approach of learning from the past thus on build higher choices within the future. Data processing functionalities square measure accustomed specify the sort of patterns to be found in data processing tasks. In general, data processing tasks will be classified into 2 categories: descriptive and prophetic (predictive). Descriptive mining tasks characterize the overall properties of the data within the info. Prophetic mining tasks perform logical thinking on the present information so as to form predictions. Numerous data processing techniques like, decision trees, association rules and neural networks square measure already planned and become the purpose of attention for many years. It suggests that a method of nontrivial extraction of implicit, previously unknown and doubtless helpful data (such as information rules, constraints, and regularities) from information in info [15]. However, with the event of analysis in data processing application and therefore the uses of distributed info that need to extend the privacy in info, there ought to gift a lot of and a lot of privacy protective protocol to produce the safety to the info. Info is split into four main partition like vertical partitioning, horizontal partitioning, derived horizontal and hybrid partition. 1.2 Data Stream Mining Mining frequent itemsets is a necessary step in several data processing issues, like mining association rules, ordered patterns, closed patterns, supreme pattern, and lots of different vital data processing tasks. the problem of mining frequent itemsets in giant databases was first proposed by Agrawal et al. [2] in 1993. Recently, database and data mining communities have cantered on a replacement knowledge model, wherever knowledge arrives within the type of continuous streams. it's typically spoken data streams or streaming knowledge. Many applications generate large amount of data streams in real time, such as sensor generated data from sensor networks, transaction flows in retail world, A data stream is an ordered series of items that arrives in time. Different from data in traditional static databases and data streams are continuous arrival, unbounded, sometimes escort high speed and have a data distribution that often changes over time. One example application of data stream is to estimate missing data in sensor networks and another one is prediction of data items in super market over continues data. Mining from data stream have following two properties of streams is which is more challenging due to data streams, i.e., the data streams are continuous and unbounded and data in the streams are not necessarily equally distributed; their distributions are usually vary over time period. In recent years, many stream mining algorithms have been proposed, and they can be broadly categorized into two classes, i.e., exact stream mining and appropriate stream mining algorithms. Exact algorithms find truly frequent item sets and appropriate algorithms find frequent items by using appropriate procedures rules, that is, these algorithms may find some non frequent item sets or may not found some frequent item sets. Here we propose a survey over stream mining algorithm with their efficiency and accuracy. The first recognized frequent item sets mining algorithm for traditional databases is Apriori. After word, many other algorithms based on the ideas of Apriori were developed for performance improvement. Apriori-based algorithms require more scans of the original database, which leads to high CPU and I/O costs. Therefore, these algorithms are not suitable for a data stream environment, in which data can be scanned only once. Another class of association rule mining algorithms for traditional databases proposed by Han and Pei are those using a frequent pattern tree (FP-tree) data structure and an FP-growth algorithm which allows mining of frequent item sets without generating candidate item sets. Compared with Apriori-based algorithms, it achieves higher performance by avoiding multiple candidate generations. Mining such streaming data differs from traditional data mining in following aspects [3]: 1 st , every data item in streaming data should be measured at most once. 2 nd , storage usage for data streams mining should be bounded even though new data items are continuously generated from the continues data stream. 3 rd , each data item in continues data streams should be processed as fast as possible. 4 th , the International Journal of Application or Innovation in Engineering& Management (IJAIEM) Web Site: www.ijaiem.org Email: [email protected] Volume 03, Issue 09, September 2014 ISSN 2319 - 4847
Volume 3, Issue 9, September 2014 Page 211 results obtained by the online processing algorithms should be instantly available when user needed. And in the last, the frequency errors of the produced outputs generated by the online algorithms should be constricted as small as possible. Data stream mining as we discus above is deferent from traditional data mining approaches, so to deal with continues data stream we have to work in two steps: first to generate a temporary static data set and second to apply a suitable mining method over this data set and continues these steps. There are following three main approach to work with first step: land mark based mining, damped space based mining and sliding windows based mining. Examples of Stream Sources: financial tickers performance measurements in network monitoring traffic management log records or click-streams in web tracking and personalization manufacturing processes data feeds from sensor applications, call detail records in telecommunications email messages Sensor Data Image Data Issues in Stream Processing 2. LITERATURE SURVEY
Digital data in many organizations can grow over the limit at a blast rate of billions of data per day. Several applications automatically generate data streams: some examples are financial sector, performance measurements in network monitoring and traffic control analysis management etc. The big challenge is that data-dependent mining is constrained by limited resources of time stamp, memory, and size. The following constraints also apply in theData Stream model: The amount of data that has arrived and will arrive in the future is extremely large; in fact, the sequence is potentially infinite. Thus, it is impossible to store it all. Only a small summary can be computed and stored, and the rest of the information is thrown away. Even if the information could be all stored, it would be unfeasible to go over it for further processing. The speed of arrival is very fast, so that each indivisual data item has to be processed essentially in real time. The distribution generating the items can change with time. So, data from the past may irrelevant for the present processing summary. Here in the figure 1, we describe the data stream mining process in a comparative context of traditional data mining process.
Figure 1.Data Stream Mining process in comparative context of traditional data mining process There are various methods proposed in the field of data stream mining and every method and algorithms proposed in literature for stream mining, here we present the brief overview of various proposed methods, which contributed sufficiently. Y. Dora Cai, David Clatter, Greg Pape et.al [21] proposed a system named MAIDS system. It consists of thefollowing systems components: (1) Stream Query Engine, (2) Stream Data Classifier, (3) Stream Pattern Finder, (4) Stream Cluster Analyzer, and (5) Stream Mining Visualizer. According to this, Data streams may contain many hidden patterns, such as sequential patterns and frequent patterns, which can be used to discover hidden incidents by comparing the current patterns with certain previous durations based on the stored patterns in the tilted time window. A pattern mining mechanism, Stream Pattern Finder, has been constructed to discover frequent patterns and sequential patterns. This algorithm essentially adopts the approx. frequent pattern growth approach [29] which discovers frequent patterns for the interested sets of items specified by users. Since a user may have knowledge about the set of interested frequent items to be traced, this MAID approach will derive precise patterns of user's interest. Further extension of the MAID framework may handle additional frequent patterns beyond user specified items with good approximation bound. The main important part of this system is its use of a FP-tree while scanning data streams. Each node in the tree contains a tilted time window that accumulates counts of the frequent patterns foreach time slot. The frequent patterns and association rules for a requested time horizon can be extracted and visualized International Journal of Application or Innovation in Engineering& Management (IJAIEM) Web Site: www.ijaiem.org Email: [email protected] Volume 03, Issue 09, September 2014 ISSN 2319 - 4847
Volume 3, Issue 9, September 2014 Page 212 using the FP-tree structure [29]. The VFDT (Very Fast Decision Trees) method has been adapted to create decision trees which are similar to those constructed by a conventional learner with the use of sampling based approximation Concept-Drifting classification. Very Fast Decision Trees) is the implementation of Hoeffding Trees with few heuristics added. The pseudo-code of VFDT is given bellow in algorithm 1. VFDT(Stream n) 1 Let T be a tree with a single leaf (root) 2 Init counts nijk at root to 0 3 for each example (x; y) in Stream 4 do VFDTGROW((x; y); T; n) Algorithm 1 Here VFDTGroth() is Hoeffding tree construction, and nijk is a splitting attribute. An efficient algorithm for mining decision tree from data stream, based on VFDT known as CVFDT proposed by Geoff Hulten, Laurie Spencer andPedro Domingos[12] in the year 2001. The CVFDT learned a model with more accuracy O(1) than O(width) inVFDT. The CVFDT periodically scan the internal nodes looking for one where the split attribute no longer selected CVFDT grows alternate sub trees to changed subtrees of HT, and modify HT when alteration is more accurate. CVFDT takes an average of 1k sec and 70MB memory to process a data set of 1000k transaction. Chris Giannella,Jiawei Han Jian Pei, Xifeng Yan [5] and Philip S. Yu, in 2002 define a method for miningfrequent pattern stream based in multiple time granular. They propose computing and maintaining all the frequent patterns and dynamically updating them with the incoming continues data streams. They extended the framework to mine time-sensitive patterns with approximate min support. This method, incrementally maintain tilted-time windows for each pattern at multiple time granularities. Interesting queries can be constructed and answered under this framework. Moreover, this is inspired by the fact that the FP-tree provides an effective data structure for frequent pattern mining, they gives FP-stream, an effective FP-tree- based model for mining frequent patterns from data streams. An FP-stream structure consists of; an in-memory frequent pattern-tree to capture the frequent and sub-frequent itemset information, and a tilted-time window table for each frequent pattern. Efficient this algorithm for constructing, maintaining and updating an FP-stream structure over data streams are explored. In the BTS algorithm which is introduced in 2002 by Manku G and Motwani R [31]; two estimated parameters: minimum support s and maximum support error e are used where 0<e<1. The incoming data stream is conceptually divided into buckets of width w=1/e transaction each, and the current length of the stream is denoted by N transactions. The BTS algorithm is composed of three steps. In the 1 st step BST repeatedly reads the batch of buckets into main memory. In 2 nd step it decomposes each transaction within the current bucket into a set of itemsets, and stores these itemsets into a summary data structure D which contains a set of entries of the form (i, i.freq, i. ), where i is an itemset, i.freq is an approximate frequency of the itemset i and i. is the maximum possible error in i.freq. For each itemset i extracted from the incoming transaction T, BTS performs two operations to maintain the summary data structure D. First, it counts the occurrences of i in the current batch, and updates the value i.freq if the itemset i already exists in the structure D. Second, BTS creates a new entry (i, i.freq, i.) in D, ifthe itemset i does not occur in D, but its estimated frequency i.freq in the batch is greater than or equal to |batch|.freq, where the value of maximal possible error i. is set to |batch|Finally, BTS outputs those entries i x in D, where i.freq (s - e).N. the BTS takes approx 1k sec processing time and consumes nearly 40 MB memory size with error rate e=0.001%. T.Calders, N.Dexters, andB.Goethals[34] has introduced a new stream mining method in 2007 that is; INSTANT is a algorithm for mining maximal frequent itemsets over stream landmark model, which used a very simple but effective data structure to maintain the maximal frequent itemsets and infrequent itemsets. INSTANT employs arrays A to store different itemsets with the equal support, that is, A(i) is used to store the itemsets with support i; thus, given the minimum support , arrays from A(1) to A() store the itemsets with support from 1 to , in which A() store the maximal frequent itemsets. When a new transactions T arrives, INSTANT will compare T to each itemset I in A(i1) and insert the interaction T I into A(i) if it is not one of the itemsets in A(i). the time complexity is O(x); if the average size of these x itemsets is y, then the average time complexity is O(y); thus, the overall time complexity is O(xy). AWSOM (Arbitrary Window Stream mOdeling Method) is a method for interesting pattern discovery from sensorsproposed by Papadimitriou et al. It is a one-pass algorithm that incrementally updates the patterns. This method requires only O(log n) memory where n is the length of the sequence. It uses wavelet coefficients as compact information representation and correlation structure detection, applying a linear regression model in the wavelet domain. An efficient one-pass algorithm in sliding windows over data streams with an error bound guarantee is proposed by Congying Han andLijunXu[29]in year 2008. This algorithm does not need to refer to obsolete transactions whenthey are removed from the sliding window. The algorithm is named as FW-SM that is Frequent Window based Stream Mining. It exploits a compact data structure to maintain potentially frequent itemsets so that it can output recent frequent temsets at any time. Flexible queries for continuous transactions in the sliding window can be answered with an error bound guarantee. A recent infrequent itemset may be frequent in the future. So the frequency of each itemset should be recorded so as to compute frequent itemsets accurately. However, the method is impractical due to the prohibitive number of itemsets occurring in data streams. In order to reduce the number of monitored itemsets, some International Journal of Application or Innovation in Engineering& Management (IJAIEM) Web Site: www.ijaiem.org Email: [email protected] Volume 03, Issue 09, September 2014 ISSN 2319 - 4847
Volume 3, Issue 9, September 2014 Page 213 itemsets whose support values are far less than the support threshold are not necessarily recorded as they cannot be frequent in the near future. In order to tell these itemsets from potentially frequent itemsets, we introduce the definition of significant itemsets. The design of the window is as follows: The size of the block is k/e, where k is an integer and e =e k/N. We can adjust k to change the size of the block. We assume there are /e tralnsactions being processed in main memory. may change according to the available memory as long as k . Each block is uniquely identified by an integer starting from 1. Large blocks with large labels contain recent transactions. FS(N) is defined as a collection of synopses over the window, each of which maintains e-significant itemsets of one block. A compact data structure is required to store the set of significant itemsets in main memory. A novel implementation, CP-tree (Concise Prefix tree) is also designed for maintaining data structure. All sibling nodes are placed in an array in CP-tree. Here one CP-tree is used to maintain all synopses in order to save space when some synopses contain identical itemsets or similar itemsets. Besides the item, each node records a list of pairs <bid, fre>, where bid is the identifier of a block and fre is the estimated frequency of the corresponding itemset in the block. The pseudo-code of the algorithm is given in Algorithm 2. Procedure: CP-Tree generator Assume the support threshold and the error parameter e; Choose the size of the sliding window N and the size of the block k/B; A. Process the incoming transactions; B. Construct the CP-tree. Output: The updated CP-tree. Algorithm 2 Two data sets, T20 and Kosarak, are used in the experiments. The T20 data set is generated by the IBM data generator [1]. The data sets were broken into batches of 10K size transactions and provided to our program through standard input. In the experiments, the block size is set to be 5 10^4 and the window size is fixed to 5 10^5. Meanwhile, the number of transactions in buffer is not greater than 10^4. The average response time is not more than 150 sec with data block size of 20K and the memory uses is 300 KB with 100K data block of T20. Hua-Fu Li, Suh-Yin Lee and Man- Kwan Shan [17] was introduced a landmark based stream mining algorithmData Stream Mining for Frequent Itemset. The DSM-FI (Data Stream Mining for Frequent Itemsets) is used to mine all frequent itemsets over the entire history of data streams. DSM-FI has three major features, namely single streaming data scan for counting itemsets frequency information, extended prefix-tree-based compact pattern representation, and top-down frequent itemset discovery scheme. The performance study of this algorithm shows that DSM-FI outperforms the well-known algorithm Lossy Counting in the same streaming environment. This is a new single-pass algorithm for mining all frequent itemsets in data streams based on a landmark windows model when a user-specified minimum support threshold ms (0, 1), and a user-defined maximal estimated support error threshold (0, ms) are given. The algorithm also define an in-memory summary data structure called IsFI-forest, and also describe the construction of IsFI-forest. The construction of IsFI- forest is described as follows: very firstly, DSM-FI reads a transaction T from the current block. Afterward in the successive step, DSM-FI projects this single large transaction into small transactions and inserts these transactions into IsFI-forest. The details explanatin of this process is: each transaction, such as T =(x1 x2xk), in the current block BN is projected into the IsFI-forest by inserting k item-suffix transactions in it. This says that, transaction T is converted into k small transactions; that is, (x1 x2 x3xk), (x2 x3xk), , (xk-1 xk) and (xk). These small transactions called as item-suffix transactions, since the first item in each transaction is an item-suffix of original transaction T. And the step is called Item-suffix Projection, and denoted it as IsProjection(T) ={x1|T, x2|T, , xk|T}, where x1|T =(x1 x2xk), x2|T =(x2 x3xk), , xk-1|T =(xk-1 xk), xk|T =(xk)}, and T =(x1 x2xk). Algorithm DSM-FI drop the original transaction T after IsProjection(T). Next, the set of items in these item-suffix transactions are inserted into these CFI-trees(item-suffixes) as a branch and updates these DHT(item-suffixes) according to the item-suffixes. If an itemset (i.e., item-suffix transaction) share a prefix with an itemset already in the tree, the new itemset will share a prefix of the branch representing that itemset. In addition, a support counter is associated with each node in the tree. The counter is updated when the next item-suffix transaction causes the insertion of a new branch. After pruning all infrequent information from the DHT, CFI-tree(item-suffix) and its corresponding DHT(item-suffix), and those CFI- trees containing this item-suffix, current IsFI-forest contains all essential information about all frequent itemsets of the stream seen so far. The outline of DSM-FI is as given in algorithm 3. Algorithm: DSM-FI B. Construct the summary data structure as bellow: Construct the IsFI Forest with a CFI and a DHT. For each CFI there is a DHT and an item id. c. Assign a head link to each DHT with item-id and suffix-number
2. Prune out the infrequent items from the summary data structure. 3. Perform the top-down frequent item discovery scheme. International Journal of Application or Innovation in Engineering& Management (IJAIEM) Web Site: www.ijaiem.org Email: [email protected] Volume 03, Issue 09, September 2014 ISSN 2319 - 4847
Volume 3, Issue 9, September 2014 Page 214 4. Repeat the step 2 and 3 for all blocks of data sets. 5. End Algorithm 3 To evaluate the performance of algorithm DSM-FI, the empirical studies is based on the synthetic datasets. The parameters of synthetic data generated by IBM synthetic data generator [2] are described as follows The first synthetic dataset T10.I5 has average transaction size T with 10 items and the average size of frequent itemset I is 5-items. It is a sparse dataset. In the second dataset T30.I20, the average transaction size T and average frequent itemset size I are set to 30 and 20, respectively. It is a dense dataset. Both synthetic datasets have 1,000,000 transactions. In the experiments, the synthetic data stream is broken into blocks with size 50K (i.e., 50,000) for simulating the continuous characteristic of streaming data, where 1K denotes 1,000. Hence, there are total 20 blocks in these experiments. The various experiment performed over synthetic data shows that the 400 sec and the memory uses is 30 KB with data blocks of size 2,000k. An real-time single-pass algorithm, called Top-k-FCI [24](top-K frequent closed itemsets of data streams), is proposed by Jun LI, Sen GONG in year 2011, for mining top-K closed itemsets from data streams efficiently. A novel algorithm, called can (T) (candidate itemset of the T), is developed for mining the essential candidate of closed itemsets generated so far. To mine the top-k closed frequent itemset in data stream, there is tree major steps to be performed: The first step is to delete the old transaction windows from data structure so that the new data set of current window is placed. Next part is to add new transaction set on the Top-k-FCI the closed frequent item set is changed based on the lattice along adding a new transaction. The final step is to generation of frequent pattern. The general algorithm is as given in algorithm 4. Algorithm: Top-k-FCI 1. Generation of candidate itemset using CIL (Closed Itemset Lattice) . a.Generate the FCI(Frequent Closed Itemsets) from the CIL of current window. 2. Update the CIL by adding new transaction. 3. Generate the HTC(Heap Table of Candidates) from current window. 4. Generate the frequent itemset set number =0; for each key ki of HTC do for each FCI Xi with the key ki of HTC Output from HTC; number=number+1; endfor if(number<k) then break; endif endfor 5. End Algorithm 4. The synthetic data set, T20I4D100K, generated by IBM synthetic data generator is used to evaluate the performance of the proposed algorithm. As the size of window increases, Top-k-FCI algorithm will consume the longer processing time as well as memory. On the other side the efficiency is better than its counter algorithms. In year 2012 Haifeng Li, Ning Zhang, Zhixin Chen[33] explore a new stream maximal frequent itemsets mining algorithm named INSTANT+based on a stream landmark model. The main ideas are as follows: Start with design a compact and useful data structure named FP-FOREST to store the max. Frequent itemsets. In FP-FOREST, each tree node dynamically maintains the itemsets with equal min support; so that the compress of the storage cost, and efficient computation of support become possible. After the completion of first step, when new transaction arrives, a easy but efficient algorithm to dynamically maintain the FP-FOREST, in which itemsets are pruned efficiently and the maximal frequent itemsets can be produced as result in real time. Finally, evaluate the INSTANT+algorithm on two datasets in comparison to the state-of-the-art maximal frequent itemset mining method INSTANT. The experimental results show that INSTANT+is effective and efficient. in INSTANT+, the itemsets with equal support have been compressed in a FP-tree and indexed, if an itemset has no interaction with T, the computation is ignored; suppose the the number of related itemsets is w(w <<n), then the time complexity is also O(mw). In 2013, Luigi Troiano and GiacomoScibelli [26] present an advanc algorithm for mining frequent itemsets in a stream of transactions within a limited time horizon. This algorithm is called as Apriori with Window Test (AWT), it is based on the Concept of Apriori and Sliding window based WIS algorithm. In contrast to other approaches that are presented in the literature, the proposed algorithm makes use of a test window that can discard non-frequent itemsets from a set of candidates. The efficiency of this approach relies on the property that the higher the support threshold is, the smaller the test window is. In addition to considering a sharp horizon, we consider a smooth window. Indeed, in many applications that are of practical interest, not all of the time slots have the same relevance, e.g., more recent slots can be more interesting than older slots. Smoothness can be determined in both qualitative and quantitative terms. The most immediate approach is to assign a degree of relevance to each slot on a International Journal of Application or Innovation in Engineering& Management (IJAIEM) Web Site: www.ijaiem.org Email: [email protected] Volume 03, Issue 09, September 2014 ISSN 2319 - 4847
Volume 3, Issue 9, September 2014 Page 215 subjective basis. This approach will lead to shape a trapezoidal window. Other approaches could be related to the structural characteristics of the stream. A possible method can be outlined as follows: 1. Choose a collection of itemsets to which we are interested; 2. Observe the stream for a period of time in such a way that we are able to characterize it statistically; 3. Define the window width that can assure events with a given probability; 4. Compute the probability pl of each itemset of interest to be frequent at the different possible lengths l of the window; 5. Assign at each length l a degree of relevance decreasing by p. The main outlines of the AWT and WIS is given in the algorithm 5 and algorithm 6 respectively. Procedure: AWT 1. Define the window size and find data item in window by WIS. 2. Generate the L1 itemset. 3. Generate (k+1) th candidate itemset as C+= L join L. 4. Find L+1 itemset by using min support S(I). 5. Repeat 3 and 4 until no more frequent itemsetremening. 6. End Algorithm 5. Procedure: WIS 1. Fix the optimal window size W 2. Find frequent item in window by using S. 3. Find leftmost item in window last(I) 4. Move W forward by h slots and change last(I). 5. Repeat 3,4 until no more item need to be processed. 6. End. Algorithm 6. In the practical of AWT T20I6D100K data set was used which contain 893 different items. The experiment performed with the window size of 100k each of which collecting 15 different items the WIS takes 23s and 20MB memory to perform with a support of 50%. Here in the figure 2. we present an approximate line chart of few major algorithms execution time in comparative way and table 1 have a brief study of the several algorithms for stream mining, with their performance measure
Figure 2: data source-outcomes of respective algorithms Table 1: comparison of various stream mining algorithms
S. Algorithm Description Performance of algorithm No. Name
1 MAIDS It is a single scan, multidimensional stream 40 MB memory used and 400s alarming system based on the window concept execution time.
2 CVFDT-2001 It is a decision tree mining algorithm based on the It takes 1k sec and 70 MB concept drift learning technique memory to process 1000k transaction.
3 OLIN:Last-2002 It is a fuzzy network based online data stream Its time complexity is O(n); n is mining algorithm. no of transaction.
International Journal of Application or Innovation in Engineering& Management (IJAIEM) Web Site: www.ijaiem.org Email: [email protected] Volume 03, Issue 09, September 2014 ISSN 2319 - 4847
Volume 3, Issue 9, September 2014 Page 216 4 BTS-2002 It is a landmark based stream mining technique It takes approx 1k sec and based on minimum support and average error rate. 40MB to process 100k data block
5 INSTANT-2007 It is a stream mining algorithm based on the Its time complexity is O(xy); x landmark method, it is a simple and efficient is average size of y window.
6 AWSOM Arbitrary Window Stream mOdeling Method is a It requires only O(log n) pattern discovery method for sensor data, it is a memory; n is length of window. single pass algorithm. 7 FW-SM-2008 FW-SM is an one pass algorithm based on sliding 150s is average execution time window, it uses a new type of structure CP-Tree to for data block size of 20k. store the frequent itemsets. 8 DSM-FI-2007 Data Stream Mining for Frequent Itemsets is used Average execution time and memory required for data block of size 2000k is: 400s and 30 KB.
frequent item discovery scheme, it uses landmark window model.
9 Top-k-FCI-2011 It is a real time single pass algorithm, it uses a new It is efficient for small window can(T) algorithm for candidate generation. of data.
10 INSTANT+ It is a improved type of INSTANT, based on a Its time complexity is O(mw); -2012 stream landmark model. It uses FP-FOREST to w is window size. store frequent item set. 11 WIS - 2014 Windowing Itemset shift is based on concept of Apriori and siding window model. Average execution time is 232ms for data size of 10k and memory used 20 MB.
3. CONCLUSION In this paper we are discus about many important methods and algorithms of data stream mining over continues data stream. Here the MAIDS and CVFDT are the few initial stream mining algorithms. Afterward algorithms mainly focus on either block based or window based mining techniques, as we found from study of various methods that there is no efficient and practically reliable possible way to develop direct algorithm for continues stream data. But there is more possibility to improve the efficiency and to reduce the execution time as per the demand of todays real time data processing and mining industry. After this study we conclude that there is several good techniques had been presented, but there is a way to improve the efficiency and reduce the processing time by applying FP Tree algorithm with a new sliding window based algorithm.
REFERANCES
[1.] R. Agrawal, T. Imeilinski, A. Swami, Database mining: a performance perspective, IEEE Trans. Knowl. Data Eng. 5 (6) 914925(1993). [2.] Charu C. Aggarwal Data Streams: Models and algorithms, Springer 2007. [3.] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, A.I. Verkamo, Fast discovery of association rules, advances in Knowledge Discovery and Data Mining 1996. [4.] S. Babu and J. Widom Continuous queries over data streams, SIGMOD 2001 Record-30 pp. 109-120. [5.] Chris Giannella, Jiawei Han, Jian Pei, Xifeng Yan, Philip S. Yu FP-stream Mining frequent patterns in data streams at multiple time granularities.2002. [6.] M. J. Zaki and C. J. Hsiao; Charm: An efficient algorithm for closed itemsets mining. SIAM Int'l Conf. on Data Mining; April 2002. [7.] C. Lucchese, S. Orlando, and R. Perego; Fast and memory efficient mining of frequent closed itemsets; Knowledge and Data Engineering, IEEE Transactions; January 2006. [8.] S. Brin, R. Motwani, J.D. Ullman, S. Tsur, Dynamic itemset counting and implication rules for market basket data, SIGMOD'97: Proceedings of the 1997 ACM SIGMOD international conference on Management of data, ACM, New York, NY, USA, pp. 255264, 1997. [9.] J. Han, J. Pei, Y. Yin, R. Mao, Mining frequent patterns without candidate generation: a frequent-pattern tree approach, in: H. Mannila (Ed.), Data Mining and Knowledge Discovery, Kluwer, New York, NY, USA, pp. 53 International Journal of Application or Innovation in Engineering& Management (IJAIEM) Web Site: www.ijaiem.org Email: [email protected] Volume 03, Issue 09, September 2014 ISSN 2319 - 4847
Volume 3, Issue 9, September 2014 Page 217 87, 2004. [10.] Giannella. C, Han. J, Pei. J, et al., Mining frequent patterns in data streams at multiple time granularities, In: Next Generation Data Mining. Cambridge, Massachusetts: MIT Press, pp. 11-212, 2003. [11.] JoongHyuk Chang and Won Suk Lee, A Sliding Window Method for Finding Recently Frequent Itemsets over Online Data Streams, Journal OF Information Science and Engineering20, 753-762(2004). [12.] G. Hulten, L. Spencer, and P. Domingos. Mining timechanging data streams. In: 7th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, San Francisco, CA, ACM Press, pages 97106, 2001. [13.] M. Last, Y. Klein, A. Kandel, Knowledge Discovery in Time Series Databases, IEEE Transactions on Systems, Man, and Cybernetics, Vol. 31, Part B, No. 1, pp. 160-169, 2001. [14.] T. Uno, T. Asai, Y. Uchida, H. Arimura, LCM: an efficient algorithm for enumerating frequent closed item sets, 2003. [15.] Cheng, J.Ke, Y.Ng, W.: Maintaining Frequent Itemsets over High-Speed Data Streams. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2006. [16.] M. Last. Online classification of nonstationary data streams. Intelligent Data Analysis, 6(2):129147, 2002. [17.] Hua- Fu Li, Man-Kwan Shan and Suh-Yin Lee, DSM-FI: an efficient algorithm for mining frequent itemsets in data stream, Springer 2007. [18.] H.-F. Li, SY. Lee, M.-K. Shan, An efficient algorithm for mining frequent itemsets over the entire history of data streams, 1st Int. Workshop on Knowledge Discovery in Data Streams, in conjunction with 15th European Conference on Machine Learning, Pisa (Italy), 2004. [19.] Y. Yan, Z. Li, H. Chen, Fast mining maximal frequent itemsets based on fp-tree. In: ER'04, pp. 348361, 2004. [20.] G.S. Manku, R. Motwani, Approximate frequency counts over data streams, Proceedings of the 28th international conference on Very Large Data Bases, VLDB'02, VLDB Endowment, pp. 346357, 2002. [21.] Y. Dora Cai, David Clutter, Greg Pape,Jiawei Han, Michael Welge, Loretta Auvil, MAIDS: Mining AlarmingIncidents from Data Streams 2004. [22.] J.H. Chang, W.S. Lee, Finding recent frequent itemsets adaptively over online data streams, Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD'03, ACM, New York, NY, USA, pp. 487492, 2003. [23.] C.H. Lee, C.R. Lin, M.S. Chen, Sliding window filtering: an efficient method for incremental mining on a time- variant database, Inf. Syst. 30, pp.227244 (2005). [24.] J.H. Chang, W.S. Lee, A sliding window method for finding recently frequent itemsets over online data streams, J. Inf. Sci. Eng. 20 pp. 753762(2004). [25.] Jun LI, Sen GONG; Top-k-FCI:Mining Top-k-Frequent Closed Itemsets in Data Stream; Journal of Computational Information Systems 7: 13 pp. 4819-4826(2011). [26.] Chris Giannella, Jiawei Han, Jian Pei, Xifeng Yan, Philip S. Yu. FP-stream Mining frequent patterns in data streams at multiple time granularities.2002. [27.] Li Tu, Ling Chen. Finding Frequent Items over Data Stream. Journal of Computational Information Systems, 6 (12): 4127- 4134, 2010. [28.] Luigi Troiano, GiacomoScibelli, Mining frequent itemsets in data streams within a time horizon, Data & Knowledge Engineering 89, pp.2137, (2014). [29.] Congying Han, LijunXu and Guoping He, Mining Recent Frequent Itemsets in Sliding Windows over Data Streams, Computing and Informatics, Vol. 27, pp.315339, 2008. [30.] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proc.2000 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'00) , Dallas, TX, pages 1-12, May 2000. [31.] Manku G, Motwani R, Approximate Frequency counts over Data Stream, In: proceeding of the 28th international conference on very large databases, Homg Kong, China pp 346-457 2002. [32.] Web reference: http://www.internetlivestats.com/google-search-statistics. [33.] Haifeng Li, Ning Zhang, Zhixin Chen; A Simple but Effective Maximal Frequent Itemset Mining Algorithm over Streams, Journal of Software, VOL. 7, NO. 1, pp.25-32, January 2012. [34.] T.Calders, N.Dexters, and B.Goethals, Mining Frequent Itemsets in a Stream, In proc. ICDM 2007. [35.] K Kiruba, Dr B RosilineJeetha, A Comparative Study on Hierarchical Clustering in Data Mining, International Journal of Engineering Sciences & Research Technology 3(2), pp.656-659, feb 2014.