A Tree Projection Algorithm For Generation of Frequent Itemsets
A Tree Projection Algorithm For Generation of Frequent Itemsets
A Tree Projection Algorithm For Generation of Frequent Itemsets
Frequent Itemsets
Ramesh C. Agarwal, Charu C. Aggarwal, V.V.V. Prasad
IBM T. J. Watson Research Center, Yorktown Heights, NY 10598
E-mail: f agarwal, charu, vvprasad [email protected]
CONTENTS
1. Introduction.
2. The Lexicographic Tree of Itemsets.
3. Algorithmic strategies for lexicographic tree creation.
4. Extensions to parallel association rule mining.
5. Empirical Results.
6. Conclusions and Summary.
1. INTRODUCTION
The problem of nding association rules was rst introduced by Agrawal, Imielin-
ski, and Swami 3]. This problem is concerned with nding relationships between
di erent items in a database containing customer transactions. Such information
can be used for many sales purposes such as target marketing, because the buying
patterns of consumers can be inferred from one another.
1
2 RAMESH AGARWAL, CHARU AGGARWAL, V.V.V. PRASAD
acdf Level 4
We assume that a lexicographic ordering exists among the items in the database.
In order to indicate that an item i occurs lexicographically earlier than j , we will
use the notation i L j . The lexicographic tree is a structural representation of the
frequent itemsets with respect to this ordering. The lexicographic tree is dened in
the following way:
(1) A vertex exists in the tree corresponding to each frequent itemset. The root
of the tree corresponds to the null itemset.
(2) Let I = fi1 : : : ik g be a frequent itemset, where i1 i2 : : : ik are listed in lexi-
cographic order. The parent of the node I is the itemset fi1 : : : ik;1 g.
The goal in this paper is to use the structure of the lexicographic tree in order
to substantially reduce the CPU time for counting frequent itemsets. An example
of the lexicographic tree is illustrated in Figure 1. A frequent 1-extension of an
itemset such that the last item is the contributor to the extension will be called
a frequent lexicographic tree extension, or simply a tree extension. Thus, each
edge in the lexicographic tree corresponds to an item corresponding to its frequent
lexicographic tree extension. We will denote the set of frequent lexicographic tree
extensions of a node P by E (P ). In the example illustrated in Figure 1, the frequent
lexicographic extensions of node a are b, c, d, and f .
Let Q be the immediate ancestor of the itemset P in the lexicographic tree.
The set of candidate branches of a node P is dened to be those items in E (Q)
which occur lexicographically after the node P . These are the possible frequent
lexicographic extensions of P . We denote this set by R(P ). Thus, we have the
following relationship: E (P ) R(P ) E (Q). The value of E (P ) in Figure 1,
when P = ab is fc dg. The value of R(P ) for P = ab is fc d f g, and for P = af ,
R(P ) is empty.
4 RAMESH AGARWAL, CHARU AGGARWAL, V.V.V. PRASAD
The levels in a lexicographic tree correspond to the sizes of the di erent itemsets.
The various levels for the example in Figure 1 are indicated. We shall denote the
set of itemsets corresponding to the nodes at level k by Lk .
A node is said to be generated, the rst time its existence is discovered by virtue
of the extension of its immediate parent. A node is said to have been examined,
when its frequent lexicographic tree extensions have been determined. Thus, the
process of examination of a node P results in generation of further nodes, unless
the set E (P ) for that node is empty. Obviously a node can be examined only after
it has been generated. This paper will discuss a set of algorithms which construct
the lexicographic tree in a top-down fashion by starting at the node null and
successively generating nodes until all nodes have been generated and subsequently
examined. At any point in the algorithm, a node in the lexicographic tree is dened
to be inactive, if it has been determined that the sub-tree rooted at that node can
not be further extended. Otherwise, the node is said to be active. Thus, the
event of a node being active or inactive is dependent on the current state of the
algorithm which is generating the nodes. A node which has just been generated
is usually born active, but it becomes inactive later when all its descendents have
been determined. For example, in the case of the Figure 1, let us assume that all
nodes up to and including level-2 have already been examined. (Consequently, all
nodes up to and including level-3 have been generated.) In this case, the set of
active nodes would be abc, acd, ab, ac, a, and null. Thus, even though there are
23 nodes corresponding to the top three levels which have been generated, only 6
of them are active. Note that we have not labeled the unexamined nodes abd and
acf as active since even the set of candidate branches for these nodes is empty.
We will be using this example repeatedly in our paper. For the sake of notational
convenience, we shall label it A.
An active node is said to be a boundary node if it has been generated but not
examined. In example A, the active boundary node set is fabc acdg. As we can
see from the complete tree in Figure 1, the subsequent examination of the node abc
will not lead to any further extensions, while the examination of the node acd will
indeed lead to the node acdf .
The extension set E (P ) was produced when P was rst examined. As the al-
gorithm progresses, some of these 1-extensions are no longer active. We introduce
the term AE (P ) to denote the subset of E (P ) which is currently active. We call
these active extensions. These represent the branches at node P which are currently
active. Next, we introduce the concept of active items.
The set of active items F (P ) at a node P is recursively dened as follows:
(1) If the node P is a boundary node, then F (P ) = R(P ).
(2) If the node P is not a boundary node, then F (P ) is the union of AE (P ) with
active items of all nodes included in AE (P ).
Clearly, AE (P ) F (P ) E (P ). The rst condition is true because of the
nature of the relationship between AE (P ) and F (P ). F (P ) is a set which reduces
in size when more itemsets are generated, since fewer number of items form active
extensions. For example A, for the null node, the only active extension is a, and
the set of active items is fa b c d f g. For node a, its active extensions are fb cg,
and the set of active items is fb c d f g.
TREE PROJECTION ALGORITHM 5
In the next section, we will discuss an algorithm which constructs the lexico-
graphic tree. The following information is stored at each node during the process
of this construction:
(1) The itemset P at that node.
(2) The set of lexicographic tree extensions at that node which are currently
active - AE (P ).
(3) The set of active items F (P ) at that node. F (P ) and AE (P ) will be updated
whenever the set of boundary nodes changes.
Let P be a node in the lexicographic tree at level-m, and let all levels of the
lexicographic tree upto level k > m have already been generated. Then, for a
transaction T we dene the projected transaction T (P ) to be equal to T \ F (P ).
However, if T does not contain the itemset corresponding to node P then T (P ) is
null. If T (P ) has less than (k ; m + 1) items then also it is eliminated. This is
because a transaction T (P ) at a level-m node with less than (k ; m + 1) items does
not have enough items to extend to a node at level-k. The rationale for this is that
any successive projection to lower levels strictly reduces the number of items in the
projected transaction, and hence, if the projected transaction at node P contains
less than (k ; m + 1) items, then the same transaction will not contain any item,
when projected to a node at level-k. For a set of transactions T , we dene the
projected transaction set T (P ) to be the set of projected transactions in T with
respect to active items F (P ) at P .
Consider the transaction abcdefghk. Then, for the example A of Figure 1, the
projected transaction at node null would be fa b c d e f g h kg\ fa b c d f g =
abcdf . The projected transaction at node a would be bcdf . For the transaction
abdefg, its projection on node ac is null because it does not contain the required
itemset ac.
Let P be a level-m node of the lexicographic tree, and let us assume that the top
k > m levels have already been generated. We emphasize the following points:
(1) An inactive node does not provide any extra information which is useful for
further processing. Thus, it can be eliminated from the lexicographic tree.
(2) For a given transaction T , the information required to count the support
of any itemset which is a descendant of a node P is completely contained in its
projection T (P ).
(3) The number of items in a projected transaction T (P ) is typically much smaller
than the original transaction. This number continues to reduce as the algorithm
progresses, more frequent itemsets are generated, and F (P ) reduces. As F (P )
shrinks, and k increases, then at some point, T (P ) contains less than (k ; m + 1)
items. At that time, T (P ) becomes null.
(4) For a given transaction set T and node P , the ratio of the number of trans-
actions in T (P ) and T is approximately determined by the support of the itemset
corresponding to P . Since projected transactions with less than (k ; m + 1) items
are not included, the ratio is actually much smaller.
k=1 Step 2
Step 4
Create triangular matrices Step 3
at level (k-1) nodes and Check support for
count support of nodes all matrix entries.
at level (k+1). Create new nodes
at level (k+1)
which have sufficient
Step 5 support
START
Initialize triangular
Read the next
matrices at level
transaction
(k-1) nodes
from disk. Step 2
Step 1
Step 3
Step 4
Have
all transactions Yes
been read yet? EXIT
No
START
Step 3
Step 2
Step 1
Is Project the transaction
for each active
node P at level No T onto extension i
extension i of node
(k-1)? P do of node P. Let us call
this transaction T(i).
Yes
Yes Is the
EXIT do loop
done?
Step 5
Algorithm AddCounts(T )
begin
f Add to the counts of the matrices maintained at
the nodes at level-(k ; 1) by using a depth
rst
projection of the transaction T g
end
Algorithm AddTree( k)
begin
Lk+1 = All (k + 1)-itemsets which satisfy the minimum support requirement
Add nodes to the tree at level-(k + 1)
end
FIG. 7. Adding New Itemsets to the Tree
Algorithm PruneTree(k)
begin
Remove all inactive nodes at level-(k + 1)
Set the active item list at each level-(k + 1) node to
the set of candidate branches at that node
for r = k down to 0 do
begin
Remove all inactive nodes at level-r
Update active item lists of nodes at level-r to the
union of their active extensions along with
active item lists of their children
end
end
FIG. 8. Pruning Inactive Nodes
{A, B, C, D, E, F}
Null LEVEL 0
A B C D E LEVEL 1
F
{B, C, D { C, D, { D, {E, F} {F} {}
E, F } E, F} E, F }
Transactions:
ACDF
AB(1)
ABCDEF
AC(2) BC(1)
BDE
AD(2) BD(2) CD(3)
CDEF
AE(1) BE(2) CE(2) DE(3)
AF(2) BF(1) CF(3) DF(3) EF(2)
nodes at level-(k + 1). The new nodes at level-(k + 1) are added by the procedure
AddTree() of Figure 7. The algorithm prunes all those nodes which have been
determined to be unnecessary for further counting and tree generation (Figure 8).
The process of pruning inactive nodes proceeds by rst removing all inactive nodes
at level-(k + 1), then all inactive nodes at level-k, and so on, until the null node.
At the same time the active item lists for the nodes are updated. Thus, in the
next iteration, when (k + 2) itemsets are being counted, the time for projecting
transactions is greatly reduced. This is because only active items need to be used
in the projection. Another way to look at it is that at any node P , as the algorithm
progresses, the projected transaction set T (P ) keeps shrinking both in terms of
the number of transactions as well as the number of items in transactions. This
is because the active item list F (P ) also shrinks as the algorithm progresses. If
level-k of the tree has already been generated, then for a node P at level-m, the
projection of a transaction T (P ) must have at least (k ; m + 1) items for it to be
useful to extend the tree to level-(k + 1). It it has fewer items, it is eliminated. For
a given value of m, as k increases, fewer transactions satisfy the (k ; m + 1)-items
criterion.
3.2. Depth First Creation of the lexicographic Tree
In depth-rst search, we create the nodes of the lexicographic tree in depth-rst
order. At any point in the search, we maintain the projected transaction sets for all
nodes (and their siblings) on the path from the root to the node which is currently
being extended. The root of the tree contains the entire transaction database.
The key point to understand is that once we have projected all the transactions
at a given node, then nding the sub-tree rooted at that node is a completely
independent itemset generation problem with a substantially reduced transaction
set. Furthermore, it is possible to prune other branches of the tree quickly. For
example, in the Figure 1, once the node acdf has been discovered by the depth-rst
search procedure, it is possible to prune o all other sub-trees hanging at c, d, e
and f , since none of these generate any itemsets whose existence is not implied by
acdf .
In breadth-rst search, it is necessary to initiate the process of transaction projec-
tion in each iteration, starting from the null node. Depth-rst search has the advan-
tage that it is not necessary to re-create the projected transactions for each level-k.
The problem with this technique is that since the entire transaction database needs
to be carried down the tree at one time, disk I/O may be necessary in order to read
and write the projected transactions. Thus, although the CPU times are reduced,
the disk I/O times may increase so substantially, that the method may often be
infeasible for large databases, except for lower levels of the tree where the projected
transactions t in memory. A very ecient version of the depth rst algorithm has
been proposed in 2] which is capable of nding very long patterns. This algorithm
is more than one order of magnitude faster than the MaxMiner algorithm 6] for
nding long patterns.
3.3. Combined Depth First and Breadth First Creation
In this case, we process the rst few levels of the tree using breadth-rst search.
At some stage of the algorithm, the projected transaction set at individual boundary
TREE PROJECTION ALGORITHM 13
level nodes is small enough to t in memory. At that point, we write all the
projected transactions to disk in separate les for each boundary node. These
les are read one at a time and the sub-tree rooted at that node is created in the
depth-rst order. This is done entirely in memory and therefore, does not require
any additional I/O. This process is repeated for all boundary nodes. Thus, the
ineciencies associated with each approach can be avoided by using the correct
strategy for the case in question. Many other hybrid schemes are possible, each
having di erent trade-o s in terms of I/O, memory, and CPU time.
The implementation discussed in this paper uses a pure breadth-rst strategy
since it is geared towards short itemsets in large databases.
3.4. Transaction Projection Strategy
As in tree creation, several strategies are possible in projecting transactions to
boundary nodes where the eventual counting is done. The most important factors
are available memory and cache size. For simplicity, let us assume that the tree has
been created in breadth-rst order and all nodes up to level-k have been created. To
create nodes at level-(k + 1), for all nodes P at level-(k ; 1) we allocate storage for
triangular matrices of size jE (P )E (P )j. The collective amount of memory required
to hold all these matrices is likely to be far larger than cache size of the machine.
However, it is expected that they will t in the amount of available memory. Any
remaining amount of memory is available to hold projected transactions. One
possible strategy is to process one transaction at a time and project it to all nodes
at level-(k ; 1). In practice, it will get eliminated at several intermediate nodes of
the tree, reaching only a small fraction of the nodes at level-(k ; 1). If; the
projected
transaction at a level-(k ; 1) node has m items, then it will require m2 operations
to update the triangular matrix stored at that node. This is likely to result in
a very poor cache performance. However, if a large fraction of the transaction
database is projected to that node, then it is quite likely that many transactions
will project to that node. Then, we can set up a loop to count contributions
of all these transactions to the triangular matrix. This results in a much better
cache performance for the triangular matrices. This approach requires a larger
amount of memory. When a transaction is simultaneously projected to all nodes
at level-(k ; 1), the total amount of memory required for all the projections is
far larger. However, this problem can be substantially avoided by following the
depth-rst strategy in projecting the transactions, even though the tree has been
created in breadth rst order. The discussion of this section can be considered
an implementation of Figure 5, using a block of transactions instead of a single
transaction T . The actual implementation of the algorithm, as discussed in the
empirical section used a block of transactions at one time. The size of this block
was determined by the memory availability.
3.5. Implementation Details
The performance of this algorithm is quite sensitive to the lexicographic ordering
of the items in the tree. When the ordering of the items in the tree is such that
the least frequent item occurs rst, the running times are best. This has also been
observed earlier by Bayardo 6]. The reason for this is that the average size of the
set E (P ) is sensitive to the nature of the ordering. If least frequent items are picked
14 RAMESH AGARWAL, CHARU AGGARWAL, V.V.V. PRASAD
All elements in
a particular strip
L1 are counted together
since they get blocked
in cache.
L1
rst, then the size of E (P ) is small. This contributes to the reduced running times.
In reality, it is possible to improve the performance even further by re-ordering
the items at lower level nodes based on the support counts of the corresponding
frequent extensions. In our paper, we choose to maintain a xed ordering from
least to most support after counting 1-itemsets.
Our current implementation is primarily based on pure breadth-rst strategy for
tree generation, combined with the depth-rst strategy for transaction projection
and counting. The counting of the support for frequent (k + 1)-itemsets which
are descendents of a (k ; 1)-itemset P was performed in a structured way so that
the cache performance was optimized. We have discussed earlier that a block of
transactions is used at one time in order to perform the counting. Here we will
discuss the method for counting L2 . The general technique of counting Lk+1 is
very similar. We assume that the database which is used is expressed in terms of
L1 for counting frequent 2-itemsets. This corresponds to the projection at the null
node. Consider the case when jL1 j = 10000. In that case, the number of elements in
the matrix of size jL1 jjL1 j is equal to 49995000. The memory required to hold this
triangle is far larger than cache size of a typical machine. Therefore, cache-blocking
was performed.
In cache-blocking, the current bu er of transactions is counted against one strip
of elements in the matrix at a time (see Figure 10). Thus, it is more likely for this
entire strip to remain in cache when the counting is performed. Each cache block
represents a set of items. (An item is represented by a column in this trapezoidal
cache block.) This strip of elements was chosen such that the total number of
elements in it ts in the cache. The number of cache blocks NC is obtained by
dividing the total amount of memory required by the cache size of the machine.
We say that a transaction touches a block, if it has at least one item in the cache
block. If a transaction touches a block, then we maintain pointers which carry the
information for the rst and the last item in the transaction from each block, as well
as the last item in the entire transaction. Thus, three pointers are maintained for
each transaction which touches a cache-block. Thus, the transactions are implicitly
projected to the cache-blocks by storing this pointer information. For a given cache
block, let there be M transactions which touch it. For this block, we maintain a
list of 3
M pointers for these M transactions. Similar lists are maintained for each
of the blocks. Memory bu ers are allocated in order to hold the lists of pointers for
TREE PROJECTION ALGORITHM 15
each cache block. These pointers may be used to perform the counting e ectively.
Let p1 (C T ) and p2 (C T ) be the pointers to be rst and last item respectively at
which a transaction T touches a cache block C , and p3 (C T ) be the pointer to the
end of the transaction. Note that the value of p3 (C T ) is the same for each cache
block C . We rst determine the projections for each transaction in the bu er onto
each cache block C :
It is projected to the null node using the active item list at the null node. This
will result in a large reduction in the bu er size. In the next step, it is projected
to all currently active level-1 nodes. Each of these projections is likely to be an
order of magnitude smaller and may t in the highest level of cache of the machine
(say L3). Next, the block at the leftmost level-1 node is projected to all its active
children. These projections are likely to be another order of magnitude smaller
and may t in L2. These blocks are again projected in depth rst order to lower
level nodes. At this point, these projections may t in L1 and and all subsequent
processing is done with data in L1. Note that the algorithm does not need to know
sizes of various levels of cache. The depth-rst projection automatically adjusts to
various levels and sizes of cache. For the higher levels of tree where the transaction
bu er does not t in cache, cache performance is still good because transactions
are accessed with stride one. To summarize, our tree projection counting algorithm
provides good data locality and exploits multiple levels of cache.
3.7. Further Optimizations
A number of optimizations can be performed on this algorithm. For example,
when the matrix E (P ) E (P ) is counted, it is also possible to count the support
of the itemset P E (P ). If this itemset is discovered to be frequent, then it is
possible to stop exploring the sub-tree rooted at P . This implies that all subsets
of P E (P ) are frequent.
Another technique which may be used to speed up the performance is the method
of counting the lower branches of the tree. In this case, we can use bucket counting
instead of matrix counting. Whenever the size of the E (P ) is below a certain level,
a set of 2jE(P )j ; 1 buckets may be used in order to nd the sub-tree rooted at P .
Since there are only 2jE(P )j ; 1 possible distinct projected transactions at a node, we
can nd a one-to-one mapping between the buckets and the distinct transactions.
The count of a bucket is equal to the number of projected transactions which map
on to it. The bucket counting is a two-phase operation. In the rst phase, the
number of transactions corresponding to each bucket are counted. In the second
phase, the counts from various buckets are aggregated together to form counts for
all the nodes of the sub-tree rooted at P . In this sub-tree, only those nodes which
meet the desired support condition are retained. This technique can result in a
substantial reduction in the CPU time and I/O requirements when the number of
projected transactions at a given node are large, but the size of E (P ) is relatively
small.
3.8. A note on memory requirements
In this subsection, we will discuss the memory requirements of the TreeProjection
algorithm. The memory requirement of the TreeProjection is equal to the sum of
the memory requirements for triangular matrices at each level-(k ; 1) node of the
lexicographic tree. At rst sight, this might seem like a rather large requirement.
However, this is proportional to the number of candidate (k+1)-itemsets at that
level. (Assuming that each entry of the matrix requires 4 bytes, the proportionality
factor will be 4.) Most other itemset generation algorithms require the explicit
generation of the candidate itemsets in some form or the other. Thus, the memory
requirements for maintaining the matrices in the TreeProjection algorithm are quite
TREE PROJECTION ALGORITHM 17
120
100
60
40
20
0
2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2
Support in Percent
Relative CPU time vs database size for T20.I6.D100K data at 0.75% support
11
10
8
Relative CPU time
1
100 200 300 400 500 600 700 800 900 1000
Database size(in 000s)
FIG. 12. CPU Time versus Database size for synthetic data
5. EMPIRICAL RESULTS
The experiments were performed on an IBM RS/6000 43P-140 workstation with
a CPU clock rate of 200 MHz, 128 MB of main memory, 1MB of L2 Cache and
running AIX 4.1. The data resided in the AIX le system and was stored on a 2GB
SCSI drive. We tested the algorithm for both synthetic and real data.
The synthetic data set which we used for the purpose of our experiments was
T20.I6.D100K 4]. The corresponding performance curves are illustrated in Figures
11 and 12. In Figure 11, we have illustrated the performance of the algorithm
versus the support both for the Apriori algorithm 4] and our method (we will call
it the TreeProjection algorithm) on the synthetic data set T20.I6.D100K. As we see,
the TreeProjection algorithm quickly outstrips the Apriori method when support
20 RAMESH AGARWAL, CHARU AGGARWAL, V.V.V. PRASAD
1000
800
400
200
0
0.3 0.28 0.26 0.24 0.22 0.2 0.18 0.16 0.14 0.12 0.1
Support in Percent
0.0025
CPU time/itemset (secs)
0.002
0.0015
0.001
0.0005
0.3 0.28 0.26 0.24 0.22 0.2 0.18 0.16 0.14 0.12 0.1
Support in Percent
FIG. 14. CPU Time per itemset versus support for retail data
TREE PROJECTION ALGORITHM 21
TABLE 2
Number of frequent itemsets found for the splice data set
Support in percent No of sets Longest Set
6 19112 7
5.5 26932 7
5 35047 7
4.5 45727 7
4 62258 7
3.5 93928 7
3 164584 8
2.5 309091 8
2 702065 8
250
CPU time (secs)
200
150
100
50
0
6 5.5 5 4.5 4 3.5 3 2.5 2
Support in Percent
values are low. For the lowest support value of 0:25 percent, the Apriori algorithm
required 147:87 seconds, while our technique required only 33.05 seconds. On the
other hand, when the support values are high, there was not much of a di erence
between the TreeProjection and Apriori algorithms. This is because the times for
performing I/O dominate at the higher levels of support. The scaling with database
size was linear for both methods.
We tested the algorithm on two real data sets. One of them was a retail data set
which was rst used in 6] to test the eciency of association algorithms. This data
set is not publicly available for proprietary reasons. The retail data set contains
213,972 transactions with an average transaction length of 31. The performance
gap between the two methods was even more substantial for the case of the retail
22 RAMESH AGARWAL, CHARU AGGARWAL, V.V.V. PRASAD
data set. The corresponding performance curves are illustrated in Figures 13 and
14. In Figure 13, we have illustrated the performance variation with support.
At the lowest support value of 0:1 percent, the TreeProjection algorithm required
only 64:28 seconds in comparison with the 1199:44 seconds required by the Apriori
algorithm. This di erence is more than an order of magnitude. The time required
for counting at each level of the tree for a support of 0:1% is illustrated in Table 1.
In addition, 1:48 seconds were required for L1 counting. Note that very little CPU
time is spent in counting support at lower levels of the tree. The reason for this is
that our PruneTree(
) method removes a very large fraction of the nodes. An even
more interesting characteristic of the TreeProjection algorithm was the CPU time
required per frequent itemset generated. The TreeProjection algorithm improved
in terms of the time required in order to generate each frequent itemset, when the
support was reduced. It is easy to see why this is the case. As the required support
level drops, more nodes are generated at the lower levels of the tree. At these levels,
the projected transaction database is very small and therefore much smaller e ort
is required to do the counting per frequent itemset.
We also tested the algorithm on the splice dataset. The splice data set was taken
from the University of California at Irvine (UCI) machine learning repository(http :
==www:ics:uci:edu=~mlearn=MLRepository:html) and subsequently cleaned 6].
The resulting data set had 3174 records with an average number of items equal
to 61. The total number of items in the data was 244. The size of the data set was
0.8MB. The Table 2 illustrates the variation of the number of patterns found and
length of the longest pattern with support. Most of the patterns were relatively
short, and even at very low levels of support (upto 2%) it was found that the length
of the longest pattern was 8. The corresponding performance chart is illustrated in
Figure 15. As we see from Figure 15, the TreeProjection algorithm outperforms the
Apriori method by more than an order of magnitude. At very low levels of support,
the Apriori algorithm runs out of memory and is unable to run to completion. This
behavior illustrates the memory advantages of using a lexicographic tree projection
algorithm over the the hash tree implementation of the Apriori algorithm.
ACKNOWLEDGMENT
We would like to thank Roberto Bayardo and Rakesh Agrawal for providing us with the retail
and splice data on which we tested the algorithms. We would also like to thank them for providing
us with their latest code for
nding frequent itemsets. We thank Anant Jhingran for his comments
on an earlier draft of this paper.
REFERENCES
1. Agarwal R. C., Aggarwal C. C., Prasad V. V. V., Crestana V., \A Tree Projection Algorithm
for Generation of Large Itemsets For Association Rules." IBM Research Report, RC 21341.
2. Agarwal R. C., Aggarwal C. C., Prasad V. V. V., \Depth First Generation of Long Patterns."
Proceedings of the ACM SIGKDD Conference, 2000.
3. Agrawal R., Imielinski T., Swami A., \Mining association rules between sets of items in very
large databases." Proceedings of the ACM SIGMOD Conference on Management of data,
pages 207-216, 1993.
4. Agrawal R., Mannila H., Srikant R., Toivonen H., Verkamo A. I., \Fast Discovery of Association
Rules." Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, Chapter 12,
pages 307-328. Proceedings of the 20th International Conference on Very Large Data Bases,
pages 478-499, 1994.
5. Agrawal R., Shafer J. C., \Parallel Mining of Association Rules." IEEE Transactions on
Knowledge and Data Engineering. 8(6), pages 962-969, 1996.
6. Bayardo R. J., \Eciently Mining Long Patterns from Databases." Proceedings of the ACM
SIGMOD, pages 85-93, 1998.
7. Brin S., Motwani R., Ullman J. D., Tsur S., \Dynamic Itemset Counting and implication rules
for Market Basket Data." Proceedings of the ACM SIGMOD, pages 255-264, 1997.
8. Han E.-H., Karypis G, Kumar V., \Scalable Parallel Data Mining for Association Rules."
Proceedings of the ACM SIGMOD Conference. pages 277-288, 1997.
9. Han J., Fu Y., \Discovery of Multi-level Association Rules From Large Databases." Proceedings
of the International Conference on Very Large Databases, pages 420-431, Zurich, Switzerland,
September 1995.
10. Kumar V., Grama A., Gupta A., Karypis G., Introduction to Parallel Computing: Algorithm
Design and Analysis. Benjamin Cummings/ Addison Wesley, Redword City, 1994.
11. Lin D., Kedem Z. M., \Pincer-Search: A New Algorithm for Discovering the Maximum Fre-
quent Itemset." EDBT Conference Proceedings, pages 105-119, 1998.
12. Mannila H., Toivonen H., Verkamo A. I., \Ecient algorithms for discovering association
rules." AAAI Workshop on Knowledge Discovery in Databases, pages 181-192, 1994.
13. Savasere A., Omiecinski E., Navathe S. B., \An ecient algorithm for mining association
rules in large databases." Proceedings of the 21st International Conference on Very Large
Databases, 1995.
14. Srikant R., Agrawal R., \Mining Generalized Association Rules." Proceedings of the 21st In-
ternational Conference on Very Large Data Bases, pages 407-419, 1995.
15. Srikant R., Agrawal R., \Mining quantitative association rules in large relational tables".
Proceedings of the ACM SIGMOD Conference on Management of Data, pages 1-12, 1996.
16. Toivonen H.,\Sampling Large Databases for Association Rules". Proceedings of the 22nd In-
ternational Conference on Very Large Databases, Bombay, India, September 1996.
17. Zaki M. J., Parthasarathy S., Ogihara M., Li W., \New Algorithms for Fast Discovery of
Association Rules." KDD Conference Proceedings, pages 283{286, 1997.