Fast Algorithms For Mining Association Rules
Fast Algorithms For Mining Association Rules
Fast Algorithms For Mining Association Rules
Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA 95120
Abstract
We consider the problem of discovering association rules between items in a large database of sales transactions. We present two new algorithms for solving this problem that are fundamentally di erent from the known algorithms. Experiments with synthetic as well as real-life data show that these algorithms outperform the known algorithms by factors ranging from three for small problems to more than an order of magnitude for large problems. We also show how the best features of the two proposed algorithms can be combined into a hybrid algorithm, called AprioriHybrid. Scale-up experiments show that AprioriHybrid scales linearly with the number of transactions. AprioriHybrid also has excellent scale-up properties with respect to the transaction size and the number of items in the database.
1 Introduction
Database mining is motivated by the decision support problem faced by most large retail organizations S+ 93]. Progress in bar-code technology has made it possible for retail organizations to collect and store massive amounts of sales data, referred to as the basket data. A record in such data typically consists of the transaction date and the items bought in the transaction. Successful organizations view such databases as important pieces of the marketing infrastructure Ass92]. They are interested in instituting information-driven marketing processes, managed by database technology, that enable marketers to develop and implement customized marketing programs and strategies Ass90]. The problem of mining association rules over basket data was introduced in AIS93b]. An example of such a rule might be that 98% of customers that purchase tires and auto accessories also get automotive services done. Finding all such rules is valuable for cross-marketing and attached mailing applications. Other applications include catalog design, add-on sales, store layout, and customer segmentation based on buying patterns. The databases involved in these applications are very large. It is imperative, therefore, to have fast algorithms for this task.
Visiting from the Department of Computer Science, University of Wisconsin, Madison.
The following is a formal statement of the problem AIS93b]: Let I = fi1; i2; . . . ; img be a set of literals, called items. Let D be a set of transactions, where each transaction T is a set of items such that T I . Associated with each transaction is a unique identi er, called its TID. We say that a transaction T contains X , a set of some items in I , if X T . An association rule is an implication of the form X =) Y , where X I , Y I , and X \ Y = ;. The rule X =) Y holds in the transaction set D with con dence c if c% of transactions in D that contain X also contain Y . The rule X =) Y has support s in the transaction set D if s% of transactions in D contain X Y . Our rules are somewhat more general than in AIS93b] in that we allow a consequent to have more than one item. Given a set of transactions D, the problem of mining association rules is to generate all association rules that have support and con dence greater than the user-speci ed minimum support (called minsup ) and minimum con dence (called minconf ) respectively. Our discussion is neutral with respect to the representation of D. For example, D could be a data le, a relational table, or the result of a relational expression. An algorithm for nding all association rules, henceforth referred to as the AIS algorithm, was presented in AIS93b]. Another algorithm for this task, called the SETM algorithm, has been proposed in HS93]. In this paper, we present two new algorithms, Apriori and AprioriTid, that di er fundamentally from these algorithms. We present experimental results, using both synthetic and real-life data, showing that the proposed algorithms always outperform the earlier algorithms. The performance gap is shown to increase with problem size, and ranges from a factor of three for small problems to more than an order of magnitude for large problems. We then discuss how the best features of Apriori and AprioriTid can be combined into a hybrid algorithm, called AprioriHybrid. Experiments show that the AprioriHybrid has excellent scale-up properties, opening up the feasibility of mining association rules over very large databases. The problem of nding association rules falls within the purview of database mining AIS93a] ABN92] HS94] MKKR92] S+ 93] Tsu90], also called knowledge discovery in databases HCC92] Lub89] PS91b]. Related, but not directly applicable, work includes the induction of classi cation rules BFOS84] Cat91] FWD93] HCC92] Qui93], discovery of causal rules CH92] Pea92], learning of logical de nitions MF92] Qui90], tting of functions to data LSBZ87] Sch90], and clustering ANB92] C+ 88] Fis87]. The closest work in the machine learning literature is the KID3 algorithm presented in PS91a]. If used for nding all association rules, this algorithm will make as many passes over the data as the number of combinations of items in the antecedent, which is exponentially large. Related work in the database literature is the work on inferring functional dependencies from data Bit92] MR87]. Functional dependencies are rules requiring strict satisfaction. Consequently, having determined a dependency X ! A, the algorithms in Bit92] MR87] 2
consider any other dependency of the form X + Y ! A redundant and do not generate it. The association rules we consider are probabilistic in nature. The presence of a rule X ! A does not necessarily mean that X + Y ! A also holds because the latter may not have minimum support. Similarly, the presence of rules X ! Y and Y ! Z does not necessarily mean that X ! Z holds because the latter may not have minimum con dence. There has been work on quantifying the \usefulness" or \interestingness" of a rule PS91a]. What is useful or interesting is often application-dependent. The need for a human in the loop and providing tools to allow human guidance of the rule discovery process has been articulated, for example, in B+ 93] KI91] Tsu90]. We do not discuss these issues in this paper, except to point out that these are necessary features of a rule discovery system that may use our algorithms as the engine of the discovery process.
and each database record is a <TID, item> pair, where TID is the identi er of the corresponding transaction. We call the number of items in an itemset its size, and call an itemset of size k a k-itemset. Items within an itemset are kept in lexicographic order. We use the notation c 1] c 2] . . . c k] 4
Notation We assume that items in each transaction are kept sorted in their lexicographic order. It is straightforward to adapt these algorithms to the case where the database D is kept normalized
Table 1: Notation
k-itemset An itemset having k items. Set of large k-itemsets (those with minimum support). Lk Each member of this set has two elds: i) itemset and ii) support count. Set of candidate k-itemsets (potentially large itemsets). Ck Each member of this set has two elds: i) itemset and ii) support count. Set of candidate k-itemsets when the TIDs of the generating transactions Ck are kept associated with the candidates.
to represent a k-itemset c consisting of items c 1]; c 2]; . . . c k], where c 1] < c 2] < . . . < c k]. If c = X Y and Y is an m-itemset, we also call Y an m-extension of X . Associated with each itemset is a count eld to store the support for this itemset. The count eld is initialized to zero when the itemset is rst created. We summarize in Table 1 the notation used in the algorithms. The set C k is used by AprioriTid and will be further discussed when we describe this algorithm.
insert into Ck select p.item1, p.item2, ..., p.itemk?1, q.itemk?1 from Lk?1 p, Lk?1 q where p.item1 = q.item1, . . ., p.itemk?2 = q.itemk?2, p.itemk?1 < q.itemk?1; Next, in the prune step, we delete all itemsets c 2 Ck such that some (k ? 1)-subset of c is not in Lk?1 : forall itemsets c 2 Ck do forall (k ?1)-subsets s of c do if (s 62 Lk?1) then delete c from Ck ;
Concurrent to our work, the following two-step candidate generation procedure has been proposed in MTV94]: 0 0 0 0 Ck = fX X jX; X 2 Lk ?1 ; jX \ X j = k ? 2g 0 Ck = fX 2 Ck jX contains k members of Lk ?1 g These two steps are similar to our join and prune steps respectively. However, in general, step 1 would produce a superset of the candidates produced by our join step. For example, if L2 were ff1 2g, f2, 3gg, then step 1 of MTV94] will generate the candidate f1 2 3g, whereas our join step will not generate any candidate.
1
Contrast this candidate generation with the one used in the AIS and SETM algorithms. In pass k of these algorithms (see Section 4 for details), a database transaction t is read and it is determined which of the large itemsets in Lk?1 are present in t. Each of these large itemsets l is then extended with all those large items that are present in t and occur later in the lexicographic ordering than any of the items in l. Continuing with the previous example, consider a transaction f1 2 3 4 5g. In the fourth pass, AIS and SETM will generate two candidates, f1 2 3 4g and f1 2 3 5g, by extending the large itemset f1 2 3g. Similarly, an additional three candidate itemsets will be generated by extending the other large itemsets in L3 , leading to a total of 5 candidates for consideration in the fourth pass. Apriori, on the other hand, generates and counts only one itemset, f1 3 4 5g, because it concludes a priori that the other combinations cannot possibly have minimum support.
Example Let L3 be ff1 2 3g, f1 2 4g, f1 3 4g, f1 3 5g, f2 3 4gg. After the join step, C4 will be ff1 2 3 4g, f1 3 4 5g g. The prune step will delete the itemset f1 3 4 5g because the itemset f1 4 5g is not in L3 . We will then be left with only f1 2 3 4g in C4 .
have minimum support. Hence, if we extended each itemset in Lk?1 with all possible items and then deleted all those whose (k ? 1)-subsets were not in Lk?1 , we would be left with a superset of the itemsets in Lk . The join is equivalent to extending Lk?1 with each item in the database and then deleting those itemsets for which the (k ? 1)-itemset obtained by deleting the (k ? 1)th item is not in Lk?1 . The condition p.itemk?1 < q .itemk?1 simply ensures that no duplicates are generated. Thus, after the join step, Ck Lk . By similar reasoning, the prune step, where we delete from Ck all itemsets whose (k ? 1)-subsets are not in Lk?1 , also does not delete any itemset that could be in Lk .
0 , where C 0 is only candidates of size k in the kth pass, we can also count the candidates Ck k+1 +1 0 generated from Ck , etc. Note that Ck C since C is generated from L . This variation can k+1 k+1 k +1 0 ? Ck+1 pay o in the later passes when the cost of counting and keeping in memory additional Ck +1 candidates becomes less than the cost of scanning the database.
Correctness We need to show that Ck Lk . Clearly, any subset of a large itemset must also
Variation: Counting Candidates of Multiple Sizes in One Pass Rather than counting
Membership Test The prune step requires testing that all (k ?1)-subsets of a newly generated
stored in a hash table.
k-candidate-itemset are present in Lk?1 . To make this membership test fast, large itemsets are
the root, by hashing on every item in t, we ensure that we only ignore itemsets that start with an item not in t. Similar arguments apply at lower depths. The only additional factor is that, since the items in any itemset are ordered, if we reach the current node by hashing the item i, we only need to consider the items in t that occur after i. If k is the size of a candidate itemset in the hash-tree, we can nd in O(k) time whether the itemset is contained in a transaction by using a temporary bitmap. Each bit of the bitmap corresponds an item. The bitmap is created once for the data structure, and reinitialized for each transaction. This initialization takes O(size(transaction)) time for each transaction.
2.1.3 Bu er Management
In the candidate generation phase of pass k, we need storage for large itemsets Lk?1 and the candidate itemsets Ck . In the counting phase, we need storage for Ck and at least one page to bu er the database transactions. First, assume that Lk?1 ts in memory but that the set of candidates Ck does not. The apriorigen function is modi ed to generate as many candidates of Ck as will t in the bu er and the database is scanned to count the support of these candidates. Large itemsets resulting from these candidates are written to disk, while those candidates without minimum support are deleted. This procedure is repeated until all of Ck has been counted. If Lk?1 does not t in memory either, we externally sort Lk?1 . We bring into memory a block of Lk?1 in which the rst k ? 2 items are the same. We now generate candidates using this block. We keep reading blocks of Lk?1 and generating candidates until the memory lls up, and then make a pass over the data. This procedure is repeated until all of Ck has been counted. Unfortunately, we can no longer prune those candidates whose subsets are not in Lk?1 , as the whole of Lk?1 is not available in memory.
entries in C k may be smaller than the number of transactions in the database, especially for large values of k. In addition, for large values of k, each entry may be smaller than the corresponding transaction because very few candidates may be contained in the transaction. However, for small values for k, each entry may be larger than the corresponding transaction because an entry in Ck includes all candidate k-itemsets contained in the transaction. We further explore this trade-o in Section 4. We establish the correctness of the algorithm in Section 2.2.1. In Section 2.2.2, we give the data structures used to implement the algorithm, and we discuss bu er management in Section 2.2.3.
1) L1 = flarge 1-itemsetsg; 2) C 1 = database D; 3) for ( k = 2; Lk?1 6= ;; k++ ) do begin 4) Ck = apriori-gen(Lk?1); // New candidates { see Section 2.1.1 5) C k = ;; 6) forall entries t 2 C k?1 do begin 7) // determine candidate itemsets in Ck contained in the transaction with identi er t.TID Ct = fc 2 Ck j (c ? c k]) 2 t:set-of-itemsets ^ (c ? c k ? 1]) 2 t.set-of-itemsetsg; 8) forall candidates c 2 Ct do 9) c:count++; 10) if (Ct 6= ;) then C k += < t:TID; Ct >; 11) end 12) Lk = fc 2 Ck j c:count minsupg 13) end S 14) Answer = k Lk ;
Calling apriori-gen with L1 at step 4 gives the candidate itemsets C2 . In steps 6 through 10, we count the support of candidates in C2 by iterating over the entries in C 1 and generate C 2. The rst entry in C 1 is f f1g f3g f4g g, corresponding to transaction 100. The Ct at step 7 corresponding to this entry t is f f1 3g g, because f1 3g is a member of C2 and both (f1 3g - f1g) and (f1 3g f3g) are members of t.set-of-itemsets. Calling apriori-gen with L2 gives C3 . Making a pass over the data with C 2 and C3 generates C 3. Note that there is no entry in C 3 for the transactions with TIDs 100 and 400, since they do not contain any of the itemsets in C3 . The candidate f2 3 5g in C3 turns out to be large and is the only member of L3 . When we generate C4 using L3 , it turns out to be empty, and we terminate.
Example Consider the database in Figure 3 and assume that minimum support is 2 transactions.
2.2.1 Correctness
Rather than using the database transactions, AprioriTid uses the entries in C k to count the support of candidates in Ck . To simplify the proof, we assume that in step 10 of AprioriTid, we always 9
TID 100 200 300 400 TID 100 200 300 400
C1 Set-of-Itemsets f f1g, f3g, f4g g f f2g, f3g, f5g g f f1g, f2g, f3g, f5g g f f2g, f5g g C2 Set-of-Itemsets f f1 3g g f f2 3g, f2 5g, f3 5g g f f1 2g, f1 3g, f1 5g, f2 3g, f2 5g, f3 5g g f f2 5g g
L1 Itemset Support f1g 2 f2g 3 f3g 3 f5g 3 L2 Itemset Support f1 3g 2 f2 3g 2 f2 5g 3 f3 5g 2 L3 Itemset Support f2 3 5g 2
Figure 3: Example add <t.TID,Ct > to C k , rather than adding an entry only when Ct is non-empty. For correctness, we need to establish that the set Ct generated in step 7 in the kth pass is the same as the set of candidate k-itemsets in Ck contained in the transaction with identi er t.TID. We say that the set C k is complete if 8t 2 C k , t.set-of-itemsets includes all large k-itemsets contained in the transaction with identi er t.TID. We say that the set C k is correct if 8t 2 C k , t.set-of-itemsets does not include any k-itemset not contained in the transaction with identi er t.TID. The set Lk is correct if it is the same as the set of all large k-itemsets. We say that the set Ct generated in step 7 in the kth pass is correct if it is the same as the set of candidate k-itemsets in Ck contained in the transaction with identi er t.TID.
in step 7 in the kth pass is the same as the set of candidate k-itemsets in Ck contained in the transaction with identi er t.TID.
Lemma 1 8k > 1, if C k?1 is correct and complete and Lk?1 is correct, then the set Ct generated
By simple rewriting, a candidate itemset c = c 1] c 2] . . . c k] is present in transaction t.TID if and only if both c1 = (c ? c k]) and c2 = (c ? c k ? 1]) are in transaction t.TID. Since Ck was obtained by calling apriori-gen(Lk?1 ), all subsets of c 2 Ck must be large. So, c1 and c2 must be large itemsets. Thus, if c 2 Ck is contained in transaction t.TID, c1 and c2 must be members of t.set-of-itemsets since C k?1 is complete Hence c will be a member of Ct. Since C k?1 is correct, if 10
c1 (c2) is not present in transaction t.TID then c1 (c2) is not contained in t:set-of-itemsets. Hence, if c 2 Ck is not contained in transaction t.TID, c will not be a member of Ct . 2
Lemma 2 8k > 1, if Lk?1 is correct and the set Ct generated in step 7 in the kth pass is the same
as the set of candidate k-itemsets in Ck contained in the transaction with identi er t.TID, then the set C k is correct and complete.
Since the apriori-gen function guarantees that Ck Lk , the set Ct includes all large k-itemsets contained in t.TID. These are added in step 10 to C k and hence C k is complete. Since Ct only includes itemsets contained in the transaction t.TID, and only itemsets in Ct are added to C k , it follows that C k is correct. 2
Theorem 1 8k > 1, the set Ct generated in step 7 in the kth pass is the same as the set of
candidate k-itemsets in Ck contained in the transaction with identi er t.TID.
We rst prove by induction on k that the set C k is correct and complete and Lk correct for all k 1. For k = 1, this is trivially true since C 1 corresponds to the database D. By de nition, L1 is also correct. Assume this holds for k = n. From Lemma 1, the set Ct generated in step 7 in the (n+1)th pass will consist of exactly those itemsets in Cn+1 contained in the transaction with identi er t.TID. Since the apriori-gen function guarantees that Cn+1 Ln+1 and Ct is correct, Ln+1 will be correct. >From Lemma 2, the set C n+1 will be correct and complete. Since C k is correct and complete and Lk correct for all k 1, the theorem follows directly from Lemma 1. 2
in transaction t.TID. For each such candidate ck?1 the extensions eld gives Tk , the set of IDs of all the candidate k-itemsets that are extensions of ck?1 . For each ck in Tk , the generators eld gives the IDs of the two itemsets that generated ck . If these itemsets are present in the entry for t.set-of-itemsets, we can conclude that ck is present in transaction t.TID, and add ck to Ct . 2 in the generators eld, since we reached ck starting from the We actually need to store only lk ?1 1 ID of lk?1 in t. We omitted this optimization in the above description to simplify exposition. Given an ID and the data structures above, we can nd the associated candidate itemset in constant time. We can also nd in constant time whether or not an ID is present in the t.set-of-itemsets eld by using a temporary bitmap. Each bit of the bitmap corresponds to an ID in Ck . This bitmap is created once at the beginning of the pass and is reinitialized for each entry t of C k .
2.2.3 Bu er Management
In the kth pass, AprioriTid needs memory for Lk?1 and Ck during candidate generation. During the counting phase, it needs memory for Ck?1 , Ck , and a page each for C k?1 and C k . Note that the entries in C k?1 are needed sequentially and that the entries in C k can be written to disk as they are generated. At the time of candidate generation, when we join Lk?1 with itself, we ll up roughly half the bu er with candidates. This allows us to keep the relevant portions of both Ck and Ck?1 in memory during the counting phase. In addition, we ensure that all candidates with the same rst (k ? 1) items are generated at the same time. The computation is now e ectively partitioned because none of the candidates in memory that turn out to large at the end of the pass will join with any of the candidates not yet generated to derive potentially large itemsets. Hence we can assume that the candidates in memory are the only candidates in Ck and nd all large itemsets that are extensions of candidates in Ck by running the algorithm to completion. This may cause further partitioning of the computation downstream. Having thus run the algorithm to completion, we return to Lk?1 , generate some more candidates in Ck , count them, and so on. Note that the prune step of the apriori-gen function cannot be applied after partitioning because we do not know all the large k-itemsets. When Lk does not t in memory, we need to externally sort Lk as in the bu er management scheme used for Apriori.
3 Discovering Rules
The association rules that we consider here are somewhat more general than in AIS93b] in that we allow a consequent to have more than one item; rules in AIS93b] were limited to single item 12
consequents. We rst give a straightforward generalization of the algorithm in AIS93b] and then present a faster algorithm. To generate rules, for every large itemset l, we nd all non-empty subsets of l. For every such subset a, we output a rule of the form a =) (l ? a) if the ratio of support(l) to support(a) is at least minconf. We consider all subsets of l to generate rules with multiple consequents. Since the large itemsets are stored in hash tables, the support counts for the subset itemsets can be found e ciently. We can improve the above procedure by generating the subsets of a large itemset in a recursive depth- rst fashion. For example, given an itemset ABCD, we rst consider the subset ABC , then AB, etc. Then if a subset a of a large itemset l does not generate a rule, the subsets of a need not be considered for generating rules using l. For example, if ABC =) D does not have enough con dence, we need not check whether AB =) CD holds. We do not miss any rules because the support of any subset a ~ of a must be as great as the support of a. Therefore, the con dence of the rule a ~ =) (l ? a ~) cannot be more than the con dence of a =) (l ? a). Hence, if a did not yield a rule involving all the items in l with a as the antecedent, neither will a ~. The following algorithm embodies these ideas:
// Simple Algorithm
itemset is large then so are all its subsets. >From a large itemset l, therefore, we rst generate all rules with one item in the consequent. We then use the consequents of these rules and the function apriori-gen in Section 2.1.1 to generate all possible consequents with two items that can appear in a rule generated from l, etc. An algorithm using this idea is given below. The rules having one-item consequents in step 2 of this algorithm can be found by using a modi ed version of the preceding genrules function in which steps 8 and 9 are deleted to avoid the recursive call.
// Faster Algorithm 1) forall large k-itemsets lk , k 2 do begin 2) H1 = f consequents of rules derived from lk with one item in the consequent g; 3) call ap-genrules(lk , H1); 4) end
procedure ap-genrules(lk: large k-itemset, Hm : set of m-item consequents) if (k > m + 1) then begin Hm+1 = apriori-gen(Hm ); forall hm+1 2 Hm+1 do begin conf = support(lk )/support(lk ? hm+1 ); if (conf minconf) then output the rule (lk ? hm+1 ) =) hm+1 with con dence = conf and support = support(lk ); else delete hm+1 from Hm+1 ; end call ap-genrules(lk , Hm+1 ); end
As an example of the advantage of this algorithm, consider a large itemset ABCDE . Assume that ACDE =) B and ABCE =) D are the only one-item consequent rules derived from this itemset that have the minimum con dence. If we use the simple algorithm, the recursive call genrules(ABCDE , ACDE ) will test if the two-item consequent rules ACD =) BE , ADE =) BC , CDE =) BA, and ACE =) BD hold. The rst of these rules cannot hold, because E BE , and ABCD =) E does not have minimum con dence. The second and third rules cannot hold for similar reasons. The call genrules(ABCDE , ABCE ) will test if the rules ABC =) DE , ABE =) DC , BCE =) DA and ACE =) BD hold, and will nd that the rst three of these rules do not hold. In fact, the only two-item consequent rule that can possibly hold is ACE =) BD, where B and D are the consequents in the valid one-item consequent rules. This is the only rule that will be tested by the faster algorithm.
4 Performance
To assess the relative performance of the algorithms for discovering large itemsets, we performed several experiments on an IBM RS/6000 530H workstation with a CPU clock rate of 33 MHz, 64 14
MB of main memory, and running AIX 3.2. The data resided in the AIX le system and was stored on a 2GB SCSI 3.5" drive, with measured sequential throughput of about 2 MB/second. We rst give an overview of the AIS AIS93b] and SETM HS93] algorithms against which we compare the performance of the Apriori and AprioriTid algorithms. We then describe the synthetic datasets used in the performance evaluation and show the performance results. Next, we show the performance results for three real-life datasets obtained from a retail and a direct mail company. Finally, we describe how the best performance features of Apriori and AprioriTid can be combined into an AprioriHybrid algorithm and demonstrate its scale-up properties.
else
Data Structures The data structures required for maintaining large and candidate itemsets
were not speci ed in AIS93b]. We store the large itemsets in a dynamic multi-level hash table to make the subset operation in step 5 fast, using the algorithm described in Section 2.1.2. Candidate 15
itemsets are kept in a hash table associated with the respective large itemsets from which they originate in order to make the membership test in step 9 fast. we discard from memory the corresponding large itemset and all candidate itemsets generated from it. This reclamation procedure is executed as often as necessary during a pass. The large itemsets discarded in a pass are extended in the next pass. This technique is a simpli ed version of the bu er management scheme presented in AIS93b].
Bu er Management When a newly generated candidate itemset causes the bu er to over ow,
of the set C k relative to the size of memory. If C k ts in memory, the two sorting steps can be performed using an in-memory sort. In HS93], C k was assumed to t in main memory and bu er management was not discussed. If C k is too large to t in memory, we write the entries in C k to disk in FIFO order when the bu er allocated to the candidate itemsets lls up, as these entries are not required until the end of the pass. However, C k now requires two external sorts.
Bu er Management The performance of the SETM algorithm critically depends on the size
To create a dataset, our synthetic data generation program takes the parameters shown in Table 2. Table 2: Parameters
jDj jT j jI j jLj
N
Number of transactions Average size of the Transactions Average size of the maximal potentially large Itemsets Number of maximal potentially large itemsets Number of items
We rst determine the size of the next transaction. The size is picked from a Poisson distribution with mean equal to jT j. Note that if each item is chosen with the same probability p, and there are N items, the expected number of items in a transaction is given by a binomial distribution with parameters N and p, and is approximated by a Poisson distribution with mean Np. We then assign items to the transaction. Each transaction is assigned a series of potentially large itemsets. If the large itemset on hand does not t in the transaction, the itemset is put in the transaction anyway in half the cases, and the itemset is moved to the next transaction the rest of the cases. Large itemsets are chosen from a set T of such itemsets. The number of itemsets in T is set to jLj. There is an inverse relationship between jLj and the average support for potentially large itemsets. An itemset in T is generated by rst picking the size of the itemset from a Poisson distribution with mean equal to jI j. Items in the rst itemset are chosen randomly. To model the phenomenon that large itemsets often have common items, some fraction of items in subsequent itemsets are chosen from the previous itemset generated. We use an exponentially distributed random variable with mean equal to the correlation level to decide this fraction for each itemset. The remaining items are picked at random. In the datasets used in the experiments, the correlation level was set to 0.5. We ran some experiments with the correlation level set to 0.25 and 0.75 but did not nd much di erence in the nature of our performance results. Each itemset in T has a weight associated with it, which corresponds to the probability that this itemset will be picked. This weight is picked from an exponential distribution with unit mean, and is then normalized so that the sum of the weights for all the itemsets in T is 1. The next itemset to be put in the transaction is chosen from T by tossing an jLj-sided weighted coin, where the weight for a side is the probability of picking the associated itemset. To model the phenomenon that all the items in a large itemset are not always bought together, we assign each itemset in T a corruption level c. When adding an itemset to a transaction, we keep dropping an item from the itemset as long as a uniformly distributed random number between 0 and 1 is less than c. Thus for an itemset of size l, we will add l items to the transaction 1 ? c of 18
the time, l ? 1 items c(1 ? c) of the time, l ? 2 items c2(1 ? c) of the time, etc. The corruption level for an itemset is xed and is obtained from a normal distribution with mean 0.5 and variance 0.1. We generated datasets by setting N = 1000 and jLj = 2000. We chose 3 values for jT j: 5, 10, and 20. We also chose 3 values for jI j: 2, 4, and 6. The number of transactions was to set to 100,000 because, as we will see in Section 4.4, SETM could not be run for larger values. However, for our scale-up experiments, we generated datasets with up to 10 million transactions (838MB for jT j = 20). Table 3 summarizes the dataset parameter settings. For the same jT j and jDj values, the size of datasets in megabytes were roughly equal for the di erent values of jI j. Table 3: Parameter settings (Synthetic datasets)
Name T5.I2.D100K T10.I2.D100K T10.I4.D100K T20.I2.D100K T20.I4.D100K T20.I6.D100K
jT j jI j
5 10 10 20 20 20
2 2 4 2 4 6
jDj Size in Megabytes 100K 2.4 100K 4.4 100K 100K 8.4 100K 100K
19
T5.I2.D100K
80 70 60
Time (sec)
T10.I2.D100K
160 SETM AIS AprioriTid Apriori 140 120
Time (sec)
T10.I4.D100K
350 300 250
Time (sec)
T20.I2.D100K
1000 AIS AprioriTid Apriori 900 800 700
Time (sec)
200 150 100 50 0 2 1.5 1 0.75 0.5 Minimum Support 0.33 0.25
600 500 400 300 200 100 0 2 1.5 1 0.75 0.5 Minimum Support 0.33 0.25
T20.I4.D100K
1800 1600 1400 2500 1200
Time (sec) Time (sec)
T20.I6.D100K
3500 AIS AprioriTid Apriori 3000 AIS AprioriTid Apriori
400 200 0 2 1.5 1 0.75 0.5 Minimum Support 0.33 0.25 500 0 2 1.5 1 0.75 0.5 Minimum Support 0.33 0.25
Apriori beats AIS for all problem sizes, by factors ranging from 2 for high minimum support to more than an order of magnitude for low levels of support. AIS always did considerably better than SETM. For small problems, AprioriTid did about as well as Apriori, but performance degraded to about twice as slow for large problems.
C-bar-k (SETM) C-bar-k (AprioriTid) C-k (AIS, SETM) C-k (Apriori, AprioriTid) L-k
Figure 7: Sizes of the large and candidate sets (T10.I4.D100K, minsup = 0.75%) The fundamental problem with the SETM algorithm is the size of its C k sets. Recall that P the size of the set C k is given by candidate itemsets c support-count(c). Thus, the sets C k are roughly S times bigger than the corresponding Ck sets, where S is the average support count of the candidate itemsets. Unless the problem size is very small, the C k sets have to be written to disk, and externally sorted twice, causing the SETM algorithm to perform poorly.2 This explains the jump in time for SETM in Table 4 when going from 1.5% support to 1.0% support for datasets with transaction size 10. The largest dataset in the scale-up experiments for SETM in HS93] was still small enough that C k could t in memory; hence they did not encounter this jump in execution time. Note that for the same minimum support, the support count for candidate itemsets increases linearly with the number of transactions. Thus, as we increase the number of transactions for the same values of jT j and jI j, though the size of Ck does not change, the size of C k goes up linearly. Thus, for datasets with more transactions, the performance gap between SETM and the other
The cost of external sorting in SETM can be reduced somewhat as follows. Before writing out entries in C k to disk, we can sort them on itemsets using an internal sorting procedure, and write them as sorted runs. These sorted runs can then be merged to obtain support counts. However, given the poor performance of SETM, we do not expect this optimization to a ect the algorithm choice.
2
4 5 Pass Number
21
algorithms will become even larger. The problem with AIS is that it generates too many candidates that later turn out to be small, causing it to waste too much e ort. Apriori also counts too many small sets in the second pass (recall that C2 is really a cross-product of L1 with L1). However, this wastage decreases dramatically from the third pass onward. Note that for the example in Figure 7, after pass 3, almost every candidate itemset counted by Apriori turns out to be a large set. AprioriTid also has the problem of SETM that C k tends to be large. However, the apriori candidate generation used by AprioriTid generates signi cantly fewer candidates than the transactionbased candidate generation used by SETM. As a result, the C k of AprioriTid has fewer entries than that of SETM. AprioriTid is also able to use a single word (ID) to store a candidate rather than requiring as many words as the number of items in the candidate.3 In addition, unlike SETM, AprioriTid does not have to sort C k . Thus, AprioriTid does not su er as much as SETM from maintaining C k . AprioriTid has the nice feature that it replaces a pass over the original dataset by a pass over the set C k . Hence, AprioriTid is very e ective in later passes when the size of C k becomes small compared to the size of the database. Thus, we nd that AprioriTid beats Apriori when its C k sets can t in memory and the distribution of the large itemsets has a long tail. When C k doesn't t in memory, there is a jump in the execution time for AprioriTid, such as when going from 0.75% to 0.5% for datasets with transaction size 10 in Figure 6. In this region, Apriori starts beating AprioriTid.
Retail Sales Data The data from the retail chain consists of the sales transactions from one
store over a short period of time. A transaction contains the names of the departments from which a customer bought a product in a visit to the store. There are a total of 63 items, representing departments. There are 46,873 transactions with an average size of 2.47. The size of the dataset is
3
For SETM to use IDs, it would have to maintain two additional in-memory data structures: a hash table to nd out whether a candidate has been generated previously, and a mapping from the IDs to candidates. However, this would destroy the set-oriented nature of the algorithm. Also, once we have the hash table which gives us the IDs of candidates, we might as well count them at the same time and avoid the two external sorts. We experimented with this variant of SETM and found that, while it did better than SETM, it still performed much worse than Apriori or AprioriTid.
22
very small, only 0.65MB. Some performance results for this dataset were reported in HS93]. Figure 8 shows the execution times of the four algorithms.4 The C k sets for both SETM and AprioriTid t in memory for this dataset. Apriori and AprioriTid are roughly three times as fast as AIS and four times faster than SETM.
9 8 7 6
Time (sec)
Figure 8: Execution times: Retail sales data items ordered by a customer in a single mail order. There are a total of 15836 items. The average size of a transaction is 2.62 items and there are a total of 2.9 million transactions. The size of this dataset is 42 MB. A transaction in the second dataset consists of all the items ordered by a customer from the company in all orders together. Again, there are a total of 15836 items, but the average size of a transaction is now 31 items and there are a total of 213,972 transactions. The size of this dataset is 27 MB. We will refer to these datasets as M.order and M.cust respectively. The execution times for these two datasets are shown in Figures 9 and 10 respectively. For both datasets, AprioriTid is initially comparable to Apriori but becomes up to twice as slow for lower supports. For M.order, Apriori outperforms AIS by a factor of 2 to 6 and beats SETM by a factor of about 15. For M.cust, Apriori beats AIS by a factor of 3 to 30. SETM had to be aborted (after taking 20 times the time Apriori took to complete) because, even for 2% support, the set C 2 became larger than the disk capacity.
4 The execution times for SETM in this gure are a little higher compared to those reported in HS93]. The timings in HS93] were obtained on a RS/6000 350 processor, whereas our experiments have been run on a slower RS/6000 530H processor. The execution time for 1% support for AIS is lower than that reported in AIS93b] because of improvements in the data structures for storing large and candidate itemsets.
Mail Order data A transaction in the rst dataset from the mail order company consists of
23
16000 14000 12000 10000 8000 6000 4000 2000 0 2 1.5 1 0.75 0.5 Minimum Support
3000 2500 2000 1500 1000 500 0 0.1 0.05 0.025 Minimum Support 0.01
0.25
Apriori AprioriTid
8 6 4 2 0 1 2 3
Figure 11: Per pass execution times of Apriori and AprioriTid (T10.I4.D100K, minsup = 0.75%) Based on these observations, we can design a hybrid algorithm, which we call AprioriHybrid, 24
4 Pass #
that uses Apriori in the initial passes and switches to AprioriTid when it expects that the set C k at the end of the pass will t in memory. We use the following heuristic to estimate if C k would t in memory in the next pass. At the end of the current pass, we have the counts of the candidates in Ck . From this, we estimate what the size of C k would have been if it had been generated. This P size, in words, is ( candidates c 2 Ck support(c) + number of transactions). If C k in this pass was small enough to t in memory, and there were fewer large candidates in the current pass than the previous pass, we switch to AprioriTid. The latter condition is added to avoid switching when C k in the current pass ts in memory but C k in the next pass may not.5 Switching from Apriori to AprioriTid does involve a cost. Assume that we decide to switch from Apriori to AprioriTid at the end of the kth pass. In the (k +1)th pass, after nding the candidate itemsets contained in a transaction, we will also have to add their IDs to C k+1 (see the description of AprioriTid in Section 2.2). Thus there is an extra cost incurred in this pass relative to just running Apriori. It is only in the (k +2)th pass that we actually start running AprioriTid. Thus, if there are no large (k +1)-itemsets, or no (k +2)-candidates, we will incur the cost of switching without getting any of the savings of using AprioriTid. Figure 12 shows the performance of AprioriHybrid relative to Apriori and AprioriTid for large datasets. AprioriHybrid performs better than Apriori in almost all cases. For T10.I2.D100K with 1.5% support, AprioriHybrid does a little worse than Apriori since the pass in which the switch occurred was the last pass; AprioriHybrid thus incurred the cost of switching without realizing the bene ts. In general, the advantage of AprioriHybrid over Apriori depends on how the size of the C k set decline in the later passes. If C k remains large until nearly the end and then has an abrupt drop, we will not gain much by using AprioriHybrid since we can use AprioriTid only for a short period of time after the switch. This is what happened with the M.cust and T20.I6.D100K datasets. On the other hand, if there is a gradual decline in the size of C k , AprioriTid can be used for a while after the switch, and a signi cant improvement can be obtained in the execution time.
25
M.order
700 600 500
Time (sec)
M.cust
1100 AprioriTid Apriori AprioriHybrid 1000 900 800
Time (sec)
0 0.05 0.025 Minimum Support 0.01 2 1.5 1 0.75 0.5 Minimum Support 0.25
T10.I2.D100K
40 35 30
Time (sec) Time (sec)
T10.I4.D100K
55 AprioriTid Apriori AprioriHybrid 50 45 40 35 30 25 20 15 10 AprioriTid Apriori AprioriHybrid
T20.I4.D100K
700 200 AprioriTid Apriori AprioriHybrid 600 500
Time (sec)
T20.I6.D100K
AprioriTid Apriori AprioriHybrid
150
Time (sec)
100
50 100 0 2 1.5 1 0.75 0.5 Minimum Support 0.33 0.25 0 2 1.5 1 0.75 0.5 Minimum Support 0.33 0.25
with respect to the 1 million transaction dataset in the second. As shown, the execution times scale quite linearly.
12 10 8 6 4 2 0 100 T20.I6 T10.I4 T5.I2 14 12 10
Relative Time Relative Time
8 6 4 2 0
250
1000
10
45 40 35 30
Time (sec)
25 20 15 10
15 10 5
Figure 14: Number of items scale-up Figure 15: Transaction size scale-up Next, we examined how AprioriHybrid scaled up with the number of items. We increased the number of items from 1000 to 10,000 for the three parameter settings T5.I2.D100K, T10.I4.D100K and T20.I6.D100K. All other parameters were the same as for the data in Table 3. We ran experiments for a minimum support at 0.75%, and obtained the results shown in Figure 14. The execution times decreased a little since the average support for an item decreased as we increased the number of items. This resulted in fewer large itemsets and, hence, faster execution times. Finally, we investigated the scale-up as we increased the average transaction size. The aim of this experiment was to see how our data structures scaled with the transaction size, independent of other factors like the physical database size and the number of large itemsets. We kept the 27
physical size of the database roughly constant by keeping the product of the average transaction size and the number of transactions constant. The number of transactions ranged from 200,000 for the database with an average transaction size of 5 to 20,000 for the database with an average transaction size 50. Fixing the minimum support as a percentage would have led to large increases in the number of large itemsets as the transaction size increased, since the probability of a itemset being present in a transaction is roughly proportional to the transaction size. We therefore xed the minimum support level in terms of the number of transactions. The results are shown in Figure 15. The numbers in the key (e.g. 500) refer to this minimum support. As shown, the execution times increase with the transaction size, but only gradually. The main reason for the increase was that in spite of setting the minimum support in terms of the number of transactions, the number of large itemsets increased with increasing transaction length. A secondary reason was that nding the candidates present in a transaction took a little more time.
We did not consider the quantities of the items bought in a transaction, which are useful for some applications. Finding such rules needs further work. The work reported in this paper has been done in the context of the Quest project at the IBM Almaden Research Center. In Quest, we are exploring the various aspects of the database mining problem. Besides the problem of discovering association rules, some other problems that we have looked into include the enhancement of the database capability with classi cation queries AGI+ 92] and similarity queries over time sequences AFS93]. We believe that database mining is an important new application area for databases, combining commercial interest with intriguing research questions.
Acknowledgment We wish to thank Mike Carey for his insightful comments and suggestions.
References
ABN92] Tarek M. Anwar, Howard W. Beck, and Shamkant B. Navathe. Knowledge mining by imprecise querying: A classi cation-based approach. In IEEE 8th Int'l Conf. on Data Engineering, Phoenix, Arizona, February 1992. Rakesh Agrawal, Christos Faloutsos, and Arun Swami. E cient similarity search in sequence databases. In Proc. of the Fourth International Conference on Foundations of Data Organization and Algorithms, Chicago, October 1993. Also in Lecture Notes in Computer Science 730, Springer Verlag, 1993, 69{84.
AFS93]
AGI+ 92] Rakesh Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, and Arun Swami. An interval classi er for database mining applications. In Proc. of the VLDB Conference, pages 560{573, Vancouver, British Columbia, Canada, August 1992. AIS93a] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Engineering, 5(6):914{ 925, December 1993. Special Issue on Learning and Discovery in Knowledge-Based Databases. Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining association rules between sets of items in large databases. In Proc. of the ACM SIGMOD Conference on Management of Data, pages 207{216, Washington, D.C., May 1993.
AIS93b]
29
ANB92]
Tarek M. Anwar, Shamkant B. Navathe, and Howard W. Beck. Knowledge mining in databases: A uni ed approach through conceptual clustering. Technical report, Georgia Institute of Technology, May 1992. David Shepard Associates. The new direct marketing. Business One Irwin, Illinois, 1990. Direct Marketing Association. Managing database marketing technology for success, 1992. R.J. Brachman et al. Integrated support for data archeology. In AAAI-93 Workshop on Knowledge Discovery in Databases, July 1993.
BFOS84] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classi cation and Regression Trees. Wadsworth, Belmont, 1984. Bit92] C+ 88] Cat91] CH92] Fis87] D. Bitton. Bridging the gap between database theory and practice, 1992. P. Cheeseman et al. AutoClass: A Bayesian classi cation system. In 5th Int'l Conf. on Machine Learning. Morgan Kaufman, June 1988. J. Catlett. Megainduction: A test ight. In 8th Int'l Conf. on Machine Learning, June 1991. G. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 1992. Douglas H. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2(2), 1987.
FWD93] Usama Fayyad, Nicholas Weir, and S.G. Djorgovski. Skicat: A machine learning system for automated cataloging of large scale sky surveys. In 10th Int'l Conf. on Machine Learning, June 1993. HCC92] Jiawei Han, Yandong Cai, and Nick Cercone. Knowledge discovery in databases: An attribute oriented approach. In Proc. of the VLDB Conference, pages 547{559, Vancouver, British Columbia, Canada, 1992. Maurice Houtsma and Arun Swami. Set-oriented mining of association rules. Research Report RJ 9567, IBM Almaden Research Center, San Jose, California, October 1993. 30
HS93]
HS94] KI91]
M. Holsheimer and A. Siebes. Data mining: The search for knowledge in databases. Technical Report CS-R9406, CWI, Netherlands, 1994. Ravi Krishnamurthy and Tomasz Imielinski. Practitioner problems in need of database research: Research directions in knowledge discovery. SIGMOD RECORD, 20(3):76{ 78, September 1991.
LSBZ87] P. Langley, H. Simon, G. Bradshaw, and J. Zytkow. Scienti c Discovery: Computational Explorations of the Creative Process. MIT Press, 1987. Lub89] David J. Lubinsky. Discovery from databases: A review of AI and statistical techniques. In IJCAI-89 Workshop on Knowledge Discovery in Databases, pages 204{218, Detroit, August 1989. S. Muggleton and C. Feng. E cient induction of logic programs. In Steve Muggleton, editor, Inductive Logic Programming. Academic Press, 1992.
MF92]
MKKR92] R.S. Michalski, L. Kerschberg, K.A. Kaufman, and J.S. Ribeiro. Mining for knowledge in databases: The INLEN architecture, initial implementation, and rst results. Journal of Intelligent Information Systems, 1:85{113, 1992. MR87] Heikki Mannila and Kari-Jouku Raiha. Dependency inference. In Proc. of the VLDB Conference, pages 155{158, Brighton, England, 1987.
MTV94] Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo. E cient algorithms for discovering association rules. In KDD-94: AAAI Workshop on Knowledge Discovery in Databases, July 1994. Pea92] PS91a] J. Pearl. Probabilistic reasoning in intelligent systems: Networks of plausible inference, 1992. G. Piatestsky-Shapiro. Discovery, analysis, and presentation of strong rules. In G. Piatestsky-Shapiro, editor, Knowledge Discovery in Databases. AAAI/MIT Press, 1991. G. Piatestsky-Shapiro, editor. Knowledge Discovery in Databases. AAAI/MIT Press, 1991. J. Ross Quinlan. Learning logical de nitions from examples. Machine Learning, 5(3), 1990. 31
PS91b] Qui90]
J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993. M. Stonebraker et al. The DBMS research at crossroads. In Proc. of the VLDB Conference, Dublin, August 1993. C. Scha er. Domain-Independent Function Finding. PhD thesis, Rutgers University, 1990. S. Tsur. Data dredging. IEEE Data Engineering Bulletin, 13(4):58{63, December 1990.
32