Chapter 5 Association Rules FP Tree
Chapter 5 Association Rules FP Tree
Chapter 5 Association Rules FP Tree
In many cases, the Apriori candidate generate-and-test method significantly reduces the size of candidate sets, leading to good performance gain. However, it suffer from two nontrivial costs:
It may generate a huge number of candidates (for example, if we have 10^4 1-itemset, it may generate more than 10^7 candidata 2-itemset) It may need to scan database many times
Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates
Grow long patterns from short ones using local frequent items
Process of FP growth
Sort frequent items in frequency descending order Scan DB again, construct FP-tree
Association Rules
FP Tree
Completeness Preserve complete information for frequent pattern mining Never break a long pattern of any transaction Compactness Reduce irrelevant infoinfrequent items are gone Items in frequency descending order: the more frequently occurring, the more likely to be shared Never be larger than the original database (not count node-links and the count field) For Connect-4 DB, compression ratio could be over 100
Exercise
A dataset has five transactions, let minsupport=60% and min_confidence=80% Find all frequent itemsets using FP Tree
TID T1 T2 T3 T4 T5
Items_bought M, O, N, K, E, Y D, O, N, K , E, Y M, A, K, E M, U, C, K ,Y C, O, O, K, I ,E
=>
=>
KEO
Run time(sec.)
Divide-and-conquer:
decompose both the mining task and DB according to the frequent patterns obtained so far
Other factors
6,000 of the customer included computer games, while 7,500 include videos, And 4,000 included both computer games and videos
Confidence (Game => Video) = 4,000 / 6,000 = 66% Suppose it pass our minimum support and confidence (30% , 60%, respectively)
However, the truth is : computer games and videos are negatively associated Which means the purchase of one of these items actually decreases the likelihood of purchasing the other. (How to get this conclusion??)
60% of customers buy the game 75% of customers buy the video Therefore, it should have 60% * 75% = 45% of people buy both That equals to 4,500 which is more than 4,000 (the actual value)
The occurrence of itemset A is independent of the occurrence of itemset B if P(AUB) = P(A)P(B) Otherwise, itemset A and B are dependent and correlated as events
If the value is less than 1, the occurrence of A is negatively correlated with the occurrence of B If the value is greater than 1, then A and B are positively correlated
uniform support
Level 1 min_sup = 5%
Level 2 min_sup = 5%
Level 2 min_sup = 3%
Some rules may be redundant due to ancestor relationships between items. Example