Chapter 5 Association Rules FP Tree

Chapter 5 Mining Association Rules with FP Tree
Dr. Bernard Chen Ph.D.

University of Central Arkansas Fall 2010
Mining Frequent Itemsets without Candidate Generation
In many cases, the Apriori candidate generate-and-test method significantly reduces the size of candidate sets, leading to good performance gain. However, it suffer from two nontrivial costs:
It may generate a huge number of candidates (for example, if we have 10^4 1-itemset, it may generate more than 10^7 candidata 2-itemset) It may need to scan database many times
Association Rules with Apriori

Minimum support=2/9 Minimum confidence=70%
Bottleneck of Frequent-pattern Mining

Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates
To find frequent itemset i1i2i100

# of scans: 100 # of Candidates: (1001) + (1002) + + (110000) = 21001 = 1.27*1030 !
Bottleneck: candidate-generation-and-test Can we avoid candidate generation?
Mining Frequent Patterns Without Candidate Generation
Grow long patterns from short ones using local frequent items
abc is a frequent pattern

Get all transactions having abc: DB|abc d is a local frequent item in DB|abc abcd is a frequent pattern
Process of FP growth
Scan DB once, find frequent 1-itemset (single item pattern)
Sort frequent items in frequency descending order Scan DB again, construct FP-tree
Association Rules
Lets have an example

T100 T200 T300 T400 T500 T600 T700 T800 T900
1,2,5 2,4 2,3 1,2,4 1,3 2,3 1,3 1,2,3,5 1,2,3
FP Tree
Mining the FP tree
Benefits of the FP-tree Structure
Completeness Preserve complete information for frequent pattern mining Never break a long pattern of any transaction Compactness Reduce irrelevant infoinfrequent items are gone Items in frequency descending order: the more frequently occurring, the more likely to be shared Never be larger than the original database (not count node-links and the count field) For Connect-4 DB, compression ratio could be over 100
Exercise
A dataset has five transactions, let minsupport=60% and min_confidence=80% Find all frequent itemsets using FP Tree
TID T1 T2 T3 T4 T5
Items_bought M, O, N, K, E, Y D, O, N, K , E, Y M, A, K, E M, U, C, K ,Y C, O, O, K, I ,E
Association Rules with Apriori

K:5 E:4 M:3 O:3 Y:3 KE:4 KM:3 KO:3 KY:3 => EM:2 EO:3 EY:2 MO:1 MY:2 OY:2 KE KM KO KY EO
=>
=>
KEO
Association Rules with FP Tree

K:5 E:4 M:3 O:3 Y:3
Association Rules with FP Tree

Y: KEMO:1 KEO:1 KY:1 K:3 KY O: KEM:1 KE:2 KE:3 KO EO KEO M: KE:2 K:1 K:3 KM E: K:4 KE
FP-Growth vs. Apriori: Scalability With the Support Threshold

100 90 80 70
Data set T25I20D10K

D1 FP-grow th runtime D1 Apriori runtime
Run time(sec.)
60 50 40 30 20 10 0 0 0.5 1 1.5 2 Support threshold(%) 2.5 3
Why Is FP-Growth the Winner?
Divide-and-conquer:
decompose both the mining task and DB according to the frequent patterns obtained so far
leads to focused search of smaller databases

no candidate generation, no candidate test
Other factors

compressed database: FP-tree structure

no repeated scan of entire database basic opscounting local freq items and building sub FP-tree, no pattern search and matching
Strong Association Rules are not necessary interesting

Dr. Bernard Chen Ph.D.
University of Central Arkansas Fall 2010
Example 5.8 Misleading Strong Association Rule
Of the 10,000 transactions analyzed, the data show that
6,000 of the customer included computer games, while 7,500 include videos, And 4,000 included both computer games and videos
Misleading Strong Association Rule
For this example:
Support (Game & Video) = 4,000 / 10,000 =40%
Confidence (Game => Video) = 4,000 / 6,000 = 66% Suppose it pass our minimum support and confidence (30% , 60%, respectively)
However, the truth is : computer games and videos are negatively associated Which means the purchase of one of these items actually decreases the likelihood of purchasing the other. (How to get this conclusion??)
Under the normal situation,

60% of customers buy the game 75% of customers buy the video Therefore, it should have 60% * 75% = 45% of people buy both That equals to 4,500 which is more than 4,000 (the actual value)
From Association Analysis to Correlation Analysis
Lift is a simple correlation measure that is given as follows
The occurrence of itemset A is independent of the occurrence of itemset B if P(AUB) = P(A)P(B) Otherwise, itemset A and B are dependent and correlated as events
Lift(A,B) = P(AUB) / P(A)P(B)
If the value is less than 1, the occurrence of A is negatively correlated with the occurrence of B If the value is greater than 1, then A and B are positively correlated
Mining Multiple-Level Association Rules
Items often form hierarchies
Items often form hierarchies
Flexible support settings
Items at the lower level are expected to have lower support

reduced support
Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%]
Level 1 min_sup = 5%
uniform support
Multi-level Association: Redundancy Filtering
Some rules may be redundant due to ancestor relationships between items. Example

milk wheat bread
[support = 8%, confidence = 70%]
2% milk wheat bread [support = 2%, confidence = 72%]
We say the first rule is an ancestor of the second rule.

Chapter 5 Association Rules FP Tree

Uploaded by

Copyright:

Available Formats

Chapter 5 Association Rules FP Tree

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 5 Association Rules FP Tree

Uploaded by

Copyright:

Available Formats

Chapter 5 Mining Association Rules with FP Tree

Dr. Bernard Chen Ph.D.

Mining Frequent Itemsets without Candidate Generation

Association Rules with Apriori

Bottleneck of Frequent-pattern Mining

To find frequent itemset i1i2i100

# of scans: 100 # of Candidates: (1001) + (1002) + + (110000) = 21001 = 1.27*1030 !

Bottleneck: candidate-generation-and-test Can we avoid candidate generation?

Mining Frequent Patterns Without Candidate Generation

abc is a frequent pattern

Scan DB once, find frequent 1-itemset (single item pattern)

Lets have an example

T100 T200 T300 T400 T500 T600 T700 T800 T900

1,2,5 2,4 2,3 1,2,4 1,3 2,3 1,3 1,2,3,5 1,2,3

Mining the FP tree

Benefits of the FP-tree Structure

Association Rules with Apriori

Association Rules with FP Tree

Association Rules with FP Tree

FP-Growth vs. Apriori: Scalability With the Support Threshold

Data set T25I20D10K

60 50 40 30 20 10 0 0 0.5 1 1.5 2 Support threshold(%) 2.5 3

Why Is FP-Growth the Winner?

leads to focused search of smaller databases

compressed database: FP-tree structure

Strong Association Rules are not necessary interesting

Example 5.8 Misleading Strong Association Rule

Of the 10,000 transactions analyzed, the data show that

Misleading Strong Association Rule

For this example:

Support (Game & Video) = 4,000 / 10,000 =40%

Misleading Strong Association Rule

Misleading Strong Association Rule

Under the normal situation,

From Association Analysis to Correlation Analysis

Lift is a simple correlation measure that is given as follows

Lift(A,B) = P(AUB) / P(A)P(B)

Mining Multiple-Level Association Rules

Items often form hierarchies

Mining Multiple-Level Association Rules

Items often form hierarchies

Mining Multiple-Level Association Rules

Flexible support settings

Items at the lower level are expected to have lower support

Multi-level Association: Redundancy Filtering

milk wheat bread

[support = 8%, confidence = 70%]

2% milk wheat bread [support = 2%, confidence = 72%]

We say the first rule is an ancestor of the second rule.

You might also like