Concepts and Techniques: - Chapter 6
Concepts and Techniques: - Chapter 6
Concepts and Techniques: - Chapter 6
Techniques
(3rd ed.)
— Chapter 6 —
Basic Concepts
Evaluation Methods
Summary
2
What Is Frequent Pattern
Analysis?
Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the
context of frequent itemsets and association rule mining
Motivation: Finding inherent regularities in data
What products were often purchased together?— Beer and
diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, sale
campaign analysis, Web log (click stream) analysis, and DNA
sequence analysis.
3
Why Is Freq. Pattern Mining
Important?
Broad applications
4
Basic Concepts: Frequent Patterns
Ti Items bought itemset: A set of one or
d more items
10 Beer, Nuts, Diaper k-itemset X = {x1, …, xk}
20 Beer, Coffee, Diaper (absolute) support, or, support
30 Beer, Diaper, Eggs count of X: Frequency or
occurrence of an itemset X
40 Nuts, Eggs, Milk
(relative) support, s, is the
Nuts, Coffee, Diaper, Eggs,
50
fraction of transactions that
Custo Milk Custome contains X (i.e., the probability
mer r that a transaction contains X)
buys buys An itemset X is frequent if
both diaper X’s support is no less than a
minsup threshold
Customer
buys beer
5
Basic Concepts: Association Rules
Customer
buys beer Association rules: (many more!)
Beer Diaper (60%, 100%)
Diaper Beer (60%, 75%)
6
Closed Patterns and Max-Patterns
7
Closed Patterns and Max-
Patterns
9
Chapter 5: Mining Frequent Patterns,
Association and Correlations: Basic
Concepts and Methods
Basic Concepts
Evaluation Methods
Summary
10
Scalable Frequent Itemset Mining
Methods
Approach
Approach
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
14
The Apriori Algorithm
(Pseudo-Code)
L1 = {frequent items};
for (k = 1; Lk != ; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database
do
increment the count of all candidates
in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with
min_support
end 15
Implementation of Apriori
Approach
Approach
ABCD
Once both A and D are
determined frequent, the
ABC ABD ACD BCD counting of AD begins
Once all length-2 subsets of
BCD are determined frequent,
AB AC BC AD BD CD the counting of BCD begins
Transactions
1-itemsets
A B C D
Apriori 2-itemsets
…
{}
Itemset lattice 1-itemsets
S. Brin R. Motwani, J.
2-items
Ullman, and S. Tsur. DIC 3-items
Dynamic itemset counting
and implication rules for
market basket data. 24
Scalable Frequent Itemset Mining
Methods
Approach
Approach
TID(ordered)Items bought
frequent items
100 {f,
{f, c, a, m,a,p}
c, d, g, i, m, p}
200 {a,
{f, c, a, b, b,
m} c, f, l, m, o}
min_support
300 {f, b} {b, f, h, j, o, w} =3
400 {b, c, k, s, p}
{c, b, p} {}
500 {a, Header Table
{f, c, a, m,f,p}
c, e, l, p, m, n}
1. Scan DB once, find Item frequency headf:4 c:1
frequent 1-itemset f 4
(single item c 4 c:3 b:1 b:1
pattern) a 3
b 3 a:3 p:1
2. Sort frequent items m 3
in frequency p 3
descending order, f- m:2 b:1
list F-list = f-c-a-b-m-p p:2 m:1
3. Scan DB again, 27
Partition Patterns and
Databases
{}
{}
Cond. pattern base of “cam”: (f:3)
f:3
cam-conditional FP-tree
31
A Special Case: Single Prefix Path
in FP-tree
C2:k2 C3:k3
a3:n3 C2:k2 C3:k3
32
Benefits of the FP-tree Structure
Completeness
Preserve complete information for frequent
pattern mining
Never break a long pattern of any
transaction
Compactness
Reduce irrelevant info—infrequent items are
gone
Items in frequency descending order: the
more frequently occurring, the more likely
to be shared
Never be larger than the original database 33
The Frequent Pattern Growth Mining
Method
am-proj DB cm-proj DB
fc f …
fc f
fc f
36
Performance of FPGrowth in Large
Datasets
100
140
90 D1 FP-grow th runtime D2 FP-growth
80
D1 Apriori runtime 120 D2 TreeProjection
70 100
Runtime (sec.)
Run time(sec.)
60
80
50 Data set T25I20D10K Data set
40 60
T25I20D100K
30 40
20
20
10
0 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2
Support threshold(%)
Support threshold (%)
37
Advantages of the Pattern Growth
Approach
Divide-and-conquer:
Decompose both the mining task and DB
according to the frequent patterns obtained so far
Lead to focused search of smaller databases
Other factors
No candidate generation, no candidate test
Compressed database: FP-tree structure
No repeated scan of entire database
Basic ops: counting local freq items and building
sub FP-tree, no pattern search and matching
A good open-source implementation and refinement of
FPGrowth
FPGrowth+ (Grahne and J. Zhu, FIMI'03) 38
Further Improvements of Mining
Methods
Zhu, Fimi’03)
Mining sequential patterns
PrefixSpan (ICDE’01), CloSpan (SDM’03), BIDE (ICDE’04)
Pattern-growth-based Clustering
MaPle (Pei, et al., ICDM’03)
Pattern-Growth-Based Classification
Mining frequent and discriminative patterns (Cheng, 40
Scalable Frequent Itemset Mining
Methods
Approach
Approach
Approach
Approach
48
Visualization of Association Rules:
Rule Graph
49
Rules
(SGI/MineSet 3.0)
50
Chapter 5: Mining Frequent Patterns,
Association and Correlations: Basic
Concepts and Methods
Basic Concepts
Evaluation Methods
Summary
51
Interestingness Measure: Correlations
(Lift)
52
Are lift and 2 Good Measures of
Correlation?
“Buy walnuts
buy milk [1%, 80%]”
is misleading if
85% of customers
buy milk
Support and
confidence are not
good to indicate
correlations
Over 20
interestingness
measures have
been proposed
(see Tan, Kumar,
Sritastava @KDD’02) 53
Null-Invariant Measures
54
Comparison of Interestingness
Measures
June 21, 2020 Data Mining: Concepts and Subtle: They disagree55
Techniques
Analysis of DBLP Coauthor
Relationships
Basic Concepts
Evaluation Methods
Summary
58
Summary
60
Ref: Apriori and Its Improvements
R. Agrawal and R. Srikant. Fast algorithms for mining
association rules. VLDB'94
H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms
for discovering association rules. KDD'94
A. Savasere, E. Omiecinski, and S. Navathe. An efficient
algorithm for mining association rules in large databases.
VLDB'95
J. S. Park, M. S. Chen, and P. S. Yu. An effective hash-based
algorithm for mining association rules. SIGMOD'95
H. Toivonen. Sampling large databases for association rules.
VLDB'96
S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset
counting and implication rules for market basket analysis.
SIGMOD'97
S. Sarawagi, S. Thomas, and R. Agrawal. Integrating 61
Ref: Depth-First, Projection-Based FP
Mining
R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection
algorithm for generation of frequent itemsets. J. Parallel and
Distributed Computing, 2002.
G. Grahne and J. Zhu, Efficiently Using Prefix-Trees in Mining Frequent
Itemsets, Proc. FIMI'03
B. Goethals and M. Zaki. An introduction to workshop on frequent
itemset mining implementations. Proc. ICDM’03 Int. Workshop on
Frequent Itemset Mining Implementations (FIMI’03), Melbourne, FL,
Nov. 2003
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate
generation. SIGMOD’ 00
J. Liu, Y. Pan, K. Wang, and J. Han. Mining Frequent Item Sets by
Opportunistic Projection. KDD'02
J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining Top-K Frequent Closed
Patterns without Minimum Support. ICDM'02
62
Ref: Vertical Format and Row
Enumeration Methods
M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel
algorithm for discovery of association rules. DAMI:97.
M. J. Zaki and C. J. Hsiao. CHARM: An Efficient Algorithm for
Closed Itemset Mining, SDM'02.
C. Bucila, J. Gehrke, D. Kifer, and W. White. DualMiner: A
Dual-Pruning Algorithm for Itemsets with Constraints. KDD’02.
F. Pan, G. Cong, A. K. H. Tung, J. Yang, and M. Zaki ,
CARPENTER: Finding Closed Patterns in Long Biological
Datasets. KDD'03.
H. Liu, J. Han, D. Xin, and Z. Shao, Mining Interesting Patterns
from Very High Dimensional Data: A Top-Down Row
Enumeration Approach, SDM'06. 63
Ref: Mining Correlations and
Interesting Rules
S. Brin, R. Motwani, and C. Silverstein. Beyond market basket:
Generalizing association rules to correlations. SIGMOD'97.
M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I.
Verkamo. Finding interesting rules from large sets of discovered
association rules. CIKM'94.
R. J. Hilderman and H. J. Hamilton. Knowledge Discovery and
Measures of Interest. Kluwer Academic, 2001.
C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable
techniques for mining causal structures. VLDB'98.
P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right
Interestingness Measure for Association Patterns. KDD'02.
E. Omiecinski. Alternative Interest Measures for Mining
Associations. TKDE’03.
T. Wu, Y. Chen, and J. Han, “Re-Examination of Interestingness
Measures in Pattern Mining: A Unified Framework", Data Mining and
Knowledge Discovery, 21(3):371-397, 2010 64