Concepts and Techniques: - Chapter 6

Download as pdf or txt
Download as pdf or txt
You are on page 1of 64

Concepts and

Techniques
(3rd ed.)

— Chapter 6 —

Jiawei Han, Micheline Kamber, and Jian Pei


University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 5: Mining Frequent Patterns,
Association and Correlations: Basic
Concepts and Methods

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

Evaluation Methods

 Summary

2
What Is Frequent Pattern
Analysis?
 Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the
context of frequent itemsets and association rule mining
 Motivation: Finding inherent regularities in data
 What products were often purchased together?— Beer and
diapers?!
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?
 Applications
 Basket data analysis, cross-marketing, catalog design, sale
campaign analysis, Web log (click stream) analysis, and DNA
sequence analysis.
3
Why Is Freq. Pattern Mining
Important?

 Freq. pattern: An intrinsic and important property of


datasets
 Foundation for many essential data mining tasks
 Association, correlation, and causality analysis

 Sequential, structural (e.g., sub-graph) patterns

 Pattern analysis in spatiotemporal, multimedia, time-

series, and stream data


 Classification: discriminative, frequent pattern analysis

 Cluster analysis: frequent pattern-based clustering

 Data warehousing: iceberg cube and cube-gradient

 Semantic data compression: fascicles

 Broad applications

4
Basic Concepts: Frequent Patterns
Ti Items bought  itemset: A set of one or
d more items
10 Beer, Nuts, Diaper  k-itemset X = {x1, …, xk}
20 Beer, Coffee, Diaper  (absolute) support, or, support
30 Beer, Diaper, Eggs count of X: Frequency or
occurrence of an itemset X
40 Nuts, Eggs, Milk
 (relative) support, s, is the
Nuts, Coffee, Diaper, Eggs,
50
fraction of transactions that
Custo Milk Custome contains X (i.e., the probability
mer r that a transaction contains X)
buys buys  An itemset X is frequent if
both diaper X’s support is no less than a
minsup threshold

Customer
buys beer

5
Basic Concepts: Association Rules

Ti Items bought  Find all the rules X Y with


d minimum support and confidence
10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper  support, s, probability that a
30 Beer, Diaper, Eggs transaction contains X Y
40 Nuts, Eggs, Milk  confidence, c, conditional
50 Nuts, Coffee, Diaper, Eggs, probability that a transaction
Milk having X also contains Y
Custome
Custom Let minsup = 50%, minconf = 50%
r
er
buys Freq. Pat.: Beer:3, Nuts:3, Diaper:4,
buys
both Eggs:3, {Beer, Diaper}:3
diaper

Customer
buys beer  Association rules: (many more!)
 Beer Diaper (60%, 100%)
 Diaper Beer (60%, 75%)
6
Closed Patterns and Max-Patterns

 A long pattern contains a combinatorial number of


sub-patterns, e.g., {a1, …, a100} contains (1001) + (1002)
+ … + (110000) = 2100 – 1 = 1.27*1030 sub-patterns!
 Solution: Mine closed patterns and max-patterns
instead
 An itemset X is closed if X is frequent and there
exists no super-pattern Y ‫ כ‬X, with the same support
as X (proposed by Pasquier, et al. @ ICDT’99)
 An itemset X is a max-pattern if X is frequent and
there exists no frequent super-pattern Y ‫ כ‬X
(proposed by Bayardo @ SIGMOD’98)
 Closed pattern is a lossless compression of freq.
patterns
 Reducing the # of patterns and rules

7
Closed Patterns and Max-
Patterns

 Exercise. DB = {<a1, …, a100>, < a1, …,


a50>}
 Min_sup = 1.
 What is the set of closed itemset?
 <a1, …, a100>: 1
 < a1, …, a50>: 2
 What is the set of max-pattern?
 <a1, …, a100>: 1
 What is the set of all patterns?
 !! 8
Computational Complexity of Frequent
Itemset Mining
 How many itemsets are potentially to be generated in the
worst case?
 The number of frequent itemsets to be generated is
senstive to the minsup threshold
 When minsup is low, there exist potentially an
exponential number of frequent itemsets
 The worst case: MN where M: # distinct items, and N:
max length of transactions

9
Chapter 5: Mining Frequent Patterns,
Association and Correlations: Basic
Concepts and Methods

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

Evaluation Methods

 Summary

10
Scalable Frequent Itemset Mining
Methods

 Apriori: A Candidate Generation-and-Test

Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth

Approach

 ECLAT: Frequent Pattern Mining with

Vertical Data Format 11


The Downward Closure Property and
Scalable Mining Methods
 The downward closure property of frequent
patterns
 Any subset of a frequent itemset must be
frequent
 If {beer, diaper, nuts} is frequent, so is
{beer, diaper}
 i.e., every transaction having {beer, diaper,
nuts} also contains {beer, diaper}
 Scalable mining methods: Three major
approaches
 Apriori (Agrawal & Srikant@VLDB’94)
 Freq. pattern growth (FPgrowth—Han, Pei &
Yin @SIGMOD’00)
12
Apriori: A Candidate Generation & Test
Approach

 Apriori pruning principle: If there is any


itemset which is infrequent, its superset
should not be generated/tested! (Agrawal &
Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
 Method:
 Initially, scan DB once to get frequent 1-
itemset
 Generate length (k+1) candidate itemsets
from length k frequent itemsets
 Test the candidates against DB
 Terminate when no frequent or candidate
13
The Apriori Algorithm—An
Example
Supmin = Itemset sup
Database TDB 2 {A} 2
Itemset sup
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
20 B, C, E 1st scan {D} 1
{C} 3
{E} 3
30 A, B, C, {E} 3
E
40 B, E
C2 Itemset sup C2
{A, B} 1 Itemset
L2 Itemset sup
{A, C} 2 2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
14
The Apriori Algorithm
(Pseudo-Code)

Ck: Candidate itemset of size k


Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk != ; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database
do
increment the count of all candidates
in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with
min_support
end 15
Implementation of Apriori

 How to generate candidates?


 Step 1: self-joining Lk
 Step 2: pruning
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4 = {abcd}
16
How to Count Supports of
Candidates?

 Why counting supports of candidates a


problem?
 The total number of candidates can be
very huge
 One transaction may contain many
candidates
 Method:
 Candidate itemsets are stored in a hash-
tree
 Leaf node of hash-tree contains a list of
itemsets and counts
17
Candidate Generation: An SQL
Implementation
 SQL Implementation of candidate generation
 Suppose the items in Lk-1 are listed in an order
 Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1
 Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
 Use object-relational extensions like UDFs, BLOBs, and Table
functions for efficient implementation [See: S. Sarawagi, S.
Thomas, and R. Agrawal. Integrating association rule mining
with relational database systems: Alternatives and implications.
SIGMOD’98]
18
Scalable Frequent Itemset Mining
Methods

 Apriori: A Candidate Generation-and-Test

Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth

Approach

 ECLAT: Frequent Pattern Mining with Vertical 19


Further Improvement of the Apriori
Method

 Major computational challenges


 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for
candidates
 Improving Apriori: general ideas
 Reduce passes of transaction database
scans
 Shrink number of candidates
 Facilitate support counting of candidates 20
Partition: Scan Database Only
Twice

 Any itemset that is potentially frequent in


DB must be frequent in at least one of the
partitions of DB
 Scan 1: partition database and find local
frequent patterns
 Scan 2: consolidate global frequent patterns
 A. Savasere, E. Omiecinski and S. Navathe,
VLDB’95

DB1 + DB2 + + DBk =


sup1(i) < sup2(i) < supk(i) < DBsup(i) <
σDB1 σDB2 σDBk σDB
DHP: Reduce the Number of
Candidates

 A k-itemset whose corresponding hashing bucket count


is below the threshold cannot be frequent
count itemset
 Candidates: a, b, c, d, e 35 s ad,
{ab,
88 ae}
{bd, be,
 Hash entries de}
 {ab, ad, ae} . .
 {bd, be, de} .. ..

 … 102 {yz, qs,


wt} Table
Hash
 Frequent 1-itemset: a, b, d, e
 ab is not a candidate 2-itemset if the sum of
count of {ab, ad, ae} is below support threshold
 J. Park, M. Chen, and P. Yu. An effective hash-based
algorithm for mining association rules. SIGMOD’95
22
Sampling for Frequent Patterns

 Select a sample of original database, mine


frequent patterns within sample using Apriori
 Scan database once to verify frequent
itemsets found in sample, only borders of
closure of frequent patterns are checked
 Example: check abcd instead of ab, ac, …,
etc.
 Scan database again to find missed frequent
patterns
 H. Toivonen. Sampling large databases for
association rules. In VLDB’96 23
DIC: Reduce Number of Scans

ABCD
 Once both A and D are
determined frequent, the
ABC ABD ACD BCD counting of AD begins
 Once all length-2 subsets of
BCD are determined frequent,
AB AC BC AD BD CD the counting of BCD begins
Transactions
1-itemsets
A B C D
Apriori 2-itemsets

{}
Itemset lattice 1-itemsets
S. Brin R. Motwani, J.
2-items
Ullman, and S. Tsur. DIC 3-items
Dynamic itemset counting
and implication rules for
market basket data. 24
Scalable Frequent Itemset Mining
Methods

 Apriori: A Candidate Generation-and-Test

Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth

Approach

 ECLAT: Frequent Pattern Mining with Vertical 25


Frequent Patterns Without Candidate
Generation

 Bottlenecks of the Apriori approach


 Breadth-first (i.e., level-wise) search
 Candidate generation and test
 Often generates a huge number of candidates
 The FPGrowth Approach (J. Han, J. Pei, and Y. Yin,
SIGMOD’ 00)
 Depth-first search
 Avoid explicit candidate generation
 Major philosophy: Grow long patterns from short ones
using local frequent items only
 “abc” is a frequent pattern
 Get all transactions having “abc”, i.e., project DB on
abc: DB|abc 26
Construct FP-tree from a Transaction Database

TID(ordered)Items bought
frequent items
100 {f,
{f, c, a, m,a,p}
c, d, g, i, m, p}
200 {a,
{f, c, a, b, b,
m} c, f, l, m, o}
min_support
300 {f, b} {b, f, h, j, o, w} =3
400 {b, c, k, s, p}
{c, b, p} {}
500 {a, Header Table
{f, c, a, m,f,p}
c, e, l, p, m, n}
1. Scan DB once, find Item frequency headf:4 c:1
frequent 1-itemset f 4
(single item c 4 c:3 b:1 b:1
pattern) a 3
b 3 a:3 p:1
2. Sort frequent items m 3
in frequency p 3
descending order, f- m:2 b:1
list F-list = f-c-a-b-m-p p:2 m:1
3. Scan DB again, 27
Partition Patterns and
Databases

 Frequent patterns can be partitioned


into subsets according to f-list
 F-list = f-c-a-b-m-p
 Patterns containing p
 Patterns having m but no p
 …
 Patterns having c but no a nor b,
m, p
 Pattern f
 Completeness and non-redundency
28
Find Patterns Having P From P-conditional
Database

 Starting at the frequent item header table in the


FP-tree
 Traverse the FP-tree by following the link of each
frequent item p
 Accumulate all of transformed prefix paths of item
p to form p’s conditional pattern base
{}
Header Table
f:4 c:1 Conditional pattern
Item frequency head
bases
f 4
c 4 c:3 b:1 b:1 item cond. pattern
a 3 base
b 3 a:3 p:1 c f:3
m 3
p 3 a fc:3
m:2 b:1
b fca:1, f:1, c:1
p:2 m:1 m fca:2, fcab:1
29
From Conditional Pattern-bases to Conditional
FP-trees

 For each pattern-base


 Accumulate the count for each item in
the base
 Construct the FP-tree for the frequent
items of the pattern base
m-conditional pattern
{} base:
Header Table
fca:2, fcab:1
All frequent
Item frequency head f:4 c:1 patterns relate
f 4 {} to m
c 4 c:3 b:1 b:1 m,
a 3 f:3
b 3 a:3 p:1 fm, cm, am,
m 3 c:3 fcm, fam, cam,
p 3 m:2 b:1
fcam
p:2 m:1 a:3
m-conditional FP-tree
30
Recursion: Mining Each Conditional
FP-tree

{}

{} Cond. pattern base of “am”: f:3


(fc:3)
c:3
f:3
am-conditional FP
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP

{}
Cond. pattern base of “cam”: (f:3)
f:3
cam-conditional FP-tree

31
A Special Case: Single Prefix Path
in FP-tree

 Suppose a (conditional) FP-tree T has a


shared single prefix-path P
 Mining can be decomposed into two
{} parts
a1:n1  Reduction of the single prefix path
a2:n2 into one node
a3:n3  Concatenation of the mining results of
r1
{}
the two parts
C1:k1 a1:n1
b1:m1 r1 =
a2:n2
+ b1:m1 C1:k1

C2:k2 C3:k3
a3:n3 C2:k2 C3:k3
32
Benefits of the FP-tree Structure

 Completeness
 Preserve complete information for frequent
pattern mining
 Never break a long pattern of any
transaction
 Compactness
 Reduce irrelevant info—infrequent items are
gone
 Items in frequency descending order: the
more frequently occurring, the more likely
to be shared
 Never be larger than the original database 33
The Frequent Pattern Growth Mining
Method

 Idea: Frequent pattern growth


 Recursively grow frequent patterns by
pattern and database partition
 Method
 For each frequent item, construct its
conditional pattern-base, and then its
conditional FP-tree
 Repeat the process on each newly created
conditional FP-tree
 Until the resulting FP-tree is empty, or it
contains only one path—single path will
generate all the combinations of its sub-
paths, each of which is a frequent pattern
34
Scaling FP-growth by Database
Projection
 What about if FP-tree cannot fit in memory?
 DB projection
 First partition a database into a set of projected DBs
 Then construct and mine FP-tree for each projected
DB
 Parallel projection vs. partition projection techniques
 Parallel projection
 Project the DB in parallel for each frequent
item
 Parallel projection is space costly
 All the partitions can be processed in parallel
 Partition projection
 Partition the DB based on the ordered frequent
items 35
Partition-Based Projection

 Parallel projection needs Tran. DB


a lot of disk space fcamp
fcabm
 Partition projection saves fb
it cbp
fcamp

p-proj DB m-proj DB b-proj DB a-proj DB c-proj DB f-proj DB


fcam fcab f fc f …
cb fca cb … …
fcam fca …

am-proj DB cm-proj DB
fc f …
fc f
fc f
36
Performance of FPGrowth in Large
Datasets

100
140
90 D1 FP-grow th runtime D2 FP-growth
80
D1 Apriori runtime 120 D2 TreeProjection
70 100

Runtime (sec.)
Run time(sec.)

60
80
50 Data set T25I20D10K Data set
40 60
T25I20D100K
30 40
20
20
10

0 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2
Support threshold(%)
Support threshold (%)

FP-Growth vs. FP-Growth vs. Tree-


Apriori Projection

37
Advantages of the Pattern Growth
Approach

 Divide-and-conquer:
 Decompose both the mining task and DB
according to the frequent patterns obtained so far
 Lead to focused search of smaller databases
 Other factors
 No candidate generation, no candidate test
 Compressed database: FP-tree structure
 No repeated scan of entire database
 Basic ops: counting local freq items and building
sub FP-tree, no pattern search and matching
 A good open-source implementation and refinement of
FPGrowth
 FPGrowth+ (Grahne and J. Zhu, FIMI'03) 38
Further Improvements of Mining
Methods

 AFOPT (Liu, et al. @ KDD’03)


 A “push-right” method for mining condensed
frequent pattern (CFP) tree
 Carpenter (Pan, et al. @ KDD’03)
 Mine data sets with small rows but numerous
columns
 Construct a row-enumeration tree for efficient
mining
 FPgrowth+ (Grahne and Zhu, FIMI’03)
 Efficiently Using Prefix-Trees in Mining Frequent
Itemsets, Proc. ICDM'03 Int. Workshop on Frequent
Itemset Mining Implementations (FIMI'03),
39
Extension of Pattern Growth Mining
Methodology
 Mining closed frequent itemsets and max-patterns
 CLOSET (DMKD’00), FPclose, and FPMax (Grahne &

Zhu, Fimi’03)
 Mining sequential patterns
 PrefixSpan (ICDE’01), CloSpan (SDM’03), BIDE (ICDE’04)

 Mining graph patterns


 gSpan (ICDM’02), CloseGraph (KDD’03)

 Constraint-based mining of frequent patterns


 Convertible constraints (ICDE’01), gPrune (PAKDD’03)

 Computing iceberg data cubes with complex measures


 H-tree, H-cubing, and Star-cubing (SIGMOD’01, VLDB’03)

 Pattern-growth-based Clustering
 MaPle (Pei, et al., ICDM’03)

 Pattern-Growth-Based Classification
 Mining frequent and discriminative patterns (Cheng, 40
Scalable Frequent Itemset Mining
Methods

 Apriori: A Candidate Generation-and-Test

Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth

Approach

 ECLAT: Frequent Pattern Mining with Vertical 41


ECLAT: Mining by Exploring Vertical
Data Format

 Vertical format: t(AB) = {T11, T25, …}


 tid-list: list of trans.-ids containing an itemset
 Deriving frequent patterns based on vertical
intersections
 t(X) = t(Y): X and Y always happen together
 t(X) t(Y): transaction having X always has Y
 Using diffset to accelerate mining
 Only keep track of differences of tids
 t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
 Diffset (XY, X) = {T2}
 Eclat (Zaki et al. @KDD’97)
 Mining Closed patterns using vertical format: CHARM
(Zaki & Hsiao@SDM’02) 42
Scalable Frequent Itemset Mining
Methods

 Apriori: A Candidate Generation-and-Test

Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth

Approach

 ECLAT: Frequent Pattern Mining with Vertical 43


Mining Frequent Closed Patterns:
CLOSET

 Flist: list of all frequent items in support


ascending order Min_sup=2
 Flist: d-a-f-e-c TID Items
10 a, c, d,
 Divide search space e, f
20 a, b, e
 Patterns having d 30 c, e, f
40 a, c, d, f
 Patterns having d but no a, etc. 50 c, e, f

 Find frequent closed pattern recursively


 Every transaction having d also has cfa
cfad is a frequent closed pattern
 J. Pei, J. Han & R. Mao. “CLOSET: An Efficient
Algorithm for Mining Frequent Closed Itemsets",
CLOSET+: Mining Closed Itemsets by Pattern-
Growth

 Itemset merging: if Y appears in every


occurrence of X, then Y is merged with X
 Sub-itemset pruning: if Y ‫ כ‬X, and sup(X) =
sup(Y), X and all of X’s descendants in the set
enumeration tree can be pruned
 Hybrid tree projection
 Bottom-up physical tree-projection
 Top-down pseudo tree-projection
 Item skipping: if a local frequent item has the
same support in several header tables at
different levels, one can prune it from the
header table at higher levels
MaxMiner: Mining Max-Patterns

 1st scan: find frequent items Tid Items


10 A, B, C, D,
 A, B, C, D, E E
B, C, D, E,
 2nd scan: find support for 20
30 A, C, D, F
 AB, AC, AD, AE, ABCDE
 BC, BD, BE, BCDE Potential
 CD, CE, CDE, DE max-
 patterns
Since BCDE is a max-pattern, no need to
check BCD, BDE, CDE in later scan
 R. Bayardo. Efficiently mining long patterns
from databases. SIGMOD’98
CHARM: Mining by Exploring Vertical
Data Format

 Vertical format: t(AB) = {T11, T25, …}


 tid-list: list of trans.-ids containing an
itemset
 Deriving closed patterns based on vertical
intersections
 t(X) = t(Y): X and Y always happen
together
 t(X) t(Y): transaction having X always
has Y
 Using diffset to accelerate mining
 Only keep track of differences of tids
 t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
Visualization of Association Rules:
Plane Graph

48
Visualization of Association Rules:
Rule Graph

49
Rules
(SGI/MineSet 3.0)

50
Chapter 5: Mining Frequent Patterns,
Association and Correlations: Basic
Concepts and Methods

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

Evaluation Methods

 Summary

51
Interestingness Measure: Correlations
(Lift)

 play basketball eat cereal [40%, 66.7%] is


misleading
 The overall % of students eating cereal is 75% >
66.7%.
 play basketball not eat cereal [20%, 33.3%] is more
accurate, although with lower support and confidence
P( AÈ B) Basketba Not Sum
 lift = of dependent/correlated
Measure events:
ll lift
basketball (row)
P( A) P( B) Cereal 2000 1750 3750

2000 / 5000 Not 1000 250 1250


lift ( B, C ) = =0.89 cereal
3000 / 5000 * 3750 / 5000
Sum(col.) 3000 2000 5000
1000 / 5000
lift ( B, ØC ) = =1.33
3000 / 5000 *1250 / 5000

52
Are lift and 2 Good Measures of
Correlation?

 “Buy walnuts
buy milk [1%, 80%]”
is misleading if
85% of customers
buy milk
 Support and
confidence are not
good to indicate
correlations
 Over 20
interestingness
measures have
been proposed
(see Tan, Kumar,
Sritastava @KDD’02) 53
Null-Invariant Measures

54
Comparison of Interestingness
Measures

 Null-(transaction) invariance is crucial for correlation


analysis
 Lift and 2 are not null-invariant
 5 null-invariant measures
Milk No Milk Sum
(row)
Coffee m, c ~m, c c
No m, ~c ~m, ~c ~c
Coffee
Sum(co m ~m 
l.) Null-transactions Kulczynski
w.r.t. m and c measure (1927) Null-invariant

June 21, 2020 Data Mining: Concepts and Subtle: They disagree55
Techniques
Analysis of DBLP Coauthor
Relationships

Recent DB conferences, removing balanced associations, low sup, etc.

Advisor-advisee relation: Kulc: high,


coherence: low, cosine: middle
 Tianyi Wu, Yuguo Chen and Jiawei Han, “Association
Mining in Large Databases: A Re-Examination of Its
Measures”, Proc. 2007 Int. Conf. Principles and Practice
of Knowledge Discovery in Databases (PKDD'07), Sept.
2007 56
Which Null-Invariant Measure Is
Better?

 IR (Imbalance Ratio): measure the imbalance


of two itemsets A and B in rule implications

 Kulczynski and Imbalance Ratio (IR) together


present a clear picture for all the three
datasets D4 through D6
 D4 is balanced & neutral
 D5 is imbalanced & neutral
 D6 is very imbalanced & neutral
Chapter 5: Mining Frequent Patterns,
Association and Correlations: Basic
Concepts and Methods

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

Evaluation Methods

 Summary

58
Summary

 Basic concepts: association rules,


support-confident framework, closed and
max-patterns
 Scalable frequent pattern mining
methods
 Apriori (Candidate generation & test)
 Projection-based (FPgrowth,
CLOSET+, ...)
 Vertical format approach (ECLAT,
CHARM, ...) 59
Ref: Basic Concepts of Frequent
Pattern Mining

 (Association Rules) R. Agrawal, T. Imielinski, and A. Swami.


Mining association rules between sets of items in large
databases. SIGMOD'93
 (Max-pattern) R. J. Bayardo. Efficiently mining long patterns
from databases. SIGMOD'98
 (Closed-pattern) N. Pasquier, Y. Bastide, R. Taouil, and L.
Lakhal. Discovering frequent closed itemsets for association
rules. ICDT'99
 (Sequential pattern) R. Agrawal and R. Srikant. Mining
sequential patterns. ICDE'95

60
Ref: Apriori and Its Improvements
 R. Agrawal and R. Srikant. Fast algorithms for mining
association rules. VLDB'94
 H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms
for discovering association rules. KDD'94
 A. Savasere, E. Omiecinski, and S. Navathe. An efficient
algorithm for mining association rules in large databases.
VLDB'95
 J. S. Park, M. S. Chen, and P. S. Yu. An effective hash-based
algorithm for mining association rules. SIGMOD'95
 H. Toivonen. Sampling large databases for association rules.
VLDB'96
 S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset
counting and implication rules for market basket analysis.
SIGMOD'97
 S. Sarawagi, S. Thomas, and R. Agrawal. Integrating 61
Ref: Depth-First, Projection-Based FP
Mining
 R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection
algorithm for generation of frequent itemsets. J. Parallel and
Distributed Computing, 2002.
 G. Grahne and J. Zhu, Efficiently Using Prefix-Trees in Mining Frequent
Itemsets, Proc. FIMI'03
 B. Goethals and M. Zaki. An introduction to workshop on frequent
itemset mining implementations. Proc. ICDM’03 Int. Workshop on
Frequent Itemset Mining Implementations (FIMI’03), Melbourne, FL,
Nov. 2003
 J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate
generation. SIGMOD’ 00
 J. Liu, Y. Pan, K. Wang, and J. Han. Mining Frequent Item Sets by
Opportunistic Projection. KDD'02
 J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining Top-K Frequent Closed
Patterns without Minimum Support. ICDM'02
62
Ref: Vertical Format and Row
Enumeration Methods
 M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel
algorithm for discovery of association rules. DAMI:97.
 M. J. Zaki and C. J. Hsiao. CHARM: An Efficient Algorithm for
Closed Itemset Mining, SDM'02.
 C. Bucila, J. Gehrke, D. Kifer, and W. White. DualMiner: A
Dual-Pruning Algorithm for Itemsets with Constraints. KDD’02.
 F. Pan, G. Cong, A. K. H. Tung, J. Yang, and M. Zaki ,
CARPENTER: Finding Closed Patterns in Long Biological
Datasets. KDD'03.
 H. Liu, J. Han, D. Xin, and Z. Shao, Mining Interesting Patterns
from Very High Dimensional Data: A Top-Down Row
Enumeration Approach, SDM'06. 63
Ref: Mining Correlations and
Interesting Rules
 S. Brin, R. Motwani, and C. Silverstein. Beyond market basket:
Generalizing association rules to correlations. SIGMOD'97.
 M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I.
Verkamo. Finding interesting rules from large sets of discovered
association rules. CIKM'94.
 R. J. Hilderman and H. J. Hamilton. Knowledge Discovery and
Measures of Interest. Kluwer Academic, 2001.
 C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable
techniques for mining causal structures. VLDB'98.
 P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right
Interestingness Measure for Association Patterns. KDD'02.
 E. Omiecinski. Alternative Interest Measures for Mining
Associations. TKDE’03.
 T. Wu, Y. Chen, and J. Han, “Re-Examination of Interestingness
Measures in Pattern Mining: A Unified Framework", Data Mining and
Knowledge Discovery, 21(3):371-397, 2010 64

You might also like