Lesson 2.2 - Frequent Pattern Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

Frequent Pattern Analysis

CS 822 Data Mining

Anis ur Rahman
Department of Computing
NUST-SEECS
Islamabad

September 24, 2018

1 / 54
Mining Frequent Patterns, Associations and Correlations

Road Map

Basic Concepts
Apriori Algorithm
Optimization Techniques
FP-Growth

2 / 54
Mining Frequent Patterns, Associations and Correlations

Frequent Pattern Analysis

Frequent Pattern. a pattern (a set of items, subsequences,


substructures, etc.) that occurs frequently in a dataset

Goal. Finding inherent regularities in data


What products were often purchased together? Juice and diapers?
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify Web documents

Applications
Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.

3 / 54
Mining Frequent Patterns, Associations and Correlations

Frequent Patterns

Itemset. A set of one or more items


K-itemset X = {x1 , · · · , xk }

(absolute) support, or, support count of X: Frequency or


occurrence of an itemset X
(relative) support, s, is the fraction of transactions that contains
X (i.e., the probability that a transaction contains X)

An itemset X is frequent if X’s support is no less than a minsup


threshold.

4 / 54
Mining Frequent Patterns, Associations and Correlations

Association Rules

Q. Find all the rules X → Y with minimum support and confidence


threshold
Support, s, probability that a transaction contains X ∪ Y
Confidence, c, conditional probability that a transaction having
X also contains Y

5 / 54
Mining Frequent Patterns, Associations and Correlations

Association Rules

Q. Find all the rules X → Y with minimum support and confidence


threshold
Tid Items bought
10 Juice, Nuts, Diaper
20 Juice, Coffee, Diaper
30 Juice, Diaper, Eggs
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs, Milk

Let minsup = 50%, minconf = 50%


Freq. Pat. Juice:3, Nuts:3, Diaper:4, Eggs:3, {Juice, Diaper}:3
Association rules. (many more!)
Juice → Diaper (60%, 100%)
Diaper → Juice (60%, 75%)

Rules that satisfy both minsup and minconf are called strong rules.
6 / 54
Mining Frequent Patterns, Associations and Correlations

Closed Patterns and Max Patterns

A long pattern contains a combinatorial number of sub-patterns,


e.g., {a1 , · · · , a100 } contains = 2100 − 1 = 1.27 ∗ 1030 sub-patterns!
Solution. Mine closed patterns and max-patterns instead

An itemset X is closed if X is frequent and there exists no


super-pattern Y ⊃ X , with the same support as X
An itemset X is a max-pattern if X is frequent and there exists no
frequent super-pattern Y ⊃ X

Closed pattern. is a lossless compression of freq. patterns


Reducing the number of patterns and rules

7 / 54
Mining Frequent Patterns, Associations and Correlations

Closed Patterns and Max Patterns

DB = {< a1 , · · · , a100 >, < a1 , · · · , a50 >}


Min_sup = 1

1 What is the set of closed itemset?


< a1 , · · · , a100 >: 1
< a1 , · · · , a50 >: 2
2 What is the set of max-pattern?
< a1 , · · · , a100 >: 1
3 What is the set of all patterns?
!!

8 / 54
Mining Frequent Patterns, Associations and Correlations

Computational Complexity

Q. How many itemsets are potentially to be generated in the worst


case?
The number of frequent itemsets to be generated is sensitive to
the minsup threshold
When minsup is low, there exist potentially an exponential number
of frequent itemsets

The worst case: MN where M: # distinct items, and N : max length of


transactions

9 / 54
Mining Frequent Patterns, Associations and Correlations

Concepts and Principles

The downward closure property of frequent patterns


Any subset of a frequent itemset must be frequent
If {Juice, Diaper, Nuts} is frequent, so is {Juice, Diaper}
i.e., every transaction having {Juice, Diaper, Nuts} also contains
{Juice, Diaper}

Apriori pruning principle


If there is any itemset which is infrequent, its superset should not be
generated/tested.

10 / 54
Mining Frequent Patterns, Associations and Correlations

Road Map

1 Basic Concepts
2 Apriori Algorithm
3 Optimization Techniques
4 FP-Growth

11 / 54
Mining Frequent Patterns, Associations and Correlations

Apriori: Method

1 Initially, scan DB once to get frequent 1-itemset


2 Generate length (k + 1) candidate itemsets from length k frequent
itemsets
3 Test the candidates against DB
4 Terminate when no frequent or candidate set can be generated

12 / 54
Mining Frequent Patterns, Associations and Correlations

Apriori Example

Database
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E

Supmin = 2
1st scan

Itemset
Itemset sup
Itemset sup
{A, B}
{A} 2
{A} 2 {A, C}
{B} 3 ⇒ {B} 3 ⇒ {A, E}
{C} 3
{C} 3 {B, C}
{D} 1
{E} 3 {B, E}
{E} 3
{C, E}

13 / 54
Mining Frequent Patterns, Associations and Correlations

Apriori Example

Database
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E

Supmin = 2
2nd scan

Itemset sup
Itemset sup
{A, B} 1
{A, C} 2 {A, C} 2 Itemset
{A, E} 1 ⇒ {B, C} 2 ⇒
{B, C, E}
{B, C} 2 {B, E} 3
{B, E} 3 {C, E} 2
{C, E} 2

14 / 54
Mining Frequent Patterns, Associations and Correlations

Apriori Example

Database

Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E

Supmin = 2
3rd scan

Itemset sup
{B, C, E} 2

15 / 54
Mining Frequent Patterns, Associations and Correlations

Apriori Algorithm

Ck : Candidate itemset of size k


Lk : frequent itemset of size k
L1 = frequent items;
for (k = 1; Lk ! = φ; k + +) do begin
Ck +1 = candidates generated from Lk ;
for each transaction t in database do
increment the count of all candidates in Ck +1 that are contained in t
Lk +1 = candidates in Ck +1 with min_support
end
S
return k Lk ;

16 / 54
Mining Frequent Patterns, Associations and Correlations

Candidate Generation

How to generate candidates?


Step 1. self-joining Lk
Step 2. pruning
Join Lk p with Lk q, as follows:
insert into Ck +1
select {p.itemi }{1,··· ,k −1} , p.itemk , q.itemk
from Lk p, Lk q
where {p.itemi }{1,··· ,k −1} = {q.itemi }{1,··· ,k −1} and p.itemk < q.itemk

17 / 54
Mining Frequent Patterns, Associations and Correlations

Example of Candidate Generation

Suppose we have the following frequent 3-itemsets and we would like to


generate the 4-itemsets candidates
L3 = {{I1 , I2 , I3 }, {I1 , I2 , I4 }, {I1 , I3 , I4 }, {I1 , I3 , I5 }, {I2 , I3 , I4 }}
Self-joining: L3 ∗ L3 gives:
{I1 , I2 , I3 , I4 } from {I1 , I2 , I3 } , {I1 , I2 , I4 }, and {I2 , I3 , I4 }
{I1 , I3 , I4 , I5 } from {I1 , I3 , I4 } and {I1 , I3 , I5 }
L4 = {I1 , I2 , I3 , I5 }
Pruning: {I1 , I3 , I4 , I5 } is removed because {I1 , I4 , I5 } is not in L3
L4 = {I1 , I2 , I3 , I5 }

18 / 54
Mining Frequent Patterns, Associations and Correlations

Generating Association Rules

Once the frequent itemsets have been found, it is straightforward


to generate strong association rules that satisfy:
minimum support
minimum confidence
Relation between Support and Confidence
support_count(A ∪ B )
Confidence(A ⇒ B ) = P (B |A ) =
support_count(A )

Support_count(A ∪ B ) is the number of transactions containing the


itemsets A ∪ B
Support_count(A) is the number of transactions containing the
itemset A.

19 / 54
Mining Frequent Patterns, Associations and Correlations

Generating Association Rules

For each frequent itemset L , generate all non empty subsets of L


For every non empty subset S of L , output the rule:

S ⇒ (L − S )

If (support_count(L)/support_count(S)) ≥ min_conf

20 / 54
Mining Frequent Patterns, Associations and Correlations

Example

Suppose the frequent Itemset


L = {I1 , I2 , I5 } Transactional Database
Subsets of L are:
TID List of item IDS
{I1 , I2 }, {I1 , I5 }, {I2 , I5 }, {I1 }, {I2 }, {I5 }
Association rules: T100 I1 , I2 , I5
T200 I2 , I4
I1 ∧ I2 ⇒ I5 confidence = 2/4 = 50% T300 I2 , I3
T400 I1 , I2 , I4
I1 ∧ I5 ⇒ I2 confidence = 2/2 = 100% T500 I1 , I3
I2 ∧ I5 ⇒ I1 confidence = 2/2 = 100% T600 I2 , I3
T700 I1 , I3
I1 ⇒ I2 ∧ I5 confidence = 2/6 = 33% T800 I1 , I2 , I3 , I5
I2 ⇒ I1 ∧ I5 confidence = 2/7 = 29% T900 I1 , I2 , I3

I5 ⇒ I1 ∧ I2 confidence = 2/2 = 100%

Question: If the minimum confidence = 70%, what are the strong rules?

21 / 54
Mining Frequent Patterns, Associations and Correlations

Strong Rules are Not Necessarily Interesting

Buys(X, “computer games”) ⇒ buys(X, “videos”)[support 40%,


confidence=66%]
This rule is strong but it is misleading
The probability of purshasing videos is 75% which is even larger
than 66%
In fact computer games and videos are negatively associated
because the purchase of one of these items actually decreases the
likelihood of purchasing the other
The confidence of a rule A ⇒ B can be deceiving
It is only an estimate of the conditional probability of itemset B
given itemset A
It does not measure the real strength of the correlation implication
between A and B
Need to use Correlation Analysis
22 / 54
Mining Frequent Patterns, Associations and Correlations

From Association to Correlation Analysis

Use Lift, a simple correlation measure


The occurrence of itemset A is independent of the occurrence of
itemset B if P (A ∪ B ) = P (A )P (B ), otherwise itemsets A and B are
dependent and correlated as events
The lift between theoccurences of A and B is given by
Lift(A , B ) = P (A ∪ B )/P (A )P (B )
If > 1, then A and B are positively correlated (the occurrence of one
implies the occurrence of the other)
If <1, then A and B are negatively correlated
If =1, then A and B are independent
Example: P ({game, video}) = 0.4/(0.60 × 0.75) = 0.89

23 / 54
Mining Frequent Patterns, Associations and Correlations

Improving the Efficiency of Apriori

Major computational challenges


Huge number of candidates
Multiple scans of transaction database
Tedious workload of support counting for candidates
Improving Apriori: general ideas
Shrink number of candidates
Reduce passes of transaction database scans
Facilitate support counting of candidates

24 / 54
Mining Frequent Patterns, Associations and Correlations

Road Map

Basic Concepts
Apriori Algorithm
Optimization Techniques
FP-Growth

25 / 54
Mining Frequent Patterns, Associations and Correlations

DHP: Direct Hashing with efficient Pruning

DHP has two major features:


Efficient generation for large itemsets
Effective reduction on transaction database size

26 / 54
Mining Frequent Patterns, Associations and Correlations

DHP: Direct Hashing with efficient Pruning

Itemset sup
Tid Items Itemset
{A} 2
10 A, C, D {A}
⇒ {B} 3 ⇒
20 B, C, E {B}
{C} 3
30 A, B, C, E {C}
{D} 1
40 B, E {E}
{E} 3

Tid Items 2-itemsets


10 A, C, D {A,C}, {A, D}, {C,D}
20 B, C, E {B,C}, {B, E}, {C,E}
30 A, B, C, E {A,B}, {A, C}, {A,E}, {B,C}, {B, E}, {C,E}
40 B, E {B, E}

Note. min-support=2

27 / 54
Mining Frequent Patterns, Associations and Correlations

Tid Items 2-itemsets


10 A, C, D {A,C}
20 B, C, E {B,C}, {B, E}, {C,E}
30 A, B, C, E {A,B}, {A, C}, {A,E}, {B,C}, {B, E}, {C,E}
40 B, E {B, E}

Making a hash table

h ({x, y}) = ((order of x) ∗ 10 + (order of y)) mod 7

Hash codes 0 1 2 3 4 5 6
Buckets {A,C} {B,C}
{A,B} {B,E}
{A,C} {C,E}
{A,E} {B,C}
{B,E}
{C,E}
{B,E}
Buckets counters 0 0 4 0 0 7 0
binary vector 0 0 1 0 0 1 0
28 / 54
Mining Frequent Patterns, Associations and Correlations

Itemset sup
Itemset
{A,B} 1
{A,C} 2 {A,C}
{A,E} 1 ⇒ {B,C}
{B,C} 2 {B,E}
{B,E} 3 {C,E}
{C,E} 2

29 / 54
Mining Frequent Patterns, Associations and Correlations

Tid Items 2-itemsets


10 A, C, D φ
20 B, C, E {B,C,E}
30 A, B, C, E {A,B,C}, {A,B,E}, {A,C,E}, {B,C,E}
40 B, E φ

Making a hash table

h ({x, y, z}) = ((order of x) ∗ 100 + (order of y) ∗ 10 + (order of z)) mod 7

Hash codes 0 1 2 3 4 5 6
Buckets {A,B,C} {B,C,E}
{A,B,E} {B,C,E}
{A,C,E}
Buckets counters 0 0 3 0 0 2 0
binary vector 0 0 1 0 0 1 0

30 / 54
Mining Frequent Patterns, Associations and Correlations

DHP: Direct Hashing with efficient Pruning

Itemset sup
{A,B,C} 1 Itemset
{A,B,E} 1 ⇒
{B,C,E}
{A,C,E} 1
{B,C,E} 2

31 / 54
Mining Frequent Patterns, Associations and Correlations

DHP: Direct Hashing with efficient Pruning

32 / 54
Mining Frequent Patterns, Associations and Correlations

Partition Large Databases

Subdivide the transactions of D into k non overlapping partitions


Each partition can fit into main memory, thus it is read only once

D1 + D2 + · · · + Dk = D

33 / 54
Mining Frequent Patterns, Associations and Correlations

Partition Large Databases

Step1. Partition database and find local frequent patterns (1


scan)
Step2. Consolidate global frequent patterns (1 scan)
The database is scanned only twice
Correctness. The key to correctness is that any potential large
itemset appears as a large itemset in at least one of the partitions

D1 + D2 + · · · + Dk = D

34 / 54
Mining Frequent Patterns, Associations and Correlations

Partition Large Databases

The counts for the candidate itemsets are generated using a


structure called tidlist
item tidlist
Tid Items
A 10,30
10 A, C, D
20 B, C, E ⇒ B 20,30,40 ⇒ {B,C} — 20,30
C 10,20,30
30 A, B, C, E
D 10
40 B, E
E 20,30,40

Correctness:
It is easy to see that the intersection of tidlists gives the correct
support for an itemset
Since the partitions are non overlapping, a cumulative count over
all partitions gives the global support for an itemset

35 / 54
Mining Frequent Patterns, Associations and Correlations

Partition Large Databases

#Transactions= 200,000

36 / 54
Mining Frequent Patterns, Associations and Correlations

Road Map

Basic Concepts
Apriori Algorithm
Optimization Techniques
FP-Growth

37 / 54
Mining Frequent Patterns, Associations and Correlations

Frequent Pattern Growth

Adopts a divide and conquer strategy


Compress the database representing frequent items into a
frequent–pattern tree or FP-tree
Retains the itemset association information
Divide the compressed database into a set of conditional
databases, each associated with one frequent item
Mine each such databases separately

38 / 54
Mining Frequent Patterns, Associations and Correlations

Example FP-Growth

The first scan of data is the same as Apriori


Derive the set of frequent 1-itemsets
Let min-sup=2
Generate a set of ordered items
Transactional Database

TID List of item IDS


T100 I1 , I2 , I5 Item ID Support count
T200 I2 , I4 I2 7
T300 I2 , I3 I1 6
T400 I1 , I2 , I4 I3 6
T500 I1 , I3 I4 2
T600 I2 , I3 I5 2
T700 I1 , I3
T800 I1 , I2 , I3 , I5
T900 I1 , I2 , I3

39 / 54
Mining Frequent Patterns, Associations and Correlations

TID List of item IDS


T100 I1 , I2 , I5
T200 I2 , I4 Create a branch for each transaction
T300 I2 , I3 Items in each transaction are processed in order
T400 I1 , I2 , I4
T500 I1 , I3
T600 I2 , I3
T700 I1 , I3
T800 I1 , I2 , I3 , I5
T900 I1 , I2 , I3

Item ID Support count


I2 7
I1 6
I3 6
I4 2
I5 2

40 / 54
Mining Frequent Patterns, Associations and Correlations

TID List of item IDS


Create a branch for each transaction
T100 I1 , I2 , I5 Items in each transaction are processed in order
T200 I2 , I4
T300 I2 , I3
T400 I1 , I2 , I4 1 Order the items T100 : {I2 , I1 , I5 }
T500 I1 , I3 2 Construct the first branch: < I2 : 1 >, < I1 : 1 >, < I5 : 1 >
T600 I2 , I3
T700 I1 , I3
T800 I1 , I2 , I3 , I5
T900 I1 , I2 , I3

Item ID Support count


I2 7
I1 6
I3 6
I4 2
I5 2

41 / 54
Mining Frequent Patterns, Associations and Correlations

TID List of item IDS


Create a branch for each transaction
T100 I1 , I2 , I5 Items in each transaction are processed in order
T200 I2 , I4
T300 I2 , I3
T400 I1 , I2 , I4 1 Order the items T200 : {I2 , I4 }
T500 I1 , I3 2 Construct the second branch: < I2 : 1 >, < I4 : 1 >
T600 I2 , I3
T700 I1 , I3
T800 I1 , I2 , I3 , I5
T900 I1 , I2 , I3

Item ID Support count


I2 7
I1 6
I3 6
I4 2
I5 2

42 / 54
Mining Frequent Patterns, Associations and Correlations

TID List of item IDS


Create a branch for each transaction
T100 I1 , I2 , I5 Items in each transaction are processed in order
T200 I2 , I4
T300 I2 , I3
T400 I1 , I2 , I4 1 Order the items T300 : {I2 , I3 }
T500 I1 , I3 2 Construct the third branch: < I2 : 2 >, < I3 : 1 >
T600 I2 , I3
T700 I1 , I3
T800 I1 , I2 , I3 , I5
T900 I1 , I2 , I3

Item ID Support count


I2 7
I1 6
I3 6
I4 2
I5 2

43 / 54
Mining Frequent Patterns, Associations and Correlations

TID List of item IDS


Create a branch for each transaction
T100 I1 , I2 , I5 Items in each transaction are processed in order
T200 I2 , I4
T300 I2 , I3
T400 I1 , I2 , I4 1 Order the items T400 : {I2 , I1 , I4 }
T500 I1 , I3 2 Construct the fourth branch: < I2 : 3 >, < I1 : 1 >, < I4 : 1 >
T600 I2 , I3
T700 I1 , I3
T800 I1 , I2 , I3 , I5
T900 I1 , I2 , I3

Item ID Support count


I2 7
I1 6
I3 6
I4 2
I5 2

44 / 54
Mining Frequent Patterns, Associations and Correlations

TID List of item IDS


Create a branch for each transaction
T100 I1 , I2 , I5 Items in each transaction are processed in order
T200 I2 , I4
T300 I2 , I3
T400 I1 , I2 , I4 1 Order the items T400 : {I2 , I1 , I4 }
T500 I1 , I3 2 Construct the fourth branch: < I2 : 3 >, < I1 : 1 >, < I4 : 1 >
T600 I2 , I3
T700 I1 , I3
T800 I1 , I2 , I3 , I5
T900 I1 , I2 , I3

Item ID Support count


I2 7
I1 6
I3 6
I4 2
I5 2

45 / 54
Mining Frequent Patterns, Associations and Correlations

TID List of item IDS


Create a branch for each transaction
T100 I1 , I2 , I5 Items in each transaction are processed in order
T200 I2 , I4
T300 I2 , I3
T400 I1 , I2 , I4 1 Order the items T400 : {I1 , I3 }
T500 I1 , I3 2 Construct the fifth branch: < I1 : 1 >, < I3 : 1 >
T600 I2 , I3
T700 I1 , I3
T800 I1 , I2 , I3 , I5
T900 I1 , I2 , I3

Item ID Support count


I2 7
I1 6
I3 6
I4 2
I5 2

46 / 54
Mining Frequent Patterns, Associations and Correlations

Construct the FP-Tree

The problem of mining frequent patterns in databases is transformed to


that of mining the FP-tree

47 / 54
Mining Frequent Patterns, Associations and Correlations

Item ID Support count


I2 7
I1 6
I3 6
I4 2
I5 2

Occurrences of I5 :< I2 , I1 , I5 > and < I2 , I1 , I3 , I5 >


Two prefix Paths < I2 , I1 : 1 > and < I2 , I1 , I3 : 1 >
Conditional FP tree contains only < I2 : 2, I1 : 2 >, I3 is not
considered because its support count of 1 is less than the
minimum support count.
Frequent patterns {I2 , I5 : 2}, {I1 , I5 : 2}, {I2 , I1 , I5 : 2}

48 / 54
Mining Frequent Patterns, Associations and Correlations

Item ID Support count


I2 7
I1 6
I3 6
I4 2
I5 2

TID Conditional Pattern Base Conditional FP-tree


I5 {{I2 , I1 : 1}, {I2 , I1 , I3 : 1}} < I2 : 2, I1 : 2 >
I4 {{I2 , I1 : 1}, {I2 , 1}} < I2 : 2 >
I3 {{I2 , I1 : 2}, {I2 : 2}, {I1 : 2}} < I2 : 4, I1 : 2 >, < I1 : 2 >
I1 {I2 , 4} < I2 : 4 >

49 / 54
Mining Frequent Patterns, Associations and Correlations

Construct the FP-Tree

Item ID Support count


I2 7
I1 6
I3 6
I4 2
I5 2

TID Conditional FP-tree Frequent Patterns Generated


I5 < I2 : 2, I1 : 2 > {I2 , I5 : 2}, {I1 , I5 : 2}, {I2 , I1 , I5 : 2}
I4 < I2 : 2 > {I2 , I4 : 2}
I3 < I2 : 4, I1 : 2 >, < I1 : 2 > {I2 , I3 : 4}, {I1 , I3 : 4}, {I2 , I1 , I3 : 2}
I1 < I2 : 4 > {I2 , I1 : 4}

50 / 54
Mining Frequent Patterns, Associations and Correlations

FP-growth properties

FP-growth transforms the problem of finding long frequent


patterns to searching for shorter ones recursively and
concatenating the suffix
It uses the least frequent suffix offering a good selectivity
It reduces the search cost
If the tree does not fit into main memory, partition the database
Efficient and scalable for mining both long and short frequent
patterns

51 / 54
Mining Frequent Patterns, Associations and Correlations

ECLAT: FP Mining with Vertical Data Format

Both Apriori and FP-growth use horizontal data format


Alternatively data can also be represented in vertical format
Transform by scanning the database once

TID List of item IDS


T100 I1 , I2 , I5 itemset TID_set
T200 I2 , I4
T300 I2 , I3 I1 {T100 , T400 , T500 , T700 , T800 , T900 }
T400 I1 , I2 , I4 ⇒ I2 {T100 , T200 , T300 , T400 , T600 , T800 , T900 }
T500 I1 , I3 I3 {T300 , T500 , T600 , T700 , T800 , T900 }
T600 I2 , I3 I4 {T200 , T400 }
T700 I1 , I3 I5 {T100 , T800 }
T800 I1 , I2 , I3 , I5
T900 I1 , I2 , I3

The support count of an itemset is simply the length of the TID_set


of the itemset 52 / 54
Mining Frequent Patterns, Associations and Correlations

Frequent 1-itemsets in vertical format

itemset TID_set
I1 {T100 , T400 , T500 , T700 , T800 , T900 }
I2 {T100 , T200 , T300 , T400 , T600 , T800 , T900 }
I3 {T300 , T500 , T600 , T700 , T800 , T900 }
I4 {T200 , T400 }
I5 {T100 , T800 }

The frequent k-itemsets are used to construct candidate (k+1)-itemsets based on the Apriori
property

Frequent 2-itemsets in vertical format (consider min_sup = 2)

itemset TID_set
{I1 , I2 } {T100 , T400 , T800 , T900 }
{I1 , I3 } {T500 , T700 , T800 , T900 }
{I1 , I4 } {T400 }
{I1 , I5 } {T100 , T800 }
{I2 , I3 } {T300 , T600 , T800 , T900 }
{I2 , I4 } {T200 , T400 }
{I2 , I5 } {T100 , T800 }
{I3 , I5 } {T800 }
53 / 54
Mining Frequent Patterns, Associations and Correlations

Frequent 3-itemsets in vertical format

itemset TID_set
{I1 , I2 , I3 } {T800 , T900 }
{I1 , I2 , I5 } {T100 , T800 }

This process repeats, with k incremented by 1 each time, until no


frequent items or no candidate itemsets can be found
Properties of mining with vertical data format
Take the advantage of the Apriori property in the generation of
candidate (k + 1)-itemset from k -itemsets
No need to scan the database to find the support of (k + 1) itemsets,
for k ≥ 1
The TID_set of each k -itemset carries the complete information
required for counting such support
The TID-sets can be quite long, hence expensive to manipulate
Use diffset technique to optimize the support count computation

54 / 54

You might also like