Mining Frequent Itemsets Using Vertical Data Format

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 14

Mining Frequent Itemsets Using

Vertical Data Format


Data Format

Horizontal Data Format

Vertical Data Format


Mining frequent itemsets using vertical data format -
ECLAT
• ECLAT – Equivalence Class Clustering and bottom-up Lattice
Traversal
• Mining is performed by intersecting the TID sets of every pair
of frequent single items.
support count of an itemset is
simply the length of the TID set of
the itemset

Minimum Support Count =2

{I1, I4} and {I3, I5} each contain only one


transaction, they do not belong to the set of
frequent 2-itemsets.
Mining frequent itemsets using vertical data
format
• Based on the Apriori property, a given 3-itemset is a candidate
3-itemset only if every one of its 2-itemset subsets is
frequent.
The candidate generation
process here will generate
only two 3-itemsets: {I1, I2, I3}
and {I1, I2, I5}.

There are only two frequent 3-


itemsets: {I1, I2, I3: 2} and {I1, I2,
I5: 2}

Minimum Support Count =2


Mining frequent itemsets using vertical data
format
• Advantages
– Take advantage of the Apriori property in the generation of
candidate (k + 1)-itemset from frequent k-itemsets
– There is no need to scan the database to find the support

• Drawback
– The TID sets can be quite long, taking substantial memory
space as well as computation time for intersecting the long
sets.
Mining frequent itemsets using vertical data
format
• To overcome the drawback:
– a technique called diffset is used, which keeps track of only
the differences of the TID sets of a (k + 1)-itemset and a
corresponding k-itemset
– Example:
{I1} = {T100, T400, T500, T700, T800, T900} and
{I1, I2} = {T100, T400, T800, T900}.
The diffset between the two is diffset({I1, I2}, {I1}) = {T500,
T700}.
Challenge in mining a Large Set
• Mining frequent itemsets from a large data set often
generates a huge number of itemsets satisfying the minimum
support (min sup) threshold, especially when min sup is set
low.

• To overcome the difficulty, closed frequent itemset and


maximal frequent itemset are introduced
Closed Frequent Itemset
• An itemset X is closed in a data set D if there exists no proper
superitemset Y such that Y has the same support count as X in
D.
• An itemset X is a closed frequent itemset in set D if X is both
closed and frequent in D.
• An itemset X is a maximal frequent itemset (or max-itemset)
in a data set D if X is frequent, and there exists no super-
itemset Y such that X ⊂ Y and Y is frequent in D.
https://towardsdatascience.com/how-to-find-closed-and-maximal-frequent-itemsets-
from-fp-growth-861a1ef13e21
Limitations of Frequent Itemset Mining
• Quantity of items are not taken into account
– ie., if a customer has bougth five breads, ten breads or twenty
breads, it is viewed as the same.
• All items are viewed as having the same importance, utility
of weight.
– For example, if a customer buys a very expensive item or just a
piece of bread, it is viewed as being equally important.
• Thus, frequent pattern mining may find many frequent
patterns that are not interesting.
– For example, one may find that {bread, milk} is a frequent pattern.
However, from a business perspective, this pattern may be
uninteresting because it does not generate much profit.
Moreover, frequent pattern mining algorithms may miss the rare
patterns that generate a high profit 

https://data-mining.philippe-fournier-viger.com/introduction-high-utility-itemset-mining/
High-Utility Itemset Mining
• In this problem, a transaction database contains transactions
where purchase quantities are taken into account as well as the
unit profit of each item.

• The objective is to find the itemsets (group of items) that


generate a high profit in a database when they are sold together. 
High-Utility Itemset Mining
• The user has to provide the minimum utility threshold “minutil”.
• A high-utility itemset mining algorithm outputs all the high-
utility itemsets, that generate at least “minutil” profit.  
• For example, consider that “minutil” is set to 25 $ by the user.
The result of a high utility itemset mining algorithm would be
the following.

for {a,e}, its utility is the sum of 8$ + 16 $ = 24$ because it appears only in
transactions T0 and transaction T3

You might also like