Mining Frequent, Patterns, Associations, and Correlations

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 13

Mining Frequent,Patterns,

Associations, and Correlations


• Frequent patterns are patterns (such as itemsets, subsequences, or

substructures) that appear in a data set frequently. For example, a set of

items, such as milk and bread, that appear frequently together in a transaction

data set is a frequent itemset.

• A subsequence, such as buying first a PC, then a digital camera, and then a

memory card, if it occurs frequently in a shopping history database, is a

(frequent) sequential pattern.

• A substructure can refer to different structural forms, such as subgraphs,

subtrees, or sub lattices, which may be combined with itemsets or

subsequences.

• If a substructure occurs frequently, it is called a (frequent) structured pattern.


Basic Concepts and a Road Map - Market
Basket Analysis
• Example of frequent itemset mining is market
basket analysis.
• This process analyzes customer buying habits by
finding associations between the different items
that customers place in their “shopping baskets”
• The discovery of such associations can help
retailers develop marketing strategies by gaining
insight into which items are frequently
purchased together by customers.
• The Boolean vectors can be analyzed for buying
patterns that reflect items that are frequently
associated or purchased together.
• These patterns can be represented in the form of
association rules.
• For example, the information that customers who
purchase computers also tend to buy antivirus
software at the same time is represented in
Association Rule :
• Computer=>antivirus software [support = 2%;
confidence = 60%]
• Association Rule support and confidence are two measures of rule
interestingness

• A support of 2% for Association Rule (5.1) means that 2% of all the


transactions under analysis show that computer and antivirus software
are purchased together.

• A confidence of 60% means that 60% of the customers who purchased


a computer also bought the software.

• Typically, association rules are considered interesting if they satisfy both


a minimum support threshold and a minimum confidence threshold.

• Such thresholds can be set by users or domain experts.


Frequent Itemsets, Closed Itemsets, and Association
Rules
• Itemset: a set of items
• k-itemset: an itemset containing k items
• Let I = {I1, …, Im} be the set of all possible items
• T  I is a transaction, each associated with TID
• DB is a set of transactions
• An association rule is of the form A =>B with minimum support and confidence,
where A  I, B  I, A  B = 

Transaction-id Items bought


10 a, b, d
20 a, c, d
• In the example, I = {a, b, c, d, e, f}
30 a, d, e
• A possible association rule: d => a
– Better notation {d} => {a} 40 b, e, f
50 b, c, d, e, f
• support (A=>B) = P(A U B)
– Probability that a transaction contains A  B, i.e., each item in the union; not to be confused with
P(A or B), which indicates either A or B

– Percentage of transactions in DB that contain A  B,


– Also referred to as relative support
– Absolute support: number of transactions that contain A  B. also called occurrence frequency,
frequency, support count, count,
– Relative support = absolute support / |DB|
• confidence(A=>B) = P(B|A)
– conditional probability that a transaction having A also contains B
– confidence(A=>B) = P(B|A) = support(A  B) / support (A)

= support_count(A  B) / support_count(A)

• confidence(A=>B) can be easily derived from the support count of A and A  B,


association rule mining can be reduced to frequent pattern mining
• frequent (large) itemset: itemset satisfying min_sup threshold
• strong rule: rules satisfying both min_sup and min_conf thresholds
• In general, association rule mining can be
viewed as a two-step process:
1. Find all frequent itemsets: By definition, each
of these itemsets will occur at least as
frequently as a predetermined minimum
support count, min sup.
2. Generate strong association rules from the
frequent itemsets: By definition, these rules
must satisfy minimum support and minimum
confidence.
Closed Patterns and Max-Patterns
• A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) =
2100 – 1 = 1.27*1030 sub-patterns!
• Solution: Mine closed patterns and max-patterns instead
• An itemset X is closed if X is frequent and there exists no super-
pattern Y ‫ כ‬X, with the same support as X
• An itemset X is a max-pattern if X is frequent and there exists
no frequent super-pattern Y ‫ כ‬X
• Closed pattern is a lossless compression of freq. patterns
– Reducing the # of patterns and rules
• A frequent itemset is one that occurs in at
least a user-specific percentage of the
database. That percentage is called support.
• An itemset is closed if none of its immediate
supersets has the same support as the
itemset.
• An itemset is maximal frequent if none of its
immediate supersets is frequent.

You might also like