Data Mining & Business Intelligence (2170715) : Unit-5 Concept Description and Association Rule Mining

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 39

Data Mining & Business

Intelligence(2170715)

Unit-5
Concept Description
and Association Rule
Mining
Outline
 What is concept description?
 Market basket analysis
 Association Rule Mining
 Generating Rules
 Improved apriori algorithm
 Incremental ARM (Association Rule Mining)
 Associative Classification
 Rule Mining

Unit: 5 – Concept Description and Association Rule Mining 2


Concept description
 Data mining can be classified into two categories: descriptive data
mining and predictive data mining.
 Descriptive data mining describes the data set in a concise and
summative manner and presents interesting general properties
of the data.
 Predictive data mining analyzes the data in order to construct
one or a set of models, and attempts to predict the behavior of
new data sets.
 Database is usually storing the large amounts of data in great
detail. However users often like to view sets of summarized data
in concise, descriptive terms.

Unit: 5 – Concept Description and Association Rule Mining 3


Concept description (Cont..)
The simplest kind of descriptive data mining is called
concept description.
 A concept usually refers to a collection of data such as
frequent_buyers, graduate_students etc.
 As a data mining task, concept description is not a simple
enumeration (number of things done one by one) of the data.
 Concept description generates descriptions for characterization
and comparison of the data it is also called class description.

Unit: 5 – Concept Description and Association Rule Mining 4


Concept description (Cont..)
 Characterization provides a concise and brief summarization of
the data.
 While concept or class comparison (also known as discrimination)
provides discriminations (inequity) comparing two or more
collections of data.
 Example
• Given the ABC Company database, for example, examining individual
customer transactions.
• Sales managers may prefer to view the data generalized to higher levels,
such as summarized by customer groups according to geographic regions,
frequency of purchases per group and customer income.

Unit: 5 – Concept Description and Association Rule Mining 5


Market basket analysis
 Market Basket Analysis is a modelling technique.
 It is based on, if you buy a certain group of items, you are more
(or less) likely to buy another group of items.
 For example, if you are in a store and you buy a car then you are
more likely to buy insurance at the same time than somebody
who didn't buy insurance.
 The set of items a customer buys is referred to as an itemset.
 Market basket analysis seeks to find relationships between
purchases (Items).
 E.g. IF {Car, Accessories} THEN {Insurance}
{Car, Accessories}  {Insurance}

Unit: 5 – Concept Description and Association Rule Mining 6


Market basket analysis (Cont..)
{Car, Accessories}  {Insurance}
 The probability that a customer will buy car without an
accessories is referred as the support for rule.
 The conditional probability that a customer will purchase
Insurance is referred to as the confidence.

Unit: 5 – Concept Description and Association Rule Mining 7


Association rule mining
 Given a set of transactions, we need rules that will predict the
occurrence of an item based on the occurrences of other items in
the transaction.
 Market-Basket transactions
TID Items
1 Bread, Milk Example of Association Rules
2 Bread, Chocolate, Pepsi, Eggs
3 Milk, Chocolate, Pepsi, Coke
{Chocolate} → {Pepsi},
{Milk, Bread} → {Eggs, Coke},
4 Bread, Milk, Chocolate, Pepsi
{Pepsi, Bread} → {Milk}
5 Bread, Milk, Chocolate, Coke

Unit: 5 – Concept Description and Association Rule Mining 8


Association rule mining (Cont..)
 Itemset TID Items

• A collection of one or more items 1 Bread, Milk


o E.g. : {Milk, Bread, Chocolate} 2 Bread, Chocolate, Pepsi, Eggs
• k-itemset 3 Milk, Chocolate, Pepsi, Coke
An itemset that contains k items 4 Bread, Milk, Chocolate, Pepsi
5 Bread, Milk, Chocolate, Coke
 Support count (σ)
• Frequency of occurrence of an itemset
o E.g. σ({Milk, Bread, Chocolate}) = 2

 Support
• Fraction of transactions that contain an itemset
o E.g. s({Milk, Bread, Chocolate}) = 2/5

 Frequent Itemset
• An itemset whose support is greater than or equal to a minimum support
threshold
Unit: 5 – Concept Description and Association Rule Mining 9
Association rule mining (Cont..)
 Association Rule
• An implication expression of the form X → Y, where X and Y are itemsets
o E.g.: {Milk, Chocolate} → {Pepsi}

 Rule Evaluation
• Support (s)
o Fraction of transactions that contain both X and Y
• Confidence (c)
o Measures how often items in Y appear in transactions that contain X

Unit: 5 – Concept Description and Association Rule Mining 10


Association rule mining (Cont..)
TID Items
1 Bread, Milk
2 Bread, Chocolate, Pepsi, Eggs
3 Milk, Chocolate, Pepsi, Coke
4 Bread, Milk, Chocolate, Pepsi
5 Bread, Milk, Chocolate, Coke

Example:
Find support & confidence for {Milk, Chocolate} ⇒ Pepsi
   
= c = 67

Unit: 5 – Concept Description and Association Rule Mining 11


Association rule mining - Example
TID Items Calculate Support & Confidence:
1 Bread, Milk {Milk, Chocolate} → {Pepsi}
2 Bread, Chocolate, Pepsi, Eggs {Milk, Pepsi} → {Chocolate}
{Chocolate, Pepsi} → {Milk}
3 Milk, Chocolate, Pepsi, Coke
{Pepsi} → {Milk, Chocolate}
4 Bread, Milk, Chocolate, Pepsi {Chocolate} → {Milk, Pepsi}
5 Bread, Milk, Chocolate, Coke {Milk} → {Chocolate, Pepsi}

Answer
Support (s) : 0.4
{Milk, Chocolate} → {Pepsi} c = 0.67
{Milk, Pepsi} → {Chocolate} c = 1.0
{Chocolate, Pepsi} → {Milk} c = 0.67
{Pepsi} → {Milk, Chocolate} c = 0.67
{Chocolate} → {Milk, Pepsi} c = 0.5
{Milk} → {Chocolate, Pepsi} c = 0.5
Unit: 5 – Concept Description and Association Rule Mining 12
Association rule mining (Cont..)
 A common strategy adopted by many association rule
mining algorithms is to decompose the problem into two
major subtasks:
1. Frequent Itemset Generation
• The objective is to find all the item-sets that satisfy
the minimum support threshold.
• These itemsets are called frequent itemsets.
2. Rule Generation
• The objective is to extract all the high-confidence
rules from the frequent itemsets found in the
previous step.
Unit: 5 – Concept Description and Association Rule Mining 13
Apriori algorithm
 Purpose: The Apriori Algorithm is an influential algorithm for
mining frequent itemsets for Boolean association rules.
 Key Concepts:
• Frequent Itemsets:
The sets of item which has minimum support (denoted by Li for ith-Itemset).
• Apriori Property:
Any subset of frequent itemset must be frequent.
• Join Operation:
To find Lk, a set of candidate k-itemsets is generated by joining Lk-1 itself.

Unit: 5 – Concept Description and Association Rule Mining 14


Apriori algorithm (Cont..)
 Find the frequent itemsets
• The sets of items that have minimum support and a subset of a frequent
itemset must also be a frequent itemset (Apriori Property).
• E.g. if {AB} is a frequent itemset, both {A} and {B} should be a frequent
itemset.
• Use the frequent item sets to generate association rules.
 The Apriori Algorithm : Pseudo code
• Join Step: Ck is generated by joining Lk-1with itself
• Prune Step: Any (k-1) itemset that is not frequent cannot be a subset of a
frequent k-itemset

Unit: 5 – Concept Description and Association Rule Mining 15


Apriori algorithm steps (Cont..)
 Step 1:
• Start with itemsets containing just a single item (Individual items).
 Step 2:
• Determine the support for itemsets.
• Keep the itemsets that meet your minimum support threshold and
remove itemsets that do not support minimum support.
 Step 3:
• Using the itemsets you have kept from Step 1, generate all the possible
itemset combinations.
 Step 4:
• Repeat steps 1 & 2 until there are no more new itemsets.

Unit: 5 – Concept Description and Association Rule Mining 16


Apriori algorithm - Pseudo code
(Cont..)
Ck: Candidate itemset of size k
Lk: Frequent itemset of size k
L1= {frequent items};
for (k = 1; Lk != ∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
Increment the count of all candidates in C k+1
That are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
Unit: 5 – Concept Description and Association Rule Mining 17
Apriori algorithm - Example Minimum Support = 2

TID Items C1 ItemSet Min. Sup L1 ItemSet Min. Sup


100 134 {1} 2 {1} 2
Scan D
200 235 {2} 3 {2} 3
300 1235 {3} 3 {3} 3
400 25 {4} 1 {5} 3
{5} 3
C2 C2
Itemset Min. Sup Itemset
L2 ItemSet Min. Sup
{1 2} 1 {1 2}
{1 3} 2
{1 3} 2 Scan D {1 3}
{2 3} 2
{1 5} 1 {1 5}
{2 5} 3
{2 3} 2 {2 3}
{3 5} 2
{2 5} 3 {2 5}
{3 5} 2 {3 5}

Unit: 5 – Concept Description and Association Rule Mining 18


Apriori algorithm - Example Minimum Support = 2

ItemSet Min. Sup ItemSet Min. Sup


L2 {1 3} 2 C3 {1 2 3} 1
L3 Items Sup
{2 3} 2 Scan D {1 2 5} 1
{2 3 5} 2
{2 5} 3 {1 3 5} 1
{3 5} 2 {2 3 5} 2
Rules generation
Association Rule Support Confidence Confidence (%)
2^35 2 2/2 = 1 100 %
3^52 2 2/2 = 1 100 %
2^53 2 2/3 = 0.66 66%
2=3^5 2 2/3 = 0.66 66%
3=2^5 2 2/3 = 0.66 66%
5=2^3 2 2/3 = 0.66 66%

Unit: 5 – Concept Description and Association Rule Mining 19


Improve apriori’s efficiency
 Hash-based itemset counting: A k-itemset whose corresponding hashing
bucket count is below the threshold cannot be frequent.
 Transaction reduction: A transaction that does not contain any frequent
k-itemset is useless in subsequent scans.
 Partitioning: Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB.
 Sampling: Mining on a subset of given data, lower support threshold + a
method to determine the completeness.
 Dynamic itemset counting: Add new candidate itemsets only when all of
their subsets are estimated to be frequent.

Unit: 5 – Concept Description and Association Rule Mining 20


Incremental Mining of Association
Rules
 It is noted that analysis of past transaction data can provide very
valuable information on customer buying behavior, and thus improve
the quality of business decisions.
 With the increasing use of the record-based databases whose data is
being continuously added, updated, deleted etc.
 Examples of such applications include Web log records, stock market
data, grocery sales data, transactions in e-commerce, and daily
weather/traffic records etc.
 In many applications, we would like to mine the transaction database
for a fixed amount of most recent data (say, data in the last 12 months).

Unit: 5 – Concept Description and Association Rule Mining 21


Incremental Mining of Association
Rules
 Mining is not a one-time operation, a naive approach to solve the
incremental mining problem is to re-run the mining algorithm on the
updated database.
Mining

Data

Feedback Business
Strategy

Unit: 5 – Concept Description and Association Rule Mining 22


FP-Growth Algorithm
 The FP-Growth Algorithm is proposed by Han.
 It is an efficient and scalable method for mining the complete set
of frequent patterns.
 Using prefix-tree structure for storing information about frequent
patterns named frequent-pattern tree (FP-tree).
 Once an FP-tree has been constructed, it uses a recursive divide-
and-conquer approach to mine the frequent item sets.

Unit: 5 – Concept Description and Association Rule Mining 23


FP-Growth Algorithm (Cont..)
 Building the FP-Tree
1. Scan data to determine the support count of each item.
• Infrequent items are discarded, while the frequent items are sorted in
decreasing support counts.
2. Make a second pass over the data to construct the FP­-tree.
• As the transactions are read, before being processed, their items are sorted
according to the above order.

Unit: 5 – Concept Description and Association Rule Mining 24


FP-Growth Algorithm - Example
FP-Tree Generation Step:1 Step:2
Freq. 1-Itemsets. Transactions with items sorted based on
TID Transactions
Min_Sup  2 frequencies, and ignoring the infrequent
1 ABCEFO Transactions items.

2 ACG A:8 ACEBF


3 EI C:8 ACG
4 ACDEG E:8 E
G:5 ACEGD
5 ACEGL
B:2 ACEG
6 EJ D:2 E
7 ABCEFP F:2 ACEBF
8 ACD Remaining all ACD
O,I,J,L,P,M & N
9 ACEGM is with ACEG
min_sup = 1
10 ACEGN ACEG
Unit: 5 – Concept Description and Association Rule Mining 25
FP-Tree after reading 1st transaction
ACEBF Header null
ACG
A:8 A:1
E
C:8
ACEGD
E:8 C:1
ACEG
G:5
E B:2 E:1
ACEBF D:2
ACD F:2 B:1
ACEG
ACEG F:1

Unit: 5 – Concept Description and Association Rule Mining 26


FP-Tree after reading 2nd transaction
ACEBF Header null
ACG
A:8 A:2
E
C:8
ACEGD
E:8 C:2
ACEG
G:5
E B:2 G:1
E:1
ACEBF D:2
ACD F:2 B:1
ACEG
ACEG F:1

Unit: 5 – Concept Description and Association Rule Mining 27


FP-Tree after reading 3rd transaction
ACEBF Header null
ACG
A:8
E A:2 E:1
C:8
ACEGD
E:8 C:2
ACEG
G:5
E B:2 G:1
E:1
ACEBF D:2
ACD F:2 B:1
ACEG
ACEG F:1

Unit: 5 – Concept Description and Association Rule Mining 28


FP-Tree after reading 4th transaction
ACEBF Header null
ACG
E A:8 A:3 E:1
C:8
ACEGD
E:8 C:3
ACEG
G:5
E B:2 G:1
E:2
ACEBF D:2
ACD F:2 B:1
G:1
ACEG
ACEG F:1 D:1

Unit: 5 – Concept Description and Association Rule Mining 29


FP-Tree after reading 5th transaction
ACEBF Header null
ACG
E A:8 A:4 E:1
ACEGD C:8
E:8 C:4
ACEG G:5
E B:2 G:1
E:3
ACEBF D:2
ACD F:2 B:1
G:2
ACEG
ACEG F:1 D:1

Unit: 5 – Concept Description and Association Rule Mining 30


FP-Tree after reading 6th transaction
ACEBF Header null
ACG
E A:8 A:4 E:2
ACEGD C:8
E:8 C:4
ACEG
G:5
E B:2 E:3 G:1
ACEBF D:2
ACD F:2 B:1
G:2
ACEG
ACEG F:1 D:1

Unit: 5 – Concept Description and Association Rule Mining 31


FP-Tree after reading 7th transaction
ACEBF Header null
ACG
E A:8 A:5 E:2
ACEGD C:8
E:8 C:5
ACEG
G:5
E
B:2 E:4 G:1
ACEBF D:2
ACD F:2 B:2
G:2
ACEG
ACEG F:2 D:1

Unit: 5 – Concept Description and Association Rule Mining 32


FP-Tree after reading 8th transaction
ACEBF Header null
ACG
E A:8 A:6 E:2
ACEGD C:8
E:8 C:6
ACEG
G:5
E
B:2 E:4 G:1 D:1
ACEBF
D:2
ACD F:2 B:2
G:2
ACEG
ACEG F:2 D:1

Unit: 5 – Concept Description and Association Rule Mining 33


FP-Tree after reading 9th transaction
ACEBF Header null
ACG
E A:8 A:7 E:2
ACEGD C:8
E:8 C:7
ACEG
G:5
E
B:2 E:5 G:1 D:1
ACEBF
D:2
ACD F:2 G:3
B:2
ACEG
ACEG F:2 D:1

Unit: 5 – Concept Description and Association Rule Mining 34


FP-Tree after reading 10th transaction
ACEBF Header null
ACG
E A:8 A:8 E:2
ACEGD C:8
E:8 C:8
ACEG
G:5
E
B:2 E:6 G:1 D:1
ACEBF
D:2
ACD F:2 G:4
B:2
ACEG
ACEG F:2 D:1

Unit: 5 – Concept Description and Association Rule Mining 35


FP-Growth algorithm - Example
Minimum Support
>= 2 FP-Tree

TID Items
null
1 AB
Header B:8 A:2
2 BCD
3 ACDE Item Support
A:5 C:3 C:1 D:1
4 ADE B 8
5 ABC A 7 C:3 D:1 D:1 E:1 D:1 E:1
6 ABCD C 7
D:1 E:1
7 BC D 5
8 ABC E 3
9 ABD
10 BCE

Unit: 5 – Concept Description and Association Rule Mining 36


FP-Growth Example (Try it) Minimum Support = 3

TID Items Bought Frequent Items


(sorted order)
100 FACDGIMP FCAMP
200 ABCFLMO FCABM
300 BFHJOW FB
400 BCKSP CBP
500 AFCELPMN FCAMP

Unit: 5 – Concept Description and Association Rule Mining 37


FP-Growth Example - Answer
FP-Tree Construction

null
Header Table
f:4 c:1
Item frequency
f 4
c 4 c:3 b:1 b:1
a 3
b 3 a:3 p:1
m 3
p 3 m:2 b:1

p:2 m:1

Unit: 5 – Concept Description and Association Rule Mining 38

You might also like