Data Mining: Department of Information Technology University of The Punjab, Jhelum Campus

DATA
MINING
Ayesha Irfan
Department of Information Technology
University Of The Punjab, Jhelum Campus
Data mining
 Data mining is a process of analyzing data from
different perspective and summarizing it into useful
information that can be used to increase revenue.
 Many people treat data mining as a synonym for another
popularly used term, knowledge discovery from data, or
KDD.
 Data, Information and Knowledge
 Data is any facts, numbers or text that can be processed
by computer.
 Information The patterns associations or relationships
among all the data can provide information.
Data mining
 Knowledge The information can be converted into
knowledge about historical trends and future patterns.
Data mining—>searching for knowledge (interesting

patterns) in data
Knowledge discovery process
Data mining as a step in the process of knowledge discovery.

 Steps of knowledge discovery process
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be
combined)
3. Data selection (where data relevant to the analysis task
are retrieved from the database)
4. Data transformation (where data are transformed and
consolidated into forms appropriate for mining by
performing summary or aggregation operations)
5. Data mining (an essential process where intelligent
methods are applied to extract data patterns)
6. Pattern evaluation (to identify the truly interesting

patterns representing knowledge based on
interestingness
7. Knowledge presentation (where visualization and
knowledge representation techniques are used to present
mined knowledge to users)
Data mining:Confluence of multiple disciplines
Basic Statistical Description of data
 For data pre-processing to be successful, it is

essential to have an overall picture of your data.
Basic statistical descriptions can be used to identify
properties of the data and highlight which data
values should be treated as noise or outliers.

 Mean

 Suppose we have the following values for salary (in thousands of dollars),
shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
 Mean=
 =
 =
 = 696/12 = 58
 Thus, the mean salary is $58,000

 Median
 Data : 1,3,3,6,7,8,9  Odd Media
n
 Median = = = 4th Value  6
 Data: 1,2,3,4,7,8,9,10  Even

Median = [] = []
Media
 [] = = = = 5.5 n
 Mode
 The mode is the value that appears the most often.
 Some data may not have a mode.
 Some data may have more two modes, this is known
as bimodal.
 Data: 0,2,1,0,0,3,2,4,2,2
 Mode: 2
 Data: 5,2,1,3,1,4,5,5,4,1
 Mode: 1 and 5
 Standard Deviation x X-
2 5 -3 9
 σ=
4 5 -1 1
 Data: 4,2,5,8,6 5 5 0 0
 σ= = =2.24 6 5 1 1
8 5 3 9
=20
 Variance
 Variance= (σ
 = (2 = 5
Box Plot Theory
 Dataset:
 14,6,3,2,4,15,11,8,1,7,2,1,3,4,10,22,20
 Arrange data: 1,1,2,2,3,3,4,4,6,7,8,10,11,14,15,20,22
Median

Median= = 2.5 Median= = 12.5
Whiskers
1 2.5 6 12.5 22
MINING FREQUENT
ITEMSET
Apriori Algorithm
 Let’s look at an example, Dataset is given in the
table below. We will apply the Apriori algorithm
for finding frequent itemsets. min_sup=3
TID Item
T1 M,O,N,K,E,Y
T2 D,O,N,K,E,Y
T3 M,A,K,E
T4 M,U,C,K,Y,
T5 C,O,O,K,I,E
Apriori Algorithm
 Step1: Count the number of transactions.
C1
Item Sup_co
unt L1
M 3 Item Sup_count
O 4 M 3
N 2 O 4
K 5 K 5
E 4 E 4
Y 3 Y 3
D 1
A 1
U 1
C 2
Apriori Algorithm
C2
Item Sup_count L2
MO 1 Item Sup_count
MK 3 MK 3
ME 2 OK 3
MY 2 OE 3
OK 3 KE 4
OE 3 KY 3
OY 2
KE 4
KY 3
EY 2
Apriori Algorithm
C3
Item Sup_count L3
MKO 1
Item Sup_count
MKE 2
OKE 3
MKY 2
OKE 3
OKY 2
KEY 2
MINING FREQUENT
PATTERNS
FP growth tree
 Dataset min_sup=3
 .
Step1: Calculate the sup_count for each element
ID Item bought item Sup_count

100 f,c,a,d,g,m,p f 4
200 a,b,c,f,l,m,o c 4
300 b,f,h,j,o,w a 3
400 b,c,k,s,p b 3
500 a,f,c,e,l,p,m,n m 3
p 3
FP growth tree
Step 2 {}
ID Item bought
f: 1 2 3 4 c:1
100 f,c,a,m,p
200 f,c,a,b,m
300 f,b c:1 2 3 b:1
b:1
400 c,b,p
500 f,c,a,m,p a:12 3
p: 1
m:1 2 b: 1
p:1 2
m:1
FP growth tree
item Conditional Pattern Base

p Frequent
f c a m: 2 , c b: 1 pattern generated
m f c a: 2 , f c a b: 1 a fc:3
a f c: 3 c f: 3
c f: 3
b f c a:1 , f:1 , c:1
Assignment 2
 Apply the Apriori algorithm for finding frequent
itemsets. min_sup=2
Dataset
Tid Items
10 A,C,D
20 B,C,E
30 A,B,C,E
40 B,E

Data Mining: Department of Information Technology University of The Punjab, Jhelum Campus

Uploaded by

Copyright:

Available Formats

Data Mining: Department of Information Technology University of The Punjab, Jhelum Campus

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining: Department of Information Technology University of The Punjab, Jhelum Campus

Uploaded by

Copyright:

Available Formats

DATA

Data mining—>searching for knowledge (interesting

Data mining as a step in the process of knowledge discovery.

6. Pattern evaluation (to identify the truly interesting

 For data pre-processing to be successful, it is

 Thus, the mean salary is $58,000

ID Item bought item Sup_count

item Conditional Pattern Base

You might also like