Data Mining: Department of Information Technology University of The Punjab, Jhelum Campus

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 23

DATA

MINING

Ayesha Irfan
Department of Information Technology
University Of The Punjab, Jhelum Campus
Data mining
 Data mining is a process of analyzing data from
different perspective and summarizing it into useful
information that can be used to increase revenue.
 Many people treat data mining as a synonym for another
popularly used term, knowledge discovery from data, or
KDD.
 Data, Information and Knowledge
 Data is any facts, numbers or text that can be processed
by computer.
 Information The patterns associations or relationships
among all the data can provide information.
Data mining
 Knowledge The information can be converted into
knowledge about historical trends and future patterns.

Data mining—>searching for knowledge (interesting


patterns) in data
Knowledge discovery process

Data mining as a step in the process of knowledge discovery.


Knowledge discovery process
 Steps of knowledge discovery process
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be
combined)
3. Data selection (where data relevant to the analysis task
are retrieved from the database)
4. Data transformation (where data are transformed and
consolidated into forms appropriate for mining by
performing summary or aggregation operations)
5. Data mining (an essential process where intelligent
methods are applied to extract data patterns)
Knowledge discovery process

6. Pattern evaluation (to identify the truly interesting


patterns representing knowledge based on
interestingness
7. Knowledge presentation (where visualization and
knowledge representation techniques are used to present
mined knowledge to users)
Data mining:Confluence of multiple disciplines
Basic Statistical Description of data

 For data pre-processing to be successful, it is


essential to have an overall picture of your data.
Basic statistical descriptions can be used to identify
properties of the data and highlight which data
values should be treated as noise or outliers.
Basic Statistical Description of data

 Mean
 
 Suppose we have the following values for salary (in thousands of dollars),
shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.

 Mean=
 =

 =

 = 696/12 = 58

 Thus, the mean salary is $58,000


Basic Statistical Description of data

  Median
 Data : 1,3,3,6,7,8,9  Odd Media
n
 Median = = = 4th Value  6
 Data: 1,2,3,4,7,8,9,10  Even


Median = [] = []
Media
 [] = = = = 5.5 n
Basic Statistical Description of data

 Mode
 The mode is the value that appears the most often.
 Some data may not have a mode.
 Some data may have more two modes, this is known
as bimodal.
 Data: 0,2,1,0,0,3,2,4,2,2
 Mode: 2
 Data: 5,2,1,3,1,4,5,5,4,1
 Mode: 1 and 5
Basic Statistical Description of data

  Standard Deviation x X-
2 5 -3 9
 σ=
4 5 -1 1
 Data: 4,2,5,8,6 5 5 0 0
 σ= = =2.24 6 5 1 1
8 5 3 9
=20
 Variance
 Variance= (σ
 = (2 = 5
Box Plot Theory
 Dataset:
 14,6,3,2,4,15,11,8,1,7,2,1,3,4,10,22,20
 Arrange data: 1,1,2,2,3,3,4,4,6,7,8,10,11,14,15,20,22

Median

   
Median= = 2.5 Median= = 12.5

Whiskers

1 2.5 6 12.5 22
MINING FREQUENT
ITEMSET
Apriori Algorithm
 Let’s look at an example, Dataset is given in the
table below. We will apply the Apriori algorithm
for finding frequent itemsets. min_sup=3

TID Item

T1 M,O,N,K,E,Y

T2 D,O,N,K,E,Y

T3 M,A,K,E

T4 M,U,C,K,Y,

T5 C,O,O,K,I,E
Apriori Algorithm
 Step1: Count the number of transactions.
C1
Item Sup_co
unt L1
M 3 Item Sup_count
O 4 M 3
N 2 O 4
K 5 K 5
E 4 E 4
Y 3 Y 3
D 1
A 1
U 1
C 2
Apriori Algorithm

C2
Item Sup_count L2
MO 1 Item Sup_count
MK 3 MK 3
ME 2 OK 3
MY 2 OE 3
OK 3 KE 4
OE 3 KY 3
OY 2
KE 4
KY 3
EY 2
Apriori Algorithm

C3
Item Sup_count L3
MKO 1
Item Sup_count
MKE 2
OKE 3
MKY 2
OKE 3
OKY 2
KEY 2
MINING FREQUENT
PATTERNS
FP growth tree
 Dataset min_sup=3
 .
Step1: Calculate the sup_count for each element

ID Item bought item Sup_count


100 f,c,a,d,g,m,p f 4
200 a,b,c,f,l,m,o c 4
300 b,f,h,j,o,w a 3
400 b,c,k,s,p b 3
500 a,f,c,e,l,p,m,n m 3
p 3
FP growth tree
Step 2 {}

ID Item bought
f: 1 2 3 4 c:1
100 f,c,a,m,p
200 f,c,a,b,m
300 f,b c:1 2 3 b:1
b:1
400 c,b,p
500 f,c,a,m,p a:12 3
p: 1

m:1 2 b: 1

p:1 2
m:1
FP growth tree

item Conditional Pattern Base


p Frequent
f c a m: 2 , c b: 1 pattern generated
m f c a: 2 , f c a b: 1 a fc:3
a f c: 3 c f: 3
c f: 3
b f c a:1 , f:1 , c:1
Assignment 2
 Apply the Apriori algorithm for finding frequent
itemsets. min_sup=2

Dataset
Tid Items
10 A,C,D
20 B,C,E
30 A,B,C,E
40 B,E

You might also like