Data Mining Course Overview
Data Mining Course Overview
Data Mining Course Overview
Course Overview
Instructor:
Home Page:
http://www.cs.bu.edu/fac/gkollios/dm07
Check frequently! Syllabus, schedule,
assignments, announcements
Grading
Overview of terms
Overview of terms
Knowledge Discovery
Scientific
Enormity of data
High dimensionality
of data
Heterogeneous,
distributed nature
of data
AI /
Statistics
Machine Learning
Data Mining
Database
systems
2. Association Rules:
Used to find associations between sets of
attributes
3. Sequential patterns:
Used to find temporal associations in time series
4. Hierarchical clustering:
used to group customers, web users, etc
Data Cleaning
Classification: Definition
Classification Example
ca
go
e
t
al
c
ri
al
us
c
i
o
u
or
in
g
t
e
t
n
ss
a
o
a
c
c
cl
Test
Set
Training
Set
Learn
Classifier
Model
go
e
t
al
c
ri
ca
go
e
t
al
c
ri
us
o
u
in
t
ss
n
a
cl
co
Splitting Attributes
HO
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Training Data
Married
NO
> 80K
YES
Married
MarSt
NO
Single,
Divorced
HO
No
Yes
NO
TaxInc
< 80K
NO
> 80K
YES
Classification: Application
1
Direct Marketing
Classification: Application
2
Fraud Detection
Use credit card transactions and the information on its accountholder as attributes.
When does a customer buy, what does he buy, how often he
pays on time, etc
Label past transactions as fraud or fair transactions. This forms
the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit card
transactions on an account.
Clustering Definition
Similarity Measures:
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Intracluster
Intraclusterdistances
distances
are
areminimized
minimized
Intercluster
Interclusterdistances
distances
are
aremaximized
maximized
Clustering: Application 1
Market Segmentation:
Clustering: Application 2
Document Clustering:
Illustrating Document
Clustering
Points: 3204 Articles of Los Angeles Times.
Clustering
Similarity Measure: How many words are common in
these documents (after some word filtering).
Category
Financial
Total
Articles
555
Correctly
Placed
364
Foreign
341
260
National
273
36
Metro
943
746
Sports
738
573
Entertainment
354
278
Association Rule
Discovery: Definition
TID
Items
1
2
3
4
5
Rules
RulesDiscovered:
Discovered:
{Milk}
{Milk}-->
-->{Coke}
{Coke}
{Diaper,
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
Data Compression
Compressed
Data
Original Data
lossless
Original Data
Approximated
y
s
s
lo
Numerosity Reduction:
Reduce the volume of data
Parametric methods
Non-parametric methods
Clustering
Sampling
Sampling
R
O
W
SRS le random
t
p
u
o
m
i
h
t
s
i
(
w
e
l
samp ment)
ce
a
l
p
e
r
SRSW
R
Raw Data
Sampling
Raw Data
Cluster/Stratified Sample