Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Concepts and
Techniques
1
Why preprocess the data?
2
Why Data Preprocessing?
or names
No quality data, no quality mining results!
Quality decisions must be based on quality
data
Data warehouse needs consistent integration
of quality data
3
Multi-Dimensional Measure of Data
Quality
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Broad categories:
intrinsic, contextual, representational, and
accessibility.
4
Major Tasks in Data
Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces
the same or similar analytical results
Data discretization
Part of data reduction but with particular importance,
especially for numerical data
5
Forms of data
preprocessing
6
Data cleaning
7
Data Cleaning
8
Missing Data
Data is not always available
E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the
time of entry
not register history or changes of the data
Missing data may need to be inferred.
9
How to Handle Missing
Data?
Ignore the tuple: usually done when class label is missing
(assuming the tasks in classification—not effective when the
percentage of missing values per attribute varies
considerably.
Fill in the missing value manually: tedious + infeasible?
Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class?!
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the
same class to fill in the missing value: smarter
Use the most probable value to fill in the missing value:
10
Noisy Data
technology limitation
incomplete data
inconsistent data 11
How to Handle Noisy Data?
Binning method:
first sort data and partition into (equi-depth)
bins
then one can smooth by bin means, smooth by
Regression
smooth by fitting the data into regression
functions 12
Simple Discretization Methods:
Binning
Equal-width (distance) partitioning:
It divides the range into N intervals of equal
15
Regression
y
Y1
Y1’ y=x+1
X1 x
16
Data integration and transformation
17
Data Integration
Data integration:
combines data from multiple sources into a
coherent store
Schema integration
integrate metadata from different sources
21
Chapter 3: Data Preprocessing
22
Data Reduction
Strategies
Warehouse may store terabytes of data: Complex
data analysis/mining may take a very long time to
run on the complete data set
Data reduction
Obtains a reduced representation of the data set
Dimensionality reduction
Numerosity reduction
23
Data Cube Aggregation
The lowest level of a data cube
the aggregated data for an individual entity of
interest
e.g., a customer in a phone calling data
warehouse.
Multiple levels of aggregation in data cubes
Further reduce the size of data to deal with
Reference appropriate levels
Use the smallest representation which is enough
to solve the task
24
Dimensionality Reduction
Feature selection (i.e., attribute subset selection):
Select a minimum set of features such that the
understand
Heuristic methods (due to exponential # of choices):
step-wise forward selection
25
Example of Decision Tree Induction
A1? A6?
26
Heuristic Feature Selection
Methods
There are 2d possible sub-features of d features
Several heuristic feature selection methods:
Best single features under the feature
...
Step-wise feature elimination:
Repeatedly eliminate the worst feature
Best combined feature selection and
elimination:
27
Data Compression
String compression
There are extensive theories and well-tuned
algorithms
Typically lossless
expansion
Audio/video compression
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
s s y
lo
Original Data
Approximated
29
Wavelet Transforms
Haar2 Daubechie4
Discrete wavelet transform (DWT): linear signal
processing
Compressed approximation: store only a small
fraction of the strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but
better lossy compression, localized in space
Method:
Length, L, must be an integer power of 2 (padding with 0s,
when necessary)
Each transform has 2 functions: smoothing, difference
Applies to pairs of data, resulting in two set of data of
length L/2
30
Principal Component Analysis
X2
Y1
Y2
X1
32
Numerosity Reduction
Parametric methods
Assume the data fits some model, estimate
model parameters, store only the parameters,
and discard the data (except possible outliers)
Log-linear models: obtain value at a point in
m-D space as the product on appropriate
marginal subspaces
Non-parametric methods
Do not assume models
Major families: histograms, clustering,
sampling
33
Regression and Log-Linear
Models
A popular data 40
reduction technique 35
Divide data into 30
buckets and store
25
average (sum) for
each bucket 20
Can be constructed 15
optimally in one 10
dimension using
5
dynamic
programming 0
Related to
90000
20000
40000
70000
50000
10000
30000
60000
80000
100000
quantization 36
Clustering
W O R
SRS le random
i m p h o u t
( s e wi t
l
samp ment)
p l a c e
re
SRSW
R
Raw Data
39
Sampling
40
Hierarchical Reduction
Use multi-resolution structure with different
degrees of reduction
Hierarchical clustering is often performed but tends
to define partitions of data sets rather than
“clusters”
Parametric methods are usually not amenable to
hierarchical representation
Hierarchical aggregation
An index tree hierarchically divides a data set
42
Discretization
Three types of attributes:
Nominal — values from an unordered set
Discretization:
☛ divide the range of a continuous attribute into
intervals
Some classification algorithms only accept
categorical attributes.
Reduce data size by discretization
43
Discretization and Concept
hierachy
Discretization
reduce the number of values for a given
continuous attribute by dividing the range of
the attribute into intervals. Interval labels can
then be used to replace actual data values.
Concept hierarchies
reduce the data by collecting and replacing
low level concepts (such as numeric values for
the attribute age) by higher level concepts
(such as young, middle-aged, or senior).
44
Discretization and concept
hierarchy generation for numeric
data
Entropy-based discretization
45
Entropy-Based Discretization
Given a set of samples S, if S is partitioned into
two intervals S1 and S2 using boundary T, the
entropy after partitioning
|S | is | S |
E (S , T ) = 1 Ent ( ) + 2 Ent ( )
|S| S1 | S | S2
The boundary that minimizes the entropy function
over all possible boundaries is selected as a binary
discretization.
The process is recursively applied to partitions
obtained until some
Ent ( S stopping
) − E (T , S )criterion
>δ is met, e.g.,
(-$1,000 - $2,000)
Step 3:
(-$4000 -$5,000)
Step 4:
49
Specification of a set of
attributes
Concept hierarchy can be automatically
generated based on the number of distinct
values per attribute in the given attribute set.
The attribute with the most distinct values is
placed at the lowest level of the hierarchy.