Lecture 09 DM

Data Preprocessing
Why data preprocessing
 Today’s real world databases are noisy, contain missing

values and inconsistent due to their huge size.
 A good preprocess data before data mining not only

“improve the quality of mining results” but also “ease the
mining process”.
 There are a number a data preprocessing techniques

– Data Cleaning:- Remove noise, correct inconsistencies
– Data Integration:- Merge data from multiple sources
– Data Transformation:- Improve distance measurements
– Data Reduction:- Eliminate redundant features or instances
Data Cleaning
 Data cleaning routines

– Missing values.
– Identify outliers.
– Correct inconsistencies.
TID Refund Country Taxable Cheat
Income
1 Yes USA 125K No
 Missing values:- 2 UK 100K No
– You can note that many tuples 3 No Australia 70K No
have no recorded value for 4 120K No
several attributes, such as 5 No NZL 95K Yes

6 No USA 60K No
country, Refund.
7 Yes UK 220K No
8 No 85K Yes
9 No UK No
10 No Yes
Missing values Techniques
 Ignore the record:-

– This is usually done when the class label is missing.
 Use the attribute mean to fill in the missing value:-
– For example, take the average value of income attribute. Use this
value to replace the missing value for income.
 Use the attribute Median to fill in the missing value:-

– The main difference between Mean and Median is that, Median is
robust to noise.
 Use classification rules to fill in the missing value:-

– This can be determined with decision tree, neural network, Bayesian
Modeling.
Missing values Techniques
(cont)
 K Nearest Neighbors
– where the missing values are replaced on the basis of nearest K neighbor.
Neighbors are determined on the basis of distance measure .
– Once K neighbors are determined then we can use mean or median for
missing value.
Missing value record

contains a missing value
for income attribute
Noisy Data
 Noise is a random error

 Incorrect attribute values may due to
– faulty data collection tools
– data entry problems
– data transmission problems
 Binning methods:-
– first sort data and partition into (equi-depth) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
Noisy Data (Binning Methods)
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Noisy Data (Clustering)
 Outliers may be
detected by
clustering, where
similar values are
organized into
groups or “clusters”.
 Values which falls

outside of the set of
clusters may be
considered outliers.
Data Integration
 Data integration:
– Combines data from multiple sources into a coherent store as in data warehousing.
 Sources may include
– Databases
– Data cubes
– Flat files
 Problem(1): Schema integration

– Entity identification problem: How we can sure that customer_id in one
database, and cust_number in another refer to the same entity.
 Problem(2): Detecting and resolving data value conflicts

– For the same entity, attribute values from different sources are different
– possible reasons: different representations, different scales, e.g., Price in dollars
vs. Price in Rupees.
Handling Redundant Data in
Data Integration
 Redundant data occur often when integration of multiple databases
– The same attribute may have different names in different
databases e.g. customer_id and customer_number.
– One attribute may be a “derived” attribute in another table,
 Technique:
– Redundant data may be able to be detected by correlation analysis.
 Advantages:
– Reduce/avoid redundancies
– Improve mining speed and quality.
Handling Redundant Data in
Data Integration
 Correlation between two attributes can be checked by:-
1. Resulting value > 0, then A and B are positively

correlated. If A increase B will also increase. A or B can
be removed
2. Resulting value = 0, then A and B are independent.
3. Resulting value < 0, then A and B are negatively

correlated. If the value of A increases, the value B will
decreases.
Data Transformation: Data
Normalization
 All integer attributes are normalized, so that there values
fall within a small specified, such as 0 to 1.0 or -1.0 to
1.0.
Data Normalization
 min-max normalization
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
– Example:- Suppose that the minimum and
maximum values for the attribute income are
12,000 and 98,000. By min-max normalization,
a value of 73,600 for income is transformed to
 ((73,000-12,000)/(98,000-12,000)) (1.0-0) + 0 = 0.716
Data Normalization
 z-score normalization
– This method of normalization is useful when the actual minimum
and maximum of any attribute are unknown.
– Or when outliers which dominate the min-max normalization.
v  meanA
v' 
stand _ devA
 Decimal scaling normalization
v
v'  Where j is the smallest integer such that Max(| v' |)<1
10 j

Lecture 09 DM

Uploaded by

Copyright:

Available Formats

Lecture 09 DM

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 09 DM

Uploaded by

Copyright:

Available Formats

Data Preprocessing

Why data preprocessing

 Today’s real world databases are noisy, contain missing

 A good preprocess data before data mining not only

 There are a number a data preprocessing techniques

 Data cleaning routines

– You can note that many tuples 3 No Australia 70K No

have no recorded value for 4 120K No

several attributes, such as 5 No NZL 95K Yes

 Ignore the record:-

 Use the attribute Median to fill in the missing value:-

 Use classification rules to fill in the missing value:-

Missing value record

 Noise is a random error

 Values which falls

 Problem(1): Schema integration

 Problem(2): Detecting and resolving data value conflicts

1. Resulting value > 0, then A and B are positively

2. Resulting value = 0, then A and B are independent.

3. Resulting value < 0, then A and B are negatively

You might also like