Lecture 09 DM

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 14

Data Preprocessing

Why data preprocessing

 Today’s real world databases are noisy, contain missing


values and inconsistent due to their huge size.

 A good preprocess data before data mining not only


“improve the quality of mining results” but also “ease the
mining process”.

 There are a number a data preprocessing techniques


– Data Cleaning:- Remove noise, correct inconsistencies
– Data Integration:- Merge data from multiple sources
– Data Transformation:- Improve distance measurements
– Data Reduction:- Eliminate redundant features or instances
Data Cleaning

 Data cleaning routines


– Missing values.
– Identify outliers.
– Correct inconsistencies.
TID Refund Country Taxable Cheat
Income
1 Yes USA 125K No
 Missing values:- 2 UK 100K No

– You can note that many tuples 3 No Australia 70K No

have no recorded value for 4 120K No

several attributes, such as 5 No NZL 95K Yes


6 No USA 60K No
country, Refund.
7 Yes UK 220K No
8 No 85K Yes
9 No UK No
10 No Yes
Missing values Techniques

 Ignore the record:-


– This is usually done when the class label is missing.
 Use the attribute mean to fill in the missing value:-
– For example, take the average value of income attribute. Use this
value to replace the missing value for income.

 Use the attribute Median to fill in the missing value:-


– The main difference between Mean and Median is that, Median is
robust to noise.

 Use classification rules to fill in the missing value:-


– This can be determined with decision tree, neural network, Bayesian
Modeling.
Missing values Techniques
(cont)
 K Nearest Neighbors
– where the missing values are replaced on the basis of nearest K neighbor.
Neighbors are determined on the basis of distance measure .
– Once K neighbors are determined then we can use mean or median for
missing value.

Missing value record


contains a missing value
for income attribute
Noisy Data

 Noise is a random error


 Incorrect attribute values may due to
– faulty data collection tools
– data entry problems
– data transmission problems

 Binning methods:-
– first sort data and partition into (equi-depth) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
Noisy Data (Binning Methods)

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Noisy Data (Clustering)

 Outliers may be
detected by
clustering, where
similar values are
organized into
groups or “clusters”.

 Values which falls


outside of the set of
clusters may be
considered outliers.
Data Integration
 Data integration:
– Combines data from multiple sources into a coherent store as in data warehousing.
 Sources may include
– Databases
– Data cubes
– Flat files

 Problem(1): Schema integration


– Entity identification problem: How we can sure that customer_id in one
database, and cust_number in another refer to the same entity.

 Problem(2): Detecting and resolving data value conflicts


– For the same entity, attribute values from different sources are different
– possible reasons: different representations, different scales, e.g., Price in dollars
vs. Price in Rupees.
Handling Redundant Data in
Data Integration
 Redundant data occur often when integration of multiple databases
– The same attribute may have different names in different
databases e.g. customer_id and customer_number.
– One attribute may be a “derived” attribute in another table,
 Technique:
– Redundant data may be able to be detected by correlation analysis.

 Advantages:
– Reduce/avoid redundancies
– Improve mining speed and quality.
Handling Redundant Data in
Data Integration
 Correlation between two attributes can be checked by:-

1. Resulting value > 0, then A and B are positively


correlated. If A increase B will also increase. A or B can
be removed

2. Resulting value = 0, then A and B are independent.

3. Resulting value < 0, then A and B are negatively


correlated. If the value of A increases, the value B will
decreases.
Data Transformation: Data
Normalization
 All integer attributes are normalized, so that there values
fall within a small specified, such as 0 to 1.0 or -1.0 to
1.0.
Data Normalization
 min-max normalization
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
– Example:- Suppose that the minimum and
maximum values for the attribute income are
12,000 and 98,000. By min-max normalization,
a value of 73,600 for income is transformed to
 ((73,000-12,000)/(98,000-12,000)) (1.0-0) + 0 = 0.716
Data Normalization

 z-score normalization
– This method of normalization is useful when the actual minimum
and maximum of any attribute are unknown.
– Or when outliers which dominate the min-max normalization.
v  meanA
v' 
stand _ devA
 Decimal scaling normalization
v
v'  Where j is the smallest integer such that Max(| v' |)<1
10 j

You might also like