Lecture 09 DM
Lecture 09 DM
Lecture 09 DM
Binning methods:-
– first sort data and partition into (equi-depth) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
Noisy Data (Binning Methods)
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Noisy Data (Clustering)
Outliers may be
detected by
clustering, where
similar values are
organized into
groups or “clusters”.
Advantages:
– Reduce/avoid redundancies
– Improve mining speed and quality.
Handling Redundant Data in
Data Integration
Correlation between two attributes can be checked by:-
z-score normalization
– This method of normalization is useful when the actual minimum
and maximum of any attribute are unknown.
– Or when outliers which dominate the min-max normalization.
v meanA
v'
stand _ devA
Decimal scaling normalization
v
v' Where j is the smallest integer such that Max(| v' |)<1
10 j