Class5 DataPreprocessing DataCleaning 23aug2021

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

23-08-2021

Data Preprocessing
Data Cleaning: Handling Missing
Values, Noisy Data and Outliers

Data Cleaning (Data Cleansing)


• Real world data are tend to be incomplete, noisy and
inconsistent
• Data cleaning routines attempt to identify missing
values, fill in missing values, smooth out noise while
identifying outliers and correct inconsistencies in the
data

• One of the biggest data cleaning task is handling


missing values

1
23-08-2021

Data Cleaning: Missing Values


• Many tuple (records) have no recorded value for
several attributes
• Identifying missing values:
– When Pandas library for python is used, it detect the
missing values as “NaN” [1]
– It automatically consider “blank” in the attribute value,
“NaN/nan/NAN” in the attribute value , “NA” in the
attribute value, “n/a” in the attribute value, “NULL/null”
in the attribute value as NaN

• Important note: If any numeric attribute have the


value 0 (zero), then it is not a missing value
– If it is not correct value, then it is simply a noise

[1] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Methods to Handle Missing Values


• Ignore the tuples:
– This method is effective only when the tuples contain
several attributes (> 50% of attributes) with missing
value

Tuples contain several attributes (> 50% of attributes) with missing


value

2
23-08-2021

Methods to Handle Missing Values


• Ignore the tuples:
– This method is effective only when the tuples contain
several attributes (> 50% of attributes) with missing
value
– This method is also used when the target variable (class
label) is missing

Target attribute (StationID) with missing value

Methods to Handle Missing Values


• Fill in the missing values (imputing values) manually:
– Time consuming
– Not feasible given a large data set with many missing
values

• Use a global constant to fill in missing value (Imputing


global constant):
– Replace all missing attribute values by a same constant
– Imputed value may not be correct

3
23-08-2021

Methods to Handle Missing Values


• Use attribute mean/median/mode to fill in the missing
value (mean/median/mode imputation):
– Applicable to numeric data
– Centre of the data won’t change

Methods to Handle Missing Values


• Use attribute mean/median/mode to fill in the missing
value (mean/median/mode imputation):
– Applicable to numeric data
– Centre of the data won’t change

4
23-08-2021

Methods to Handle Missing Values


• Use attribute mean/median/mode to fill in the missing
value (mean/median/mode imputation):
– Applicable to numeric data
– Centre of the data won’t change
– However, it does not preserve the relationship with
other variables

Methods to Handle Missing Values


• Filling with local mean/median/mode:
– Use attribute mean/median/mode of all samples
belonging to a group (class) to fill in the missing value
• Applicable to numeric data
• Centre of the data of a group won’t change

5
23-08-2021

Methods to Handle Missing Values


• Filling with local mean/median/mode:
– Use attribute mean/median/mode of all samples
belonging to a group (class) to fill in the missing value
• Applicable to numeric data
• Centre of the data of a group won’t change

Methods to Handle Missing Values


• Filling with local mean/median/mode:
– Use attribute mean/median/mode of all samples
belonging to a group (class) to fill in the missing value
• Applicable to numeric data
• Centre of the data of a group won’t change
• However, it does not preserve the relationship with other
variables

6
23-08-2021

Methods to Handle Missing Values


• Use the values from the previous/next record (with in
a group) to fill in missing value (Padding)
– Useful only when the domine understanding is good

• If the data is categorical or text, one can replace the


missing values by most frequent observations

Methods to Handle Missing Values


• Use most probable value to fill the missing value:
– Use interpolation technique to predict the missing value
• Linear interpolation is achieved by geometrically
rendering a straight line between two adjacent points on a
graph or plane
• Interpolation happens column wise
• Popular strategy
• It does not preserves the relationship with other variables

7
23-08-2021

Methods to Handle Missing Values


• Use most probable value to fill the missing value:
– Use regression techniques to predict the missing value
(regression imputation)
• Let y1, y2, …, yd be a set of d attributes
• Regression (multivariate): The nth value is predicted as
xn = f(yn1, yn2, …, ynd )

y x
d
f(.)

• Linear regression (multivariate): xn = w1yn1 + w2yn2 +… + wdynd

Methods to Handle Missing Values


• Use most probable value to fill the missing value:
– Use regression techniques to predict the missing value
(regression imputation)
• Let y1, y2, …, yd be a set of d attributes
• Regression (multivariate): The nth value is predicted as
xn = f(yn1, yn2, …, ynd )

Temperature = f(Humidity, Rain)


Temperature = wT1Humidity + wT2Rain

Humidity = f(Temperature, Rain)


Humidity = wH1Temperature + wH2Rain

8
23-08-2021

Methods to Handle Missing Values


• Use most probable value to fill the missing value:
– Use regression techniques to predict the missing value
(regression imputation)
• Let y1, y2, …, yd be a set of d attributes
• Regression (multivariate): The nth value is predicted as
xn = f(yn1, yn2, …, ynd )

y x
d
f(.)

• Linear regression (multivariate): xn = w1yn1 + w2yn2 +… + wdynd


• Popular strategy
• It uses the most information from the present data to
predict the missing values
• It preserves the relationship with other variables

Data Cleaning: Handling the Noisy Data


• Noise is a random error or variance in a measured variable
• Consider the case where most of the entries in a numeric
attribute is 0 (zero)
• Example1 • Example2: Pima-Indians-Diabetes
Date Temperature --- BMI Age ---
Sept 1 25.47 --- 33.6 50 ---
Sept 2 26.19 --- 26.6 31 ---
Sept 3 0 --- 23.3 32 ---
Sept 4 24.30 --- 0 21 ---
Sept 5 24.07 --- 43.1 33 ---
Sept 6 21.21 --- 25.6 30 ---
Sept 7 0 --- 0 26 ---
Sept 8 21.79 --- 35.3 29 ---
Sept 9 25.09 --- 30.5 53 ---
Sept 10 0 --- 0 54 ---
--- ---
• Replace the 0s (zeros) based on domain knowledge
• Replace the 0s (zeros) by regression based methods

9
23-08-2021

Data Cleaning: Smoothing the Noisy Data


• Noise is a random error or variance in a measured variable
• Due to noise, many tuple (records) have incorrect value for
several attributes
• Mostly data is full of noise
• Smooth out the data to remove the effect of noise
• Data smoothing allows important patterns to stand out
• The idea is to sharpen the patterns (values) in the data and
highlight trends the data is pointing to

• Methods for data


smoothing:
– Binning
– Regression (function
approximation)

Binning Methods for Data Smoothing


• Binning method smooth a sorted data value of a noisy
attribute by consulting its neighbourhood i.e., the
values around it
• It perform local smoothing as this method consult the
neighbourhood of values
• The sorted values are partitioned into (almost) equal-
frequency bins

10
23-08-2021

Binning Methods for Data Smoothing


• Different approaches for smoothing by bin:
1. Smoothing by bin means:
– Each value in a bin is replaced by the mean value of the
bin
2. Smoothing by bin medians:
– Each value in a bin is replaced by the median value of
the bin
3. Smoothing by bin boundaries:
– The minimum and maximum values in a given bin are
identified as bin boundaries
– Each bin value is then replaced by the closest boundary
value
• Larger the width, the greater the effect of the
smoothing

Illustration of Binning Methods for


Data Smoothing
• Example:
• Noisy data for price (in Rs) : 8, 15, 34, 24, 4, 21, 28, 21, 25
• Sorted data for price (in Rs) : 4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into bins: Smoothing by bin means:


Bin1: 4, 8, 15 Bin1: 9, 9, 9
Bin2: 21, 21, 24 Bin2: 22, 22, 22
Bin3: 25, 28, 34 Bin3: 29, 29, 29
Noisy data

11
23-08-2021

Illustration of Binning Methods for


Data Smoothing
• Example:
• Noisy data for price (in Rs) : 8, 15, 34, 24, 4, 21, 28, 21, 25
• Smoothing by bin means : 9, 9, 29, 22, 9, 22, 29, 22, 29

Partition into bins: Smoothing by bin means:


Bin1: 4, 8, 15 Bin1: 9, 9, 9
Bin2: 21, 21, 24 Bin2: 22, 22, 22
Bin3: 25, 28, 34 Bin3: 29, 29, 29
Noisy data
Smoothing by bin means

Illustration of Binning Methods for


Data Smoothing
• Example:
• Noisy data for price (in Rs) : 8, 15, 34, 24, 4, 21, 28, 21, 25
• Sorted data for price (in Rs) : 4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into bins: Smoothing by bin Boundaries:


Bin1: 4, 8, 15 Bin1: 4, 4, 15
Bin2: 21, 21, 24 Bin2: 21, 21, 24
Bin3: 25, 28, 34 Bin3: 25, 25, 34
Noisy data

12
23-08-2021

Illustration of Binning Methods for


Data Smoothing
• Example:
• Noisy data for price (in Rs) : 8, 15, 34, 24, 4, 21, 28, 21, 25
• Smoothing by bin boundaries : 4, 15, 34, 24, 4, 21, 25, 21, 25

Partition into bins: Smoothing by bin Boundaries:


Bin1: 4, 8, 15 Bin1: 4, 4, 15
Bin2: 21, 21, 24 Bin2: 21, 21, 24
Bin3: 25, 28, 34 Bin3: 25, 25, 34
Noisy data
Smoothing by bin Boundaries

Illustration of Binning Methods for


Data Smoothing
• Example:
• Noisy data for price (in Rs) : 8, 15, 34, 24, 4, 21, 28, 21, 25
• Sorted data for price (in Rs) : 4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into bins: Smoothing by bin means: Smoothing by bin Boundaries:


Bin1: 4, 8, 15 Bin1: 9, 9, 9 Bin1: 4, 4, 15
Bin2: 21, 21, 24 Bin2: 22, 22, 22 Bin2: 21, 21, 24
Bin3: 25, 28, 34 Bin3: 29, 29, 29 Bin3: 25, 25, 34
Noisy data
Smoothing by bin means
Smoothing by bin Boundaries

13
23-08-2021

Outlier Detection and Replacing with


Centre of Tendency
• Compute first quartile (Q1) and third quartile (Q3) for
an attribute
• Compute the interquartile range (IQR) as IQR=Q3-Q1
for that attribute
• Compute
– Lower Bound = | Q1 – (1.5 x IQR) |
– Upper Bound = | Q3 + (1.5 x IQR) |
• Detect attribute value as outlier if
– it is less than Lower Bound OR
– it is larger than Upper Bound
• Replace these outlier values with mean/median/mode
of the attribute
• Important note: If the outliers are due to noise, then
it is better to replace
– Domine knowledge is very important 27

Summary of Data Cleaning


• 80% of data analyst’s time spent in cleaning that data
• Data cleaning routines attempt to identify missing
values, fill in missing values, smooth out noise while
identifying outliers
• One of the biggest data cleaning task is handling
missing values
• Among the different methods for filling the missing
values
– Filling by central tendency (mean/median/mode)
– Filling by interpolation
– Filling by regression are popular methods
• When data is mostly full of noise, smooth out the data
to remove the effect of noise (binning and regression)
• Outliers can be detected using quartiles and IQR
– Detected outliers can be replaced by
mean/median/mode 28

14

You might also like