Why Data Preprocessing?
Why Data Preprocessing?
Why Data Preprocessing?
• Mean:
• arithmetic mean: • Weighted arithmetic mean: n
• Trimmed mean: mean after chopping of extreme values. w x i i
• Median: x i 1
n
• Middle value if odd number of values, or average of the middle two values otherwise
N w i
•
• Estimated by interpolation (for grouped data):
Mode
median=L1+ 2
(
−( ∑ freql )
freqmedian
width
) i 1
• Value that occurs most frequently in the data • Unimodal, bimodal, trimodal
• Empirical formula: mean mode 3 (mean median)
Data Cleaning
• Importance
• “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball
• “Data cleaning is the number one problem in data warehousing”—DCI survey
• Data cleaning tasks
• Fill in missing values • Identify outliers and smooth out noisy data
• Correct inconsistent data • Resolve redundancy caused by data integration
Missing Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as customer income in sales data
• Missing data may be due to
• equipment malfunction • inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• did not register history or changes of the data
• Missing data may need to be inferred.
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not
effective when the percentage of missing values per attribute varies considerably.
• Fill in the missing value manually: tedious + infeasible?
• Fill in automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class: smarter
• the most probable value: inference-based such as Regression, Bayesian formula or decision tree
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may occur due to
• faulty data collection instruments • data transmission problems
• data entry problems • technology limitation
• inconsistency in naming convention
• Other data problems which requires data cleaning
• duplicate records • incomplete data • inconsistent data