Chapter 2 Part1
Chapter 2 Part1
Chapter 2 Part1
preparation
Part I
Machine Learning Team
UP GL-BD
Why Prepare Data?
• Some data preparation is needed for all mining tools
• Error prediction rate should be lower (or the same) after the
preparation as before it.
• incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data
• e.g., occupation=“”
• noisy: containing errors or outliers
• e.g., Salary=“-10”, Age=“222”
• inconsistent: containing discrepancies in codes or names
Data Cleaning
Machine Learning
ESPRIT 2022/2023
Major Tasks in Data Preparation
• Data discretization
• Part of data reduction but with particular importance, especially for numerical data
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or similar analytical
results.
7
Data Preparation as a step in the
Knowledge Discovery Process
Knowledge
Evaluation and
Presentation
Data
Mining
Selection and
Transformation
Cleaning and
D
Integration W
DB
8
What is Data?
Labels
• Collection of data objects Attributes
and their attributes
• An attribute is a property or Tid Refund Marital Taxable
characteristic of an object Status Income Cheat
– Examples: eye color of a 1 Yes Single 125K No
person, temperature, etc.
2 No Married 100K No
– Attribute is also known as
variable, field, characteristic, 3 No Single 70K No
or feature 4 Yes Married 120K No
• A collection of attributes 5 No Divorced 95K Yes
Objects
describe an object 6 No Married 60K No
– Object is also known as 7 Yes Divorced 220K No
record, point, case, sample, 8 No Single 85K Yes
entity, or instance
9 No Married 75K No
• Labels: Final desired classes 10 No Single 90K Yes
10
CS590D 9
Standard Activities of data preparation
Data Cleaning
Machine Learning
ESPRIT 2022/2023
Data Cleaning
“Data cleaning is used to refer to all kinds of tasks and activities to detect and
repair errors in the data.”
“Cleaning up your data is not the most glamourous of tasks, but it’s an essential
part of data wrangling. […] Knowing how to properly clean and assemble your data
will set you miles apart from others in your field.”
Machine Learning
ESPRIT 2022/2023
Data Cleaning
Machine Learning
ESPRIT 2022/2023
Data Cleaning: Outliers
“We will generally define outliers as samples that are exceptionally far from the
mainstream of the data.”
Outliers causes:
– Measurement or input error.
– Data corruption.
– …
Machine Learning
ESPRIT 2022/2023
Data Cleaning: Outliers
• Some methods to detect outliers:
Boxplot Histogram
Scatter plot
Machine Learning
ESPRIT 2022/2023
Boxplot
MISSING
DATA
Machine Learning
ESPRIT 2022/2023
Missing Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
Machine Learning
ESPRIT 2022/2023
How to Handle Missing Data?
• Usually done when class label is missing as most prediction methods do not handle missing
data well
• Not effective when the percentage of missing values per attribute varies considerably
as it can lead to insufficient and/or biased sample sizes
• Ignore attributes with missing values
• Use only features (attributes) with all values (may leave out important features)
• tedious + infeasible?
Data Imputation
Some tools ignore missing values, others use some metric to fill in replacements
Filling in missing values with data is called Data imputation.
Machine Learning
ESPRIT 2022/2023
Data Imputation Strategies
A popular approach for data imputation is to calculate a statistical value for
each column (such as a mean) and replace all missing values for that column
with the statistic.
Definition Replace missing Replace missing Replace the missing Replace missing values
values using the values using the value using the most with fill_value.
mean along median along each frequent value along • fill_value=0 when
each column column each column imputing numeric
data
• fill_value=missing_v
alue” for strings or
object data types
Data type numeric data numeric data numeric data/Strings numeric data/Strings
Machine Learning
ESPRIT 2022/2023
Standard Activities of data preparation
Data Cleaning
Machine Learning
ESPRIT 2022/2023
Feature Selection
• Feature Selection: is the process of selecting the most important features to
input in machine learning algorithms
Reduce
overfitting
Machine Learning
ESPRIT 2022/2023
Feature Selection Methods
Machine Learning
ESPRIT 2022/2023
Feature Selection Methods
• Wrapper: choose in an
• « Filter » vs. « Wrapper »:
iterative way the
model.
features.
Machine Learning
ESPRIT 2022/2023
Standard Activities of data preparation
Data Cleaning
Machine Learning
ESPRIT 2022/2023
Data Transformation
• Data transformation: is used to remove noise from
a dataset.
• Noise is any distorted and meaningless data in a
data set.
• Most common strategies for data transformation:
1) Normalization
2) Data Agregration
3) Standardization
4) Generalization …
Normalization
• For distance-based methods, normalization helps to
prevent that attributes with large ranges out-weight
attributes with small ranges
• min-max normalization
• z-score normalization
• normalization by decimal scaling
Normalization
• min-max normalization
v min v
v'
max v min
v
• z-score normalization
vv
v'
v
Aggregation
• Combining two or more attributes (or objects)
into a single attribute (or object)
• Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc
– More “stable” data
• Aggregated data tends to have less variability
Aggregation
Variation of Precipitation in Australia
Data Cleaning
Machine Learning
ESPRIT 2022/2023