Chapter 2 Part1

Chapter 1 : Data
preparation
Part I
Machine Learning Team
UP GL-BD
Why Prepare Data?
• Some data preparation is needed for all mining tools
• The purpose of preparation is to transform data sets so that

their information content is best exposed to the mining tool
• Error prediction rate should be lower (or the same) after the
preparation as before it.
• On a Machine Learning project, raw data typically cannot be

used directly:
- Machine learning algorithms require data to be numbers.
- Some machine learning algorithms impose requirements on the data.
- Statistical noise and errors in the data may need to be corrected…
3
Why Prepare Data?
• Data need to be formatted for a given software tool
• Data need to be made adequate for a given method
• Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data
• e.g., occupation=“”
• noisy: containing errors or outliers
• e.g., Salary=“-10”, Age=“222”
• inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997”

• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records
• e.g., Endereço: travessa da Igreja de Nevogilde Freguesia: Paranhos 4
Thus..
• The raw data must be pre-processed prior to being used to fit and evaluate
a machine learning model.
• This step in a predictive modeling project is referred to as “Data
Preparation“
Standard Activities of Data Preparation
Data Cleaning
Identifying and correcting mistakes or errors in the data.
Machine Learning
ESPRIT 2022/2023
Major Tasks in Data Preparation
• Data discretization
• Part of data reduction but with particular importance, especially for numerical data
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or similar analytical
results.
7
Data Preparation as a step in the
Knowledge Discovery Process
Knowledge
Evaluation and
Presentation
Data
Mining
Selection and
Transformation
Cleaning and
D
Integration W
DB
8
What is Data?
Labels
• Collection of data objects Attributes
and their attributes
• An attribute is a property or Tid Refund Marital Taxable
characteristic of an object Status Income Cheat
– Examples: eye color of a 1 Yes Single 125K No
person, temperature, etc.
2 No Married 100K No
– Attribute is also known as
variable, field, characteristic, 3 No Single 70K No
or feature 4 Yes Married 120K No
• A collection of attributes 5 No Divorced 95K Yes
Objects
describe an object 6 No Married 60K No
– Object is also known as 7 Yes Divorced 220K No
record, point, case, sample, 8 No Single 85K Yes
entity, or instance
9 No Married 75K No
• Labels: Final desired classes 10 No Single 90K Yes
10
CS590D 9
Standard Activities of data preparation
Data Cleaning
Machine Learning
ESPRIT 2022/2023
Data Cleaning
“Data cleaning is used to refer to all kinds of tasks and activities to detect and
repair errors in the data.”
Page xiii, Data Cleaning, 2019.
“Cleaning up your data is not the most glamourous of tasks, but it’s an essential
part of data wrangling. […] Knowing how to properly clean and assemble your data
will set you miles apart from others in your field.”
Page 149, Data Wrangling with Python, 2016..
Machine Learning
ESPRIT 2022/2023
Data Cleaning
• Data cleaning includes simple tasks such as:

– Define normal data.
– Removing duplicate rows, redundant and
irrelevant columns.
– Identify outliers.
– Dealing with missing values.
Machine Learning
ESPRIT 2022/2023
Data Cleaning: Outliers
“We will generally define outliers as samples that are exceptionally far from the
mainstream of the data.”
Page 33, Applied Predictive Modeling, 2013..
Outliers causes:
– Measurement or input error.
– Data corruption.
– …
Machine Learning
ESPRIT 2022/2023
Data Cleaning: Outliers
• Some methods to detect outliers:
Boxplot Histogram
Scatter plot
Machine Learning
ESPRIT 2022/2023
Boxplot
MISSING
DATA
Machine Learning
ESPRIT 2022/2023
Missing Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
• Missing data may need to be inferred.

• Missing values may carry some information content: e.g. a credit application
may carry information by noting which field the applicant did not complete
Machine Learning
ESPRIT 2022/2023
How to Handle Missing Data?
• Ignore records (use only cases with all values)
• Usually done when class label is missing as most prediction methods do not handle missing
data well
• Not effective when the percentage of missing values per attribute varies considerably
as it can lead to insufficient and/or biased sample sizes
• Ignore attributes with missing values
• Use only features (attributes) with all values (may leave out important features)
• Fill in the missing value manually
• tedious + infeasible?
Data Imputation
Some tools ignore missing values, others use some metric to fill in replacements
Filling in missing values with data is called Data imputation.
Some data imputation methods:
1. Delete individuals with missing data
2. Replace missing data with a fixed value
3. Replace missing data with a decision tree
4. Replace missing data with nearest values
5. Replace missing data with dedicated algorithms…
Machine Learning
ESPRIT 2022/2023
Data Imputation Strategies
A popular approach for data imputation is to calculate a statistical value for
each column (such as a mean) and replace all missing values for that column
with the statistic.
Strategy « mean » « median » « most_frequent» « constant»
Definition Replace missing Replace missing Replace the missing Replace missing values
values using the values using the value using the most with fill_value.
mean along median along each frequent value along • fill_value=0 when
each column column each column imputing numeric
data
• fill_value=missing_v
alue” for strings or
object data types
Data type numeric data numeric data numeric data/Strings numeric data/Strings
Attention: Missing values must be marked with NaN
Machine Learning
ESPRIT 2022/2023
Data Cleaning
Machine Learning
ESPRIT 2022/2023
Feature Selection
• Feature Selection: is the process of selecting the most important features to
input in machine learning algorithms
Reduce
overfitting
Machine Learning
ESPRIT 2022/2023
Feature Selection Methods
Machine Learning
ESPRIT 2022/2023
Feature Selection Methods
• Wrapper: choose in an
• « Filter » vs. « Wrapper »:
iterative way the
characteristics that give
the best performing
model.
• Filter: Assign a score to
each input feature to
select the best performing
features.
Machine Learning
ESPRIT 2022/2023
Data Cleaning
Machine Learning
ESPRIT 2022/2023
Data Transformation
• Data transformation: is used to remove noise from
a dataset.
•  Noise is any distorted and meaningless data in a
data set.
• Most common strategies for data transformation:
1) Normalization
2) Data Agregration
3) Standardization
4) Generalization …
Normalization
• For distance-based methods, normalization helps to
prevent that attributes with large ranges out-weight
attributes with small ranges
• min-max normalization
• z-score normalization
• normalization by decimal scaling
Normalization
• min-max normalization
v  min v
v'
 max v  min
v
• z-score normalization
vv
v'
v
Aggregation
• Combining two or more attributes (or objects)
into a single attribute (or object)
• Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc
– More “stable” data
• Aggregated data tends to have less variability
Aggregation
Variation of Precipitation in Australia
Standard Deviation of Average Standard Deviation of Average

Monthly Precipitation Yearly Precipitation
Standard activities of data preparation
Data Cleaning
Machine Learning
ESPRIT 2022/2023

Chapter 2 Part1

Uploaded by

Copyright:

Available Formats

Chapter 2 Part1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 2 Part1

Uploaded by

Copyright:

Available Formats

Chapter 1 : Data

• The purpose of preparation is to transform data sets so that

• On a Machine Learning project, raw data typically cannot be

• Data need to be made adequate for a given method

• Data in the real world is dirty

• e.g., Age=“42” Birthday=“03/07/1997”

Identifying and correcting mistakes or errors in the data.

Identifying and correcting mistakes or errors in the data.

Page xiii, Data Cleaning, 2019.

Page 149, Data Wrangling with Python, 2016..

• Data cleaning includes simple tasks such as:

Page 33, Applied Predictive Modeling, 2013..

• Missing data may be due to

• Missing data may need to be inferred.

• Ignore records (use only cases with all values)

• Fill in the missing value manually

Some data imputation methods:

1. Delete individuals with missing data

2. Replace missing data with a fixed value

3. Replace missing data with a decision tree

4. Replace missing data with nearest values

5. Replace missing data with dedicated algorithms…

Strategy « mean » « median » « most_frequent» « constant»

Attention: Missing values must be marked with NaN

Identifying and correcting mistakes or errors in the data.

characteristics that give

the best performing

• Filter: Assign a score to

each input feature to

select the best performing

Identifying and correcting mistakes or errors in the data.

Standard Deviation of Average Standard Deviation of Average

Identifying and correcting mistakes or errors in the data.

You might also like