Machine Learning Chapter 2

Chapter 2
Data preprocessing for Machine

Learning
1
Outline
• What is Data Processing
• Data quality problems
• Data preprocessing
• Data cleaning
• Data integration
• Data transformation
• Data reduction
2
What is data pre-processing
 Data preprocessing is a process of preparing the raw data and
making it suitable for a machine learning model.
 It is the first and crucial step while creating a machine learning
model.
 Why do we need Data Preprocessing?
 A real-world data generally contains noises, missing values, and
maybe in an unusable format which cannot be directly used for
machine learning models.
 Data preprocessing is required tasks for cleaning the data and
making it suitable for a machine learning model which also
increases the accuracy and efficiency of a machine learning model.
3
Data pre-processing
• The collection and manipulation of items of data to

produce meaningful information.
• The processing is in any manner detectable by an
observer.
• The conversion of data into usable and desired form. The
conversion is carried out using a predefined sequence of
operations either
 Manual data processing or
 Automatic (Electronic) data processing
4
Data pre-processing
• Data processing may involve various processes :-
5
Data quality problem
6
Incomplete data
7
Data Duplication
8
Inconsistent Format
9
Accessibility
10
System Upgrades
11
Data purging and storage
12
Poor organization
13
Example of data quality problem
14
Con..
 Examples of data quality problems:

– Noise and outliers
– Missing values
– Duplicate data
– Wrong data
– Fake data
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?
15
Noise
16
Outliers
17
Missing value
• Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
• Handling missing values

– Eliminate data objects or variables
– Estimate missing values
• Example: time series of temperature
• Example: census results
– Ignore the missing value during analysis
18
Quality Data
• Data have quality if they satisfy the requirements of

the intended use and when it solves the data quality
problems. These includes:-
• Accuracy, Completeness, Consistency, Timeliness,
Believability and Interpretability.
19
Data pre-processing
20
Data Engineering ( Raw data )
21
Data Engineering ( Prepared data )
22
Feature Engineering
23
Major Tasks in Data pre-processing
• Data cleaning: this step involves identifying and removing missing,
inconsistent, or irrelevant data. This can include removing duplicate records,
filling in missing values, and handling outliers.
• Data integration: this step involves combining data from multiple sources,
such as databases, spreadsheets, and text files. The goal of integration is to
create a single, consistent view of the data.
• Data transformation: this step involves converting the data into a format that
is more suitable for the data mining task. This can include normalizing
numerical data, creating dummy variables, and encoding categorical data.
• Data reduction: this step is used to select a subset of the data that is relevant
to the machine learning task. This can include feature selection.
• Data discretization: this step is used to convert continuous numerical data
into categorical data, which can be used for decision tree and other
categorical machine learning techniques.
24
Data Cleaning
25
Handling Missing value
26
Data Integration
27
Data Integration approach
28
Data transformation
29
Data transformation
30
Data Reduction
31
Example of data reduction : Feature selection
• Feature selection is selecting a subset of variables from input which can
efficiently describe the input data while reducing effects from noise or
irrelevant variables and still provide good prediction results. This for :-
• Redundant features
– Duplicate much or all of the information contained in one or more
other attributes
– Example: purchase price of a product and the amount of sales tax
paid
• Irrelevant features
– Contain no information that is useful for the data mining task at
hand
– Example: students' ID is often irrelevant to the task of predicting
students' GPA 32
Feature selection con..
 Types of Feature Selection
• Filter methods: use variable ranking techniques as the principle
criteria for variable selection by ordering. Eg. Gain ratio
• Wrapper methods: the feature selection process is based on a
specific machine learning algorithm that we are trying to fit on a
given dataset. It follows a greedy search approach by evaluating all
the possible combinations of features against the evaluation
criterion.
• Embedded methods: this approach combines both filter and wrapper
feature selection approaches to get advantage of both approach.
33
Data reduction Techniques
34
Dimensionality Reduction
• Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data mining
algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce noise
• Techniques
– Principal Components Analysis (PCA)
– Singular Value Decomposition
– Others: supervised and non-linear techniques
35
Data discretization
• Discretization is the process of converting a continuous attribute into an
ordinal attribute
– Discretization is commonly used in classification
– Many classification algorithms work best if both the independent and
dependent variables have only a few values.
– A potentially infinite number of values are mapped into a small number
of categories. Eg. below :,
– example below :-
36
End of Chapter one
37

Machine Learning Chapter 2

Uploaded by

Copyright:

Available Formats

Machine Learning Chapter 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Chapter 2

Uploaded by

Copyright:

Available Formats

Chapter 2

Data preprocessing for Machine

• The collection and manipulation of items of data to

• Data processing may involve various processes :-

 Examples of data quality problems:

• Handling missing values

• Data have quality if they satisfy the requirements of

You might also like