Machine Learning Chapter 2
Machine Learning Chapter 2
Machine Learning Chapter 2
1
Outline
• What is Data Processing
• Data quality problems
• Data preprocessing
• Data cleaning
• Data integration
• Data transformation
• Data reduction
2
What is data pre-processing
Data preprocessing is a process of preparing the raw data and
making it suitable for a machine learning model.
It is the first and crucial step while creating a machine learning
model.
Why do we need Data Preprocessing?
A real-world data generally contains noises, missing values, and
maybe in an unusable format which cannot be directly used for
machine learning models.
Data preprocessing is required tasks for cleaning the data and
making it suitable for a machine learning model which also
increases the accuracy and efficiency of a machine learning model.
3
Data pre-processing
5
Data quality problem
6
Incomplete data
7
Data Duplication
8
Inconsistent Format
9
Accessibility
10
System Upgrades
11
Data purging and storage
12
Poor organization
13
Example of data quality problem
14
Con..
16
Outliers
17
Missing value
• Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
19
Data pre-processing
20
Data Engineering ( Raw data )
21
Data Engineering ( Prepared data )
22
Feature Engineering
23
Major Tasks in Data pre-processing
• Data cleaning: this step involves identifying and removing missing,
inconsistent, or irrelevant data. This can include removing duplicate records,
filling in missing values, and handling outliers.
• Data integration: this step involves combining data from multiple sources,
such as databases, spreadsheets, and text files. The goal of integration is to
create a single, consistent view of the data.
• Data transformation: this step involves converting the data into a format that
is more suitable for the data mining task. This can include normalizing
numerical data, creating dummy variables, and encoding categorical data.
• Data reduction: this step is used to select a subset of the data that is relevant
to the machine learning task. This can include feature selection.
• Data discretization: this step is used to convert continuous numerical data
into categorical data, which can be used for decision tree and other
categorical machine learning techniques.
24
Data Cleaning
25
Handling Missing value
26
Data Integration
27
Data Integration approach
28
Data transformation
29
Data transformation
30
Data Reduction
31
Example of data reduction : Feature selection
• Feature selection is selecting a subset of variables from input which can
efficiently describe the input data while reducing effects from noise or
irrelevant variables and still provide good prediction results. This for :-
• Redundant features
– Duplicate much or all of the information contained in one or more
other attributes
– Example: purchase price of a product and the amount of sales tax
paid
• Irrelevant features
– Contain no information that is useful for the data mining task at
hand
– Example: students' ID is often irrelevant to the task of predicting
students' GPA 32
Feature selection con..
Types of Feature Selection
• Filter methods: use variable ranking techniques as the principle
criteria for variable selection by ordering. Eg. Gain ratio
• Wrapper methods: the feature selection process is based on a
specific machine learning algorithm that we are trying to fit on a
given dataset. It follows a greedy search approach by evaluating all
the possible combinations of features against the evaluation
criterion.
• Embedded methods: this approach combines both filter and wrapper
feature selection approaches to get advantage of both approach.
33
Data reduction Techniques
34
Dimensionality Reduction
• Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data mining
algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce noise
• Techniques
– Principal Components Analysis (PCA)
– Singular Value Decomposition
– Others: supervised and non-linear techniques
35
Data discretization
• Discretization is the process of converting a continuous attribute into an
ordinal attribute
– Discretization is commonly used in classification
– Many classification algorithms work best if both the independent and
dependent variables have only a few values.
– A potentially infinite number of values are mapped into a small number
of categories. Eg. below :,
– example below :-
36
End of Chapter one
37