Machine Learning Chapter 2

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 37

Chapter 2

Data preprocessing for Machine


Learning

1
Outline
• What is Data Processing
• Data quality problems
• Data preprocessing
• Data cleaning
• Data integration
• Data transformation
• Data reduction

2
What is data pre-processing
 Data preprocessing is a process of preparing the raw data and
making it suitable for a machine learning model.
 It is the first and crucial step while creating a machine learning
model.
 Why do we need Data Preprocessing?
 A real-world data generally contains noises, missing values, and
maybe in an unusable format which cannot be directly used for
machine learning models.
 Data preprocessing is required tasks for cleaning the data and
making it suitable for a machine learning model which also
increases the accuracy and efficiency of a machine learning model.
3
Data pre-processing

• The collection and manipulation of items of data to


produce meaningful information.
• The processing is in any manner detectable by an
observer.
• The conversion of data into usable and desired form. The
conversion is carried out using a predefined sequence of
operations either
 Manual data processing or
 Automatic (Electronic) data processing
4
Data pre-processing

• Data processing may involve various processes :-

5
Data quality problem

6
Incomplete data

7
Data Duplication

8
Inconsistent Format

9
Accessibility

10
System Upgrades

11
Data purging and storage

12
Poor organization

13
Example of data quality problem

14
Con..

 Examples of data quality problems:


– Noise and outliers
– Missing values
– Duplicate data
– Wrong data
– Fake data
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?
15
Noise

16
Outliers

17
Missing value
• Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

• Handling missing values


– Eliminate data objects or variables
– Estimate missing values
• Example: time series of temperature
• Example: census results
– Ignore the missing value during analysis
18
Quality Data

• Data have quality if they satisfy the requirements of


the intended use and when it solves the data quality
problems. These includes:-
• Accuracy, Completeness, Consistency, Timeliness,
Believability and Interpretability.

19
Data pre-processing

20
Data Engineering ( Raw data )

21
Data Engineering ( Prepared data )

22
Feature Engineering

23
Major Tasks in Data pre-processing
• Data cleaning: this step involves identifying and removing missing,
inconsistent, or irrelevant data. This can include removing duplicate records,
filling in missing values, and handling outliers.
• Data integration: this step involves combining data from multiple sources,
such as databases, spreadsheets, and text files. The goal of integration is to
create a single, consistent view of the data.
• Data transformation: this step involves converting the data into a format that
is more suitable for the data mining task. This can include normalizing
numerical data, creating dummy variables, and encoding categorical data.
• Data reduction: this step is used to select a subset of the data that is relevant
to the machine learning task. This can include feature selection.
• Data discretization: this step is used to convert continuous numerical data
into categorical data, which can be used for decision tree and other
categorical machine learning techniques.
24
Data Cleaning

25
Handling Missing value

26
Data Integration

27
Data Integration approach

28
Data transformation

29
Data transformation

30
Data Reduction

31
Example of data reduction : Feature selection
• Feature selection is selecting a subset of variables from input which can
efficiently describe the input data while reducing effects from noise or
irrelevant variables and still provide good prediction results. This for :-
• Redundant features
– Duplicate much or all of the information contained in one or more
other attributes
– Example: purchase price of a product and the amount of sales tax
paid
• Irrelevant features
– Contain no information that is useful for the data mining task at
hand
– Example: students' ID is often irrelevant to the task of predicting
students' GPA 32
Feature selection con..
 Types of Feature Selection
• Filter methods: use variable ranking techniques as the principle
criteria for variable selection by ordering. Eg. Gain ratio
• Wrapper methods: the feature selection process is based on a
specific machine learning algorithm that we are trying to fit on a
given dataset. It follows a greedy search approach by evaluating all
the possible combinations of features against the evaluation
criterion.
• Embedded methods: this approach combines both filter and wrapper
feature selection approaches to get advantage of both approach.

33
Data reduction Techniques

34
Dimensionality Reduction
• Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data mining
algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce noise

• Techniques
– Principal Components Analysis (PCA)
– Singular Value Decomposition
– Others: supervised and non-linear techniques
35
Data discretization
• Discretization is the process of converting a continuous attribute into an
ordinal attribute
– Discretization is commonly used in classification
– Many classification algorithms work best if both the independent and
dependent variables have only a few values.
– A potentially infinite number of values are mapped into a small number
of categories. Eg. below :,
– example below :-

36
End of Chapter one

37

You might also like