Data Preprocessing Implementation 13112023 061217pm
Data Preprocessing Implementation 13112023 061217pm
Data Preprocessing Implementation 13112023 061217pm
Data Types
Categorical Data
Categorical Data
Text Data
What is Data Preprocessing
• Data preprocessing is a process of preparing the raw data and making
it suitable for a machine learning
• It is the first and crucial step while creating a machine learning model.
Why do we need Data Preprocessing
• A real-world data generally contains
• Noises
• missing values
• It maybe in an unusable format which cannot be directly used for machine
learning models.
Steps for data preprocessing
• Acquire the dataset
• Import all the crucial libraries
• Import the dataset
• Identifying and handling the missing values
• Encoding the categorical data
• Splitting the dataset
• Feature scaling
Acquiring the dataset
• The first step in data preprocessing in machine learning
• The dataset will be comprised of data gathered from multiple and
disparate sources which are then combined in a proper format to
form a dataset
• Dataset formats differ according to use cases.
• A business dataset will be entirely different from a medical dataset.
• A business dataset will contain relevant industry and business data
• A medical dataset will include healthcare-related data.
• Once the dataset is ready, you must put it in CSV, or HTML, or XLSX
file formats.
https://www.kaggle.com/datasets https://archive.ics.uci.edu/ml/index.php
Importing the libraries
• Numpy
• It is the fundamental package for scientific calculation in Python.
• It is used for inserting any type of mathematical operation in the code.
• Also used to add large multidimensional arrays and matrices in your code.
• Pandas
• Pandas is an open-source Python library for data manipulation and analysis.
• It is used for importing and managing the datasets.
• Matplotlib
• Matplotlib is a Python 2D plotting library that is used to plot any type of
charts in Python.
Code:
Sample dataset
• For our exercise the dataset is given in Data.csv file
• It has 10 instances/examples
• It has three independent variables
• Country
• Age
• Salary
• It has one dependent variable
• Purchased
• Two values are missing
• One in Age independent variable
• One in Salary independent variable
• One variable is categorical i.e., Country
Importing the dataset
Code: • Save your Python file in the directory
containing the dataset.
• read_csv()” is function of the Pandas
library. This function can read a CSV file
• For every Machine Learning model, it is
necessary to separate the independent
variables and dependent variables in a
dataset.
• To extract the independent variables, you
can use “iloc[ ]” function of the Pandas
library.
Identifying and handling missing values
Identifying and handling missing values
• In data preprocessing, it is pivotal to identify and correctly handle the
missing values,
• Failing to handle missing values, you might draw inaccurate and faulty
conclusions and inferences from the data.
• There are two commonly used methods to handle missing data: (Ask the
domain expert, which method to use)
• Deleting a particular row
• Impute the data
• Replacing with the mean
• Replacing with the median
• Replacing with the most frequently occurring value
• Replacing with a constant value
Deleting a particular row
• You remove a specific row that has a null value for a feature or a
particular column where more than 75% of the values are missing.
• However, this method is not 100% efficient, and it is recommended
that you use it only when the dataset has adequate samples.
• You must ensure that after deleting the data, there remains no
addition of bias.
Code: Deleting rows with nan values
Impute data
• This method can add variance to the dataset, and any loss of data can
be efficiently negated.
• Hence, it yields better results compared to the first method (omission
of rows/columns)
Code: Replacing nan values
Replacing nan values (most frequent)