Data Preprocessing Implementation 13112023 061217pm

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 31

Data Preprocessing

Data Types
Categorical Data
Categorical Data
Text Data
What is Data Preprocessing
• Data preprocessing is a process of preparing the raw data and making
it suitable for a machine learning
• It is the first and crucial step while creating a machine learning model.
Why do we need Data Preprocessing
• A real-world data generally contains
• Noises
• missing values
• It maybe in an unusable format which cannot be directly used for machine
learning models.
Steps for data preprocessing
• Acquire the dataset
• Import all the crucial libraries
• Import the dataset
• Identifying and handling the missing values
• Encoding the categorical data
• Splitting the dataset
• Feature scaling
Acquiring the dataset
• The first step in data preprocessing in machine learning
• The dataset will be comprised of data gathered from multiple and
disparate sources which are then combined in a proper format to
form a dataset
• Dataset formats differ according to use cases.
• A business dataset will be entirely different from a medical dataset.
• A business dataset will contain relevant industry and business data
• A medical dataset will include healthcare-related data.
• Once the dataset is ready, you must put it in CSV, or HTML, or XLSX
file formats.
https://www.kaggle.com/datasets https://archive.ics.uci.edu/ml/index.php
Importing the libraries
• Numpy
• It is the fundamental package for scientific calculation in Python.
• It is used for inserting any type of mathematical operation in the code.
• Also used to add large multidimensional arrays and matrices in your code.
• Pandas
• Pandas is an open-source Python library for data manipulation and analysis.
• It is used for importing and managing the datasets.
• Matplotlib
• Matplotlib is a Python 2D plotting library that is used to plot any type of
charts in Python.
Code:
Sample dataset
• For our exercise the dataset is given in Data.csv file
• It has 10 instances/examples
• It has three independent variables
• Country
• Age
• Salary
• It has one dependent variable
• Purchased
• Two values are missing
• One in Age independent variable
• One in Salary independent variable
• One variable is categorical i.e., Country
Importing the dataset
Code: • Save your Python file in the directory
containing the dataset.
• read_csv()” is function of the Pandas
library. This function can read a CSV file
• For every Machine Learning model, it is
necessary to separate the independent
variables and dependent variables in a
dataset.
• To extract the independent variables, you
can use “iloc[ ]” function of the Pandas
library.
Identifying and handling missing values
Identifying and handling missing values
• In data preprocessing, it is pivotal to identify and correctly handle the
missing values,
• Failing to handle missing values, you might draw inaccurate and faulty
conclusions and inferences from the data.
• There are two commonly used methods to handle missing data: (Ask the
domain expert, which method to use)
• Deleting a particular row
• Impute the data
• Replacing with the mean
• Replacing with the median
• Replacing with the most frequently occurring value
• Replacing with a constant value
Deleting a particular row
• You remove a specific row that has a null value for a feature or a
particular column where more than 75% of the values are missing.
• However, this method is not 100% efficient, and it is recommended
that you use it only when the dataset has adequate samples.
• You must ensure that after deleting the data, there remains no
addition of bias.
Code: Deleting rows with nan values
Impute data
• This method can add variance to the dataset, and any loss of data can
be efficiently negated.
• Hence, it yields better results compared to the first method (omission
of rows/columns)
Code: Replacing nan values
Replacing nan values (most frequent)

Replacing nan values (median/mean)


Splitting the dataset
• Every dataset for Machine Learning model must be split into two separate sets –
• training set
• test set.
• This is one of the crucial steps of data preprocessing as by doing this, we can enhance the
performance of our machine learning model.
• Suppose, if we have given training to our machine learning model by a dataset and we test it
by a completely different dataset. Then, it will create difficulties for our model to understand
the correlations between the models.
• Training Set
• Training set denotes the subset of a dataset that is used for training the machine learning model.
• In the training set, you are already aware of the output.
• Test Set
• A test set, is the subset of the dataset that is used for testing the machine learning model.
• The ML model uses the test set to predict outcomes and evaluate the trained ML model
• Usually, the dataset is split into 70:30 ratio or 80:20 ratio.
• 70:30 ratio
• This means that you take 70% of the data for training the model while leaving
out the rest 30%.
• 80:20 ratio
• This means that you take 80% of the data for training the model while leaving
out the rest 20%.
• The code includes four variables:
• X_train – features for the training data
• X_test – features for the test data
• y_train – dependent variables for training data
• y_test – independent variable for testing data

• The train_test_split() function includes four parameters,


• The first two of which are for arrays of data.
• The test_size function specifies the size of the test set. The test_size maybe 0.5, 0.3, or 0.2 –
this specifies the dividing ratio between the training and test sets.
• The last parameter, “random_state” sets seed for a random generator so that the output is
always the same if it is set to zero.

You might also like