Group A Assignment No2 Writeup
Group A Assignment No2 Writeup
Group A Assignment No2 Writeup
Assignment No: 2
Title: Data Wrangling
Problem Statement: Create an “Academic performance” dataset of students and perform the
following operations using Python.
1. Scan all variables for missing values and inconsistencies. If there are missing values
and/or inconsistencies, use any of the suitable techniques to deal with them.
2. Scan all numeric variables for outliers. If there are outliers, use any of the suitable
techniques to deal with them.
3. Apply data transformations on at least one of the variables. The purpose of this
transformation should be one of the following reasons: to change the scale for better
understanding of the variable, to convert a non-linear relation into a linear one, or to
decrease the skewness and convert the distribution into a normal distribution.
Objective: Students should be able to perform the data wrangling operation using Python on
any open-source dataset
Prerequisite:
Theory:
Syntax : RANDBETWEEN(bottom, top) Bottom The smallest integer and Top The
largest integer RANDBETWEEN will return.
For better understanding and visualization, 20% impurities are added into each
variable To the dataset.
The step to create the dataset are as follows:
Step 1: Open Microsoft Excel and click on Save As. Select Other .Formats
Step 2: Enter the name of the dataset and Save the dataset astye CSV(MS-DOS).
Step 4: Fill the dara by using RANDOMBETWEEN function. For every feature , fill the
data by considering above spectified range.
Step 5: In 20% data, fill the impurities. The range of math score is [60,80], updating a few
instances values below 60 or above 80. Repeat this for Writing_Score [60,80],
Placement_Score[75-100], Club_Join_Date [2018-2021].
1. None: None is a Python singleton object that is often used for missing data in Python code.
2. NaN : NaN (an acronym for Not a Number), is a special floating-point value recognized by
all systems that use the standard IEEE floating-point representation. Pandas treat None and
NaN as essentially interchangeable for indicating missing or null values. To facilitate this
convention, there are several useful functions for detecting, removing, and replacing null
values in Pandas DataFrame :
● isnull()
● notnull()
● dropna()
● fillna()
● replace()
We all have heard of the idiom ‘odd one out' which means something unusual in
Similarly, an Outlier is an observation in a given dataset that lies far from the rest of the
observations. That means an outlier is vastly larger or smaller than the remaining values in
the set.
An outlier may occur due to the variability in the data, or due to experimental error/human
error.
They may indicate an experimental error or heavy skewness in the data(heavy- tailed
distribution).
In statistics, we have three measures of central tendency namely Mean, Median, and Mode.
They help us describe the data.
Mean is the accurate measure to describe the data when we do not have any outliers present.
Median is used if there is an outlier in the dataset. Mode is used if there is an outlier AND
about 1⁄2 or more of the data is the same.
‘Mean’ is the only measure of central tendency that is affected by the outliers which in turn
impacts Standard deviation.
Example: Consider a small dataset, sample= [15, 101, 18, 7, 13, 16, 11, 21, 5, 15, 10, 9]. By
looking at it, one can quickly say ‘101’ is an outlier that is much larger than the other
values.
4. Detecting Outliers
If our dataset is small, we can detect the outlier by just looking at the dataset. But what if we
have a huge dataset, how do we identify the outliers then? We need to use visualization and
mathematical techniques.
● Boxplots
● Scatterplots
● Z-score
● Inter Quantile Range(IQR)
For removing the outlier, one must follow the same process of removing an entry from the
dataset using its exact position in the dataset because in all the above methods of detecting
the outliers end result is the list of all those data items that satisfy the outlier definition
according to the method used.
1) These lines import necessary libraries for data manipulation, handling missing values,
encoding categorical variables, statistical operations, visualization, and other
operations on data.
import pandas as pd
#provides support for large, multi-dimensional arrays and matrices, along with
mathematical functions to operate on these arrays.
import numpy as np
2) This line loads a dataset from a CSV file into a pandas DataFrame named data.
missing_values = data.isnull().sum()
print("Missing Values:")
print(missing_values)
4) This section handles missing values in categorical columns by replacing them with the
most frequent value in each column using the SimpleImputer from scikit-learn.
handle_missing_values_categorical =SimpleImputer(strategy='most_frequent')
data_categorical = data.select_dtypes(exclude='number')
data[data_categorical.columns]=handle_missing_values_categorical.fit_transf
orm(data_categorical)
5) This section handles missing values in numeric columns by replacing them with the
median value in each column using the SimpleImputer
handle_missing_values_numeric_median = SimpleImputer(strategy='median')
data_numeric = data.select_dtypes(include='number')
data[data_numeric.columns]=handle_missing_values_numeric_median.fit_tra
nsform(data_numeric)
6) This calculates the z-scores for each numeric column in the dataset.
7) This line identifies outliers by checking if the z-scores are greater than 3 or less than -3
outliers = (z_scores > 3) | (z_scores < -3)
8) This masks the outliers in the numeric columns of the dataset with NaN values.
data_no_outliers = data.select_dtypes(include='number').mask(outliers,
np.nan)
9) This section prints the first few rows of each column after masking outliers.
print(f"\nColumn: {column}")
print(data_no_outliers[column].head())
10) This calculates and prints the skewness of the 'Fees' column before transformation.
skew_before = data_no_outliers['Fees'].skew()
11) This applies a square root transformation to the 'Fees' column and stores the
transformed values in a new column named 'Fees_sqrt'.
data_no_outliers['Fees_sqrt'] = np.sqrt(data_no_outliers['Fees'])
12) This calculates and prints the skewness of the 'Fees_sqrt' column after the square root
transformation.
skew_after_sqrt = data_no_outliers['Fees_sqrt'].skew()
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.histplot(data_no_outliers['Fees_sqrt'], kde=True)
plt.subplot(1, 2, 2)
plt.show()
Conclusion: Hence we have implemented this assignment on data wrangling and gained
insights and knowledge about data cleaning, outliers ,Z-score as a method and data
transformation