Group A Assignment No2 Writeup

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Group A

Assignment No: 2
Title: Data Wrangling

Problem Statement: Create an “Academic performance” dataset of students and perform the
following operations using Python.

1. Scan all variables for missing values and inconsistencies. If there are missing values
and/or inconsistencies, use any of the suitable techniques to deal with them.
2. Scan all numeric variables for outliers. If there are outliers, use any of the suitable
techniques to deal with them.
3. Apply data transformations on at least one of the variables. The purpose of this
transformation should be one of the following reasons: to change the scale for better
understanding of the variable, to convert a non-linear relation into a linear one, or to
decrease the skewness and convert the distribution into a normal distribution.

Objective: Students should be able to perform the data wrangling operation using Python on
any open-source dataset

Prerequisite:

1. Basic of Python Programming


2. Concept of Data Preprocessing, Data Formatting, Data Normalization and Data
Cleaning.

Theory:

1. Creation of Dataset using Microsoft Excel.

The dataset is created in “CSV” format.

● The name of the dataset is Academics

● The features of the dataset are:

Gender, NationalITy, PlaceofBirth, StageID, GradeID, SectionID, Topic, Semester

Syntax : RANDBETWEEN(bottom, top) Bottom The smallest integer and Top The
largest integer RANDBETWEEN will return.

For better understanding and visualization, 20% impurities are added into each
variable To the dataset.
The step to create the dataset are as follows:

Step 1: Open Microsoft Excel and click on Save As. Select Other .Formats

Step 2: Enter the name of the dataset and Save the dataset astye CSV(MS-DOS).

Step 3: Enter the name of features as column header.

Step 4: Fill the dara by using RANDOMBETWEEN function. For every feature , fill the
data by considering above spectified range.

Step 5: In 20% data, fill the impurities. The range of math score is [60,80], updating a few
instances values below 60 or above 80. Repeat this for Writing_Score [60,80],
Placement_Score[75-100], Club_Join_Date [2018-2021].

2. Identification and Handling of Null Values


Missing Data can occur when no information is provided for one or more items or for a
whole unit. Missing Data is a very big problem in real-life scenarios. Missing Data can also
refer to as NA(Not Available) values in pandas. In DataFrame sometimes many datasets
simply arrive with missing data, either because it exists and was not collected or it never
existed. For Example, Suppose different users being surveyed may choose not to share their
income, some users may choose not to share the address in this way many datasets went
missing.
In Pandas missing data is represented by two value:

1. None: None is a Python singleton object that is often used for missing data in Python code.

2. NaN : NaN (an acronym for Not a Number), is a special floating-point value recognized by
all systems that use the standard IEEE floating-point representation. Pandas treat None and
NaN as essentially interchangeable for indicating missing or null values. To facilitate this
convention, there are several useful functions for detecting, removing, and replacing null
values in Pandas DataFrame :

 ● isnull()
 ● notnull()
 ● dropna()
 ● fillna()
 ● replace()

Checking for missing values using isnull() and notnull()

● Checking for missing values using isnull()


In order to check null values in Pandas DataFrame, isnull() function is used. This function
return dataframe of Boolean values which are True for NaN values.

3. Identification and Handling of Outliers


3.1 Identification of Outliers
One of the most important steps as part of data preprocessing is detecting and treating the
outliers as they can negatively affect the statistical analysis and the training process of a
machine learning algorithm resulting in lower accuracy.
1. What are Outliers?

We all have heard of the idiom ‘odd one out' which means something unusual in

comparison to the others in a group.

Similarly, an Outlier is an observation in a given dataset that lies far from the rest of the
observations. That means an outlier is vastly larger or smaller than the remaining values in
the set.

2. Why do they occur?

An outlier may occur due to the variability in the data, or due to experimental error/human
error.

They may indicate an experimental error or heavy skewness in the data(heavy- tailed
distribution).

3. What do they affect?

In statistics, we have three measures of central tendency namely Mean, Median, and Mode.
They help us describe the data.

Mean is the accurate measure to describe the data when we do not have any outliers present.
Median is used if there is an outlier in the dataset. Mode is used if there is an outlier AND
about 1⁄2 or more of the data is the same.

‘Mean’ is the only measure of central tendency that is affected by the outliers which in turn
impacts Standard deviation.

Example: Consider a small dataset, sample= [15, 101, 18, 7, 13, 16, 11, 21, 5, 15, 10, 9]. By
looking at it, one can quickly say ‘101’ is an outlier that is much larger than the other

values.
4. Detecting Outliers

If our dataset is small, we can detect the outlier by just looking at the dataset. But what if we
have a huge dataset, how do we identify the outliers then? We need to use visualization and
mathematical techniques.

Below are some of the techniques of detecting outliers

 ● Boxplots
 ● Scatterplots
 ● Z-score
 ● Inter Quantile Range(IQR)

3.2 Handling of Outliers:

For removing the outlier, one must follow the same process of removing an entry from the
dataset using its exact position in the dataset because in all the above methods of detecting
the outliers end result is the list of all those data items that satisfy the outlier definition
according to the method used.

Below are some of the methods of treating the outliers

 ● Trimming/removing the outlier


 ● Quantile based flooring and capping
 ● Mean/Median imputation
 4. Data Transformation for the purpose of :
 Data transformation is the process of converting raw data into a format or structure
that would be more suitable for model building and also data discovery in general.The
process of data transformation can also be referred to as extract/transform/load (ETL).
The extraction phase involves identifying and pulling data from the various source
systems that create data and then moving the data to a single repository. Next, the raw
data is cleansed, if needed. It's then transformed into a target format that can be fed
into operational systems or into a data warehouse, a date lake or another repository for
use in business intelligence and analytics applications. The transformation The data
are transformed in ways that are ideal for mining the data. The data transformation
involves steps that are.
Explanation of the Code:

1) These lines import necessary libraries for data manipulation, handling missing values,
encoding categorical variables, statistical operations, visualization, and other
operations on data.

import pandas as pd

#provides support for large, multi-dimensional arrays and matrices, along with
mathematical functions to operate on these arrays.

import numpy as np

#SimpleImputer is used for handling missing values in a dataset

from sklearn.impute import SimpleImputer

#The LabelEncoder is used for encoding categorical variables into numerical


values.

from sklearn.preprocessing import LabelEncoder

#Z-score is a measure of how many standard deviations a data point is from


the mean.

from scipy.stats import zscore

from scipy.stats import zscore, skew, shapiro, probplot

from scipy.stats import zscore, skew, shapiro, probplot

import matplotlib.pyplot as plt # Import matplotlib.pyplot

import seaborn as sns

import matplotlib.pyplot as plt

import seaborn as sns

import scipy.stats as stats

2) This line loads a dataset from a CSV file into a pandas DataFrame named data.

data = pd.read_csv("/Users/apple/Downloads/BVCOEW/TE SEM 6/DSBDA


Practicals/academic.csv")
3) These lines print and display the count of missing values in each column of the dataset
before handling them.

print("Missing values before handling: ")

missing_values = data.isnull().sum()

print("Missing Values:")

print(missing_values)

4) This section handles missing values in categorical columns by replacing them with the
most frequent value in each column using the SimpleImputer from scikit-learn.

handle_missing_values_categorical =SimpleImputer(strategy='most_frequent')

data_categorical = data.select_dtypes(exclude='number')

data[data_categorical.columns]=handle_missing_values_categorical.fit_transf
orm(data_categorical)

5) This section handles missing values in numeric columns by replacing them with the
median value in each column using the SimpleImputer

handle_missing_values_numeric_median = SimpleImputer(strategy='median')

data_numeric = data.select_dtypes(include='number')

data[data_numeric.columns]=handle_missing_values_numeric_median.fit_tra
nsform(data_numeric)

6) This calculates the z-scores for each numeric column in the dataset.

z_scores = zscore(data.select_dtypes(include='number'), axis=0)

7) This line identifies outliers by checking if the z-scores are greater than 3 or less than -3
outliers = (z_scores > 3) | (z_scores < -3)

8) This masks the outliers in the numeric columns of the dataset with NaN values.

data_no_outliers = data.select_dtypes(include='number').mask(outliers,
np.nan)

9) This section prints the first few rows of each column after masking outliers.

for column in data_no_outliers.columns:

print(f"\nColumn: {column}")

print(data_no_outliers[column].head())

10) This calculates and prints the skewness of the 'Fees' column before transformation.

skew_before = data_no_outliers['Fees'].skew()

print(f"\nSkewness before transformation: {skew_before}")

11) This applies a square root transformation to the 'Fees' column and stores the
transformed values in a new column named 'Fees_sqrt'.

data_no_outliers['Fees_sqrt'] = np.sqrt(data_no_outliers['Fees'])

12) This calculates and prints the skewness of the 'Fees_sqrt' column after the square root
transformation.

skew_after_sqrt = data_no_outliers['Fees_sqrt'].skew()

print(f"\nSkewness after square root transformation: {skew_after_sqrt}")


13) These lines plot a histogram and a Q-Q plot for the square root-transformed 'Fees'
column, showing the distribution and quantiles of the transformed data.

plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)

sns.histplot(data_no_outliers['Fees_sqrt'], kde=True)

plt.title('Histogram of Square Root-transformed Fees')

plt.subplot(1, 2, 2)

probplot(data_no_outliers['Fees_sqrt'], dist="norm", plot=plt)

plt.title('Q-Q Plot of Square Root-transformed Fees')

plt.show()
Conclusion: Hence we have implemented this assignment on data wrangling and gained
insights and knowledge about data cleaning, outliers ,Z-score as a method and data
transformation

You might also like