DS&BD Lab Manul

Download as pdf or txt
Download as pdf or txt
You are on page 1of 98

Savitribai Phule Pune

University Faculty of
Computer Engineering

310256: Data Science and Big Data Analytics -


Laboratory
CE (2019 Course)
Semester- VI

Teaching Scheme Examination Scheme


Practical : 4 Hrs. / Week Term work : 50 Marks
Practical 25 Marks

LABORATORY MANUAL (Version 1.0)

ACADEMIC YEAR 2022-23


VISION

To provide excellent Information Technology education by building


strong teaching and research environment.

MISSION

1) To transform the students into innovative, competent and high


quality IT professionals to meet the growing global challenges.
2) To achieve and impart quality education with an emphasis on
practical skills and social relevance.
3) To endeavor for continuous up-gradation of technical expertise
of students to cater to the needs of the society.
4) To achieve an effective interaction with industry for mutual
benefits.
PROGRAM EDUCATIONAL OBJECTIVES

The students of Information Technology course after passing out will

1. Graduates of the program will possess strong fundamental concepts


in mathematics, science, engineering and Technology to address
technological challenges with emerging trends.

2. Possess knowledge and skills in the field of Computer Science &


Engineering and Information Technology for analyzing, designing
and implementing multifaceted engineering problems of any domain
with innovative and efficient approaches.

3. Acquire an attitude and aptitude for research, entrepreneurship and


higher studies in the field of Computer Science & Engineering and
Information Technology.

4. Learn commitment to ethical practices, societal contributions


through communities and life-long intellect.

5. Attain better communication, presentation, time management and


teamwork skills leading to responsible & competent professionals
and will be able to address challenges in the field of IT at global
level.
PROGRAM OUTCOMES
The students in the Information Technology course will attain:

a. An ability to apply knowledge of computing, mathematics including discrete


mathematics as well as probability and statistics, science, engineering and technology.

b. An ability to define a problem and provide a systematic solution with the help of
conducting experiments, as well as analyzing and interpreting the data.

c. An ability to design, implement, and evaluate a software or a software/hardware


cosystem, component, or process to meet desired needs within realistic constraints.

d. An ability to identify, formulate, and provide systematic solutions to complex


engineering problems.

e. An ability to use the techniques, skills, and modern engineering technologies tools,
standard processes necessary for practice as an IT professional.

f. An ability to apply mathematical foundations, algorithmic principles, and Information


Technology theory in the modeling and design of computer-based systems with
necessary constraints and assumptions.

g. An ability to analyze the local and global impact of computing on individuals,


organizations and society.

h. An ability to understand professional, ethical, legal, security and social issues and
responsibilities.

i. An ability to function effectively as an individual or as a team member to accomplish


a desired goal(s).

j. An ability to engage in life-long learning and continuing professional development to


cope up with fast changes in the technologies/tools with the help of electives,
professional organizations and extra-curricular activities.

k. An ability to communicate effectively in engineering community at large by means of


effective presentations, report writing, paper publications, demonstrations.

l. An ability to understand engineering, management, financial aspects, performance,


optimizations and time complexity necessary for professional practice.
Savitribai Phule Pune University,
Pune Third Year Computer Engineering (2019 Course)

310256: Data Science and Big Data Analytics

Prerequisite Courses:
• Discrete Mathematics (210241)
• Database Management Systems (310341)

Course Objectives:
1. To understand the need of Data Science and Big Data
2. To understand computational statistics in Data Science
3. To analyze and demonstrate knowledge of statistical data analysis techniques
for decision-making
4. To gain practical, hands-on experience with statistics programming languages
and big data tools

Course Outcomes: After completion of the course, students should be able to


CO1: Analyze needs and challenges for Data Science Big Data Analytics
CO2: Apply statistics for Big Data Analytics
CO3: Implement and evaluate data analytics algorithms
CO4: Perform text preprocessing
CO5: Implement data visualization techniques
CO6: Use cutting edge tools and technologies to analyze Big Data
INDEX
Sr. Page Sign
Title No.
No.
Data Wrangling I
1 Perform the following operations using Python on any open source dataset
(eg. data.csv)
1. Import all the required Python Libraries.
2. Locate an open source data from the web (eg. https://www.kaggle.com).
Provide a clear
description of the data and its source (i.e. URL of the web site).

2 Data Wrangling II
Perform the following operations using Python on any open source dataset
(eg. data.csv)
1. Scan all variables for missing values and inconsistencies. If there are
missing values and/or
inconsistencies, use any of the suitable techniques to deal with them.
Basic Statistics - Measures of Central Tendencies and Variance
3 Perform the following operations on any open source dataset (eg. data.csv)
1. Provide summary statistics (mean, median, minimum, maximum,
standard deviation) for
a dataset (age, income etc.) with numeric variables grouped by one of the
qualitative
(categorical) variable
Data Analytics I
Create a Linear Regression Model using Python/R to predict home prices
4
using Boston Housing
Dataset (https://www.kaggle.com/c/boston-housing).
Data Analytics II
1. Implement logistic regression using Python/R to perform classification
5 on
Social_Network_Ads.csv dataset

Data Analytics III


1. Implement Simple Naïve Bayes classification algorithm using Python/R
on iris.csv dataset.
6
II. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error
rate, Precision, Recall on
the given dataset
Text Analytics
1. Extract Sample document and apply following document preprocessing
methods:
Tokenization, POS Tagging, stop words removal, Stemming and
7
Lemmatization.
Create representation of document by calculating Term Frequency and
Inverse Document
Frequency.
Data Visualization I
1. Use the inbuilt dataset 'titanic'. The dataset contains 891 rows and
8 contains information
about the passengers who boarded the unfortunate Titanic ship. Use the
Seaborn library to
see if we can find any patterns in the data.
Write a code to check how the price of the ticket (column name: 'fare') for
each passenger is
distributed by plotting a histogram
Data Visualization II
9 1. Use the inbuilt dataset 'titanic' as used in the above problem. Plot a box
plot for distribution
of age with respect to each gender along with the information about
whether they survived
or not. (Column names : 'sex' and 'age')
Data Visualization III
10 Download the Iris flower dataset or any other dataset into a DataFrame. (eg
https://archive.ics.uci.edu/ml/datasets/Iris ). Scan the dataset and give the
inference as:
1. How many features are there and what are their types (e.g., numeric,
nominal)?
2. Create a histogram for each feature in the dataset to illustrate the feature
distributions
Write a code in JAVA for a simple WordCount application that counts the
number of
11 occurrences of each word in a given input set using the Hadoop
MapReduce framework on
local-standalone set-up
Design a distributed application using MapReduce which processes a log
12
file of a system
Locate dataset (eg. sample_weather.txt) for working on weather data
13 which reads the text
input files and finds average for temperature, dew point and wind speed.
Computer Engineering Department
Assignment No: 1

Aim/Problem Statement:- Data Wrangling I


Perform the following operations using Python on any open source dataset (eg. data.csv)
1. Import all the required Python Libraries.
2. Locate an open source data from the web (eg. https://www.kaggle.com). Provide a clear description of the data
and its source (i.e. URL of the web site).
3. Load the Dataset into pandas dataframe.
4. Data Preprocessing: check for missing values in the data using pandas isnull(), describe() function to get some
initial statistics. Provide variable descriptions. Types of variables etc. Check the dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by checking the data types (i.e.,
character, numeric, integer, factor, and logical) of the variables in the data set. If variables are not in the correct
data type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python

In addition to the codes and outputs, explain every operation that you do in the above steps and explain everything
that you do to import/read/scrape the data set.

Theory:-

Introduction to Dataset

A dataset is a collection of records, similar to a relational database table. Records are similar to table
rows, but the columns can contain not only strings or numbers, but also nested data structures such as
lists, maps, and other records.

Instance: A single row of data is called an instance. It is an observation from the domain. Feature: A
single column of data is called a feature. It is a component of an observation and is also called an

DSBDL/310256/ Sem- VI Page 1


Computer Engineering Department
attribute of a data instance. Some features may be inputs to a model (the predictors) and others may be
outputs or the features to be predicted.

Data Type: Features have a data type. They may be real or integer-valued or may have a categorical or
ordinal value. You can have strings, dates, times, and more complex types, but typically they are
reduced to real or categorical values when working with traditional machine learning methods.
Datasets: A collection of instances is a dataset and when working with machine learning methods we
typically need a few datasets for different purposes.
Training Dataset: A dataset that we feed into our machine learning algorithm to train our model.
Testing Dataset: A dataset that we use to validate the accuracy of our model but is not used to train the
model. It may be called the validation dataset.

Data Represented in a Table:

Data should be arranged in a two-dimensional space made up of rows and columns. This type of data
structure makes it easy to understand the data and pinpoint any problems. An example of some raw data
stored as a CSV (comma separated values).

The representation of the same data in a table is as follows:

Pandas Data Types

A data type is essentially an internal construct that a programming language uses to understand how to
store and manipulate data.
A possible confusing point about pandas data types is that there is some overlap between pandas, python
and numpy. This table summarizes the key points

DSBDL/310256/ Sem- VI Page 2


Computer Engineering Department

Pandas Python
NumPy Usage
dtype
type type
object str or mixed string_, unicode_, mixed types Text or mixed numeric and
non-numeric values
int64 Int int_, int8, int16, int32, int64, uint8, Integer numbers
uint16, uint32, uint64
float64 Float float_, float16, float32, float64 Floating point numbers
bool Bool bool_ True/False values
datetime64 NA datetime64[ns] Date and time values
timedelta[ns] NA NA Differences between two
datetimes
category NA NA Finite list of text values

Python Libraries for Data Science

a) Pandas
Pandas is an open-source Python package that provides high-performance, easy-to-use data structures
and data analysis tools for the labeled data in Python programming language.

What can you do with Pandas?

Indexing, manipulating, renaming, sorting, merging data frame


1. Update, Add, Delete columns from a data frame
2. Impute missing files, handle missing data or NANs
3. Plot data with histogram or box plot

b) NumPy

One of the most fundamental packages in Python, NumPy is a general-purpose array- processing
package. It provides high-performance multidimensional array objects and tools to work with the arrays.
NumPy is an efficient container of generic multi- dimensional data. NumPy’s main object is the
homogeneous multidimensional array. It is a table of elements or numbers of the same datatype, indexed
by a tuple of positive integers. In NumPy, dimensions are called axes and the number of axes is called
rank. NumPy’s array class is called ndarray aka array.

DSBDL/310256/ Sem- VI Page 3


Computer Engineering Department
What can you do with NumPy?

1.Basic array operations: add, multiply, slice, flatten, reshape, index arrays
2.Advanced array operations: stack arrays, split into sections, broadcast arrays
3.Work with DateTime or Linear Algebra
4.Basic Slicing and Advanced Indexing in NumPy Python

b. Matplotlib

This is undoubtedly my favorite and a quintessential Python library. You can create stories with the data
visualized with Matplotlib. Another library from the SciPy Stack, Matplotlib plots 2D figures.

What can you do with Matplotlib?

Histogram, bar plots, scatter plots, area plot to pie plot, Matplotlib can depict a wide range of
visualizations. With a bit of effort and tint of visualization capabilities, with Matplotlib, you can create
just any visualizations:

● Line plots
● Scatter plots
● Area plots
● Bar charts and Histograms
● Pie charts
● Stem plots
● Contour plots
● Quiver plots
● Spectrograms

DSBDL/310256/ Sem- VI Page 4


Computer Engineering Department

Matplotlib also facilitates labels, grids, legends, and some more formatting entities with
Matplotlib.

c.Seaborn

So when you read the official documentation on Seaborn, it is defined as the data visualization library
based on Matplotlib that provides a high-level interface for drawing attractive and informative statistical
graphics. Putting it simply, seaborn is an extension of Matplotlib with advanced features.

What can you do with Seaborn?

1. Determine relationships between multiple variables (correlation)


2. Observe categorical variables for aggregate statistics
3. Analyze univariate or bi-variate distributions and compare them between different
data subsets
4. Plot linear regression models for dependent variables
5. Provide high-level abstractions, multi-plot grids
6. Seaborn is a great second-hand for R visualization libraries like corrplot and ggplot.

d. Scikit Learn

Introduced to the world as a Google Summer of Code project, Scikit Learn is a robust machine learning
library for Python. It features ML algorithms like SVMs, random forests, k-means clustering, spectral
clustering, mean shift, cross-validation and more... Even NumPy, SciPy and related scientific operations
are supported by Scikit Learn with Scikit Learn being a part of the SciPy Stack.

What can you do with Scikit Learn?

1.Classification: Spam detection, image recognition


2.Clustering: Drug response, Stock price
3. Regression: Customer segmentation, Grouping experiment outcomes
4. Dimensionality reduction: Visualization, Increased efficiency
5.Model selection: Improved accuracy via parameter tuning
6. Pre-processing: Preparing input data as a text for processing with machine learning algorithms.

Description of Dataset:

DSBDL/310256/ Sem- VI Page 5


Computer Engineering Department
The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in
Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.
It includes three iris species with 50 samples each as well as some properties about each flower. One
flower species is linearly separable from the other two, but the other two are not linearly separable from
each other.

Total Sample- 150

The columns in this dataset are:


1. Id
2. SepalLengthCm
3. SepalWidthCm
4. PetalLengthCm
5. PetalWidthCm
6. Species

3 Different Types of Species each contain 50 Sample-

DSBDL/310256/ Sem- VI Page 6


Computer Engineering Department

Description of Dataset-

Panda Dataframe functions for Load Dataset

# The columns of the resulting DataFrame have different dtypes.


iris.dtypes
1. The dataset is downloads from UCI repository.

csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

2. Now Read CSV File as a Dataframe in Python from from path where you saved the same
The Iris data set is stored in .csv format. ‘.csv’ stands for comma separated values. It is
easier to load .csv files in Pandas data frame and perform various analytical operations on
it.
Load Iris.csv into a Pandas data frame —

Syntax-

iris = pd.read_csv(csv_url, header = None)


3. The csv file at the UCI repository does not contain the variable/column names. They are
located in a separate file.

DSBDL/310256/ Sem- VI Page 7


Computer Engineering Department
col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Species']

4. read in the dataset from the UCI Machine Learning Repository link and specify column
names to use

iris = pd.read_csv(csv_url, names = col_names)

5. Panda Dataframe functions for Data Preprocessing: Dataframe Operations:

S Data Frame Function Descripti


r on
.
N
o
1 dataset.head(n=5) Return the first n rows.

2 dataset.tail(n=5)
Return the last n rows.

3 dataset.index The index (row labels) of the Dataset.

4 dataset.columns The column labels of the Dataset.

5 dataset.shape Return a tuple representing the dimensionality of


the Dataset.

6 dataset.dtypes Return the dtypes in the Dataset.

DSBDL/310256/ Sem- VI Page 8


Computer Engineering Department

This returns a Series with the data type of each


column. The result’s index is the original Dataset’s
columns. Columns with mixed types are stored with
the object dtype.

7 dataset.columns.values Return the columns values in the Dataset in


array format

8 dataset.describe(include='all' Generate descriptive statistics.


)
to view some basic statistical details like
percentile, mean, std etc. of a data frame or a
series of numeric values.

Analyzes both numeric and object series, as well


as Dataset column sets of mixed data types.

9 dataset['Column name] Read the Data Column wise.

1 dataset.sort_index(axis=1, Sort object by labels (along an axis).


0
ascending=False)

1 Sort values by column name.


dataset.sort_values(by="C
1
olu mn name")

1 dataset.iloc[5] Purely integer-location based indexing for selection


2
by position.

1 dataset[0:3] Selecting via [], which slices the rows.


3

DSBDL/310256/ Sem- VI Page 9


Computer Engineering Department

1 dataset.loc[:, Selection by label


4
["Col_name1",
"col_name2"]]

1 dataset.iloc[:n, :] a subset of the first n rows of the original data


5
1 dataset.iloc[:, :n] a subset of the first n columns of the original data
6
1 dataset.iloc[:m, :n] a subset of the first m rows and the first n columns
7

Few Examples of iLoc to slice data for iris Dataset

S Data Description Outp


r Frame ut
. Functio
N n
o
1 dataset.iloc[3:5, Slice the data
0:2]

2 dataset.iloc[[1, By lists of integer


2,
position locations,
4], [0, 2]]
similar to the
NumPy/Python style:

3 dataset.iloc[1:3, :] For slicing


rows
explicitly:

DSBDL/310256/ Sem- VI Page 10


Computer Engineering Department

4 dataset.iloc[:, For slicing


1:3]
Column
explicitly:

4 dataset.iloc[1, 1] For getting a


value explicitly:

5 dataset['SepalLe Accessing Column


ng thCm'].iloc[5] and Rows by
position

6 cols_2_4=dataset Get Column Name


.c olumns[2:4] then get data from
column
dataset[cols_2_4]

7 dataset[dataset.c in one Expression


ol answer for the above
umns[2:4]].iloc[5 two commands
:1 0]

Checking of Missing Values in Dataset:


● isnull() is the function that is used to check missing values or null values in pandas python.
● isna() function is also used to get the count of missing values of column and row wise count
of missing values
● The dataset considered for explanation is:

DSBDL/310256/ Sem- VI Page 11


Computer Engineering Department

a. is there any missing values in dataframe as a whole


Function: DataFrame.isnull()
Output:

b. is there any missing values across each


column Function: DataFrame.isnull().any()
Output:

c. count of missing values across each column using isna() and isnull()
In order to get the count of missing values of the entire dataframe isnull() function is
used. sum() which does the column wise sum first and doing another sum() will get
the count of missing values of the entire dataframe.
Function: dataframe.isnull().sum().sum()
Output : 8
d. count row wise missing value using
isnull() Function:
dataframe.isnull().sum(axis = 1) Output:
e. count Column wise missing value using
isnull() Method 1:
Function: dataframe.isnull().sum()

DSBDL/310256/ Sem- VI Page 12


Computer Engineering Department
Output:

Method 2:
unction: dataframe.isna().sum()

f. count of missing values of a specific column.


Function:dataframe.col_name.isnull().sum()

df1.Gender.isnull().sum()
Output: 2
g. groupby count of missing values of a column.
In order to get the count of missing values of the particular column by group in
pandas we will be using isnull() and sum() function with apply() and groupby()
which performs the group wise count of missing values as shown below.
Function:
df1.groupby(['Gender'])['Score'].apply(lambda x:
x.isnull().sum())
Output:

6. Panda functions for Data Formatting and Normalization

The Transforming data stage is about converting the data set into a format that can be

analyzed or modelled effectively, and there are several techniques for this process.

a. Data Formatting: Ensuring all data formats are correct (e.g. object, text, floating
number, integer, etc.) is another part of this initial ‘cleaning’ process. If you are
working with dates in Pandas, they also need to be stored in the exact format to use
special date-time functions.

DSBDL/310256/ Sem- VI Page 13


Computer Engineering Department
Functions used for data formatting

S Data Description Outp


r Frame ut
. Functio
N n
o
1. df.dtypes To check the
data type

2. df['petal To change the


length (cm)']= data type (data
type of ‘petal
df['petal length
(cm)'changed to
length
int)
(cm)'].astype("in
t
")

b. Data normalization: Mapping all the nominal data values onto a uniform scale
(e.g. from 0 to 1) is involved in data normalization. Making the ranges consistent
across variables helps with statistical analysis and ensures better comparisons
later on.It is also known as Min-Max scaling.

Algorithm:

Step 1 : Import pandas and sklearn library for preprocessing


from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
iris = load_iris()

df = pd.DataFrame(iris.data,
columns=iris.feature_names)

DSBDL/310256/ Sem- VI Page 14


Computer Engineering Department
Step 3: Print iris dataset.
df.head()
Step 4: Create x, where x the 'scores' column's values as floats
x = df[['score']].values.astype(float)

Step 5: Create a minimum and maximum processor object


min_max_scaler = preprocessing.MinMaxScaler()
Step 6: Create an object to transform the data to fit minmax processor
x_scaled = min_max_scaler.fit_transform(x)
Step 7:Run the normalizer on the dataframe
df_normalized = pd.DataFrame(x_scaled)
Step 8: View the dataframe
df_normalized

Output: After Step 3:

DSBDL/310256/ Sem- VI Page 15


Computer Engineering Department

Output after step 8:

7. Panda Functions for handling categorical variables

● Categorical variables have values that describe a ‘quality’ or ‘characteristic’


of a data unit, like ‘what type’ or ‘which category’.
● Categorical variables fall into mutually exclusive (in one category or in
another) and exhaustive (include all possible options) categories. Therefore,
categorical variables are qualitative variables and tend to be represented by a
non-numeric value.
● Categorical features refer to string type data and can be easily understood by
human beings. But in case of a machine, it cannot interpret the categorical
data directly. Therefore, the categorical data must be translated into numerical
data that can be understood by machine.
There are many ways to convert categorical data into numerical data. Here the three most used
methods are discussed.
a. Label Encoding: Label Encoding refers to converting the labels into a numeric form
so as to convert them into the machine-readable form. It is an important preprocessing
step for the structured dataset in supervised learning.
Example : Suppose we have a column Height in some dataset. After applying label
encoding, the Height column is converted into:

DSBDL/310256/SEM_VI PAGE 17

DSBDL/310256/ Sem- VI Page 16


Computer Engineering Department

where 0 is the label for tall, 1 is the label for medium, and 2 is a label for short height.
Label Encoding on iris dataset: For iris dataset the target column which is Species. It
contains three species Iris-setosa, Iris-versicolor, Iris-virginica.

Sklearn Functions for Label Encoding:

● preprocessing.LabelEncoder : It Encode labels with value between 0


and n_classes-1.

● fit_transform(y):

Parameters: yarray-like of shape (n_samples,)


Target values.
Returns: yarray-like of shape (n_samples,)

Encoded labels.

This transformer should be used to encode target values, and not the input.

Algorithm:

Step 1 : Import pandas and sklearn library for preprocessing


from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Observe the unique values for the Species column.
df['Species'].unique()
output: array(['Iris-setosa', 'Iris-
versicolor', 'Iris-virginica'], dtype=object)

Step 4: define label_encoder object knows how to understand word labels.

DSBDL/310256/ Sem- VI Page 18


Computer Engineering Department
label_encoder = preprocessing.LabelEncoder()
Step 5: Encode labels in column 'species'.
df['Species']= label_encoder.fit_transform(df['Species'])
Step 6: Observe the unique values for the Species column.

df['Species'].unique()

Output: array([0, 1, 2], dtype=int64)


● Use LabelEncoder when there are only two possible values of a categorical feature.
For example, features having value such as yes or no. Or, maybe, gender features
when there are only two possible values including male or female.

Limitation: Label encoding converts the data in machine-readable form, but it assigns a
unique number(starting from 0) to each class of data. This may lead to the generation
of priority issues in the data sets. A label with a high value may be considered to have
high priority than a label having a lower value.

b. One-Hot Encoding:

In one-hot encoding, we create a new set of dummy (binary) variables that is equal to the
number of categories (k) in the variable. For example, let’s say we have a categorical
variable Color with three categories called “Red”, “Green” and “Blue”, we need to use
three dummy variables to encode this variable using one-hot encoding. A dummy
(binary) variable just takes the value 0 or 1 to indicate the exclusion or inclusion of a
category.

In one-hot encoding,

DSBDL/310256/ Sem- VI Page 19


Computer Engineering Department
“Red” color is encoded as [1 0 0] vector of size 3.
“Green” color is encoded as [0 1 0] vector of size 3.
“Blue” color is encoded as [0 0 1] vector of size 3.
One-hot encoding on iris dataset: For iris dataset the target column which is Species. It
contains three species Iris-setosa, Iris-versicolor, Iris-virginica.

Sklearn Functions for One-hot Encoding:

● sklearn.preprocessing.OneHotEncoder(): Encode
categorical integer features using a one-hot aka one-of-K scheme

Algorithm:

Step 1 : Import pandas and sklearn library for preprocessing


from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Observe the unique values for the Species column.
df['Species'].unique()
output: array(['Iris-setosa', 'Iris-
versicolor', 'Iris-virginica'], dtype=object)

Step 4: Apply label_encoder object for label encoding the Observe the unique
values for the Species column.

df['Species'].unique()

Output: array([0, 1, 2], dtype=int64)


Step 5: Remove the target variable from dataset
features_df=df.drop(columns=['Species'])
Step 6: Apply one_hot encoder for Species column.
enc = preprocessing.OneHotEncoder()
enc_df=pd.DataFrame(enc.fit_transform(df[['Species']])).toarray()

Step 7: Join the encoded values with Features variable


df_encode = features_df.join(enc_df)

Step 8: Observe the merge dataframe

DSBDL/310256/ Sem- VI Page 20


Computer Engineering Department
df_encode
Step 9: Rename the newly encoded columns.
df_encode.rename(columns = {0:'Iris-Setosa',
1:'Iris-Versicolor',2:'Iris-virginica'}, inplace = True)
Step 10: Observe the merge dataframe
df_encode

Output after Step 8:

Output after Step 10:

c. Dummy Variable Encoding

Dummy encoding also uses dummy (binary) variables. Instead of creating a number of
dummy variables that is equal to the number of categories (k) in the variable, dummy
encoding uses k-1 dummy variables. To encode the same Color variable with three
categories using the dummy encoding, we need to use only two dummy variables.

DSBDL/310256/ Sem- VI Page 21


Computer Engineering Department

In dummy encoding,
“Red” color is encoded as [1 0] vector of size 2.
“Green” color is encoded as [0 1] vector of size
2. “Blue” color is encoded as [0 0] vector of size
2.
Dummy encoding removes a duplicate category present in the one-hot encoding.

Pandas Functions for One-hot Encoding with dummy variables:

● pandas.get_dummies(data, prefix=None, prefix_sep='_',


dummy_na=False, columns=None, sparse=False,
drop_first=False, dtype=None): Convert categorical variable into
dummy/indicator variables.

● Parameters:

data:array-like, Series, or DataFrame


Data of which to get dummy indicators.
prefixstr: list of str, or dict of str, default None
String to append DataFrame column names.
prefix_sep: str, default ‘_’
If appending prefix, separator/delimiter to use. Or pass a list or dictionary as
with prefix.
dummy_nabool: default False

Add a column to indicate NaNs, if False NaNs are ignored.

DSBDL/310256/ Sem- VI Page 22


Computer Engineering Department
columns: list:like, default None
Column names in the DataFrame to be encoded. If columns is None then all the
columns with object or category dtype will be converted.

Whether the dummy-encoded columns should be backed by a SparseArray ( True) or a regular NumPy
array (False).

drop_first:bool, default False

Whether to get k-1 dummies out of k categorical levels by removing the first level.

dtype: dtype, default np.uint8

Data type for new columns. Only a single dtype is allowed.


● Return : DataFrame with Dummy-coded data.

Algorithm:

Step 1 : Import pandas and sklearn library for preprocessing


from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Observe the unique values for the Species column.
df['Species'].unique()
output: array(['Iris-setosa', 'Iris-
versicolor', 'Iris-virginica'], dtype=object)
Step 4: Apply label_encoder object for label encoding the Observe the unique
values for the Species column.
df['Species'].unique()
Output: array([0, 1, 2], dtype=int64)
Step 6: Apply one_hot encoder with dummy variables for Species column.
one_hot_df = pd.get_dummies(df, prefix="Species",
columns=['Species'], drop_first=False)
Step 7: Observe the merge dataframe
one_hot_df

DSBDL/310256/ Sem- VI Page 23


Computer Engineering Department

Conclusion- In this way we have explored the functions of the python library for Data
Preprocessing, Data Wrangling Techniques and How to Handle missing values on Iris
Dataset.

Assignment Question

1. Explain Data Frame with Suitable example.


2. What is the limitation of the label encoding method?
3. What is the need of data normalization?
4. What are the different Techniques for Handling the Missing Data?

DSBDL/310256/ Sem- VI Page 24


Computer Engineering Department

Assignment No: 2

Aim/Problem Statement: - Data Wrangling I


Perform the following operations using Python on any open source dataset (eg. data.csv)
1. Scan all variables for missing values and inconsistencies. If there are missing values and/or inconsistencies, use
any of the suitable techniques to deal with them.
2. Scan all numeric variables for outliers. If there are outliers, use any of the suitable techniques to deal with them.
3. Apply data transformations on at least one of the variables. The purpose of this transformation should be one of
the following reasons: to change the scale for better understanding of the variable, to convert a non-linear relation
into a linear one, or to decrease the skewness and convert the distribution into a normal distribution.

Reason and document your approach properly.

Theory:-
1. Creation of Dataset using Microsoft Excel.
The dataset is created in “CSV” format.
● The name of dataset is StudentsPerformance
● The features of the dataset are: Math_Score, Reading_Score, Writing_Score,
Placement_Score, Club_Join_Date .
● Number of Instances: 30

● The response variable is: Placement_Offer_Count .


● Range of Values:
Math_Score [60-80], Reading_Score[75-,95], ,Writing_Score [60,80],
Placement_Score[75-100], Club_Join_Date [2018-2021].
● The response variable is the number of placement offers facilitated to particular
students, which is largely depend on Placement_Score
To fill the values in the dataset the RANDBETWEEN is used. Returns a random integer
number between the numbers you specify
Syntax : RANDBETWEEN(bottom, top) Bottom The smallest integer and
Top The largest integer RANDBETWEEN will return.
For better understanding and visualization, 20% impurities are added into each variable to the dataset.
The step to create the dataset are as follows:
Step 1: Open Microsoft Excel and click on Save As. Select Other .Formats

DSBDL/310256/ Sem- VI Page 25


Computer Engineering Department

Step 2: Enter the name of the dataset and Save the dataset astye CSV(MS-DOS).

Step 3: Enter the name of features as column header.

Step 3: Fill the dara by using RANDOMBETWEEN function. For every feature , fill the data by considering the
above spectified range.

DSBDL/310256/ Sem- VI Page 26


Computer Engineering Department

one example is given:

Scroll down the cursor for 30 rows to create 30 instances.


Repeat this for the features, Reading_Score, Writing_Score, Placement_Score,
Club_Join_Date.

The placement count largely depends on the placement score. It is considered that if
placement score <75, 1 offer is facilitated; for placement score >75 , 2 offer is facilitated and
for else (>85) 3 offer is facilitated. Nested If formula is used for ease of data filling.

DSBDL/310256/ Sem- VI Page 27


Computer Engineering Department

Step 4: In 20% data, fill the impurities. The range of math score is [60,80], updating a few
instances values below 60 or above 80. Repeat this for Writing_Score [60,80],
Placement_Score[75-100], Club_Join_Date [2018-2021].

Step 5: To violate the ruleof response variable, update few valus . If placement scoreis greater
then 85, facilated only 1 offer.

The dataset is created with the given description.

2. Identification and Handling of Null Values


Missing Data can occur when no information is provided for one or more items or for a whole
unit. Missing Data is a very big problem in real-life scenarios. Missing Data can also refer to
as NA(Not Available) values in pandas. In DataFrame sometimes many datasets simply arrive

DSBDL/310256/ Sem- VI Page 28


Computer Engineering Department

with missing data, either because it exists and was not collected or it never existed. For
Example, Suppose different users being surveyed may choose not to share their income, some
users may choose not to share the address in this way many datasets went missing.
In Pandas missing data is represented by two value:

1. None: None is a Python singleton object that is often used for missing data in Python
code.
2. NaN : NaN (an acronym for Not a Number), is a special floating-point value
recognized by all systems that use the standard IEEE floating-point
representation.Pandas treat None and NaN as essentially interchangeable for indicating
missing or null values. To facilitate this convention, there are several useful functions
for detecting, removing, and replacing null values in Pandas DataFrame :

● isnull()

● notnull()

● dropna()

● fillna()

● replace()
1. Checking for missing values using isnull() and notnull()

● Checking for missing values using isnull()


In order to check null values in Pandas DataFrame, isnull() function is used. This
function return dataframe of Boolean values which are True for NaN values.

Algorithm:
Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame

DSBDL/310256/ Sem- VI Page 29


Computer Engineering Department

df

DSBDL/310256/ Sem- VI Page 30


Computer Engineering Department

Step 4: Use isnull() function to check null values in the dataset.


df.isnull()

Step 5: To create a series true for NaN values for specific columns. for example math
score in dataset and display data with only math score as NaN
series = pd.isnull(df["math score"])
df[series]

● Checking for missing values using notnull()


In order to check null values in Pandas Dataframe, notnull() function is used. This function
return dataframe of Boolean values which are False for NaN values.

Algorithm:
Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame df

DSBDL/310256/ Sem- VI Page 31


Computer Engineering Department

Step 4: Use notnull() function to check null values in the dataset.


df.notnull()

Step 5: To create a series true for NaN values for specific columns. for example math
score in dataset and display data with only math score as NaN
series1 = pd.notnull(df["math score"])
df[series1]

DSBDL/310256/ Sem- VI Page 32


Computer Engineering Department

See that there are also categorical values in the dataset, for this, you need to use Label
Encoding or One Hot Encoding.
from sklearn.preprocessing import LabelEncoder le
= LabelEncoder()
df['gender'] = le.fit_transform(df['gender'])
newdf=df
df

2. Filling missing values using dropna(), fillna(), replace()

In order to fill null values in a datasets, fillna(), replace() functions are used. These
functions replace NaN values with some value of their own. All these functions help in
filling null values in datasets of a DataFrame.

● For replacing null values with NaN


missing_values = ["Na", "na"]df =
pd.read_csv("StudentsPerformanceTest1.csv", na_values = missing_values)
df

DSBDL/310256/ Sem- VI Page 33


Computer Engineering Department

● Filling null values with a single value


Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
df
Step 4: filling missing value using fillna()
ndf=df
ndf.fillna(0)

Step 5: filling missing values using mean, median and standard deviation of that column.

data['math score'] = data['math score'].fillna(data['math score'].mean())

data[''math score''] = data[''math score''].fillna(data[''math


score''].median())

data['math score''] = data[''math score''].fillna(data[''math score''].std())

replacing missing values in forenoon column with minimum/maximum number of that


column

data[''math score''] = data[''math score''].fillna(data[''math score''].min())

data[''math score''] = data[''math score''].fillna(data[''math score''].max())

● Filling null values in dataset

DSBDL/310256/ Sem- VI Page 34


Computer Engineering Department

To fill null values in dataset use inplace=true


m_v=df['math score'].mean()
df['math score'].fillna(value=m_v, inplace=True) df

● Filling a null values using replace() method

Following line will replace Nan value in dataframe with value -99

ndf.replace(to_replace = np.nan, value = -99)

● Deleting null values using dropna() method


In order to drop null values from a dataframe, dropna() function is used. This function
drops Rows/Columns of datasets with Null values in different ways.
1. Dropping rows with at least 1 null value

2. Dropping rows if all values in that row are missing


3. Dropping columns with at least 1 null value.
4. Dropping Rows with at least 1 null value in CSV file
Algorithm:

DSBDL/310256/ Sem- VI Page 35


Computer Engineering Department

Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
df
Step 4:To drop rows with at least 1 null value
ndf.dropna()

Step 5: To Drop rows if all values in that row are missing


ndf.dropna(how = 'all')

Step 6: To Drop columns with at least 1 null value.


ndf.dropna(axis = 1)

DSBDL/310256/ Sem- VI Page 36


Computer Engineering Department

Step 7 : To drop rows with at least 1 null value in CSV file.


making new data frame with dropped NA values
new_data = ndf.dropna(axis = 0, how ='any')
new_data

3. Identification and Handling of Outliers


3.1 Identification of Outliers
One of the most important steps as part of data preprocessing is detecting and treating the
outliers as they can negatively affect the statistical analysis and the training process of a
machine learning algorithm resulting in lower accuracy.
1. What are Outliers?
We all have heard of the idiom ‘odd one out' which means something unusual in
comparison to the others in a group.

Similarly, an Outlier is an observation in a given dataset that lies far from the rest of

DSBDL/310256/ Sem- VI Page 37


Computer Engineering Department

the observations. That means an outlier is vastly larger or smaller than the remaining values in
the set.

2. Why do they occur?


An outlier may occur due to the variability in the data, or due to experimental
error/human error.

They may indicate an experimental error or heavy skewness in the data(heavy-tailed


distribution).

3. What do they affect?


In statistics, we have three measures of central tendency namely Mean, Median, and
Mode. They help us describe the data.

Mean is the accurate measure to describe the data when we do not have any outliers
present. Median is used if there is an outlier in the dataset. Mode is used if there is an outlier
AND about ½ or more of the data is the same.

‘Mean’ is the only measure of central tendency that is affected by the outliers which in
turn impacts Standard deviation. Example:
Consider a small dataset, sample= [15, 101, 18, 7, 13, 16, 11, 21, 5, 15, 10, 9]. By
looking at it, one can quickly say ‘101’ is an outlier that is much larger than the other values.

fig. Computation with and without outlier

From the above calculations, we can clearly say the Mean is more affected than the Median.
4. Detecting Outliers
If our dataset is small, we can detect the outlier by just looking at the dataset. But what
if we have a huge dataset, how do we identify the outliers then? We need to use visualization
and mathematical techniques.

DSBDL/310256/ Sem- VI Page 38


Computer Engineering Department

Below are some of the techniques of detecting outliers

● Boxplots
● Scatterplots
● Z-score
● Inter Quantile Range(IQR)

4.1 Detecting outliers using Boxplot:


It captures the summary of the data effectively and efficiently with only a simple box
and whiskers. Boxplot summarizes sample data using 25th, 50th, and 75th percentiles. One
can just get insights(quartiles, median, and outliers) into the dataset by just looking at its
boxplot.
Algorithm:
Step 1 : Import pandas and numpy libraries
import pandas as pd import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/demo.csv")
Step 3: Display the data frame
df

Step 4:Select the columns for boxplot and draw the boxplot.

col = ['math score', 'reading score' , 'writing


score','placement score']

DSBDL/310256/ Sem- VI Page 39


Computer Engineering Department

df.boxplot(col)

Step 5: We can now print the outliers for each column with reference to the box plot.
print(np.where(df['math score']>90))print(np.where(df['reading
score']<25)) print(np.where(df['writing score']<30))

4.2 Detecting outliers using Scatterplot:


It is used when you have paired numerical data, or when your dependent variable has
multiple values for each reading independent variable, or when trying to determine the
relationship between the two variables. In the process of utilizing the scatter plot, one can also
use it for outlier detection.
To plot the scatter plot one requires two variables that are somehow related to each
other. So here Placement score and Placement count features are used.
Algorithm:
Step 1 : Import pandas , numpy and matplotlib libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/demo.csv")
Step 3: Display the data frame
df
Step 4: Draw the scatter plot with placement score and placement offer count
fig, ax = plt.subplots(figsize = (18,10))
ax.scatter(df['placement score'], df['placement offer

DSBDL/310256/ Sem- VI Page 40


Computer Engineering Department

count'])
plt.show()
Labels to the axis can be assigned (Optional)
ax.set_xlabel('(Proportion non-retail business
acres)/(town)')
ax.set_ylabel('(Full-value property-tax rate)/(
$10,000)')

Step 5: We can now print the outliers with reference to scatter plot.
print(np.where((df['placement score']<50) & (df['placement
offer count']>1)))
print(np.where((df['placement score']>85) & (df['placement
offer count']<3)))

4.3 Detecting outliers using Z-Score:


Z-Score is also called a standard score. This value/score helps to understand
how far is the data point from the mean. And after setting up a threshold value one can
utilize z score values of data points to define the outliers.
Zscore = (data_point -mean) / std. deviation

Algorithm:
Step 1 : Import numpy and stats from scipy libraries

DSBDL/310256/ Sem- VI Page 41


Computer Engineering Department

import numpy as np

from scipy import stats

Step 2: Calculate Z-Score for mathscore column


z = np.abs(stats.zscore(df['math score']))
Step 3: Print Z-Score Value. It prints the z-score values of each data item of
the column
print(z)

Step 4: Now to define an outlier threshold value is chosen.


threshold = 0.18
Step 5: Display the sample outliers
sample_outliers = np.where(z <threshold)
sample_outliers

4.4 Detecting outliers using Inter Quantile Range(IQR):


IQR (Inter Quartile Range) Inter Quartile Range approach to finding the
outliers is the most commonly used and most trusted approach used in the research
field.
IQR = Quartile3 – Quartile1
To define the outlier base value is defined above and below datasets normal
range namely Upper and Lower bounds, define the upper and the lower bound
(1.5*IQR value is considered) :

upper = Q3 +1.5*IQR
lower = Q1 – 1.5*IQR
In the above formula as according to statistics, the 0.5 scale-up of IQR
(new_IQR = IQR + 0.5*IQR) is taken.

DSBDL/310256/ Sem- VI Page 42


Computer Engineering Department

Algorithm:
Step 1 : Import numpy library
import numpy as np

Step 2: Sort Reading Score feature and store it into sorted_rscore.


sorted_rscore= sorted(df['reading score'])
Step 3: Print sorted_rscore
sorted_rscore
Step 4: Calculate and print Quartile 1 and Quartile 3 q1 =
np.percentile(sorted_rscore, 25) q3 = np.percentile(sorted_rscore,
75) print(q1,q3)

Step 5: Calculate value of IQR (Inter Quartile Range)


IQR = q3-q1
Step 6: Calculate and print Upper and Lower Bound to define the
outlier base value.
lwr_bound = q1-(1.5*IQR)
upr_bound = q3+(1.5*IQR)
print(lwr_bound, upr_bound)

Step 7: Print Outliers


r_outliers = []
for i in sorted_rscore:
if (i<lwr_bound or i>upr_bound):
r_outliers.append(i)
print(r_outliers)

3.2 Handling of Outliers:


For removing the outlier, one must follow the same process of removing an entry from
the dataset using its exact position in the dataset because in all the above methods of detecting
the outliers end result is the list of all those data items that satisfy the outlier definition
according to the method used.

Below are some of the methods of treating the outliers

● Trimming/removing the outlier

DSBDL/310256/ Sem- VI Page 43


Computer Engineering Department

● Quantile based flooring and capping


● Mean/Median imputation
Trimming/removing the outlier:
In this technique, we remove the outliers from the dataset. Although it is not a good
practice to follow.
new_df=df
for i in sample_outliers:
new_df.drop(i,inplace=True)
new_df

Here Sample_outliers are So instances with index 0, 12


,16 and 17 are deleted.

● Quantile based flooring and capping:


In this technique, the outlier is capped at a certain value above the 90th percentile value or
floored at a factor below the 10th percentile value df=pd.read_csv("/demo.csv")
df_stud=df
ninetieth_percentile = np.percentile(df_stud['math score'], 90)
b = np.where(df_stud['math score']>ninetieth_percentile,
ninetieth_percentile, df_stud['math score'])

DSBDL/310256/ Sem- VI Page 44


Computer Engineering Department

print("New array:",b)

df_stud.insert(1,"m score",b,True)
df_stud

● Mean/Median imputation:
As the mean value is highly influenced by the outliers, it is advised to replace the
outliers with the median value.
1. Plot the box plot for reading score col
= ['reading score']
df.boxplot(col)

2. Outliers are seen in box plot.


3. Calculate the median of reading score by using sorted_rscore
median=np.median(sorted_rscore)

DSBDL/310256/ Sem- VI Page 45


Computer Engineering Department

median
4. Replace the upper bound outliers using median value
refined_df=df
refined_df['reading score'] = np.where(refined_df['reading
score'] >upr_bound, median,refined_df['reading score'])
5. Display redefined_df

6. Replace the lower bound outliers using median value refined_df['reading


score'] = np.where(refined_df['reading score'] <lwr_bound,
median,refined_df['reading score'])
7. Display redefined_df

8. Draw the box plot for redefined_df col


= ['reading score']

DSBDL/310256/ Sem- VI Page 46


Computer Engineering Department

refined_df.boxplot(col)

4. Data Transformation for the purpose of :


Data transformation is the process of converting raw data into a format or structure that would
be more suitable for model building and also data discovery in general.The process of data
transformation can also be referred to as extract/transform/load (ETL). The extraction phase
involves identifying and pulling data from the various source systems that create data and then
moving the data to a single repository. Next, the raw data is cleansed, if needed. It's then
transformed into a target format that can be fed into operational systems or into a data
warehouse, a date lake or another repository for use in business intelligence and analytics
applications. The transformation The data are transformed in ways that are ideal for mining
the data. The data transformation involves steps that are.

● Smoothing: It is a process that is used to remove noise from the dataset using some
algorithms It allows for highlighting important features present in the dataset. It helps in
predicting the patterns
● Aggregation: Data collection or aggregation is the method of storing and presenting data
in a summary format. The data may be obtained from multiple data sources to integrate
these data sources into a data analysis description. This is a crucial step since the
accuracy of data analysis insights is highly dependent on the quantity and quality of the
data used.
● Generalization:It converts low-level data attributes to high-level data attributes
using concept hierarchy. For Example Age initially in Numerical form (22, 25) is

DSBDL/310256/ Sem- VI Page 47


Computer Engineering Department

converted into categorical value (young, old).


● Normalization: Data normalization involves converting all data variables into a
given range. Some of the techniques that are used for accomplishing normalization are:
○ Min–max normalization: This transforms the original data linearly.
○ Z-score normalization: In z-score normalization (or zero-mean normalization) the
values of an attribute (A), are normalized based on the mean of A and its standard
deviation.
○ Normalization by decimal scaling: It normalizes the values of an attribute by
changing the position of their decimal points
● Attribute or feature construction.
○ New attributes constructed from the given ones: Where new attributes are created
& applied to assist the mining process from the given set of attributes. This simplifies
the original data & makes the mining more efficient.
In this assignment , The purpose of this transformation should be one of the following
reasons:
a. To change the scale for better understanding (Attribute or feature construction)
Here the Club_Join_Date is transferred to Duration.

Algorithm:
Step 1 : Import pandas and numpy libraries
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/demo.csv")
Step 3: Display the data frame
df

DSBDL/310256/ Sem- VI Page 48


Computer Engineering Department

Step 3: Change the scale of Joining year to duration.

b. To decrease the skewness and convert distribution into normal distribution


(Normalization by decimal scaling)
Data Skewness: It is asymmetry in a statistical distribution, in which the curve
appears distorted or skewed either to the left or to the right. Skewness can be
quantified to define the extent to which a distribution differs from a normal
distribution.
Normal Distribution: In a normal distribution, the graph appears as a classical,
symmetrical “bell-shaped curve.” The mean, or average, and the mode, or maximum
point on the curve, are equal.

Positively Skewed Distribution


A positively skewed distribution means that the extreme data results are larger. This
skews the data in that it brings the mean (average) up. The mean will be larger than the
median in a Positively skewed distribution.
A negatively skewed distribution means the opposite: that the extreme data results
are smaller. This means that the mean is brought down, and the median is larger than
the mean in a negatively skewed distribution.

DSBDL/310256/ Sem- VI Page 49


Computer Engineering Department

Reducing skewness A data transformation may be used to reduce skewness. A


distribution that is symmetric or nearly so is often easier to handle and interpret than a
skewed distribution. The logarithm, x to log base 10 of x, or x to log base e of x (ln x),
or x to log base 2 of x, is a strong transformation with a major effect on distribution
shape. It is commonly used for reducing right skewness and is often appropriate for
measured variables. It can not be applied to zero or negative values.

Algorithm:
Step 1 : Detecting outliers using Z-Score for the Math_score variable and remove the
outliers.
Step 2: Observe the histogram for math_score variable. import
matplotlib.pyplot as plt new_df['math
score'].plot(kind = 'hist')
Step 3: Convert the variables to logarithm at the scale 10.
df['log_math'] = np.log10(df['math score'])

Step 4: Observe the histogram for math_score variable.


df['log_math'].plot(kind = 'hist')

It is observed that skewness is reduced at some level.

DSBDL/310256/ Sem- VI Page 50


Computer Engineering Department

Conclusion: In this way we have explored the functions of the python library for Data Identifying
and handling the outliers. Data Transformations Techniques are explored with the purpose of creating
the new variable and reducing the skewness from datasets.
Assignment Question:
1. Explain the methods to detect the outlier.
2. Explain data transformation methods
3. Write the algorithm to display the statistics of Null values present in the dataset.
4. Write an algorithm to replace the outlier value with the mean of the variable.
.

DSBDL/310256/ Sem- VI Page 51


Computer Engineering / DYPCOEI,PUNE

Assignment No: 3

Aim/Problem Statement:-Basic Statistics - Measures of Central Tendencies and


Variance
Perform the following operations on any open source dataset (eg. data.csv)
1. Provide summary statistics (mean, median, minimum, maximum, standard deviation) for a dataset
(age, income etc.) with numeric variables grouped by one of the qualitative (categorical) variable. For
example, if your categorical variable is age groups and quantitative variable is income, then provide
summary statistics of income grouped by the age groups. Create a list that contains a numeric value for
each response to the categorical variable.
2. Write a Python program to display some basic statistical details like percentile, mean, standard
deviation etc. of the species of ‘Iris-setosa’, ‘Iris-versicolor’ and ‘Iris-versicolor’ of iris.csv dataset.

Provide the codes with outputs and explain everything that you do in this step. .

Theory:-

Statistical Inference:
statistical inference as the process of generating conclusions about a population from a noisy sample.
Without statistical inference we’re simply living within our data. With statistical inference, we’re trying
to generate new knowledge.

Statistical analysis and probability influence our lives on a daily basis. Statistics is used to predict the
weather, restock retail shelves, estimate the condition of the economy, and much more. Used in a variety
of professional fields, statistics has the power to derive valuable insights and solve complex problems in
business, science, and society. Without hard science, decision making relies on emotions and gut
reactions. Statistics and data override intuition, inform decisions, and minimize risk and uncertainty.

In data science, statistics is at the core of sophisticated machine learning algorithms, capturing and
translating data patterns into actionable evidence. Data scientists use statistics to gather, review, analyze,
and draw conclusions from data, as well as apply quantified mathematical models to appropriate
variables.

Data science knowledge is grouped into three main areas: computer science; statistics and mathematics;
and business or field expertise. These areas separately result in a variety of careers, as displayed in the
diagram below. Combining computer science and statistics without business knowledge enables
professionals to perform an array of machine learning functions. Computer science and business
expertise leads to software development skills. Mathematics and statistics (combined with business

DSBDL/310256/ Sem- VI Page 52


Computer Engineering / DYPCOEI,PUNE

expertise) result in some of the most talented researchers. It is only with all three areas combined that
data scientists can maximize their performance, interpret data, recommend innovative solutions, and
create a mechanism to achieve improvements.

Statistical functions are used in data science to analyze raw data, build data models, and infer results.
Below is a list of the key statistical terms:
● Population: the source of data to be collected.
● Sample: a portion of the population.
● Variable: any data item that can be measured or counted.
● Quantitative analysis (statistical): collecting and interpreting data with patterns and data
visualization.
● Qualitative analysis (non-statistical): producing generic information from other non-data forms
of media.
● Descriptive statistics: characteristics of a population.
● Inferential statistics: predictions for a population.
● Central tendency (measures of the center): mean (average of all values), median (central value of
a data set), and mode (the most recurrent value in a data set).
● Measures of the Dispersion:
○ Range: the distance between each value in a data set.
○ Variance: the distance between a variable and its expected value.
○ Standard deviation: the dispersion of a data set from the mean.
Statistical techniques for data scientists

There are a number of statistical techniques that data scientists need to master. When just starting out, it
is important to grasp a comprehensive understanding of these principles, as any holes in knowledge will
result in compromised data or false conclusions.

DSBDL/310256/ Sem- VI Page 53


Computer Engineering / DYPCOEI,PUNE

General statistics: The most basic concepts in statistics include bias, variance, mean, median, mode, and
percentiles.
Probability distributions: Probability is defined as the chance that something will occur, characterized as
a simple “yes” or “no” percentage. For instance, when weather reporting indicates a 30 percent chance
of rain, it also means there is a 70 percent chance it will not rain. Determining the distribution calculates
the probability that all those potential values in the study will occur. For example, calculating the
probability that the 30 percent chance for rain will change over the next two days is an example of
probability distribution.
Dimension reduction: Data scientists reduce the number of random variables under consideration
through feature selection (choosing a subset of relevant features) and feature extraction (creating new
features from functions of the original features). This simplifies data models and streamlines the process
of entering data into algorithms.
Over and under sampling: Sampling techniques are implemented when data scientists have too much or
too little of a sample size for a classification. Depending on the balance between two sample groups,
data scientists will either limit the selection of a majority class or create copies of a minority class in
order to maintain equal distribution.
Bayesian statistics: Frequency statistics uses existing data to determine the probability of a future event.
Bayesian statistics, however, takes this concept a step further by accounting for factors we predict will
be true in the future. For example, imagine trying to predict whether at least 100 customers will visit
your coffee shop each Saturday over the next year. Frequency statistics will determine probability by
analyzing data from past Saturday visits. But Bayesian statistics will determine probability by also
factoring for a nearby art show that will start in the summer and take place every Saturday afternoon.
This allows the Bayesian statistical model to provide a much more accurate figure.
The goals of inference

1. Estimate and quantify the uncertainty of an estimate of a population quantity (the proportion of
people who will vote for a candidate).

2. Determine whether a population quantity is a benchmark value (“is the treatment effective?”).
3. Infer a mechanistic relationship when quantities are measured with noise (“What is the slope for
Hooke’s law?”)

4. Determine the impact of a policy? (“If we reduce pollution levels, will asthma rates decline?”)
5. Talk about the probability that something occurs.

DSBDL/310256/ Sem- VI Page 54


Computer Engineering / DYPCOEI,PUNE

Algorithm:-
Step 1. Import Dataset:
train_df=pd.read_csv('train.csv')
test_df=pd.read_csv('test.csv')
train_df.shape, test_df.shape

train_df['label']='train'
test_df['label']='test'

combined_data_df=pd.concat([train_df,test_df])
combined_data_df.shape
#The reasons for combining both training and test dataset are:

#To find missing values in both the datasets


#If we need transform/remove any features, we can do it in both datasets at one time
#To convert categorical variable to numerical variable in both datasets

Step 2. Statistical Inference


#Statistical Analysis
combined_data_df.mean()
combined_data_df.median()
combined_data_df.mode()
combined_data_df.std()
combined_data_df.describe()
combined_data_df.min()
combined_data_df.max()
combined_data_df.dtypes

Step 3. Handling Missing Values


#Handle the missing value
combined_data_df.info()
#Data types in data set:

#Categorical = 10
#Numerical = 5
#Target =1
combined_data_df.isnull().sum()
combined_data_df.dropna(subset=['workclass','occupation','native-country'],axis=0,inplace=True)
combined_data_df.isnull().sum()
combined_data_df.dropna(subset=['income_>50K'],axis=0,inplace=True)
combined_data_df.isnull().sum()

DSBDL/310256/ Sem- VI Page 55


Computer Engineering / DYPCOEI,PUNE

Step 4. Data Visualization


import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="darkgrid")

#frequency distribution of work class


plt.figure(figsize=(10,10))
sns.countplot(data= combined_data_df, x = combined_data_df['workclass'])
combined_data_df.drop(combined_data_df.index[combined_data_df['workclass'] == 'Without-pay'],
inplace=True)
combined_data_df.shape
plt.figure(figsize=(10,15))
sns.countplot(data= combined_data_df, y = "native-country")
#This graph clearly shows that the given dataset in from US. We can remove other countries since
almost of them are US origin.
combined_data_df=combined_data_df[combined_data_df['native-country']=='United-States']
combined_data_df.shape
#Since, now we only have US as the only native country, this feature is useless. We can drop this feature
altogether
combined_data_df=combined_data_df.drop(columns='native-country',axis=1)
combined_data_df.shape
#frequency distribution of education class

plt.figure(figsize=(20,10))
sns.countplot(data= combined_data_df, x = "education")
combined_data_df['education'] = combined_data_df['education'].replace(['1st-4th','5th-6th'],'elementary-
school')
combined_data_df['education'] = combined_data_df['education'].replace(['7th-8th'],'middle-school')
combined_data_df['education'] = combined_data_df['education'].replace(['9th','10th','11th','12th'],'high-
school')
combined_data_df['education'] = combined_data_df['education'].replace(['Doctorate','Bachelors','Some-
college','Masters','Prof-school','Assoc-voc','Assoc-acdm'],'postsecondary-education')

plt.figure(figsize=(20,10))
sns.countplot(data= combined_data_df, x = "education")

plt.figure(figsize=(20,10))
sns.countplot(data= combined_data_df, x = "marital-status")

combined_data_df['marital-status'] = combined_data_df['marital-status'].replace(['Divorced','Never-
married','Widowed'],'single')

DSBDL/310256/ Sem- VI Page 56


Computer Engineering / DYPCOEI,PUNE

combined_data_df['marital-status'] = combined_data_df['marital-status'].replace(['Married-civ-
spouse','Separated','Married-spouse-absent','Married-AF-spouse'],'married')
plt.figure(figsize=(20,10))
plt.figure()
sns.countplot(data= combined_data_df, x = "marital-status")

plt.figure(figsize=(20,10))
sns.countplot(data= combined_data_df, y = "occupation")

plt.figure(figsize=(20,10))
sns.countplot(data= combined_data_df, x = "relationship")

Step 5. convert categorical variable to numerical variable


#Split the dataset into categorical and numerical values
#categorical
cat_columns = [ col for col in list(combined_data_df.columns) if combined_data_df[col].dtype
=='object' and col!= 'label']
cat_columns

#numberical num_columns = [ col for col in list(combined_data_df.columns) if


combined_data_df[col].dtype in ['int64','float64']]
num_columns fig= plt.figure(figsize=(15,15))
corr_matrix = combined_data_df.corr()
sns.heatmap(data=corr_matrix,annot=True)
plt.show()
combined_data_df.drop(columns='fnlwgt',inplace=True)
combined_data_df.shape
#Converting categorical variables to numerical ( Dummy variables )
#get dummies
features_df = pd.get_dummies(data=combined_data_df, columns=cat_columns)
features_df.shape
features_df.columns
#split your data
train_df = features_df[features_df['label'] == 'train']
test_df = features_df[features_df['label'] == 'test']
# Drop your labels
train_df = train_df.drop('label', axis=1)
test_df = test_df.drop(columns=['label','income_>50K'], axis=1)
train_df.shape, test_df.shape
train_df.columns
train_df.isnull().sum().sum()

DSBDL/310256/ Sem- VI Page 57


Computer Engineering / DYPCOEI,PUNE

Input: train.csv, test.csv

Output: Performed statistical analysis on the income prediction dataset and also converted categorical
data into numerical data.

Conclusion:-
● Handled the missing value, by dropping them from the dataset
● From the data visualization, combined/categorized the features
● Using dummy variable, converted categorical variable to numerical variable to create better model.

DSBDL/310256/ Sem- VI Page 58


Computer Engineering / DYPCOEI,PUNE

Assignment No: 4

Aim/Problem Statement:- Data Analytics I

Create a Linear Regression Model using Python/R to predict home prices using Boston Housing Dataset
(https://www.kaggle.com/c/boston-housing). The Boston Housing dataset contains information about
various houses in Boston through different parameters. There are 506 samples and 14 feature variables
in this dataset.
The objective is to predict the value of prices of the house using the given features.

Theory:-
Linear Regression: It is a machine learning algorithm based on supervised learning. It performs
a regression task. Regression models a target prediction value based on independent variables. It is
mostly used for finding out the relationship between variables and forecasting. Different regression
models differ based on – the kind of relationship between dependent and independent variables, they
are considering and the number of independent variables being used.

Linear regression performs the task to predict a dependent variable value (y) based on a given
independent variable (x). So, this regression technique finds out a linear relationship between x (input)
and y(output). Hence, the name is Linear Regression.

In the figure above, X (input) is the work experience and Y (output) is the salary of a person. The
regression line is the best fit line for our model.

Hypothesis function for Linear Regression :

While training the model we are given :


x: input training data (univariate – one input variable(parameter))
y: labels to data (supervised learning)

DSBDL/310256/ Sem- VI Page 59


Computer Engineering / DYPCOEI,PUNE

When training the model – it fits the best line to predict the value of y for a given value of x. The
model gets the best regression fit line by finding the best θ1 and θ2 values.
θ1: intercept
θ2: coefficient of x
Once we find the best θ1 and θ2 values, we get the best fit line. So when we are finally using our
model for prediction, it will predict the value of y for the
input value of x.

How to update θ1 and θ2 values to get the best fit line ?


Cost Function (J):
By achieving the best-fit regression line, the model aims to predict y value such that the error
difference between predicted value and true value is minimum. So, it is very important to update the
θ1 and θ2 values, to reach the best value that minimize the error between predicted y value (pred) and
true y value (y).

Cost function(J) of Linear Regression is the Root Mean Squared Error (RMSE) between predicted
y value (pred) and true y value (y).
Gradient Descent:
To update θ1 and θ2 values in order to reduce Cost function (minimizing RMSE value) and achieving
the best fit line the model uses Gradient Descent. The idea is to start with random θ1 and θ2 values and
then iteratively updating the values, reaching minimum cost.

Algorithm:-
Step 1: Download the data set of Boston Housing Prices
(https://www.kaggle.com/c/boston-housing).
Step 2: Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Step 3: Importing Data
from sklearn.datasets import load_boston
boston = load_boston()
Step 4: Converting data from nd-array to data frame and adding feature names to the data
data = pd.DataFrame(boston.data)
data.columns = boston.feature_names
Step 5: Adding 'Price' (target) column to the data
DSBDL/310256/ Sem- VI Page 60
Computer Engineering / DYPCOEI,PUNE

data['Price'] = boston.target
Step 6: Getting input and output data and further splitting data to training and testing dataset.

# Input Data
x = boston.data

# Output Data
y = boston.target
Step 7: splitting data to training and testing dataset.

#from sklearn.cross_validation import train_test_split


#the submodule cross_validation is renamed and reprecated to model_selection
from sklearn.model_selection import train_test_split

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size =0.2,random_state = 0)

Step 8: #Applying Linear Regression Model to the dataset and predicting the prices.
# Fitting Multi Linear regression model to training model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(xtrain, ytrain)

# predicting the test set results


y_pred = regressor.predict(xtest)

Step 9: Plotting Scatter graph to show the prediction


# results - 'ytrue' value vs 'y_pred' value
plt.scatter(ytest, y_pred, c = 'green')
plt.xlabel("Price: in $1000's")
plt.ylabel("Predicted value")
plt.title("True value vs predicted value : Linear Regression")
plt.show()
Step 10: Results of Linear Regression.
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(ytest, y_pred)
print("Mean Square Error : ", mse)

Input: Dataset of Boston Housing Prices. This dataset concerns the housing prices in the housing city
of Boston. The dataset provided has 506 instances with 14 features.

Output: Prediction of Boston Housing Prices by plotting graph to show prediction

DSBDL/310256/ Sem- VI Page 61


Computer Engineering / DYPCOEI,PUNE

Conclusion:-Housing Prices of Boston city predicted using Linear Regression

Questions:
1) What is Linear regression
2) What are different types of linear regressions
3) Applications where linear regression is used
4) What are the limitations of linear regression

Assignment No: 5

Aim/Problem Statement:- Data Analytics II


Implement logistic regression using Python/R to perform classification on Social_Network_Ads.csv
dataset. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall on
the given dataset.
Theory:-
Logistic regression is one of the most popular Machine Learning algorithms, which comes under the
Supervised Learning technique. It is used for predicting the categorical dependent variable using a given
set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome must
be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of
giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are used. Linear
Regression is used for solving Regression problems, whereas Logistic regression is used for solving
the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function,
which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as whether the
cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data and can
easily determine the most effective variables used for the classification. The below image is
showing the logistic function:

DSBDL/310256/ Sem- VI Page 62


Computer Engineering / DYPCOEI,PUNE

Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is called
logistic regression, but is used to classify samples; Therefore, it falls under the classification algorithm.
Logistic Function (Sigmoid Function):
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit,
so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function or the
logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the probability of
either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
threshold values tends to 0.
Assumptions for Logistic Regression:
o The dependent variable must be categorical in nature.
o The independent variable should not have multi-collinearity.
Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation
by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:

The above equation is the final equation for Logistic Regression.


Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.

DSBDL/310256/ Sem- VI Page 63


Computer Engineering / DYPCOEI,PUNE

o Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered


types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model,
where N is the number of target classes. The matrix compares the actual target values with those
predicted by the machine learning model. This gives us a holistic view of how well our classification
model is performing and what kinds of errors it is making.
For a binary classification problem, we would have a 2 x 2 matrix as shown below with 4 values:

The target variable has two values: Positive or Negative


The columns represent the actual values of the target variable
The rows represent the predicted values of the target variable
True Positive (TP)
The predicted value matches the actual value
The actual value was positive and the model predicted a positive value
True Negative (TN)
The predicted value matches the actual value
The actual value was negative and the model predicted a negative value
False Positive (FP) – Type 1 error
The predicted value was falsely predicted
The actual value was negative but the model predicted a positive value
Also known as the Type 1 error
False Negative (FN) – Type 2 error
The predicted value was falsely predicted
The actual value was positive but the model predicted a negative value
Also known as the Type 2 error
Accuracy is calculated as below

Precision tells us how many of the correctly predicted cases actually turned out to be positive.
Here’s how to calculate Precision:

DSBDL/310256/ Sem- VI Page 64


Computer Engineering / DYPCOEI,PUNE

This would determine whether our model is reliable or not.


Recall tells us how many of the actual positive cases we were able to predict correctly with our model.
And here’s how we can calculate Recall:

Algorithm:-

Step 1: Download the data set of Social_Network_Ads


(https://www.kaggle.com/)
Step 2: Importing Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Step 3: Importing Data


dataset = pd.read_csv('Social_Network_Ads.csv')
#select only age and salary as the features
x = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
Step 4: Perform splitting for training and testing. We will take 75% of the data for training, and
test on the remaining data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

Step 5: Scale the features to avoid variation and let the features follow a normal distribution
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Step 6:Fit the Model

from sklearn.linear_model import LogisticRegression


classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

DSBDL/310256/ Sem- VI Page 65


Computer Engineering / DYPCOEI,PUNE

Step 7: Predict the labels of test data

y_pred = classifier.predict(X_test)

Step 8: Evaluate performance of the model

From sklearn.metrics import confusion_matrix,classification_report


cm = confusion_matrix(y_test, y_pred)
print(cm)

Step 9: Evaluate Accuracy depending on Confusion Matrix

#Accuray=(TN+TP)/Total

Step 10: Evaluate error rate

#Error_rate=(FN+FP)/Total

Step 11: Compute Precision and Recall

cl_report=classification_report(y_test,y_pred)

cl_report

Input: Social_Network_Ads.csv dataset

Output: Classification using Logistic Regression, Computed Confusion matrix to find TP, FP, TN, FN,
Accuracy, Error rate, Precision, Recall on the given dataset.

Conclusion:- Implemented logistic regression to perform classification on Social_Network_Ads.csv


dataset using python

Questions:
1) What is logistic regression
2) How it is different from linear regression
3) What are the types of logistic regression
4) What are the limitations of logistic regression
5) List application where logistic regression can be applied

DSBDL/310256/ Sem- VI Page 66


Computer Engineering / DYPCOEI,PUNE

Assignment No: 6

Aim/Problem Statement:- Data Analytics III


Implement Simple Naïve Bayes classification algorithm using Python/R on iris.csv dataset.
Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall on the given
dataset.
Theory:-
Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used
for solving classification problems. It is mainly used in text classification that includes a high-
dimensional training dataset. Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine learning models that can make quick
predictions. It is a probabilistic classifier, which means it predicts on the basis of the probability of
an object. Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases of
color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each
feature individually contributes to identify that it is an apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:

Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis
is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Advantages of Naïve Bayes Classifier:
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.

DSBDL/310256/ Sem- VI Page 67


Computer Engineering / DYPCOEI,PUNE

o It performs well in Multi-class predictions as compared to the other Algorithms.


o It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.
Applications of Naïve Bayes Classifier:
o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.
Types of Naïve Bayes Model:
There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a normal distribution. This means if
predictors take continuous values instead of discrete, then the model assumes that these values
are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular
document belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor
variables are the independent Booleans variables. Such as if a particular word is present or not in
a document. This model is also famous for document classification tasks.

Algorithm:-

Step 1: Download the data set of Iris


(https://www.kaggle.com/uciml/iris)
Step 2: Importing Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Step 3: Assign the 4 independent variables to X and the dependent variable ‘species’ to Y.
The first 5 rows of the dataset are displayed
X = dataset.iloc[:,:4].values
y = dataset['species'].valuesdataset.head(5)

Step 4: Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Step 5: Feature Scaling


from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

DSBDL/310256/ Sem- VI Page 68


Computer Engineering / DYPCOEI,PUNE

X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Step 6: Training the Naive Bayes Classification model on the Training Set

from sklearn.naive_bayes import GaussianNB


classifier = GaussianNB()
classifier.fit(X_train, y_train)

Step 7: Predicting the Test set results


y_pred = classifier.predict(X_test)
y_pred

Step 8: Confusion Matrix and Accuracy


from sklearn.metrics import confusion_matrix,classification_report
cm = confusion_matrix(y_test, y_pred)
from sklearn.metrics import accuracy_score
print ("Accuracy : ", accuracy_score(y_test, y_pred))
print(cm)
print("Error rate:",(1-accuracy_score(y_test, y_pred)))

Step 9: Compute Precision and Recall

cl_report=classification_report(y_test,y_pred)
print(cl_report)

Step 10: #Comparing the Real Values with Predicted Values


df = pd.DataFrame({'Real Values':y_test, 'Predicted Values':y_pred})
df

Input: Iris Dataset

Output: Confusion matrix, Accuracy, Error rate, Precision, Recall on the given dataset.

Conclusion: Implemented successfully Simple Naïve Bayes classification algorithm using Python on
iris.csv dataset

Questions:
1) What is confusion matrix
2) How to calculate Accuracy, precision and recall
3) Explain applications of naïve Bays classification algorithm
4) State advantages and limitations of Naïve Bays algorithm

DSBDL/310256/ Sem- VI Page 69


Computer Engineering / DYPCOEI,PUNE

Assignment No: 7

Aim/Problem Statement:- Text Analytics


Extract sample document and apply following document preprocessing methods:
1) Tokenization, POS tagging, stop words removal, stemming and lammetization
2) Create representation of document by calculating term frequency and inverse document
frequency

Theory:-
Natural language processing is one of the fields in programming where the natural language is processed
by the software. This has many applications like sentiment analysis, language translation, fake news
detection, grammatical error detection etc.
The input in natural language processing is text. The data collection for this text happens from a lot of
sources. This requires a lot of cleaning and processing before the data can be used for analysis.
These are some of the methods of processing the data in NLP:
 Tokenization
 Stop words removal
 Stemming
 Normalization
 Lemmatization
 Parts of speech tagging
Tokenization:
Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words,
sentences called tokens. These tokens help in understanding the context or developing the model for the
NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the
words.
For example, the text “It is raining” can be tokenized into ‘It’, ‘is’, ‘raining’

There are different methods and libraries available to perform tokenization. NLTK, Gensim, Keras are
some of the libraries that can be used to accomplish the task.

Tokenization can be done to either separate words or sentences. If the text is split into words using some
separation technique it is called word tokenization and same separation done for sentences is called
sentence tokenization.
There are various tokenization techniques available which can be applicable based on the language and
purpose of modeling. Below are a few of the tokenization techniques used in NLP.

DSBDL/310256/ Sem- VI Page 70


Computer Engineering / DYPCOEI,PUNE

White Space Tokenization


This is the simplest tokenization technique. Given a sentence or paragraph it tokenizes into words by
splitting the input whenever a white space in encountered. This is the fastest tokenization technique but
will work for languages in which the white space breaks apart the sentence into meaningful words.
Example: English.
Dictionary Based Tokenization
In this method the tokens are found based on the tokens already existing in the dictionary. If the token is
not found, then special rules are used to tokenize it. It is an advanced technique compared to whitespace
tokenizer.
Rule Based Tokenization
In this technique a set of rules are created for the specific problem. The tokenization is done based on the
rules. For example creating rules bases on grammar for particular language.
Regular Expression Tokenizer
This technique uses regular expression to control the tokenization of text into tokens. Regular expression
can be simple to complex and sometimes difficult to comprehend. This technique should be preferred
when the above methods does not serve the required purpose. It is a rule based tokenizer.
Penn TreeBank Tokenization
Tree bank is a corpus created which gives the semantic and syntactical annotation of language. Penn
Treebank is one of the largest treebanks which was published. This technique of tokenization separates
the punctuation, clitics (words that occur along with other words like I’m, don’t) and hyphenated words
together.
Spacy Tokenizer
This is a modern technique of tokenization which faster and easily customizable. It provides the
flexibility to specify special tokens that need not be segmented or need to be segmented using special
rules. Suppose you want to keep $ as a separate token, it takes precedence over other tokenization
operations.
Moses Tokenizer

DSBDL/310256/ Sem- VI Page 71


Computer Engineering / DYPCOEI,PUNE

This is a tokenizer which is advanced and is available before Spacy was introduced. It is basically a
collection of complex normalization and segmentation logic which works very well for structured
language like English.
Subword Tokenization
This tokenization is very useful for specific application where sub words make significance. In this
technique the most frequently used words are given unique ids and less frequent words are split into sub
words and they best represent the meaning independently. For example if the word few is appearing
frequently in the text it will be assigned a unique id, where fewer and fewest which are rare words and are
less frequent in the text will be split into sub words like few, er, and est. This helps the language model
not to learn fewer and fewest as two separate words. This allows to identify the unknown words in the
data set during training. There are different types of subword tokenization and they are given below and
Byte-Pair Encoding and WordPiece will be discussed briefly.
 Byte-Pair Encoding (BPE)
 WordPiece
 Unigram Language Model
 SentencePiece
Byte-Pair Encoding (BPE)
This technique is based on the concepts in information theory and compression. BPE uses Huffman
encoding for tokenization meaning it uses more embedding or symbols for representing less frequent
words and less symbols or embedding for more frequently used words.
The BPE tokenization is bottom up sub word tokenization technique. The steps involved in BPE
algorithm is given below.
1. Starts with splitting the input words into single unicode characters and each of them corresponds to a
symbol in the final vocabulary.
2. Find the most frequent occurring pair of symbols from the current vocabulary.
3. Add this to the vocabulary and size of vocabulary increases by one.
4. Repeat steps ii and iii till the defined number of tokens are built or no new combination of symbols
exist with required frequency.
WordPiece
WordPiece is similar to BPE techniques expect the way the new token is added to the vocabulary. BPE
considers the token with most frequent occurring pair of symbols to merge into the vocabulary. While
WordPiece considers the frequency of individual symbols also and based on below count it merges into
the vocabulary.
Count (x, y) = frequency of (x, y) / frequency (x) * frequency (y)
The pair of symbols with maximum count will be considered to merge into vocabulary. So it allows rare
tokens to be included into vocabulary as compared to BPE.
Tokenization with NLTK
NLTK (natural language toolkit ) is a python library developed by Microsoft to aid in NLP.

#Tokenization

#Splitting a sentece into words(tokes)

DSBDL/310256/ Sem- VI Page 72


Computer Engineering / DYPCOEI,PUNE

#Learn tokenize with nltk

# tokenization with nltk

print("\ntokenize with nltk: ",nlp.word_tokenize(data))

Input: Text can be sentences, strings, words, characters and large documents.
Now lets create a sentence to understand basics of text mining methods.
Our sentence is "no woman no cry" from Bob Marley
playing swimming dancing played danced

Output: tokenize with nltk: ['Text', 'can', 'be', 'sentences', ',', 'strings',
',', 'words', ',', 'characters', 'and', 'large', 'documents', '.', 'Now', 'lets
', 'create', 'a', 'sentence', 'to', 'understand', 'basics', 'of', 'text', 'mini
ng', 'methods', '.', 'Our', 'sentence', 'is', '``', 'no', 'woman', 'no', 'cry',
"''", 'from', 'Bob', 'Marley', 'playing', 'swimming', 'dancing', 'played', 'da
nced']

Stemming:

Stemming is a natural language processing technique that lowers inflection in words to their root forms,
hence aiding in the preprocessing of text, words, and documents for text normalization.
According to Wikipedia, inflection is the process through which a word is modified to communicate
many grammatical categories, including tense, case, voice, aspect, person, number, gender, and mood.
Thus, although a word may exist in several inflected forms, having multiple inflected forms inside the
same text adds redundancy to the NLP process.
As a result, we employ stemming to reduce words to their basic form or stem, which may or may not be
a legitimate word in the language.
For instance, the stem of these three words, connections, connected, connects, is “connect”. On the other
hand, the root of trouble, troubled, and troubles is “troubl,” which is not a recognized word.

Why Stemming is Important?


As previously stated, the English language has several variants of a single term. The presence of these
variances in a text corpus results in data redundancy when developing NLP or machine learning models.
Such models may be ineffective.
To build a robust model, it is essential to normalize text by removing repetition and transforming words
to their base form through stemming.

Application of Stemming
In information retrieval, text mining SEOs, Web search results, indexing, tagging systems, and word
analysis, stemming is employed. For instance, a Google search for prediction and predicted returns
comparable results.

Martin Porter invented the Porter Stemmer or Porter algorithm in 1980. Five steps of word reduction are
used in the method, each with its own set of mapping rules. Porter Stemmer is the original stemmer and

DSBDL/310256/ Sem- VI Page 73


Computer Engineering / DYPCOEI,PUNE

is renowned for its ease of use and rapidity. Frequently, the resultant stem is a shorter word with the
same root meaning.
PorterStemmer() is a module in NLTK that implements the Porter Stemming technique. Let us examine
this with the aid of an example.
Example of PorterStemmer()
In the example below, we construct an instance of PorterStemmer() and use the Porter algorithm to stem
the list of words.

from nltk.stem import PorterStemmer


porter = PorterStemmer()
words =
['Connects','Connecting','Connections','Connected','Connection','Connectings','Connect']
for word in words:
print(word,"--->",porter.stem(word))
Connects ---> connect
Connecting ---> connect
Connections ---> connect
Connected ---> connect
Connection ---> connect
Connectings ---> connect
Connect ---> connect

Lemmatization:

Lemmatization is another technique which is used to reduce words to a normalized form. In


lemmatization, the transformation uses a dictionary to map different variants of a word back to its root
format. So, with this approach, we are able to reduce non trivial inflections such as “is”, “was”, “were”
back to the root “be”.

import the WordNetLemmatizer and try it out by the following example!

# lemmatization

lemma = nlp.WordNetLemmatizer()

lemma_roots = [lemma.lemmatize(each) for each in words_list]

print("result of lemmatization: ",lemma_roots)

DSBDL/310256/ Sem- VI Page 74


Computer Engineering / DYPCOEI,PUNE

result of lemmatization: ['text', 'can', 'be', 'sentences,', 'strings,', 'word


s,', 'character', 'and', 'large', 'documents.\nnow', 'let', 'create', 'a', 'sen
tence', 'to', 'understand', 'basic', 'of', 'text', 'mining', 'methods.', '\nour
', 'sentence', 'is', '"no', 'woman', 'no', 'cry"', 'from', 'bob', 'marley\nplay
ing', 'swimming', 'dancing', 'played', 'danced']

Lemmatization is similar to stemming with one difference i.e. the final form is also a meaningful word.
Thus, stemming operation does not need a dictionary like lemmatization.

Stopword:

There different techniques for removing stop words from strings in Python. Stop words are those words
in natural language that have a very little meaning, such as "is", "an", "the", etc. Search engines and
other enterprise indexing platforms often filter the stop words while fetching results from the database
against the user queries.

Stop words are often removed from the text before training deep learning and machine learning models
since stop words occur in abundance, hence providing little to no unique information that can be used
for classification or clustering.

Removing Stop Words with Python

With the Python programming language, we have a myriad of options to use in order to remove stop
words from strings. we can either use one of the several natural language processing libraries such as
NLTK, SpaCy, Gensim, TextBlob, etc.,

There are number of different approaches, depending on the NLP library in use

 Stop Words with NLTK


 Stop Words with Gensim
 Stop Words with SpaCy

DSBDL/310256/ Sem- VI Page 75


Computer Engineering / DYPCOEI,PUNE

Using Python's NLTK Library

The NLTK library is one of the oldest and most commonly used Python libraries for Natural Language
Processing. NLTK supports stop word removal, and we can find the list of stop words in
the corpus module. To remove stop words from a sentence, we can divide your text into words and then
remove the word if it exits in the list of stop words provided by NLTK.

Steps Involved in the POS tagging

 Tokenize text (word_tokenize)


 apply pos_tag to above step that is nltk.pos_tag(tokenize_text)

NLTK POS Tags Examples are as below:

Abbreviation Meaning

CC coordinating conjunction

CD cardinal digit

DT Determiner

EX existential there

FW foreign word

IN preposition/subordinating conjunction

JJ This NLTK POS Tag is an adjective (large)

JJR adjective, comparative (larger)

JJS adjective, superlative (largest)

LS list market

MD modal (could, will)

NN noun, singular (cat, tree)

NNS noun plural (desks)

DSBDL/310256/ Sem- VI Page 76


Computer Engineering / DYPCOEI,PUNE

Abbreviation Meaning

NNP proper noun, singular (sarah)

NNPS proper noun, plural (indians or americans)

PDT predeterminer (all, both, half)

POS possessive ending (parent\ ‘s)

PRP personal pronoun (hers, herself, him, himself)

PRP$ possessive pronoun (her, his, mine, my, our )

RB adverb (occasionally, swiftly)

RBR adverb, comparative (greater)

RBS adverb, superlative (biggest)

RP particle (about)

TO infinite marker (to)

UH interjection (goodbye)

VB verb (ask)

VBG verb gerund (judging)

VBD verb past tense (pleaded)

VBN verb past participle (reunified)

VBP verb, present tense not 3rd person singular(wrap)

VBZ verb, present tense with 3rd person singular (bases)

WDT wh-determiner (that, what)

WP wh- pronoun (who)

DSBDL/310256/ Sem- VI Page 77


Computer Engineering / DYPCOEI,PUNE

Abbreviation Meaning

WRB wh- adverb (how)

The above NLTK POS tag list contains all the NLTK POS Tags. NLTK POS tagger is used to assign
grammatical information of each word of the sentence. Installing, Importing and downloading all the
packages of POS NLTK is complete.

import nltk

nltk.download('averaged_perceptron_tagger')

print("\nPOS tagging:",nltk.pos_tag(text_tokens))

POS tagging: [('Text', 'NN'), ('can', 'MD'), ('be', 'VB'), ('sentences', 'NNS')
, (',', ','), ('strings', 'NNS'), (',', ','), ('words', 'NNS'), (',', ','), ('c
haracters', 'NNS'), ('and', 'CC'), ('large', 'JJ'), ('documents', 'NNS'), ('.',
'.'), ('Now', 'RB'), ('lets', 'VBZ'), ('create', 'VB'), ('a', 'DT'), ('sentenc
e', 'NN'), ('to', 'TO'), ('understand', 'VB'), ('basics', 'NNS'), ('of', 'IN'),
('text', 'NN'), ('mining', 'NN'), ('methods', 'NNS'), ('.', '.'), ('Our', 'PRP
$'), ('sentence', 'NN'), ('is', 'VBZ'), ('``', '``'), ('no', 'DT'), ('woman', '
NN'), ('no', 'DT'), ('cry', 'NN'), ("''", "''"), ('from', 'IN'), ('Bob', 'NNP')
, ('Marley', 'NNP'), ('playing', 'VBG'), ('swimming', 'VBG'), ('dancing', 'VBG'
), ('played', 'VBN'), ('danced', 'VBD')]

What is Feature Engineering of text data?


The procedure of converting raw text data into machine understandable format(numbers) is called feature
engineering of text data. Machine learning and deep learning algorithm performance and accuracy is
fundamentally dependent on the type of feature engineering techniques used.
we are going to see Feature Engineering technique using TF-IDF and mathematical calculation of TF,
IDF and TF-IDF to understand the insights of mathematical logic behind libraries such
as TfidfTransformer from sklearn.feature_extraction package in python.

DSBDL/310256/ Sem- VI Page 78


Computer Engineering / DYPCOEI,PUNE

TF (Term Frequency) :
 Term frequency is simply the count of a word present in a sentence
 TF is basically capturing the importance of the word irrespective of the length of the document.
 A word with the frequency of 3 with the length of sentence being 10 is not the same as when the
word length of sentence being 100 words. It should get more importance in the first scenario; that is
what TF does.

IDF (Inverse Document Frequency):


 IDF of each word is the log of the ratio of the total number of rows to the number of rows in a
particular document in which that word is present.
 IDF will measure the rareness of a term. word like ‘a’ and ‘the’ show up in all the documents of
corpus, but the rare words is not in all the documents.
TF-IDF:
It is the simplest product of TF and IDF so that both of the drawbacks are addressed above, which makes
predictions and information retrieval relevant.

Now import TfidfTransformer and CountVectorizer from sklearn.feature_extraction module.


from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

CountVectorizer is used to find the word count in each document of a dataset. Also known as to calculate
Term Frequency

docs=['the cat see the mouse',


'the house has a tiny little mouse',
'the mouse ran away from the house',
'the cat finally ate the mouse',
'the end of the mouse story'
]

DSBDL/310256/ Sem- VI Page 79


Computer Engineering / DYPCOEI,PUNE

#Extracting features by using TfidfTransformer from sklearn.feature_extraction package


from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

#fit all the sentences using count-vectorizer and get an array of word counts of each document
import pandas as pd
cv = CountVectorizer()
word_count_vector = cv.fit_transform(docs)
tf = pd.DataFrame(word_count_vector.toarray(), columns=cv.get_feature_names())
print(tf)

ate away cat end finally from has house little mouse of ran \
0 0 0 1 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 1 1 1 1 0 0
2 0 1 0 0 0 1 0 1 0 1 0 1
3 1 0 1 0 1 0 0 0 0 1 0 0
4 0 0 0 1 0 0 0 0 0 1 1 0

see story the tiny


0 1 0 2 0
1 0 0 1 1
2 0 0 2 0
3 0 0 2 0
4 0 1 2 0

#declare TfidfTransformer() instance and fit it with the above word_count_vector to get IDF and final
normalized features
tfidf_transformer = TfidfTransformer()
X = tfidf_transformer.fit_transform(word_count_vector)
idf = pd.DataFrame({'feature_name':cv.get_feature_names(), 'idf_weights':tfidf_transformer.idf_})
print(idf)

feature_name idf_weights
0 ate 2.098612
1 away 2.098612
2 cat 1.693147
3 end 2.098612
4 finally 2.098612
5 from 2.098612
6 has 2.098612
7 house 1.693147
8 little 2.098612
9 mouse 1.000000
10 of 2.098612
11 ran 2.098612
12 see 2.098612
13 story 2.098612

DSBDL/310256/ Sem- VI Page 80


Computer Engineering / DYPCOEI,PUNE

14 the 1.000000
15 tiny 2.098612

#Final feature vectors are


tf_idf = pd.DataFrame(X.toarray() ,columns=cv.get_feature_names())
print(tf_idf)

ate away cat end finally from has \


0 0.000000 0.000000 0.483344 0.000000 0.000000 0.000000 0.000000
1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.493562
2 0.000000 0.457093 0.000000 0.000000 0.000000 0.457093 0.000000
3 0.513923 0.000000 0.414630 0.000000 0.513923 0.000000 0.000000
4 0.000000 0.000000 0.000000 0.491753 0.000000 0.000000 0.000000

house little mouse of ran see story \


0 0.000000 0.000000 0.285471 0.000000 0.000000 0.599092 0.000000
1 0.398203 0.493562 0.235185 0.000000 0.000000 0.000000 0.000000
2 0.368780 0.000000 0.217807 0.000000 0.457093 0.000000 0.000000
3 0.000000 0.000000 0.244887 0.000000 0.000000 0.000000 0.000000
4 0.000000 0.000000 0.234323 0.491753 0.000000 0.000000 0.491753

the tiny
0 0.570941 0.000000
1 0.235185 0.493562
2 0.435614 0.000000
3 0.489774 0.000000
4 0.468646 0.000000

df = pd.DataFrame({'docs': ["the cat see the mouse",


"the house has a tiny little mouse",
"the mouse ran away from the house",
"the cat finally ate the mouse",
"the end of the mouse story"
]})
print(df)

docs
0 the cat see the mouse
1 the house has a tiny little mouse
2 the mouse ran away from the house
3 the cat finally ate the mouse
4 the end of the mouse story

Input: text document

Output: list of tokens,POS tags, lammetization and stemming output , TF,IDF and TF-IDF

DSBDL/310256/ Sem- VI Page 81


Computer Engineering / DYPCOEI,PUNE

Conclusion: In this way we have applied document pre-processing methods for tokenization, Stop word
removal, stemming, lemmatization, PoS tagging on input document and created document representation b
calculating TF-IDF.

Questions:
1) What is tokenization explain with example
2) Differentiate between stemming and lemmatization
3) Explain TF-IDF with an example
4) What is NLTK?
5) Explain different methods of stopword removal

DSBDL/310256/ Sem- VI Page 82


Computer Engineering / DYPCOEI,PUNE

Assignment No: 8

Data Visualization I

1. Use the inbuilt dataset 'titanic'. The dataset contains 891 rows and contains information
about the passengers who boarded the unfortunate Titanic ship. Use the Seaborn library to
see if we can find any patterns in the data.

2. Write a code to check how the price of the ticket (column name: 'fare') for each passenger is
distributed by plotting a histogram.

Theory:-

Data visualization is the graphical representation of information and data. By using visual elements like charts,
graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and
patterns in data.

In the world of Big Data, data visualization tools and technologies are essential to analyze massive amounts of
information and make data-driven decisions.

The advantages and benefits of good data visualization:

Our eyes are drawn to colors and patterns. We can quickly identify red from blue, square from circle. Our culture
is visual, including everything from art and advertisements to TV and movies. Data visualization is another form
of visual art that grabs our interest and keeps our eyes on the message. When we see a chart, we quickly see trends
and outliers. If we can see something, we internalize it quickly. It’s storytelling with a purpose. If you’ve ever

DSBDL/310256/ Sem- VI Page 83


Computer Engineering / DYPCOEI,PUNE

stared at a massive spreadsheet of data and couldn’t see a trend, you know how much more effective a
visualization can be.

Big Data is here and we need to know what it says

As the “age of Big Data” kicks into high-gear, visualization is an increasingly key tool to make sense of the
trillions of rows of data generated every day. Data visualization helps to tell stories by curating data into a form
easier to understand, highlighting the trends and outliers. A good visualization tells a story, removing the noise
from data and highlighting the useful information. However, it’s not simply as easy as just dressing up a graph to
make it look better or slapping on the “info” part of an infographic. Effective data visualization is a delicate
balancing act between form and function. The plainest graph could be too boring to catch any notice or it make
tell a powerful point; the most stunning visualization could utterly fail at conveying the right message or it could
speak volumes. The data and the visuals need to work together, and there’s an art to combining great analysis with
great storytelling.

Algorithm:

Step 1: Download the data set of Titanic

(https://www.kaggle.com/uciml/iris)

Step 2: Importing Libraries


import pandas as pd
import numpy as np

import matplotlib.pyplot as plt


import seaborn as sns

dataset = sns.load_dataset('titanic')

dataset.head()

Step 3: Draw distributional plot

sns.distplot(dataset['fare'])

Step 4: Removal of Kernal Density line

sns.distplot(dataset['fare'], kde=False)

Step 4: Draw histogram

sns.distplot(dataset['fare'], kde=False, bins=10)

DSBDL/310256/ Sem- VI Page 84


Computer Engineering / DYPCOEI,PUNE

Step 5: The Joint Plot

sns.jointplot(x='age', y='fare', data=dataset)

Step 6: sns.jointplot(x='age', y='fare', data=dataset, kind='hex')

Step 7: #The Pair Plot

sns.pairplot(dataset)

Conclusion: Implemented successfully Simple Data visualization techniques using Python on Titanic
dataset.

DSBDL/310256/ Sem- VI Page 85


Computer Engineering / DYPCOEI,PUNE

Assignment No: 9

Aim/Problem Statement:- Data Visualization II

1. Use the inbuilt dataset 'titanic' as used in the above problem. Plot a box plot for distribution
of age with respect to each gender along with the information about whether they survived
or not. (Column names : 'sex' and 'age')

Write observations on the inference from the above statistics.

Theory:-
Why data visualization is important

It’s hard to think of a professional industry that doesn’t benefit from making data more understandable. Every
STEM field benefits from understanding data—and so do fields in government, finance, marketing, history,
consumer goods, service industries, education, sports, and so on. While we’ll always wax poetically about data
visualization (you’re on the Tableau website, after all) there are practical, real-life applications that are
undeniable. And, since visualization is so prolific, it’s also one of the most useful professional skills to develop.
The better you can convey your points visually, whether in a dashboard or a slide deck, the better you can
leverage that information. The concept of the citizen data scientist is on the rise. Skill sets are changing to
accommodate a data-driven world. It is increasingly valuable for professionals to be able to use data to make
decisions and use visuals to tell stories of when data informs the who, what, when, where, and how. While
traditional education typically draws a distinct line between creative storytelling and technical analysis, the
modern professional world also values those who can cross between the two: data visualization sits right in the
middle of analysis and visual storytelling.

The different types of visualizations

When you think of data visualization, your first thought probably immediately goes to simple bar graphs or pie
charts. While these may be an integral part of visualizing data and a common baseline for many data graphics, the
right visualization must be paired with the right set of information. Simple graphs are only the tip of the iceberg.
There’s a whole selection of visualization methods to present data in effective and interesting ways. Common
general types of data visualization:

● Charts
● Tables
● Graphs
● Maps
● Infographics

DSBDL/310256/ Sem- VI Page 86


Computer Engineering / DYPCOEI,PUNE

● Dashboards

More specific examples of methods to visualize data:

● Area Chart
● Bar Chart
● Box-and-whisker Plots
● Bubble Cloud
● Bullet Graph
● Cartogram
● Circle View
● Dot Distribution Map
● Gantt Chart
● Heat Map
● Highlight Table
● Histogram
● Matrix
● Network
● Polar Area
● Radial Tree
● Scatter Plot (2D or 3D)

Algorithm:-
Step 1: Download the data set of Titanic

Step 2: Importing Libraries


import math #library for mathematical calculation
import numpy as np #library for scientific computing
import pandas as pd #library for easy-to-use data structures and data analysis tools
import matplotlib.pyplot as plt #library for plotting
import seaborn as sns #library for plotting

Step 3: Reading of dataset

DSBDL/310256/ Sem- VI Page 87


Computer Engineering / DYPCOEI,PUNE

df = pd.read_csv('train.csv')

Step 4: Print details of the dataset


print('_'*50)
print('*'*50)
print(df.info(memory_usage=False))
print('_'*50)
Step 5: Finding Null values
df.isnull().sum()

Step 6: Dealing with missing values


#drop passengerID, Name, Cabin columns
df = df.drop(['PassengerId', 'Name', 'Cabin'], axis=1)

#drop rows which has nan value


df = df.dropna(axis=0, how='any')

Step 7: map 0 to not survived and 1 to survived in survived column


df['Survived'] = df['Survived'].map({0: 'Not Survived', 1: 'Survived'})
Step 8: group passengers into bins of children, teenager, adult, senior citizen…
df['Age'] = pd.cut(df['Age'], bins=[-1, 5, 19, 60, 150], labels=['Children','Teenager','Adult','Senior
Citizen'],right=True)
Step 9: Set the plot
sns.set(context='notebook',style='whitegrid',font_scale=1.5)
Step 10: The two methods below is for displaying the count value on top of the bar graph
def assignAxis(df,g):
# Get current axis on current figure
for i in range(0,g.axes.size):
ax = g.fig.get_axes()[i]
displayCount(df,ax)

def displayCount(df,ax):
# ylim max value to be set
y_max = df.value_counts().max() + 75
ax.set_ylim(top=y_max)

# Iterate through the list of axes' patches


for p in ax.patches:
#checks if there index has 0 count
if(math.isnan(p.get_height())):
ax.text(p.get_x() + p.get_width()/2, 0, 0,
fontsize=12, color='red', ha='center', va='bottom')
continue
else:
ax.text(p.get_x() + p.get_width()/2, p.get_height(), int(p.get_height()),

DSBDL/310256/ Sem- VI Page 88


Computer Engineering / DYPCOEI,PUNE

fontsize=12, color='red', ha='center', va='bottom')

#number of males and females


g = sns.factorplot(x='Sex', data=df,kind='count',size=4, aspect=.8,alpha=0.7,
palette='muted').set(xlabel='Gender',ylabel='Count',title='Number of Male and Female')

assignAxis(df['Sex'],g)
Step 11: #number of males and females based on Age
g = sns.factorplot(x='Age', data=df,kind='count',size=4, aspect=1.8, hue='Sex',alpha=0.7,
palette='muted').set(xlabel='Age',ylabel='Count',title='Gender Distribution by Age')

assignAxis(df['Age'],g)
Step 12: People Survived based on gender and age
g = sns.factorplot(x='Survived', data=df,kind='count',size=4, aspect=1.2,hue='Age',col='Sex',alpha=0.7,
palette='muted',order=['Survived','Not
Survived'],legend_out=True).set(xlabel='Survival',ylabel='Count')
g.fig.subplots_adjust(wspace=.3)

assignAxis(df['Survived'],g)

Conclusion: Implemented successfully Simple Data visualization techniques using Python on Titanic
dataset.

DSBDL/310256/ Sem- VI Page 89


Computer Engineering / DYPCOEI,PUNE

Assignment No: 10

Aim/Problem Statement:- Data Visualization III

Download the Iris flower dataset or any other dataset into a DataFrame.
(e.g., https://archive.ics.uci.edu/ml/datasets/Iris ). Scan the dataset and give the inference as:
1. List down the features and their types (e.g., numeric, nominal) available in the dataset.
2. Create a histogram for each feature in the dataset to illustrate the feature distributions.
3. Create a box plot for each feature in the dataset.
4. Compare distributions and identify outliers.

Theory:-
Data Visualization: Data visualization is the practice of translating information into a visual context, such as a
map or graph, to make data easier for the human brain to understand and pull insights from. The main goal of data
visualization is to make it easier to identify patterns, trends and outliers in large data sets. The term is often used
interchangeably with others, including information graphics, information visualization and statistical graphics.
Data visualization is one of the steps of the data science process, which states that after data has been collected,
processed and modeled, it must be visualized for conclusions to be made. Data visualization is also an element of
the broader data presentation architecture (DPA) discipline, which aims to identify, locate, manipulate, format and
deliver data in the most efficient way possible.
Examples of data visualization
In the early days of visualization, the most common visualization technique was using a Microsoft Excel
spreadsheet to transform the information into a table, bar graph or pie chart. While these visualization methods are
still commonly used, more intricate techniques are now available, including the following:
● infographics
● bubble clouds
● bullet graphs
● heat maps
● fever charts
● time series charts
Some other popular techniques are as follows.
Line charts: This is one of the most basic and common techniques used. Line charts display how variables can
change over time.
Area charts: This visualization method is a variation of a line chart; it displays multiple values in a time series --
or a sequence of data collected at consecutive, equally spaced points in time.
Scatter plots: This technique displays the relationship between two variables. A scatter plot takes the form of an
x- and y-axis with dots to represent data points.
Treemaps: This method shows hierarchical data in a nested format. The size of the rectangles used for each
category is proportional to its percentage of the whole. Treemaps are best used when multiple categories are
present, and the goal is to compare different parts of a whole.
Population pyramids: This technique uses a stacked bar graph to display the complex social narrative of a
population. It is best used when trying to display the distribution of a population.
Common data visualization use cases
Common use cases for data visualization include the following:

DSBDL/310256/ Sem- VI Page 90


Computer Engineering / DYPCOEI,PUNE

Sales and marketing: Research from the media agency Magna predicts that half of all global advertising dollars
will be spent online by 2020. As a result, marketing teams must pay close attention to their sources of web traffic
and how their web properties generate revenue. Data visualization makes it easy to see traffic trends over time as
a result of marketing efforts.
Politics: A common use of data visualization in politics is a geographic map that displays the party each state or
district voted for.
Healthcare: Healthcare professionals frequently use choropleth maps to visualize important health data. A
choropleth map displays divided geographical areas or regions that are assigned a certain color in relation to a
numeric variable. Choropleth maps allow professionals to see how a variable, such as the mortality rate of heart
disease, changes across specific territories.
Scientists: Scientific visualization, sometimes referred to in shorthand as SciVis, allows scientists and researchers
to gain greater insight from their experimental data than ever before.
Finance: Finance professionals must track the performance of their investment decisions when choosing to buy or
sell an asset. Candlestick charts are used as trading tools and help finance professionals analyze price movements
over time, displaying important information, such as securities, derivatives, currencies, stocks, bonds and
commodities. By analyzing how the price has changed over time, data analysts and finance professionals can
detect trends.
Logistics. Shipping companies can use visualization tools to determine the best global shipping routes.

Algorithm:-
Step 1: Download the data set of Iris Flower
Step 2: Importing Libraries
import numpy as np
import pandas as pd
Step 3: Reading the dataset
df = pd.read_csv("iris-flower-dataset.csv")
df
Step 4: Perform statistical analysis on data
df.mean()
df.median()
df.std()
df.min()
df.max()
df.describe()
Step 5: Print number of features and their datatypes
column = len(list(df))
column
df.info()
np.unique(df["species"])
Step 6: Data Visualization-Create a histogram for each feature in the dataset to illustrate the
feature distributions. Plot each histogram.
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

DSBDL/310256/ Sem- VI Page 91


Computer Engineering / DYPCOEI,PUNE

%matplotlib inline

fig, axes = plt.subplots(2, 2, figsize=(16, 8))

axes[0,0].set_title("Distribution of First Column")


axes[0,0].hist(df["sepal_length"]);

axes[0,1].set_title("Distribution of Second Column")


axes[0,1].hist(df["sepal_width"]);

axes[1,0].set_title("Distribution of Third Column")


axes[1,0].hist(df["petal_length"]);

axes[1,1].set_title("Distribution of Fourth Column")


axes[1,1].hist(df["petal_width"]);
Step 7: Create a boxplot for each feature in the dataset. All of the boxplots should be combined
into a single plot. Compare distributions and identify outliers.
data_to_plot = [df["sepal_length"],df["sepal_width"],df["petal_length"],df["petal_width"]]

sns.set_style("whitegrid")
# Creating a figure instance
fig = plt.figure(1, figsize=(12,8))

# Creating an axes instance


ax = fig.add_subplot(111)

# Creating the boxplot


bp = ax.boxplot(data_to_plot);

Conclusion: Implemented successfully Simple Data visualization techniques using Python on iris flower
dataset.

DSBDL/310256/ Sem- VI Page 92

You might also like