DSBDA Lab Manual
DSBDA Lab Manual
DSBDA Lab Manual
WITH AFFILATION TO
2019 PATTERN(THIRDYEAR)
INDEX
Objective of the Assignment: Students should be able to perform the data wrangling
operation using Python on any open source dataset
Prerequisite:
1. Basic of Python Programming
1
2. Concept of Data Preprocessing, Data Formatting, Data Normalization and Data
Cleaning.
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Introduction to Dataset
2. Python Libraries for Data Science
3. Description of Dataset
4. Panda Data frame functions for load the dataset
5. Panda functions for Data Preprocessing
6. Panda functions for Data Formatting and Normalization
7. Panda Functions for handling categorical variables
---------------------------------------------------------------------------------------------------------------
1. Introduction to Dataset
A dataset is a collection of records, similar to a relational database table. Records are
similar to table rows, but the columns can contain not only strings or numbers, but also
nested data structures such as lists, maps, and other records.
Instance: A single row of data is called an instance. It is an observation from the domain.
Feature: A single column of data is called a feature. It is a component of an observation
and is also called an attribute of a data instance. Some features may be inputs to a model
(the predictors) and others may be outputs or the features to be predicted.
2
Data Type: Features have a data type. They may be real or integer-valued or may have a
categorical or ordinal value. You can have strings, dates, times, and more complex types,
but typically they are reduced to real or categorical values when working with traditional
machine learning methods.
Datasets: A collection of instances is a dataset and when working with machine learning
methods we typically need a few datasets for different purposes.
Training Dataset: A dataset that we feed into our machine learning algorithm to train
our model
Testing Dataset: A dataset that we use to validate the accuracy of our model but is not
used to train the model. It may be called the validation dataset.
Data Represented in a Table:
Data should be arranged in a two-dimensional space made up of rows and columns. This
type of data structure makes it easy to understand the data and pinpoint any problems. An
example of some raw data stored as a CSV (comma separated values).
3
Pandas Python
NumPy type Usage
dtype type
object str or string_, unicode_, mixed types Text or mixed numeric
mixed and non-numeric values
int64 int int_, int8, int16, int32, int64, uint8, uint16, Integer numbers
uint32, uint64
4
NumPy, dimensions are called axes and the number of axes is called rank. NumPy’s
array class is called ndarray aka array.
1. Basic array operations: add, multiply, slice, flatten, reshape, index arrays
2. Advanced array operations: stack arrays, split into sections, broadcast arrays
3. Work with DateTime or Linear Algebra
4. Basic Slicing and Advanced Indexing in NumPy Python
c. Matplotlib
This is undoubtedly my favorite and a quintessential Python library. You can create
stories with the data visualized with Matplotlib. Another library from the SciPy Stack,
Matplotlib plots 2D figures.
Histogram, bar plots, scatter plots, area plot to pie plot, Matplotlib can depict a wide
range of visualizations. With a bit of effort and tint of visualization capabilities, with
Matplotlib, you can create just any visualizations:Line plots
● Scatter plots
● Area plots
● Bar charts and Histograms
● Pie charts
● Stem plots
● Contour plots
● Quiver plots
● Spectrograms
Matplotlib also facilitates labels, grids, legends, and some more formatting entities with
Matplotlib.
d. Seaborn
So when you read the official documentation on Seaborn, it is defined as the data
5
visualization library based on Matplotlib that provides a high-level interface for drawing
attractive and informative statistical graphics. Putting it simply, seaborn is an extension
of Matplotlib with advanced features.
Introduced to the world as a Google Summer of Code project, Scikit Learn is a robust
machine learning library for Python. It features ML algorithms like SVMs, random
forests, k-means clustering, spectral clustering, mean shift, cross-validation and more...
Even NumPy, SciPy and related scientific operations are supported by Scikit Learn with
Scikit Learn being a part of the SciPy Stack.
6
3. Description of Dataset:
The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple
Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning
Repository.
It includes three iris species with 50 samples each as well as some properties about each
flower. One flower species is linearly separable from the other two, but the other two are not
linearly separable from each other.
Total Sample- 150
The columns in this dataset are:
1. Id
2. SepalLengthCm
3. SepalWidthCm
4. PetalLengthCm
5. PetalWidthCm
6. Species
3 Different Types of Species each contain 50 Sample-
7
Description of Dataset-
8
3. The csv file at the UCI repository does not contain the variable/column names. They are
located in a separate file.
col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Species']
4. read in the dataset from the UCI Machine Learning Repository link and specify column
names to use
iris = pd.read_csv(csv_url, names = col_names)
Sr.
Data Frame Function Description
No
9
This returns a Series with the data type of each column.
The result’s index is the original Dataset’s columns.
Columns with mixed types are stored with the object
dtype.
17 dataset.iloc[:m, :n] a subset of the first m rows and the first n columns
10
Few Examples of iLoc to slice data for iris Dataset
11
4 dataset.iloc[1, 1] For getting a value
explicitly:
dataset[cols_2_4]
12
Function: DataFrame.isnull()
Output:
Function: DataFrame.isnull().any()
Output:
c. count of missing values across each column using isna() and isnull()
In order to get the count of missing values of the entire dataframe isnull() function is
used. sum() which does the column wise sum first and doing another sum() will get
the count of missing values of the entire dataframe.
Function: dataframe.isnull().sum().sum()
Output : 8
d. count row wise missing value using is
null()Function: dataframe.isnull().sum(axis
= 1) Output:
13
e. count Column wise missing value using isnull()
Method 1:
Function: dataframe.isnull().sum()
Output:
Method 2:
unction: dataframe.isna().sum()
df1.Gender.isnull().sum()
Output: 2
g. groupby count of missing values of a column.
In order to get the count of missing values of the particular column by group in
pandas we will be using isnull() and sum() function with apply() and groupby()
which performs the group wise count of missing values as shown below.
Function:
df1.groupby(['Gender'])['Score'].apply(lambda x:
x.isnull().sum())
Output:
analyzed or modelled effectively, and there are several techniques for this process.
14
a. Data Formatting: Ensuring all data formats are correct (e.g. object, text, floating
number, integer, etc.) is another part of this initial ‘cleaning’ process. If you are
working with dates in Pandas, they also need to be stored in the exact format to use
special date-time functions.
b. Data normalization: Mapping all the nominal data values onto a uniform scale
(e.g. from 0 to 1) is involved in data normalization. Making the ranges consistent
across variables helps with statistical analysis and ensures better comparisons
later on.It is also known as Min-Max scaling.
Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
iris = load_iris()
df = pd.DataFrame(iris.data,
15
columns=iris.feature_names)
Step 3: Print iris dataset.
df.head()
Step 4: Create x, where x the 'scores' column's values as floats
x = df[['score']].values.astype(float)
16
7. Panda Functions for handling categorical variables
● Categorical variables have values that describe a ‘quality’ or ‘characteristic’
Of a data unit, like ‘what type’ or ‘which category’.
● Categorical variables fall into mutually exclusive (in one category or in
another) and exhaustive (include all possible options) categories. Therefore,
categorical variables are qualitative variables and tend to be represented by a
non-numeric value.
● Categorical features refer to string type data and can be easily understood by
human beings. But in case of a machine, it cannot interpret the categorical
data directly. Therefore, the categorical data must be translated into numerical
data that can be understood by machine.
There are many ways to convert categorical data into numerical data. Here the three most used
methods are discussed.
a. Label Encoding: Label Encoding refers to converting the labels into a numeric form
so as to convert them into the machine-readable form. It is an important preprocessing
step for the structured dataset in supervised learning.
Example : Suppose we have a column Height in some dataset. After applying label
encoding, the Height column is converted into:
Where 0 is the label for tall, 1 is the label for medium, and 2 is a label for short height.
Label Encoding on iris dataset: For iris dataset the target column which is Species. It
17
Contains three species Iris-setosa, Iris-versicolor, Iris-virginica.
Sklearn Functions for Label Encoding:
● preprocessing.LabelEncoder : It Encode labels with value between 0
and n_classes-1.
● fit_transform(y):
Parameters: yarray-like of shape (n_samples,)
Target values.
Returns: yarray-like of shape (n_samples,)
Encoded labels.
This transformer should be used to encode target values, and not the input.
Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Observe the unique values for the Species column.
df['Species'].unique()
output: array(['Iris-setosa', 'Iris-versicolor',
'Iris-virginica'], dtype=object)
Step 4: define label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
Step 5: Encode labels in column 'species'.
df['Species']= label_encoder.fit_transform(df['Species'])
Step 6: Observe the unique values for the Species column.
df['Species'].unique()
Output: array([0, 1, 2], dtype=int64)
● Use LabelEncoder when there are only two possible values of a categorical feature.
For example, features having value such as yes or no. Or, maybe, gender features
when there are only two possible values including male or female.
Limitation: Label encoding converts the data in machine-readable form, but it assigns a
unique number (starting from 0) to each class of data. This may lead to the generation
of priority issues in the data sets. A label with a high value may be considered to have
18
high priority than a label having a lower value.
b. One-Hot Encoding:
In one-hot encoding, we create a new set of dummy (binary) variables that is equal to the
number of categories (k) in the variable. For example, let’s say we have a categorical
variable Color with three categories called “Red”, “Green” and “Blue”, we need to use
three dummy variables to encode this variable using one-hot encoding. A dummy
(binary) variable just takes the value 0 or 1 to indicate the exclusion or inclusion of a
category.
In one-hot encoding,
“Red” color is encoded as [1 0 0] vector of size 3.
“Green” color is encoded as [0 1 0] vector of size 3.
“Blue” color is encoded as [0 0 1] vector of size 3.
19
One-hot encoding on iris dataset: For iris dataset the target column which is Species. It
contains three species Iris-setosa, Iris-versicolor, Iris-virginica.
Sklearn Functions for One-hot Encoding:
● sklearn.preprocessing.OneHotEncoder(): Encode categorical
integer features using a one-hot aka one-of-K scheme
Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Observe the unique values for the Species column.
df['Species'].unique()
output: array(['Iris-setosa', 'Iris-versicolor',
'Iris-virginica'], dtype=object)
Step 4: Apply label_encoder object for label encoding the Observe the unique
values for the Species column.
df['Species'].unique()
Output: array([0, 1, 2], dtype=int64)
Step 5: Remove the target variable from dataset
features_df=df.drop(columns=['Species'])
Step 6: Apply one_hot encoder for Species column.
enc = preprocessing.OneHotEncoder()
enc_df=pd.DataFrame(enc.fit_transform(df[['Species']])).toarray()
Step 7: Join the encoded values with Features variable
df_encode = features_df.join(enc_df)
20
Output after Step 8:
Dummy encoding also uses dummy (binary) variables. Instead of creating a number of
dummy variables that is equal to the number of categories (k) in the variable, dummy
encoding uses k-1 dummy variables. To encode the same Color variable with three
categories using the dummy encoding, we need to use only two dummy variables.
21
In dummy encoding,
“Red” color is encoded as [1 0] vector of size 2.
“Green” color is encoded as [0 1] vector of size 2.
“Blue” color is encoded as [0 0] vector of size 2.
Dummy encoding removes a duplicate category present in the one-hot encoding.
Pandas Functions for One-hot Encoding with dummy variables:
● pandas.get_dummies(data, prefix=None, prefix_sep='_',
dummy_na=False, columns=None, sparse=False,
drop_first=False, dtype=None): Convert categorical variable into
dummy/indicator variables.
● Parameters:
data:array-like, Series, or DataFrame
Data of which to get dummy indicators.
prefixstr: list of str, or dict of str, default None
String to append DataFrame column names.
prefix_sep: str, default ‘_’
If appending prefix, separator/delimiter to use. Or pass a list or dictionary as
with prefix.
dummy_nabool: default False
Add a column to indicate NaNs, if False NaNs are ignored.
22
Whether the dummy-encoded columns should be backed by a SparseArray (
True) or a regular NumPy array (False).
Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Observe the unique values for the Species column.
df['Species'].unique()
output: array(['Iris-setosa', 'Iris-versicolor',
'Iris-virginica'], dtype=object)
Step 4: Apply label_encoder object for label encoding the Observe the unique
values for the Species column.
df['Species'].unique()
Output: array([0, 1, 2], dtype=int64)
Step 6: Apply one_hot encoder with dummy variables for Species column.
one_hot_df = pd.get_dummies(df, prefix="Species",
columns=['Species'], drop_first=False)
Step 7: Observe the merge dataframe
one_hot_df
23
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Conclusion- In this way we have explored the functions of the python library for Data
Preprocessing, Data Wrangling Techniques and How to Handle missing values on Iris Dataset.
Assignment Question
1. Explain Data Frame with Suitable example.
2. What is the limitation of the label encoding method?
3. What is the need of data normalization?
4. What are the different Techniques for Handling the Missing Data?
24
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Group A
Assignment No: 2
Objective of the Assignment: Students should be able to perform the data wrangling
operation using Python on any open source dataset
Prerequisite:
1. Basic of Python Programming
2. Concept of Data Preprocessing, Data Formatting, Data Normalization and Data
Cleaning.
---
Contents for Theory:
1. Creation of Dataset using Microsoft Excel.
2. Identification and Handling of Null Values
3. Identification and Handling of Outliers
25
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
---
Theory:
1. Creation of Dataset using Microsoft Excel.
The dataset is created in “CSV” format.
● The name of dataset is StudentsPerformance
● The features of the dataset are: Math_Score, Reading_Score, Writing_Score,
Placement_Score, Club_Join_Date .
● Number of Instances: 30
● The response variable is: Placement_Offer_Count .
● Range of Values:
Math_Score [60-80], Reading_Score[75-,95], ,Writing_Score [60,80],
Placement_Score[75-100], Club_Join_Date [2018-2021].
● The response variable is the number of placement offers facilitated to particular
students, which is largely depend on Placement_Score
To fill the values in the dataset the RANDBETWEEN is used. Returns a random
integer number between the numbers you specify
Syntax: RANDBETWEEN(bottom, top) Bottom The smallest integer and
Top The largest integer RANDBETWEEN will return.
For better understanding and visualization, 20% impurities are added into each variable
to the dataset.
The step to create the dataset is as follows:
Step 1: Open Microsoft Excel and click on Save As. Select Other .Formats
Step 2: Enter the name of the dataset and save the dataset as type CSV (MS-DOS).
Step 4: Fill the data by using RANDOMBETWEEN function. For every feature, fill
the data by considering above specified range.
one example is given:
26
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
The placement count largely depends on the placement score. It is considered that if
placement score <75, 1 offer is facilitated; for placement score >75, 2 offer is facilitated
and for else (>85) 3 offer is facilitated. Nested If formula is used for ease of data filling.
Step 5: In 20% data, fill the impurities. The range of math score is [60,80], updating a
few instances values below 60 or above 80. Repeat this for Writing_Score [60,80],
Placement_Score[75-100], Club_Join_Date [2018-2021].
Step 6: To violate the rule of response variable, update few values. If placement score is
greater than 85, get only 1 offer.
datasets simply arrive with missing data, either because it exists and was not collected or
it never existed. For Example, suppose different users being surveyed may choose not to
share their income, some users may choose not to share the address in this way many
datasets went missing.
In Pandas missing data is represented by two values:
1. None: None is a Python singleton object that is often used for missing data in
Python code.
2. NaN : NaN (an acronym for Not a Number), is a special floating-point value
recognized by all systems that use the standard IEEE floating-point
representation.
27
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Pandas treat None and NaN as essentially interchangeable for indicating missing
or null values. To facilitate this convention, there are several useful functions for
detecting, removing, and replacing null values in Pandas DataFrame :
● isnull()
● notnull()
● dropna()
● fillna()
● replace()
Algorithm:
Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
Df
28
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Step 5: To create a series true for NaN values for specific columns. for example
math score in dataset and display data with only math score as NaN
series = pd.isnull(df["math score"])
df[series]
Algorithm:
Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
df
29
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Step 5: To create a series true for NaN values for specific columns. for example
math score in dataset and display data with only math score as NaN
series1 = pd.notnull(df["math score"])
df[series1]
30
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
See that there are also categorical values in the dataset, for this, you need to use
Label Encoding or One Hot Encoding.
from sklearn.preprocessing import LabelEncoderle =
LabelEncoder()
df['gender'] =newdf=dfle.fit_transform(df['gender'])df
In order to fill null values in a datasets, fillna(), replace() functions are used.
These functions replace NaN values with some value of their own. All these
functions help in filling null values in datasets of a DataFrame.
31
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
df
Step 4: filling missing value using fillna()
ndf=df ndf.fillna(0)
Step 5: filling missing values using mean, median and standard deviation of that column.
32
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Replacing missing values in forenoon column with minimum/maximum number of that column
Following line will replace Nan value in dataframe with value -99
33
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Algorithm:
Step 1: Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as pdimport
numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
df
Step 4:To drop rows with at least 1 null value
ndf.dropna()
34
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Similarly, an Outlier is an observation in a given dataset that lies far from the rest
of the observations. That means an outlier is vastly larger or smaller than the remaining
35
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Mean is the accurate measure to describe the data when we do not have any
outliers present. Median is used if there is an outlier in the dataset. Mode is used if there
is an outlier AND about ½ or more of the data is the same.
‘Mean’ is the only measure of central tendency that is affected by the outliers
which in turn impacts Standard derivation.
Example:
Consider a small dataset, sample= [15, 101, 18, 7, 13, 16, 11, 21, 5, 15, 10, 9]. By
looking at it, one can quickly say ‘101’ is an outlier that is much larger than the other
values.
From the above calculations, we can clearly say the Mean is more ffected than the
Median.
4. Detecting Outliers
36
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
If our dataset is small, we can detect the outlier by just looking at the dataset. But
what if we have a huge dataset, how do we identify the outliers then? We need to use
visualization and mathematical techniques.
● Boxplots
● Scatterplots
● Z-score
● Inter Quantile Range(IQR)
Algorithm:
Step 1 : Import pandas and numpy libraries
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/demo.csv")
Step 3: Display the data frame
df
37
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Step 5: We can now print the outliers for each column with reference to the box plot.
print(np.where(df['math score']>90))
print(np.where(df['reading score']<25))
print(np.where(df['writing score']<30))
Algorithm:
Step 1 : Import pandas , numpy and matplotlib libraries
import pandas as pdimport
numpy as np
import matplotlib.pyplot as plt
Step 2: Load the dataset in dataframe object df
38
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
df=pd.read_csv("/content/demo.csv")
Step 3: Display the data frame
df
Step 4: Draw the scatter plot with placement score and placement offer countfig,
ax = plt.subplots(figsize = (18,10)) ax.scatter(df['placement score'],
df['placement offer
count'])
plt.show()
Labels to the axis can be assigned (Optional)
ax.set_xlabel('(Proportionnon-retailbusinessacres)/(town)')
ax.set_ylabel('(Full-value property-tax rate)/(
$10,000)')
Step 5: We can now print the outliers with reference to scatter plot.
print(np.where((df['placement score']<50) & (df['placementoffer
count']>1)))
print(np.where((df['placement score']>85) & (df['placementoffer
count']<3)))
39
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Algorithm:
Step 1 : Import numpy and stats from scipy libraries
import numpy as np from scipy
import stats
40
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
normal range namely Upper and Lower bounds, define the upper and the lower
bound (1.5*IQR value is considered) :
upper = Q3 +1.5*IQR
lower = Q1 – 1.5*IQR
In the above formula as according to statistics, the 0.5 scale-up of IQR
(new_IQR = IQR + 0.5*IQR) is taken.
Algorithm:
Step 1 : Import numpy library
import numpy as np
Step 2: Sort Reading Score feature and store it into sorted_rscore.
sorted_rscore= sorted(df['reading score'])
Step 3: Print sorted_rscore
sorted_rscore
Step 4: Calculate and print Quartile 1 and Quartile 3
q1 = np.percentile(sorted_rscore, 25)q3 =
np.percentile(sorted_rscore, 75)
print(q1,q3)
41
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Handling of Outliers:
For removing the outlier, one must follow the same process of removing an entry
from the dataset using its exact position in the dataset because in all the above methods of
detecting the outliers end result is the list of all those data items that satisfy the outlier
definition according to the method used.
42
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
df_stud.insert(1,"m score",b,True)df_stud
● Mean/Median imputation:
As the mean value is highly influenced by the outliers, it is advised to replace the
outliers with the median value.
1. Plot the box plot for reading
scorecol = ['reading score']
df.boxplot(col)
43
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
median=np.median(sorted_rscore)median
4. Replace the upper bound outliers using median value
refined_df=df
refined_df['reading score'] = np.where(refined_df['readingscore'] >upr_bound,
median,refined_df['reading score'])
5. Display redefined_df
44
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
45
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
operational systems or into a data warehouse, a date lake or another repository for use in
business intelligence and analytics applications. The transformation The data are
transformed in ways that are ideal for mining the data. The data transformation involves
steps that are.
● Smoothing: It is a process that is used to remove noise from the dataset using some
algorithms It allows for highlighting important features present in the dataset. It
helps in predicting the patterns
● Aggregation: Data collection or aggregation is the method of storing and presenting
data in a summary format. The data may be obtained from multiple data sources to
integrate these data sources into a data analysis description. This is a crucial step
since the accuracy of data analysis insights is highly dependent on the quantity and
46
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
construction)
Here the Club_Join_Date is transferred to Duration.
Algorithm:
Step 1 : Import pandas and numpy libraries
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/demo.csv")
Step 3: Display the data frame
df
47
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
48
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Algorithm:
Step 1 : Detecting outliers using Z-Score for the Math_score
variable andremove the outliers.
Step 2: Observe the histogram for math_score
variable. import matplotlib.pyplot as plt
new_df['math score'].plot(kind = 'hist')
Step 3: Convert the variables to logarithm at the scale 10.
df['log_math'] = np.log10(df['math score'])
49
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Group A
Assignment No: 3
Provide the codes with outputs and explain everything that you do in this step.
Objective of the Assignment: Students should be able to perform the Statistical operations
using Python on any open source dataset.
Prerequisite:
1. Basic of Python Programming
2. Concept of statistics such as mean, median, minimum, maximum, standard deviation
etc.
50
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
1. Summary statistics
2. Types of Variables
1. Summary statistics:
What is Statistics?
Statistics is the science of collecting data and analysing them to infer proportions (sample)
that are representative of the population. In other words, statistics is interpreting data in
order to make predictions for the population.
Branches of Statistics:
There are two branches of Statistics.
DESCRIPTIVE STATISTICS: Descriptive Statistics is a statistics or a measure that
describes the data.
INFERENTIAL STATISTICS: Using a random sample of data taken from a population to
describe and make inferences about the population is called Inferential Statistics.
Descriptive Statistics
Descriptive Statistics is summarising the data at hand through certain numbers like mean,
median etc. so as to make the understanding of the data easier. It does not involve any
generalisation or inference beyond what is available. This means that the descriptive
51
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
statistics are just the representation of the data (sample) available and ot based on any
theory of probability.
b. Median: Median is the point which divides the entire data into two equal halves.
One-half of the data is less than the median, and the other half is gre ter than the same.
Median is calculated by first arranging the data in either ascending or descending order.
○ If the number of observations is odd, the median is given by the middle
observation in the sorted form.
○ If the numbers of observations are even, median is given by the mean of the
two middle observations in the sorted form.
An important point to note is that the order of the data (ascending or
descending) does not affect the median.
52
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
c. Mode: Mode is the number which has the maximum frequency in the entire data
set, or in other words, mode is the number that appears the maximum number of
times. A data can have one or more than one mode.
● If there is only one number that appears the maximum number of times,
the data has one mode, and is called Uni-modal.
● If there are two numbers that appear the maximum number of times, the
data has two modes, and is called Bi-modal.
● If there are more than two numbers that appear the maximum number of
times, the data has more than two modes, and is called Multi-modal.
Mode is given by the number that occurs the maximum number of times.
Here, 17 and 21 both occur twice. Hence, this is a Bimodal data and the modes
are 17 and 21.
Measures of Dispersion describes the spread of the data around the central value (or the
Measures of Central Tendency)
53
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
1. Absolute Deviation from Mean — The Absolute Deviation from Mean, also
called Mean Absolute Deviation (MAD), describes the variation in the data set, in
the sense that it tells the average absolute distance of each data point in the set. It
is calculated as
2. Variance — Variance measures how far are data points spread out from the mean.
A high variance indicates that data points are spread widely and a small variance
indicates that the data points are closer to the mean of the data set. It is calculated
as
4. Range — Range is the difference between the Maximum value and the Minimum
value in the data set. It is given as
5. Quartiles — Quartiles are the points in the data set that divides the data set into
four equal parts. Q1, Q2 and Q3 are the first, second and third quartile of the data
set.
● 25% of the data points lie below Q1 and 75% lie above it.
54
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
● 50% of the data points lie below Q2 and 50% lie above it. Q2 is nothing but
Median.
● 75% of the data points lie below Q3 and 25% lie above it.
Positive Skew — This is the case when the tail on the right side of the curve is
bigger than that on the left side. For these distributions, mean is greater than the
mode.
Negative Skew — This is the case when the tail on the left side of the curve is
bigger than that on the right side. For these distributions, mean is smaller than the
mode.
55
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Python
Code:
1. Mean To find mean of all columns
Syntax:
df.mean()
Output:
56
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
2. Median
3. Mode
To find mode of all columns
Syntax:
df.mode()
Output:
57
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
In the Genre Column mode is Female, for column Age mode is 32 etc. If a
particular column does not have mode all the values will be displayed in
the column.
32
4. Minimum
58
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
df.loc[:,'Age'].min(skipna = False)
Output:
18
5. Maximum
18
6. Standard
Deviation
To find Standard Deviation of all columns
Syntax:
df.std()
Output:
59
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
13.969007331558883
2. Types of Variables:
A variable is a characteristic that can be measured and that can assume different values.
Height, age, income, province or country of birth, grades obtained at sc ool and type of
housing are all examples of variables.
● Categorical and
● Numeric.
Each category is then classified in two subcategories: nominal or ordinal for categorical
variables, discrete or continuous for numeric variables.
● Categorical variables
60
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
○ Ordinal Variable
An ordinal variable is a variable whose values are defined by an order relation
between the different categories. In following table, the variable “behaviour” is
ordinal because the category “Excellent” is better than the category “Very good,”
which is better than the category “Good,” etc. There is some natural ordering, but
it is limited since we do not know by how much “Excellent” behaviour is better
than “Very good” behaviour.
● Numerical Variables
A numeric variable (also called quantitative variable) is a quantifiable characteristic
whose values are numbers (except numbers which are codes standing up for categories).
61
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
○ Continuous variables
A variable is said to be continuous if it can assume an infinite number of real
values within a given interval.
For instance, consider the height of a student. The height can’t take any
values. It can’t be negative and it can’t be higher than three metres. But between 0
and 3, the number of possible values is theoretically infinite. A student may be
1.6321748755 … metres tall.
○ Discrete variables
As opposed to a continuous variable, a discrete variable can assume only a finite
number of real values within a given interval.
An example of a discrete variable would be the score given by a judge to a
gymnast in competition: the range is 0 to 10 and the score is always given to one
decimal (e.g. a score of 8.5)
62
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Syntax:
df_u=df.rename(columns= {'Annual Income
k$)':'Income'},inplace=False)
(df_u.groupby(['Genre']).Income.mean())Output:
To create a list that contains a numeric value for each response to the categorical variable.
from sklearn import preprocessing
enc = preprocessing.OneHotEncoder()
enc_df =pd.DataFrame(enc.fit_transform(df[['Genre']]).toarray())enc_df
63
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
dataset.
5. Algorithm:
1. Import Pandas Library
2. The dataset is downloaded from UCI repository.
csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
3. Assign Column names
col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Species']
4. Load Iris.csv into a Pandas data frame
iris = pd.read_csv(csv_url, names = col_names)
5. Load all rows with Iris-setosa species in variable irisSet
irisSet = (iris['Species']== 'Iris-setosa')
6. To display basic statistical details like percentile, mean,standard deviation etc. forIris-
setosa use describe
print('Iris-setosa')
print(iris[irisSet].describe())
64
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
8. To display basic statistical details like percentile, mean,standard deviation etc. forIris-
versicolor use describe
print('Iris-versicolor')
print(iris[irisVer].describe())
10. To display basic statistical details like percentile, mean,standard deviation etc. forIris-
virginica use describe
print('Iris-virginica')
print(iris[irisVir].describe())
65
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Conclusion:
Measures of central tendency describe the centre of a data set. It includes the
mean, median, and mode.
Measures of variability or spread describe the dispersion of data within the set andit
includes standard deviation, variance, minimum and maximum variables.
Assignment Questions:
1. Explain Measures of Central Tendency with examples.
2. What are the different types of variables? Explain with examples.
3. Which method is used to statistic the dataframe? Write the code.
66
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Group A
Assignment No: 4
Title of the Assignment: Create a Linear Regression Model using Python/R to predict
home prices using Boston Housing Dataset (https://www.kaggle.com/c/boston-housing).
The Boston Housing dataset contains information about various houses in Boston through
different parameters. There are 506 samples and 14 feature variables in this dataset.
The objective is to predict the value of prices of the house using the given features.
----------------------------------------------------------------------------------------------------------------
Objective of the Assignment: Students should be able to data analysis using liner regression
using Python for any open source dataset
---------------------------------------------------------------------------------------------------------------
Prerequisite:
1. Basic of Python Programming
2.Concept of Regression.
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Linear Regression : Univariate and Multivariate
2. Least Square Method for Linear Regression
3. Measuring Performance of Linear Regression
4. Example of Linear Regression
5. Training data set and Testing data set
---------------------------------------------------------------------------------------------------------------
1. Linear Regression: It is a machine learning algorithm based on supervised learning. It
targets prediction values on the basis of independent variables.
● It is preferred to find out the relationship between forecasting and variables.
● A linear relationship between a dependent variable (X) is continuous; while
independent variable(Y) relationship may be continuous or discrete. A linear
relationship should be available in between predictor and target variable so known
as Linear Regression.
67
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
● Linear regression is popular because the cost function is Mean Squared Error
(MSE) which is equal to the average squared difference between an observation’s
actual and predicted values.
● It is shown as an equation of line like :
Y = m*X + b + e
Where : b is intercepted, m is slope of the line and e is error term.
This equation can be used to predict the value of target variable Y based on given
predictor variable(s) X, as shown in Fig. 1.
● Fig. 2 shown below is about the relation between weight (in Kg) and height (in
cm), a linear relation. It is an approach of studying in a statistical manner to
summarise and learn the relationships among continuous (quantitative) variables.
● Here a variable, denoted by ‘x’ is considered as the predictor, explanatory, or
independent variable.
68
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Fig.2 : Relation between weight (in Kg) and height (in cm)
MultiVariate Regression :It concerns the study of two or more predictor variables.
Usually a transformation of the original features into polynomial features from a given
degree is preferred and further Linear Regression is applied on it.
● A simple linear model Y = a + bX is in original feature will be transformed into
polynomial feature is transformed and further a linear regression applied to it and
it will be something like
Y=a + bX + cX2
● If a high degree value is used in transformation the curve becomes over-fitted as it
captures the noise from data as well.
69
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
● A simple linear model is the one which involves only one dependent and one independent
variable. Regression Models are usually denoted in Matrix Notations.
● However, for a simple univariate linear model, it can be denoted by the regression
equation
𝑦 = β + β 𝑦 (1)
0 1
● This linear equation represents a line also known as the ‘regression line’. The least square
estimation technique is one of the basic techniques used to guess the values of the
parameters and based on a sample set.
● This technique estimates parameters β and β and by trying to minimise the square
0 1
of errors at all the points in the sample set. The error is the deviation of the actual sample
● data point from the regression line. The technique can be represented by the equation.
𝑛
2
𝑚𝑖𝑛 ∑ (𝑦 − 𝑦) (2)
𝑖=0
70
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Using differential calculus on equation 1 we can find the values of β and β such
0 1
β = 𝑦 −β 𝑥 (4)
0 1
Once the Linear Model is estimated using equations (3) and (4), we can estimate the
value of the dependent variable in the given range only. Going outside the range is called
extrapolation which is inaccurate if simple regression techniques are used.
3. Measuring Performance of Linear Regression
Mean Square Error:
The Mean squared error (MSE) represents the error of the estimator or predictive model
created based on the given set of observations in the sample. Two or more regression
models created using a given sample data can be compared based on their MSE. The
lesser the MSE, the better the regression model is. When the linear regression model is
trained using a given set of observations, the model with the least mean sum of squares
error (MSE) is selected as the best model. The Python or R packages select the best-fit
model as the model with the lowest MSE or lowest RMSE when training the linear
regression models.
Mathematically, the MSE can be calculated as the average sum of the squared difference
between the actual value and the predicted or estimated value represented by the
regression model (line or plane).
71
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
An MSE of zero (0) represents the fact that the predictor is a perfect predictor.
RMSE:
Root Mean Squared Error method that basically calculates the least-squares error and takes a
root of the summed values.
Mathematically speaking, Root Mean Squared Error is the square root of the sum of all errors
divided by the total number of values. This is the formula to calculate RMSE
R-Squared is the ratio of the sum of squares regression (SSR) and the sum of squares total
(SST).
SST: total sum of squares (SST), regression sum of squares (SSR), Sum of square of errors
(SSE) are all showing the variation with different measures.
72
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
A value of R-squared closer to 1 would mean that the regression model covers most part
of the variance of the values of the response variable and can be termed as a good
model.
One can alternatively use MSE or R-Squared based on what is appropriate and the need of the
hour. However, the disadvantage of using MSE rather than R-squared is that it will be difficult
to gauge the performance of the model using MSE as the value of MSE can vary from 0 to
any larger number. However, in the case of R-squared, the value is bounded between 0 and .
4. Example of Linear Regression
Consider following data for 5 students.
Each Xi (i = 1 to 5) represents the score of ith student in standard X and corresponding
Yi (i = 1 to 5) represents the score of ith student in standard XII.
(i) Linear regression equation best predicts standard XIIth score
(ii) Interpretation for the equation of Linear Regression
(iii) If a student's score is 80 in std X, then what is his expected score in XII standard?
73
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
x y 𝑥 −𝑥 𝑦 −𝑦 (𝑥 −𝑥 )2 (𝑥 −𝑥 )(𝑦 − 𝑦 )
95 85 17 8 289 136
85 95 7 18 49 126
80 70 2 -7 4 -14
70 65 -8 -12 64 96
60 70 -18 -7 324 126
𝑥 = 78 𝑦= 77 ε (𝑥 −𝑥 )2= 730 ε (𝑥 −𝑥 )(𝑦 − 𝑦 ) = 470
(i) linear regression equation that best predicts standard XIIth score
𝑦 = β + β 𝑦
0 1
𝑛 𝑛 2
β = ∑ (𝑥 —𝑥 ) 𝑦
( — 𝑦 )/ ∑ 𝑥
( 𝑥)
1 𝑖 𝑖 𝑖−
𝑖=1 𝑖=1
β = 470/730 = 0. 644
1
β = 𝑦 −β 𝑥
0 1
𝑦 = 26. 76 + 0. 644 𝑦
Interpretation 1
For an increase in value of x by 0.644 units there is an increase in value of y in one unit.
Interpretation 2
74
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Score in XII standard (Yi) is 0.644 units depending on Score in X standard (Xi) but other
factors will also contribute to the result of XII standard by 26.768 .
(iii) If a student's score is 65 in std X, then his expected score in XII standard is 78.288
75
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
● Machines can learn when they observe enough relevant data. Using this one can model
algorithms to find relationships, detect patterns, understand complex problems and make
decisions.
● Training error is the error that occurs by applying the model to the same data from which
the model is trained.
● In a simple way the actual output of training data and predicted output of the model does
not match the training error Ein is said to have occurred.
● Training error is much easier to compute.
(b) Testing Phase
● Testing dataset is provided as input to this phase.
● Test dataset is a dataset for which class label is unknown. It is tested using model
● A test dataset used for assessment of the finally chosen model.
● Training and Testing dataset are completely different.
● Testing error is the error that occurs by assessing the model by providing the unknown
data to the model.
● In a simple way the actual output of testing data and predicted output of the model does
not match the testing error Eout is said to have occurred.
● E out is generally observed larger than Ein.
(c) Generalization
● Generalization is the prediction of the future based on the past system.
● It needs to generalize beyond the training data to some future data that it might not have
seen yet.
● The ultimate aim of the machine learning model is to minimize the generalization error.
● The generalization error is essentially the average error for data the model has never
seen.
● In general, the dataset is divided into two partition training and test sets.
● The fit method is called on the training set to build the model.
● This fit method is applied to the model on the test set to estimate the target value and
evaluate the model's performance.
● The reason the data is divided into training and test sets is to use the test set to estimate
how well the model trained on the training data and how well it would perform on the
unseen data.
76
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Output:
68.63
Step 6: Predict the y_pred for all values of x.
y_pred= predict(x)
y_pred
Output:
array([81.50684932, 87.94520548, 71.84931507, 68.63013699, 71.84931507])
Output:
0.4803218090889323
Step 8: Plotting the linear regression model
y_line = model[1] + model[0]* x
plt.plot(x, y_line, c = 'r')
plt.scatter(x, y_pred)
plt.scatter(x,y,c='r')
77
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Output:
78
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Step 12: Calculate Mean Square Paper for train_y and test_y
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(ytest, ytest_pred)
print(mse)
mse = mean_squared_error(ytrain_pred,ytrain)
print(mse)
Output:
33.44897999767638
mse = mean_squared_error(ytest, ytest_pred)
print(mse)
Output:
19.32647020358573
Step 13: Plotting the linear regression model
lt.scatter(ytrain ,ytrain_pred,c='blue',marker='o',label='Training data')
plt.scatter(ytest,ytest_pred ,c='lightgreen',marker='s',label='Test data')
plt.xlabel('True values')
plt.ylabel('Predicted')
plt.title("True value vs Predicted value")
plt.legend(loc= 'upper left')
#plt.hlines(y=0,xmin=0,xmax=50)
plt.plot()
plt.show()
79
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Conclusion:
In this way we have done data analysis using linear regression for Boston Dataset and
predict the price of houses using the features of the Boston Dataset.
Assignment Question:
1) Compute SST, SSE, SSR, MSE, RMSE, R Square for the below example .
2) Comment on whether the model is best fit or not based on the calculated values.
3) Write python code to calculate the RSquare for Boston Dataset.
(Consider the linear regression model created in practical session)
80
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Group A
Assignment No:5
Objective of the Assignment: Students should be able to data analysis using logistic
regression using Python for any open source dataset
Prerequisite:
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Logistic Regression
2. Differentiate between Linear and Logistic Regression
3. Sigmoid Function
4. Types of Logistic Regression
5. Confusion Matrix Evaluation Metrics
81
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Logistic Regression can be used for various classification problems such as spam
detection. Diabetes prediction, if a given customer will purchase a particular product or
will they churn another competitor, whether the user will click on a given advertisement
link or not, and many more examples are in the bucket.
Logistic Regression is one of the most simple and commonly used Machine Learning
algorithms for two-class classification. It is easy to implement and can be used as the
baseline for any binary classification problem. Its basic fundamental concepts are also
constructive in deep learning. Logistic regression describes and estimates the relationship
between one dependent binary variable and independent variables.
Logistic regression is a statistical method for predicting binary classes. The outcome or
target variable is dichotomous in nature. Dichotomous means there are only two possible
classes. For example, it can be used for cancer detection problems. It computes the
It is a special case of linear regression where the target variable is categorical in nature. It
uses a log of odds as the dependent variable. Logistic Regression predicts the probability
Where, y is a dependent variable and x1, x2 ... and Xn are explanatory variables.
Sigmoid Function:
82
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
3. Sigmoid Function
The sigmoid function, also called logistic function, gives an ‘S’ shaped curve that can take any
real-valued number and map it into a value between 0 and 1. If the curve goes to positive infinity,
y predicted will become 1, and if the curve goes to negative infinity, y predicted will become 0.
If the output of the sigmoid function is more than 0.5, we can classify the outcome as 1 or YES,
and if it is less than 0.5, we can classify it as 0 or NO. The output cannot For example: If the
output is 0.75, we can say in terms of probability as: There is a 75 percent chance that a patient
will suffer from cancer.
83
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
84
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
The following table shows the confusion matrix for a two class classifier.
Here each row indicates the actual classes recorded in the test data set and the each column
indicates the classes as predicted by the classifier.
Numbers on the descending diagonal indicate correct predictions, while theconcerns prediction
errors.
● Number of positive (Pos) : Total number instances which are labelled as positive in
a given dataset.
● Number of negative (Neg) : Total number instances which are labelled as negative in a
given dataset.
● Number of True Positive (TP) : Number of instances which are actually labelled as
positive and the predicted class by classifier is also positive.
● Number of True Negative (TN) : Number of instances which are actually labelled as
negative and the predicted class by classifier is also negative.
● Number of False Positive (FP) : Number of instances which are actually labelled as
negative and the predicted class by classifier is positive.
● Number of False Negative (FN): Number of instances which are actually labelled as
positive and the class predicted by the classifier is negative.
85
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
𝑇𝑃+𝑇𝑁
𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁
𝑇𝑃+𝑇𝑁
𝑃𝑜𝑠+𝑁𝑒𝑔
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃
𝑇𝑃+𝐹 𝑃
86
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
Step 6: Predict the y_pred for all values of train_x and test_x Step
Conclusion:
In this way we have done data analysis using logistic regression for Social Media Adv.
andevaluate the performance of model.
Value Addition:
Visualising Confusion Matrix using Heatmap
Assignment Question:
1) Consider the binary classification task with two classes positive and
87
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
2) Comment on whether the model is best fit or not based on the calculated values.
3) Write python code for the preprocessing mentioned in step 4. and
88
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
Group A
Experiment No:
Assignment No: 6
6
Objective of the Assignment: Students should be able to data analysis using Naïve Bayes
classification algorithm using Python for any open source dataset
Prerequisite:
1. Basic of Python Programming
2. Concept of Join and Marginal Probability.
89
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
For example, P(A), P(B), P(C) are prior probabilities because while calculating P(A),
occurrences of event B or C are not concerned i.e. no information about occurrence
ofany other event is used.
Conditional Probabilities:
We have a dataset with some features Outlook, Temp, Humidity, and Windy, and
thetarget here is to predict whether a person or team will play tennis or not.
90
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
Conditional Probability
Here, we are predicting the probability of class1 and class2 based on the given condition. If I
try to write the same formula in terms of classes and features, we will get the following
equation
Now we have two classes and four features, so if we write this formula for class C1, it will be
something like
this.
Here, we replaced Ck with C1 and X with the intersection of X1, X2, X3, X4. You might have
a question, It’s because we are taking the situation when all these features are present at the
same time.
The Naive Bayes algorithm assumes that all the features are independent of each other or in
other words all the features are unrelated. With that assumption, we can further simplify the
This is the final equation of the Naive Bayes and we have to calculate the probability of both
91
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
P (N0 | Today) > P (Yes | Today) So, the prediction that golf would be played is ‘No’.
Step 5: Use Naive Bayes algorithm( Train the Machine ) to Create Model
# import the class
from sklearn.naive_bayes import GaussianNBgaussian =
GaussianNB() gaussian.fit(X_train, y_train)
Step 6: Predict the y_pred for all values of train_x and test_x
Y_pred = gaussian.predict(X_test)
92
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
Conclusion:
In this way we have done data analysis using Naive Bayes Algorithm for Iris dataset and
evaluated the performance of the model.
Assignment Question:
1) Consider the observation for the car theft scenario having 3 attributes color, Type and
origin.
Find the probability of car theft having scenarios Red SUV and Domestic.
2).Write python code for the preprocessing mentioned in step 4. and Explain everystep in
detail.
93
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
Group A
Assignment No: 7
Objective of the Assignment: Students should be able to perform Text Analysis using TF
IDF Algorithm
Prerequisite:
1. Basic of Python Programming
2. Basic of English language.
Text mining is also referred to as text analytics. Text mining is a process of exploring
sizable textual data and finding patterns. Text Mining processes the text itself, while NLP
processes with the underlying metadata. Finding frequency counts of words, length of the
94
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
sent_tokenize() method
●
● Word tokenization : split a sentence into list of words using word_tokenize()
method
95
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
Lemmatization Vs Stemming
Stemming algorithm works by cutting the suffix from the word. In a broader sense
cuts either the beginning or end of the word.
required to create dictionaries and look for the proper form of the word.
Stemming is a general operation while lemmatization is an intelligent operation
where the proper form will be looked in the dictionary. Hence, lemmatization
helps in forming better machine learning features.
POS Tagging
POS (Parts of Speech) tell us about grammatical information of words of the
sentence by assigning specific token (Determiner, noun, adjective , adverb ,
verb,Personal Pronoun etc.) as tag (DT,NN ,JJ,RB,VB,PRP etc) to each words.
Word can have more than one POS depending upon the context where it is used.
We can use POS tags as statistical NLP tasks. It distinguishes a sense of word
96
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
which is very helpful in text realization and infer semantic information from text
for sentiment analysis.
Example:
The initial step is to make a vocabulary of unique words and calculate TF for each
document. TF will be more for words that frequently appear in a document and
less for rare words in a document.
● Inverse Document Frequency (IDF)
It is the measure of the importance of a word. Term frequency (TF) does not
97
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
consider the importance of words. Some words such as’ of’, ‘and’, etc. can be
most frequently present but are of little significance. IDF provides weightage to
each word based on its frequency in the corpus D.
98
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
After applying TFIDF, text in A and B documents can be represented as a TFIDF vector of
dimension equal to the vocabulary words. The value corresponding to each word represents
the importance of that word in a particular document.
TFIDF is the product of TF with IDF. Since TF values lie between 0 and 1, not using ln can
result in high IDF for some words, thereby dominating the TFIDF. We don’t want that, and
therefore, we use ln so that the IDF should not completely dominate the TFIDF.
● Disadvantage of TFIDF
It is unable to capture the semantics. For example, funny and humorous are synonyms, but
TFIDF does not capture that. Moreover, TFIDF can be computationally expensive if the
vocabulary is vast.
4. Bag of Words (BoW)
Machine learning algorithms cannot work with raw text directly. Rather, the text must be
converted into vectors of numbers. In natural language processing, a common technique
for extracting features from text is to place all of the words that occur in the text in a
bucket. This approach is called a bag of words model or BoW for short. It’s referred to
as a “bag” of words because any information about the structure of the sentence is lost.
99
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
nltk.download('punkt') nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
#Word Tokenization
from nltk.tokenize import word_tokenize
tokenized_word=word_tokenize(text)
print(tokenized_word)
text= "How to remove stop words with NLTK library in Python?"text= re.sub('[^a-zA-Z]', '
',text)
tokens = word_tokenize(text.lower())filtered_text=[]
for w in tokens:
if w not in stop_words: filtered_text.append(w)
print("Tokenized Sentence:",tokens) print("Filterd
Sentence:",filtered_text)
100
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
101
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
Conclusion:
In this way we have done text data analysis using TF IDF algorithm
Assignment Question:
1) Perform Stemming for text = "studies studying cries cry". Compare the results
generated with Lemmatization. Comment on your answer how Stemming and
Lemmatization differ from each other.
2) Write Python code for removing stop words from the below documents, conver the
documents into lowercase and calculate the TF, IDF and TFIDF score for each
document.documentA = 'Jupiter is the largest Planet' documentB =
'Mars is the fourth planet from the Sun'
102
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
Group A
Assignment No: 8
---------------------------------------------------------------------------------------------------------------
Theory:
Data Visualisation plays a very important role in Data mining. Various data scientists spent their
time exploring data through visualisation. To accelerate this process we need to have a well-
documentation of all the plots.
Even plenty of resources can’t be transformed into valuable goods without planning and
architecture
103
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
Let's see what the Titanic dataset looks like. Execute the following script:
numpy as np
as sns
dataset = sns.load_dataset('titanic')dataset.head()
The dataset contains 891 rows and 15 columns and contains information about the passengers
who boarded the unfortunate Titanic ship. The original task is to predict whether or not the
passenger survived depending upon different features such as their age, ticket, cabin they
boarded, the class of the ticket, etc. We will use the Seaborn library to see if we can find any
patterns in the data.
A. Distribution Plots
a. Dist-Plot
b. Joint Plot
d. Rug Plot
104
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
B. Categorical Plots
a. Bar Plot
b. Count Plot
c. Box Plot
d. Violin Plot
C. Advanced Plots
a. Strip Plot
b. Swarm Plot
D. Matrix Plots
a. Heat Map
b. Cluster Map
A. Distribution Plots:
These plots help us to visualize the distribution of data. We can use these plots to understand the
a. Distplot
● We can change the number of bins i.e. number of vertical bars in a histogram
105
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
The line that you see represents the kernel density estimation. You can remove this line
by passing False as the parameter for the kde attribute as shown below
Here the x-axis is the age and the y-axis displays frequency. For example, for bins = 10,
106
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
more the colour intensity, the more will be the number of observations.
For Plot 1
# For Plot 2
● From the output, you can see that a joint plot has three parts. A distribution plot at the top
for the column on the x-axis, a distribution plot on the right for the column on the y-axis
and a scatter plot in between that shows the mutual distribution of data for both the
columns. You can see that there is no correlation observed between prices and the fares.
● You can change the type of the joint plot by passing a value for the kind parameter. For
instance, if instead of a scatter plot, you want to display the distribution of data in the
form of a hexagonal plot, you can pass the value hex for the kind parameter.
● In the hexagonal plot, the hexagon with the most number of points gets darker colour. So
if you look at the above plot, you can see that most of the passengers are between the
ages of 20 and 30 and most of them paid between 10-50 for the tickets.
107
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
b. The rugplot() is used to draw small bars along the x-axis for each point in the dataset. To
plot a rug plot, you need to pass the name of the column. Let's plot a rug plot for fare.
sns.rugplot(dataset['fare'])
From the output, you can see that most of the instances for the fares have values between 0 and
100.
These are some of the most commonly used distribution plots offered by the Python's Seaborn
Library. Let's see some of the categorical plots in the Seaborn library.
2. Categorical Plots
Categorical plots, as the name suggests, are normally used to plot categorical data. The
categorical plots plot the values in the categorical column against another categorical column or
a numeric column. Let's see some of the most commonly used categorical data.
108
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
From the output, you can clearly see that the average age of male passengers is just less than 40
while the average age of female passengers is around 33.
In addition to finding the average, the bar plot can also be used to calculate other aggregate
values for each category. To do so, you need to pass the aggregate function to the estimator. For
instance, you can calculate the standard deviation for the age of each gender as follows:
import numpy as np
as sns
Notice, in the above script we use the std aggregate function from the numpy library to calculate
the standard deviation for the ages of male and female passengers. The output looks like this:
109
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
sns.countplot(x='sex', data=dataset)
The box plot is used to display the distribution of the categorical data in the form of quartiles.
The centre of the box shows the median value. The value from the lower whisker to the bottom
of the box shows the first quartile. From the bottom of the box to the middle of the box lies the
second quartile. From the middle of the box to the top of the box lies the third quartile and finally
from the top of the box to the top whisker lies the last quartile.
Now let's plot a box plot that displays the distribution for the age with respect to each gender.
You need to pass the categorical column as the first parameter (which is sex in our case) and the
numeric column (age in our case) as the second parameter. Finally, the dataset is passed as the
third parameter, take a look at the following script:
110
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
Let's try to understand the box plot for females. The first quartile starts at around 1 and ends at
20 which means that 25% of the passengers are aged between 1 and 20. The second quartile
starts at around 20 and ends at around 28 which means that 25% of the passengers are aged
between20 and 28. Similarly, the third quartile starts and ends between 28 and 38, hence 25%
passengers are aged within this range and finally the fourth or last quartile starts at 38 and ends
around 64.
If there are any outliers or the passengers that do not belong to any of the quartiles, they are
called outliers and are represented by dots on the box plot.
You can make your box plots more fancy by adding another layer of distribution. For instance, if
you want to see the box plots of forage of passengers of both genders, along with the information
about whether or not they survived, you can pass the survived as value to the hue parameter as
shown below:
sns.boxplot(x='sex', y='age', data=dataset, hue="survived")
111
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
Now in addition to the information about the age of each gender, you can also see the
distribution of the passengers who survived. For instance, you can see that among the male
passengers, on average more younger people survived as compared to the older ones. Similarly,
you can see that the variation among the age of female passengers who did not survive is much
greater than the age of the surviving female passengers.
Let's plot a violin plot that displays the distribution for the age with respect to each gender.
You can see from the figure above that violin plots provide much more information about the
data as compared to the box plot. Instead of plotting the quartile, the violin plot allows us to see
all the components that actually correspond to the data. The area where the violin plot is thicker
has a higher number of instances for the age. For instance, from the violin plot for males, it is
clearly evident that the number of passengers with age between 20 and 40 is higher than all the
rest of the age brackets.
Like box plots, you can also add another categorical variable to the violin plot using the hue
parameter as shown below:
112
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
Now you can see a lot of information on the violin plot. For instance, if you look at the bottom of
the violin plot for the males who survived (left-orange), you can see that it is thicker than the
bottom of the violin plot for the males who didn't survive (left-blue). This means that the number
of young male passengers who survived is greater than the number of young male passengers
who did not survive
Advanced Plots:
The stripplot() function is used to plot the violin plot. Like the box plot, the first parameter is the
categorical column, the second parameter is the numeric column while the third parameter is the
dataset. Look at the following script:
You can see the scattered plots of age for both males and females. The data points look like
strips. It is difficult to comprehend the distribution of data in this form. To better comprehend the
data, pass True for the jitter parameter which adds some random noise to the data. Look at the
following script:
113
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
Now you have a better view for the distribution of age across the genders.
Like violin and box plots, you can add an additional categorical column to strip plot using hue
parameter as shown below:
114
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
You can clearly see that the above plot contains scattered data points like the strip plot and the
data points are not overlapping. Rather they are arranged to give a view similar to that of a violin
plot.
Let's add another categorical column to the swarm plot using the hue parameter.
sns.swarmplot(x='sex', y='age', data=dataset, hue='survived')
From the output, it is evident that the ratio of surviving males is less than the ratio of surviving
females. Since for the male plot, there are more blue points and less orange points. On the other
hand, for females, there are more orange points (surviving) than the blue points (not surviving).
Another observation is that amongst males of age less than 10, more passengers survived as
compared to those who didn't.
1. Matrix Plots
Matrix plots are the type of plots that show data in the form of rows and columns. Heat maps are
the prime examples of matrix plots.
a. Heat Maps
Heat maps are normally used to plot correlation between numeric columns in the form of a
matrix. It is important to mention here that to draw matrix plots, you need to have meaningful
information on rows as well as columns. Let's plot the first five rows of the Titanic dataset to see
if both the rows and column headers have meaningful information. Execute the following script:
numpy as np
115
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
dataset = sns.load_dataset('titanic')dataset.head()
From the output, you can see that the column headers contain useful information such as
passengers surviving, their age, fare etc. However the row headers only contain indexes 0, 1, 2,
etc. To plot matrix plots, we need useful information on both columns and row headers. One way
to do this is to call the corr() method on the dataset. The corr() function returns the correlation
between all the numeric columns of the dataset. Execute the following script:
dataset.corr()
In the output, you will see that both the columns and the rows have meaningful header
information, as shown below:
Now to create a heat map with these correlation values, you need to call the heatmap() function
and pass it your correlation dataframe. Look at the following script:
corr=dataset.corr()
sns.heatmap(corr)
116
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
From the output, it can be seen that what heatmap essentially does is that it plots a box for every
combination of rows and column value. The colour of the box depends upon the gradient. For
instance, in the above image if there is a high correlation between two features, the
corresponding cell or the box is white, on the other hand if there is no correlation, the
corresponding cell remains black.
The correlation values can also be plotted on the heatmap by passing True for the annot
parameter. Execute the following script to see this in action:
annot=True)
117
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANLYTICS
You can also change the colour of the heatmap by passing an argument for the cmap parameter.
For now, just look at the following script:
corr=dataset.corr()
sns.heatmap(corr)
b. Cluster Map:
In addition to the heat map, another commonly used matrix plot is the cluster map. The
cluster map basically uses Hierarchical Clustering to cluster the rows and columns of the
matrix.
Let's plot a cluster map for the number of passengers who travelled in a specific month of
a specific year. Execute the following script:
4. Checking how the price of the ticket (column name: 'fare') for each
passenger is distributed by plotting a histogram.
import seaborn as sns
118
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
From the histogram, it is seen that for around 730 passengers the price of the ticket is
50.For 100 passengers the price of the ticket is 100 and so on.
Conclusion-
Seaborn is an advanced data visualisation library built on top of Matplotlib library. In this
assignment, we looked at how we can draw distributional and categorical plots using the
Seabornlibrary. We have seen how to plot matrix plots in Seaborn. We also saw how to change
plot stylesand use grid functions to manipulate subplots.
Assignment Questions
1. List out different types of plot to find patterns of data
2. Explain when you will use distribution plots and when you will use categorical
plots.
3. Write the conclusion from the following swarm plot (consider titanic dataset)
119
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Group A
Assignment No: 9
In [5]:
Out[5]:
survived pclass sex age sibsp parch fare embarked class who adult_male
In [7]:
sns.boxplot(x='sex',y='age',data=dataset)
120
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Out[7]:
<AxesSubplot:xlabel='sex',
ylabel='age'>
Let's try to understand the box plot for female. The first quartile starts at around 5 and ends at 22 which
means that 25% of the passengers are aged between 5 and 25. The second quartile starts at around 23
and ends at around 32 which means that 25% of the passengers are aged between 23 and 32. Similarly,
the third quartile starts and ends between 34 and 42, hence 25% passengers are aged within this range
and finally the fourth or last quartile starts at 43 and ends around 65.
If there are any outliers or the passengers that do not belong to any of the quartiles, they are called
outliers and are represented by dots on the box plot.
In [8]:
sns.boxplot(x='sex',y='age',data=dataset,hue='survived')
Out[8]:
<AxesSubplot:xlabel='sex', ylabel='age'>
121
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
You can make your box plots more fancy by adding another layer of distribution. For instance,
if you want to see the box plots of forage of passengers of both genders, along with the
information about whether or not they survived, you can pass the survived as value to the hue
parameter.
Now in addition to the information about the age of each gender, you can also see the distribution of
the passengers who survived. For instance, you can see that among the male passengers, on average
younger people survived as compared to the older ones. Similarly, you can see that the variation among
the age of female passengers who did not survive is much greater than the age of the surviving female
passengers.
Conclusion-
Seaborn is an advanced data visualisation library built on top of Matplotlib library. In this assignment,
we looked at how we can Plot a box plot for distribution of age with respect to each gender using the
Seaborn library. We also saw how to change plot styles
122
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Group A
Assignment No: 10
Title of the Assignment: Download the Iris flower dataset or any other dataset into a
DataFrame.(e.g.,https://archive.ics.uci.edu/ml/datasets/Iris ). Scan the dataset and give the
inference as:
1. List down the features and their types (e.g., numeric, nominal) available in the dataset.
2. Create a histogram for each feature in the dataset to illustrate the feature distributions.
3. Create a boxplot for each feature in the dataset.
4. Compare distributions and identify outliers.
Download the Iris flower dataset or any other dataset into a DataFrame. (e.g.,
https://archive.ics.uci.edu/ml/datasets/Iris (https://archive.ics.uci.edu/ml/datasets/Iris) ).
In [1]:
import pandas as pd
import numpy as np
csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Species']
In [2]:
iris = pd.read_csv(csv_url, names = col_names)
Q1. How many features are there and what are their types?
In [3]:
column = len(list(iris))
column
Out[3]:
5
Clearly, dataset has 5 column indicating 5 features about the data
In [4]:
iris.info()
<class 'pandas.core.frame.DataFrame'>RangeIndex:
150 entries, 0 to 149 Data columns (total 5 columns):
# Column Non-Null Count Dtype
0 Sepal_Length 150 non-null float64
1 Sepal_Width 150 non-null float64
2 Petal_Length 150 non-null float64
3 Petal_Width 150 non-null float64
4 Species 150 non-null object
dtypes: float64(4), object(1)memory usage:
123
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
6.0 + KB
Hence the dataset contains 4 numerical columns and 1 object column
In [6]:
np.unique(iris["Species"])
Out[6]:
array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)
Q2. Compute and display summary statistics for each feature available in the dataset.
In [13]:
iris.describe()
Out[13]:
Sepal_Length Sepal_Width Petal_Length Petal_Width
count
150.000000 150.000000 150.000000 150.000000
mean
5.843333 3.054000 3.758667 1.198667
std
0.828066 0.433594 1.764420 0.763161
min
4.300000 2.000000 1.000000 0.100000
25%
5.100000 2.800000 1.600000 0.300000
50%
5.800000 3.000000 4.350000 1.300000
75%
6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
Q3. Data Visualization-Create a histogram for each feature in the dataset to illustrate the feature
distributions Plot each histogram
Solution 1:
In [8]:
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
124
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
OR
Solution 2:
Q4. Create a boxplot for each feature in the dataset. All of the boxplots should be combined into a
singleplot. Compare distributions and identify outliers.
125
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
In [12]:
If we observe closely for the box 2, interquartile distance is roughly around 0.75 hence
the values lying beyond this range of (third quartile + interquartile distance) i.e. roughly
around 4.05 will be considered as outliers.
Similarly outliers with other boxplots can be found.
Conclusion-
126
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Group B
Assignment No:
11
Theory:
● Steps to Install Hadoop for distributed environment
● Java Code for processes a log file of a system
cd hadoop-2.7.3
Step 2) Once the NameNode is formatted, go to hadoop-2.7.3/sbin directory and start all the
daemons/nodes.
cd hadoop-2.7.3/sbin
1) Start NameNode:
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files
stored in the HDFS and tracks all the file stored across the cluster.
2) Start DataNode:
On startup, a DataNode connects to the Namenode and it responds to the requests from the
Namenode for different operations.
127
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
3) Start ResourceManager:
ResourceManager is the master that arbitrates all the available cluster resources and thus helps in
managing the distributed applications running on the YARN system. Its work is to manage each
NodeManagers and the each application’s ApplicationMaster.
4) Start NodeManager:
The NodeManager in each machine framework is the agent which is responsible for managing
containers, monitoring their resource usage and reporting the same to the ResourceManager.
5) Start JobHistoryServer:
JobHistoryServer is responsible for servicing all job history related requests from client.
Step 3) To check that all the Hadoop services are up and running, run the below command.
jps
Step 4) cd
Step 9) cd mapreduce_vijay/
Step 10) ls
128
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Step 14) ls
Step 17) cd ..
Step 20) ls
Step 21) cd
129
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Step 29) Now open the Mozilla browser and go to localhost:50070/dfshealth.html to check the
NameNode interface.
Java Code to
process
logfile
Mapper
Class:
package SalesCountry; import
java.io.IOException;
import java.io.IOException;import
130
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
java.util.*;
import org.apache.hadoop.io.IntWritable;import
org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
public class SalesCountryReducer extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
}
output.collect(key, new IntWritable(frequencyForCountry));
}
}
Driver Class:
package SalesCountry;
131
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
my_client.setConf(job_conf); try {
// Run the job JobClient.runJob(job_conf);
} catch (Exception e) { e.printStackTrace();
}
}
}
Input File
Pune Mumbai
Nashik Pune
Nashik
Kolapur
Assignment Questions
1. Write down the steps for Design a distributed application using MapReduce which
processes a log file of a system.
132
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
133
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
--------------------------------------------------------------------------------------------------------------------------
Group B
Experiment No: 12
--------------------------------------------------------------------------------------------------------------------------
Theory:
What is Scala ?
Scala is an acronym for “Scalable Language”. It is a general-purpose programming language
designed for the programmers who want to write programs in a concise, elegant, and type-
safe way. Scala enables programmers to be more productive. Scala is developed as an object-
oriented and functionalprogramming language.
Installing Scala
Scala can be installed in any Unix or windows based system. Below are the steps to
134
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
install for Ubuntu(14.04) for scala version 2.11.7. I am showing the steps for installing
Install Java
If you are asked to accept Java license terms, click on “Yes” and proceed. Once finished, let us
check whether Java has installed successfully or not. To check the Java version and installation,
you can type:
$ java -version
$ cd ~/Downloads
$ wget http://www.scala-lang.org/files/archive/scala-2.11.7.deb
$ sudo dpkg -i scala-2.11.7.deb
$ scala –version
Choosing right environment depends on your preference and use case. I personally prefer
writing aprogram on shell because it provides a lot of good features like
135
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
suggestions for method call and youcan also run your code while writing line by line.
In Scala, you can declare a variable using ‘var’ or ‘val’ keyword. The decision is based on
whether it is a constant or a variable. If you use ‘var’ keyword, you define a variable as
mutable variable. On the other hand, if you use ‘val’, you define it as immutable. Let’s first
declare a variable using “var” and then using “val”.
Declare using var
In the above Scala statement, you declare a mutable variable called “Var1” which takes a
string value. You can also write the above statement without specifying the type of variable.
In the above Scala statement, we have declared an immutable variable “Var2” which takes a
Operations on variables
You can perform various operations on variables. There are various kinds of operators defined in
Scala. For example: Arithmetic Operators, Relational Operators, Logical Operators, Bitwise
Operators, Assignment Operators. Lets see “+” , “==” operators on two variables ‘Var4’,
“Var5”. But, before that,let us first assign values to “Var4” and “Var5”.
136
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Var4==Var5
Output:
res2: Boolean = false
If you want to know complete list of operators in Scala refer this link:
In Scala, if-else expression is used for conditional statements. You can write one or more
conditionsinside “if”. Let’s declare a variable called “Var3” with a value 1 and then compare
var Var3 =1
if (Var3 ==1){
println("True")}else{
println("False")}
Output: True
“Var3” using if-elseexpression.
137
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
In the above snippet, the condition evaluates to True and hence True will be printed in the output.
Like most languages, Scala also has a FOR-loop which is the most widely used
Value of a: 6
Value of a: 7
Value of a: 8
Value of a: 9
Value of a: 10
You can define a function in Scala using “def” keyword. Let’s define a function called
“mul2” which will take a number and multiply it by 10. You need to define the return type of
function, if a functionnot returning any value you should use the “Unit” keyword. In the below
138
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
example, the function returns an integer value. Let’s define the function “mul2”:
mul2(2)
Output:
res9: Int = 20
If you want to read more about the function, please refer this tutorial.
Arrays
Lists
Sets
Tuple
Maps
Option
Arrays in Scala
In Scala, an array is a collection of similar elements. It can contain duplicates. Arrays are
alsoimmutable in nature. Further, you can access elements of an array using an index:
To declare any array in Scala, you can define it either using a new keyword or you can directly
In the above program, we have defined an array called name with 5 string values.
139
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
The following is the syntax for declaring an array variable using a new keyword.
Here you have declared an array of Strings called “name” that can hold up to three elements. You
scala> name
Accessing an array
You can access the element of an array by index. Lets access the first element of array “name”.
name(0)
Output:
res11: String = jal
Bygiving index 0. Index in Scala starts from 0.
140
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
List in Scala
Lists are one of the most versatile data structure in Scala. Lists contain items of different
types inPython, but in Scala the items all have the same type. Scala lists are immutable.
You can define list simply by comma separated values inside the “List” method.
You can also define multi dimensional list in Scala. Lets define a two dimensional list:
Accessing a list
Let’s get the third element of the list “numbers” . The index should 2 because index
scala> numbers(2)
in Scala startfrom 0.
res6: Int = 3
141
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
Let us start with a “Hello World!” program. It is a good simple way to understand how to
object HelloWorld {
def main(args: Array[String]) {
println("Hello, world!")
}
}
write, compile and run codes in Scala. No prizes for telling the outcome of this code!
As mentioned before, if you are familiar with Java, it will be easier for you to understand
Scala. If you know Java, you can easily see that the structure of above “HelloWorld” program
is very similar to Java program. This program contains a method “main” (not returning any
value) which takes an argument a string array through command line. Next, it calls a
predefined method called “Println” and passes the argument “Hello, world!”. You can define
the main method as static in Java but in Scala, the static method is no longer available. Scala
programmer can’t use static methods because they use singleton objects.
To run any Scala program, you first need to compile it. “Scalac” is the compiler which
Let’s start compiling your “HelloWorld” program using the following steps:
1. For compiling it, you first need to paste this program into a text file then you need to save thisprogram as
HelloWorld.scala
2. Now you need change your working directory to the directory where your program is saved
3. After changing the directory you can compile the program by issuing the command.
scalac HelloWorld.scala
142
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
After compiling, you will get Helloworld.class as an output in the same directory. If you can
After compiling, you can now run the program using following command:
scala HelloWorld
You will get an output if the above command runs successfully. The program will print
“Hello,world!”
Conclusion: In this way we have successfully implemented simple program in SCALA using
ApacheSpark Framework
143
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
144
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
145
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
146
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
147
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
148
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
149
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
150
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
151
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
152
ADSUL’S TECHNICAL CAMPUS DATA SCIENCE AND BIG DATA ANALYTICS
153