Dsbda Lab Manual
Dsbda Lab Manual
Dsbda Lab Manual
INDEX
Page
Sr. No. Title Signature
Number
Part II: 310251: Data Science and Big Data Analytics
In addition to the codes and outputs, explain every operation that you do
in the above steps and explain everything that you do to
import/read/scrape the data set.
Data Wrangling II
Create an “Academic performance” dataset of students and perform the
2 following operations using Python.
1. Scan all variables for missing values and inconsistencies. If there
are missing values and/or inconsistencies, use any of the
suitable techniques to deal with them.
2. Scan all numeric variables for outliers. If there are outliers, use
any of the suitable techniques to deal with them.
3. Apply data transformations on at least one of the variables. The
purpose of this transformation should be one of the following
reasons: to change the scale for better understanding of the
variable, to convert a non-linear relation into a linear one, or to
decrease the skewness and convert the distribution into a normal
distribution.
Provide the codes with outputs and explain everything that you do
in this step.
Data Analytics I
Create a Linear Regression Model using Python/R to predict home prices
using Boston Housing Dataset (https://www.kaggle.com/c/boston-
4 housing). The Boston Housing dataset contains information about various
houses in Boston through different parameters. There are 506 samples and
14 feature variables in this dataset.
a. The objective is to predict the value of prices of the house using the
given features.
Data Analytics II
1. Implement logistic regression using Python/R to perform
classification on Social_Network_Ads.csv dataset.
5 2. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy,
Error rate, Precision, Recall on the given dataset.
Text Analytics
1. Extract Sample document and apply following document
7 preprocessing methods: Tokenization, POS Tagging, stop words
removal, Stemming and Lemmatization.
2. Create representation of document by calculating Term Frequency
and Inverse Document Frequency.
Data Visualization I
1. Use the inbuilt dataset 'titanic'. The dataset contains 891 rows and
8 contains information about the passengers who boarded the
unfortunate Titanic ship. Use the Seaborn library to see if we can
find any patterns in the data.
2. Write a code to check how the price of the ticket (column name:
'fare') for each passenger is distributed by plotting a histogram.
Data Visualization II
1. Use the inbuilt dataset 'titanic' as used in the above problem. Plot a
9 box plot for distribution of age with respect to each gender along
with the information about whether they survived or not. (Column
names : 'sex' and 'age')
1 Write a code in JAVA for a simple Word Count application that counts the
number of occurrences of each word in a given input set using the Hadoop
Map-Reduce framework on local-standalone set-up.
2 Design a distributed application using Map-Reduce which processes a log
file of a system.
3 Locate dataset (e.g., sample_weather.txt) for working on weather data
which reads the text input files and finds average for temperature, dew
point and wind speed.
4 Write a simple program in SCALA using Apache Spark framework
Group C- Mini Projects/ Case Study – PYTHON/R (Any TWO Mini Project)
2 Use the following dataset and classify tweets into positive and negative
tweets.
https://www.kaggle.com/ruchi798/data-science-tweets
3 Develop a movie recommendation model using the scikit-learn library in
python.
Refer dataset https://github.com/rashida048/Some-NLP-
Projects/blob/master/movie_dataset.csv
4 Use the following covid_vaccine_statewise.csv dataset and perform
following analytics on the given dataset
https://www.kaggle.com/sudalairajkumar/covid19-in-
india?select=covid_vaccine_statewise.csv
a. Describe the dataset
b. Number of persons state wise vaccinated for first dose in India
c. Number of persons state wise vaccinated for second dose in India
d. Number of Males vaccinated
d. Number of females vaccinated
5 Write a case study to process data driven for Digital Marketing OR Health
care systems with Hadoop Ecosystem components as shown. (Mandatory)
● HDFS: Hadoop Distributed File System
● YARN: Yet Another Resource Negotiator
Group A
Objective of the Assignment: Students should be able to perform the data wrangling
operation using Python on any open source dataset
Prerequisite:
1. Basic of Python Programming
2. Concept of Data Preprocessing, Data Formatting , Data Normalization and Data Cleaning.
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Introduction to Dataset
2. Python Libraries for Data Science
3. Description of Dataset
4. Panda Dataframe functions for load the dataset
5. Panda functions for Data Preprocessing
6. Panda functions for Data Formatting and Normalization
7. Panda Functions for handling categorical variables
---------------------------------------------------------------------------------------------------------------
1. Introduction to Dataset
A dataset is a collection of records, similar to a relational database table. Records are similar to
table rows, but the columns can contain not only strings or numbers, but also nested data
structures such as lists, maps, and other records.
Instance: A single row of data is called an instance. It is an observation from the domain.
Feature: A single column of data is called a feature. It is a component of an observation and is
also called an attribute of a data instance. Some features may be inputs to a model (the predictors)
and others may be outputs or the features to be predicted.
Data Type: Features have a data type. They may be real or integer-valued or may have a
categorical or ordinal value. You can have strings, dates, times, and more complex types, but
typically they are reduced to real or categorical values when working with traditional machine
learning methods.
Datasets: A collection of instances is a dataset and when working with machine learning
methods we typically need a few datasets for different purposes.
Training Dataset: A dataset that we feed into our machine learning algorithm to train our model.
Testing Dataset: A dataset that we use to validate the accuracy of our model but is not used to
train the model. It may be called the validation dataset.
object str or string_, unicode_, mixed types Text or mixed numeric and
mixed non-numeric values
int64 int int_, int8, int16, int32, int64, uint8, uint16, uint32, Integer numbers
uint64
One of the most fundamental packages in Python, NumPy is a general-purpose array- processing
package. It provides high-performance multidimensional array objects and tools to work with the
arrays. NumPy is an efficient container of generic multi- dimensional data.
1. Basic array operations: add, multiply, slice, flatten, reshape, index arrays
2. Advanced array operations: stack arrays, split into sections, broadcast arrays
3. Work with DateTime or Linear Algebra
4. Basic Slicing and Advanced Indexing in NumPy Python
c. Matplotlib
This is undoubtedly my favorite and a quintessential Python library. You can create stories with
the data visualized with Matplotlib. Another library from the SciPy Stack, Matplotlib plots 2D
figures.
Histogram, bar plots, scatter plots, area plot to pie plot, Matplotlib can depict a wide range of
visualizations. With a bit of effort and tint of visualization capabilities, with Matplotlib, you can
create just any visualizations:Line plots
● Scatter plots
● Area plots
● Bar charts and Histograms
● Pie charts
● Stem plots
● Contour plots
● Quiver plots
● Spectrograms
Matplotlib also facilitates labels, grids, legends, and some more formatting entities with
Matplotlib.
d. Seaborn
So when you read the official documentation on Seaborn, it is defined as the data visualization
library based on Matplotlib that provides a high-level interface for drawing attractive and
informative statistical graphics. Putting it simply, seaborn is an extension of Matplotlib with
advanced features.
Introduced to the world as a Google Summer of Code project, Scikit Learn is a robust machine
learning library for Python. It features ML algorithms like SVMs, random forests, k-means
clustering, spectral clustering, mean shift, cross-validation and more... Even NumPy, SciPy and
related scientific operations are supported by Scikit Learn with Scikit Learn being a part of the
SciPy Stack.
3. Description of Dataset:
The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in
Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.
It includes three iris species with 50 samples each as well as some properties about each flower. One
flower species is linearly separable from the other two, but the other two are not linearly separable
from each other.
Description of Dataset-
3. The csv file at the UCI repository does not contain the variable/column names. They are located
in a separate file.
col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Species']
4. read in the dataset from the UCI Machine Learning Repository link and specify column names to
use
2 dataset.tail(n=5)
Return the last n rows.
This returns a Series with the data type of each column. The
result’s index is the original Dataset’s columns. Columns
with mixed types are stored with the object dtype.
17 dataset.iloc[:m, :n] a subset of the first m rows and the first n columns
Data Frame
Sr. Description Output
Function
No
1 dataset.iloc[3:5, 0:2] Slice the data
dataset[cols_2_4]
Function: DataFrame.isnull()
Output:
Function: DataFrame.isnull().any()
Output:
c. count of missing values across each column using isna() and isnull()
In order to get the count of missing values of the entire dataframe isnull() function is
used. sum() which does the column wise sum first and doing another sum() will get the
count of missing values of the entire dataframe.
Function: dataframe.isnull().sum().sum()
Output : 8
d. count row wise missing value using isnull()
Function: dataframe.isnull().sum(axis = 1)
Output:
Method 2:
unction: dataframe.isna().sum()
df1.Gender.isnull().sum()
Output: 2
g. groupby count of missing values of a column.
In order to get the count of missing values of the particular column by group in pandas
we will be using isnull() and sum() function with apply() and groupby() which performs
the group wise count of missing values as shown below.
Function: df1.groupby(['Gender'])['Score'].apply(lambda x:
x.isnull().sum())
Output:
or modelled effectively, and there are several techniques for this process.
a. Data Formatting: Ensuring all data formats are correct (e.g. object, text, floating
number, integer, etc.) is another part of this initial ‘cleaning’ process. If you are working
with dates in Pandas, they also need to be stored in the exact format to use special date-
time functions.
b. Data normalization: Mapping all the nominal data values onto a uniform scale (e.g.
from 0 to 1) is involved in data normalization. Making the ranges consistent
across variables helps with statistical analysis and ensures better comparisons later on.It
is also known as Min-Max scaling.
Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
x=df.iloc[:,:4]
Step 6: Create an object to transform the data to fit minmax processor
x_scaled = min_max_scaler.fit_transform(x)
Step 7:Run the normalizer on the dataframe
df_normalized = pd.DataFrame(x_scaled)
Step 8: View the dataframe
df_normalized
Output: After Step 3:
Example : Suppose we have a column Height in some dataset. After applying label encoding,
the Height column is converted into:
where 0 is the label for tall, 1 is the label for medium, and 2 is a label for short height. Label
Encoding on iris dataset: For iris dataset the target column which is Species. It contains three
species Iris-setosa, Iris-versicolor, Iris-virginica.
df['Species'].unique()
output: array(['Iris-setosa', 'Iris-versicolor', 'Iris-
virginica'], dtype=object)
Step 4: define label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
Step 5: Encode labels in column 'species'.
df['Species']= label_encoder.fit_transform(df['Species'])
Step 6: Observe the unique values for the Species column.
Department of Computer Subject :
df['Species'].unique()
Output: array([0, 1, 2], dtype=int64)
● Use LabelEncoder when there are only two possible values of a categorical feature. For
example, features having value such as yes or no. Or, maybe, gender features when there are
only two possible values including male or female.
Limitation: Label encoding converts the data in machine-readable form, but it assigns a unique
number(starting from 0) to each class of data. This may lead to the generation of priority
issues in the data sets. A label with a high value may be considered to have high priority than a
label having a lower value.
b. One-Hot Encoding:
In one-hot encoding, we create a new set of dummy (binary) variables that is equal to the number
of categories (k) in the variable. For example, let’s say we have a categorical variable Color with
three categories called “Red”, “Green” and “Blue”, we need to use three dummy variables to
encode this variable using one-hot encoding. A dummy (binary) variable just takes the value 0 or
1 to indicate the exclusion or inclusion of a category.
In one-hot encoding,
“Red” color is encoded as [1 0 0] vector of size 3.
“Green” color is encoded as [0 1 0] vector of size 3.
“Blue” color is encoded as [0 0 1] vector of size 3.
Department of Computer Subject :
One-hot encoding on iris dataset: For iris dataset the target column which is Species. It
contains three species Iris-setosa, Iris-versicolor, Iris-virginica.
df['Species'].unique()
output: array(['Iris-setosa', 'Iris-versicolor', 'Iris-
virginica'], dtype=object)
Step 4: Apply label_encoder object for label encoding the Observe the unique
values for the Species column.
df['Species'].unique()
Output: array([0, 1, 2], dtype=int64)
Step 5: Remove the target variable from dataset
features_df=df.drop(columns=['Species'])
Step 6: Apply one_hot encoder for Species column.
enc = preprocessing.OneHotEncoder()
enc_df=pd.DataFrame(enc.fit_transform(df[['Species']])).toarray()
Step 7: Join the encoded values with Features variable
df_encode = features_df.join(enc_df)
Step 8: Observe the merge dataframe
df_encode
Step 9: Rename the newly encoded columns.
df_encode.rename(columns = {0:'Iris-Setosa',
1:'Iris-Versicolor',2:'Iris-virginica'}, inplace = True)
Step 10: Observe the merge dataframe
df_encode
Department of Computer Subject :
Dummy encoding also uses dummy (binary) variables. Instead of creating a number of dummy
variables that is equal to the number of categories (k) in the variable, dummy encoding uses k-1
dummy variables. To encode the same Color variable with three categories using the dummy
encoding, we need to use only two dummy variables.
Department of Computer Subject :
In dummy encoding,
“Red” color is encoded as [1 0] vector of size 2. “Green”
color is encoded as [0 1] vector of size 2. “Blue” color is
encoded as [0 0] vector of size 2.
Dummy encoding removes a duplicate category present in the one-hot encoding.
Pandas Functions for One-hot Encoding with dummy variables:
Column names in the DataFrame to be encoded. If columns is None then all the
columns with object or category dtype will be converted.
Whether to get k-1 dummies out of k categorical levels by removing the first level.
Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
df['Species'].unique()
output: array(['Iris-setosa', 'Iris-versicolor', 'Iris-
virginica'], dtype=object)
Step 4: Apply label_encoder object for label encoding the Observe the unique
values for the Species column.
df['Species'].unique()
Output: array([0, 1, 2], dtype=int64)
Step 6: Apply one_hot encoder with dummy variables for Species column.
Conclusion- In this way we have explored the functions of the python library for Data Preprocessing, Data
Wrangling Techniques and How to Handle missing values on Iris Dataset. Assignment Question
1. Explain Data Frame with Suitable example.
2. What is the limitation of the label encoding method?
3. What is the need of data normalization?
4. What are the different Techniques for Handling the Missing Data?
Department of Computer Subject :
Objective of the Assignment: Students should be able to perform thedata wrangling operation using
Python on any open source dataset
Prerequisite:
1. Basic of Python Programming
2. Concept of Data Preprocessing, Data Formatting , Data Normalization and Data
Cleaning.
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Creation of Dataset using Microsoft Excel.
2. Identification and Handling of Null Values
3. Identification and Handling of Outliers
4. Data Transformation for the purpose of :
a. To change the scale for better understanding
b. To decrease the skewness and convert distribution into normal distribution
---------------------------------------------------------------------------------------------------------------
Theory:
1. Creation of Dataset using Microsoft Excel.
The dataset is created in “CSV” format.
● The name of dataset is StudentsPerformance
● The features of the dataset are: Math_Score, Reading_Score, Writing_Score,
Placement_Score, Club_Join_Date .
● Number of Instances: 30
● The response variable is: Placement_Offer_Count .
● Range of Values:
Math_Score [60-80], Reading_Score[75-,95], ,Writing_Score [60,80],
Placement_Score[75-100], Club_Join_Date [2018-2021].
● The response variable is the number of placement offers facilitated to particular
students, which is largely depend on Placement_Score
To fill the values in the dataset the RANDBETWEEN is used. Returns a randominteger
number between the numbers you specify
Syntax : RANDBETWEEN(bottom, top) Bottom The smallest integer and
Top The largest integer RANDBETWEEN will return.
For better understanding and visualization, 20% impurities are added into each variableto the
dataset.
The step to create the dataset are as follows:
Step 1: Open Microsoft Excel and click on Save As. Select Other .Formats
Step 2: Enter the name of the dataset and Save the dataset astye CSV(MS-DOS).
Step 3: Fill the dara by using RANDOMBETWEEN function. For every feature , fillthe data
by considering above spectified range.
one example is given:
The placement count largely depends on the placement score. It is considered that if placement
score <75, 1 offer is facilitated; for placement score >75 , 2 offer is facilitated and for else (>85)
3 offer is facilitated. Nested If formula is used for ease of data filling.
Step 4: In 20% data, fill the impurities. The range of math score is [60,80], updating a few
instances values below 60 or above 80. Repeat this for Writing_Score [60,80],
Placement_Score[75-100], Club_Join_Date [2018-2021].
Step 5: To violate the ruleof response variable, update few valus . If placement scoreis greater
then 85, facilated only 1 offer.
1. None: None is a Python singleton object that is often used for missing data in Python
code.
2. NaN : NaN (an acronym for Not a Number), is a special floating-point value recognized
by all systems that use the standard IEEE floating-point representation.
Pandas treat None and NaN as essentially interchangeable for indicating missing or null
values. To facilitate this convention, there are several useful functions for detecting, removing,
and replacing null values in Pandas DataFrame :
● isnull()
● notnull()
● dropna()
● fillna()
● replace()
Algorithm:
Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
df
df.isnull()
Step 5: To create a series true for NaN values for specific columns. for examplemath
score in dataset and display data with only math score as NaN
Algorithm:
Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
df
Step 5: To create a series true for NaN values for specific columns. for examplemath
score in dataset and display data with only math score as NaN
See that there are also categorical values in the dataset, for this, you need to useLabel
Encoding or One Hot Encoding.
from sklearn.preprocessing import LabelEncoderle =
LabelEncoder()
df['gender'] = le.fit_transform(df['gender'])newdf=df
df
In order to fill null values in a datasets, fillna(), replace() functions are used. These
functions replace NaN values with some value of their own. All these functions help in
filling null values in datasets of a DataFrame.
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
df
Step 4: filling missing value using fillna()
ndf=df
ndf.fillna(0)
Step 5: filling missing values using mean, median and standard deviation of that
column.
Following line will replace Nan value in dataframe with value -99
Algorithm:
Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
df
Step 4:To drop rows with at least 1 null value
ndf.dropna()
ndf.dropna(axis = 1)
new_data
Similarly, an Outlier is an observation in a given dataset that lies far from the rest of the
observations. That means an outlier is vastly larger or smaller than the remaining values in the
set.
They may indicate an experimental error or heavy skewness in the data(heavy- tailed
distribution).
Mean is the accurate measure to describe the data when we do not have any outliers
present. Median is used if there is an outlier in the dataset. Mode is used if there is an outlier
AND about ½ or more of the data is the same.
‘Mean’ is the only measure of central tendency that is affected by the outliers which in
turn impacts Standard deviation.
Example:
Consider a small dataset, sample= [15, 101, 18, 7, 13, 16, 11, 21, 5, 15, 10, 9]. Bylooking
at it, one can quickly say ‘101’ is an outlier that is much larger than the other values.
From the above calculations, we can clearly say the Mean is more affected than the Median.
4. Detecting Outliers
If our dataset is small, we can detect the outlier by just looking at the dataset. But what if
we have a huge dataset, how do we identify the outliers then? We need to use visualization and
mathematical techniques.
● Boxplots
● Scatterplots
● Z-score
● Inter Quantile Range(IQR)
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/demo.csv")
Step 3: Display the data frame
df
Step 4:Select the columns for boxplot and draw the boxplot.
Step 5: We can now print the outliers for each column with reference to the box plot.
print(np.where(df['math score']>90))
SNJB’s Late Sau. K B Jain College of Engineering, Chandwad
Department of Computer Subject :
print(np.where(df['reading score']<25))
print(np.where(df['writing score']<30))
df=pd.read_csv("/content/demo.csv")
Step 3: Display the data frame
df
Step 4: Draw the scatter plot with placement score and placement offer countfig, ax
= plt.subplots(figsize = (18,10)) ax.scatter(df['placement score'],
df['placement offer
count'])
plt.show()
Labels to the axis can be assigned (Optional)
ax.set_xlabel('(Proportion non-retail businessacres)/(town)')
ax.set_ylabel('(Full-value property-tax rate)/(
$10,000)')
Step 5: We can now print the outliers with reference to scatter plot.
Algorithm:
Step 1 : Import numpy and stats from scipy libraries
import numpy as np
z = np.abs(stats.zscore(df['math score']))
Step 3: Print Z-Score Value. It prints the z-score values of each data itemof the
column
print(z)
upper = Q3 +1.5*IQR
lower = Q1 – 1.5*IQR
In the above formula as according to statistics, the 0.5 scale-up of IQR (new_IQR
= IQR + 0.5*IQR) is taken.
Algorithm:
Step 1 : Import numpy library
import numpy as np
Step 2: Sort Reading Score feature and store it into sorted_rscore.
q1 = np.percentile(sorted_rscore, 25) q3 =
np.percentile(sorted_rscore, 75)print(q1,q3)
b = np.where(df_stud['math score']>ninetieth_percentile,
ninetieth_percentile, df_stud['math score'])
print("New array:",b)
df_stud.insert(1,"m score",b,True)df_stud
● Mean/Median imputation:
As the mean value is highly influenced by the outliers, it is advised to replace theoutliers
with the median value.
1. Plot the box plot for reading scorecol
= ['reading score'] df.boxplot(col)
median=np.median(sorted_rscore)median
4. Replace the upper bound outliers using median value
refined_df=df
refined_df['reading score'] = np.where(refined_df['readingscore'] >upr_bound,
median,refined_df['reading score'])
5. Display redefined_df
6. Replace the lower bound outliers using median value refined_df['reading score'] =
np.where(refined_df['readingscore'] <lwr_bound, median,refined_df['reading
score'])
7. Display redefined_df
● Smoothing: It is a process that is used to remove noise from the dataset using some
algorithms It allows for highlighting important features present in the dataset. It
helps in predicting the patterns
● Aggregation: Data collection or aggregation is the method of storing and presenting
data in a summary format. The data may be obtained from multiple data sources to
integrate these data sources into a data analysis description. This is a crucial step
since the accuracy of data analysis insights is highly dependent on the quantity and
quality of the data used.
● Generalization:It converts low-level data attributes to high-level data attributes
S. B. Patil College of Engineering,
Department of Computer Subject :
using concept hierarchy. For Example Age initially in Numerical form (22, 25) is
converted into categorical value (young, old).
● Normalization: Data normalization involves converting all data variables into a
given range. Some of the techniques that are used for accomplishing normalization
are:
Min–max normalization: This transforms the original data linearly.
Z-score normalization: In z-score normalization (or zero-mean normalization)
the values of an attribute (A), are normalized based on the mean of A and its
standard deviation.
Normalization by decimal scaling: It normalizes the values of an attribute by
changing the position of their decimal points
● Attribute or feature construction.
New attributes constructed from the given ones: Where new attributes are
created & applied to assist the mining process from the given set of attributes.
This simplifies the original data & makes the mining more efficient.
In this assignment , The purpose of this transformation should be one of the
following reasons:
df=pd.read_csv("/content/demo.csv")
Step 3: Display the data frame df
Algorithm:
Step 1 : Detecting outliers using Z-Score for the Math_score variable and
remove the outliers.
Step 2: Observe the histogram for math_score variable.
import matplotlib.pyplot as plt new_df['math
score'].plot(kind = 'hist')
Step 3: Convert the variables to logarithm at the scale 10.
df['log_math'] = np.log10(df['math score'])
Assignment No-3
Title of the Assignment: Descriptive Statistics - Measures of Central Tendency and variability Perform the
following operations on any open source dataset (e.g., data.csv)
1. Provide summary statistics (mean, median, minimum, maximum, standard deviation) for a
dataset (age, income etc.) with numeric variables grouped by one of the qualitative
(categorical) variables.
For example, if your categorical variable is age groups and quantitative variable is income,
then provide summary statistics of income grouped by the age groups. Create a list that
contains a numeric value for each response to the categorical variable.
2. Write a Python program to display some basic statistical details like percentile, mean,
standard deviation etc. of the species of ‘Iris-setosa’, ‘Iris-versicolor’ and ‘Iris- versicolor’ of
iris.csv dataset.
Provide the codes with outputs and explain everything that you do in this step.
Objective of the Assignment: Students should be able to perform the Statistical operations using Python on
any open source dataset.
Prerequisite:
1. Basic of Python Programming
2. Concept of statistics such as mean, median, minimum, maximum, standard deviation etc.
Introduction
Descriptive Statistics is the building block of data science. Advanced analytics is often incomplete without
analyzing descriptive statistics of the key metrics. In simple terms, descriptive statistics can be defined as the
measures that summarize a given data, and these measures can be broken down further into the measures of
central tendency and the measures of dispersion.
Measures of central tendency include mean, median, and the mode, while the measures of variability include
standard deviation, variance, and the interquartile range. In this guide, you will learn how to compute these
measures of descriptive statistics and use them to interpret the data.
We will cover the topics given below:
1. Mean
2. Median
3. Mode
4. Standard Deviation
5. Variance
6. Interquartile Range
7. Skewness
We will begin by loading the dataset to be used in this guide.
Data
In this guide, we will be using fictitious data of loan applicants containing 600 observations and 10 variables, as
described below:
Measures of central tendency describe the center of the data, and are often represented by the mean, the median,
and the mode.
Mean
Mean represents the arithmetic average of the data. The line of code below prints the mean of the numerical
variables in the data. From the output, we can infer that the average age of the applicant is 49 years, the average
annual income is USD 705,541, and the average tenure of loans is 183 months. The command df.mean(axis =
0) will also give the same output.
1df.mean()
python
Output:
1 Dependents 0.748333
2 Income 705541.333333
3 Loan_amount 323793.666667
4 Term_months 183.350000
5 Age 49.450000
6 dtype: float64
It is also possible to calculate the mean of a particular variable in a data, as shown below, where we calculate the
mean of the variables 'Age' and 'Income'.
1print(df.loc[:,'Age'].mean())
2print(df.loc[:,'Income'].mean())
python
Output:
1 49.45
2 705541.33
In the previous sections, we computed the column-wise mean. It is also possible to calculate the mean of the
rows by specifying the (axis = 1) argument. The code below calculates the mean of the first five rows.
1df.mean(axis = 1)[0:5]
python
Output:
10 70096.0
21 161274.0
32 125113.4
43 119853.8
54 120653.8
6 dtype: float64
Median
In simple terms, median represents the 50th percentile, or the middle value of the data, that separates the
distribution into two halves. The line of code below prints the median of the numerical variables in the data. The
command df.median(axis = 0) will also give the same output.
1df.median()
python
Output:
1 Dependents 0.0
2 Income 508350.0
3 Loan_amount 76000.0
4 Term_months 192.0
5 Age 51.0
6 dtype: float64
From the output, we can infer that the median age of the applicants is 51 years, the median annual income is
USD 508,350, and the median tenure of loans is 192 months. There is a difference between the mean and the
median values of these variables, which is because of the distribution of the data. We will learn more about this
in the subsequent sections.
It is also possible to calculate the median of a particular variable in a data, as shown in the first two lines of
code below. We can also calculate the median of the rows by specifying the (axis = 1) argument. The third
line below calculates the median of the first five rows.
1#to calculate a median of a particular column
2print(df.loc[:,'Age'].median())
3print(df.loc[:,'Income'].median())
4
5df.median(axis = 1)[0:5]
python
Output:
1 51.0
2 508350.0
3
40 102.0
51 192.0
62 192.0
73 192.0
84 192.0
9 dtype: float64
Mode
Mode represents the most frequent value of a variable in the data. This is the only central tendency measure that
can be used with categorical variables, unlike the mean and the median which can be used only with quantitative
data.
The line of code below prints the mode of all the variables in the data. The .mode() function returns the most
common value or most repeated value of a variable. The command df.mode(axis = 0) will also give the same
output.
S. B. Patil College of Engineering,
Department of Computer Subject :
1df.mode()
python
Output:
1| | Marital_status | Dependents | Is_graduate | Income | Loan_amount |
Term_months | Credit_score | approval_status | Age | Sex |
2|--- |---------------- |------------ |------------- |-------- |------------- |----------
--- |-------------- |----------------- |----- |----- |
3| 0 | Yes |0 | Yes | 333300 | 70000 | 192.0
| Satisfactory | Yes | 55 |M |
The interpretation of the mode is simple. The output above shows that most of the applicants are married, as
depicted by the 'Marital_status' value of "Yes". Similar interpreation could be done for the other categorical
variables like 'Sex' and 'Credit-Score'. For numerical variables, the mode value represents the value that occurs
most frequently. For example, the mode value of 55 for the variable 'Age' means that the highest number (or
frequency) of applicants are 55 years old.
Measures of Dispersion
In the previous sections, we have discussed the various measures of central tendency. However, as we have seen
in the data, the values of these measures differ for many variables. This is because of the extent to which a
distribution is stretched or squeezed. In statistics, this is measured by dispersion which is also referred to as
variability, scatter, or spread. The most popular measures of dispersion are standard deviation, variance, and the
interquartile range.
Standard Deviation
Standard deviation is a measure that is used to quantify the amount of variation of a set of data values from its
mean. A low standard deviation for a variable indicates that the data points tend to be close to its mean, and vice
versa. The line of code below prints the standard deviation of all the numerical variables in the data.
1df.std()
python
Output:
1 Dependents 1.026362
2 Income 711421.814154
3 Loan_amount 724293.480782
4 Term_months 31.933949
5 Age 14.728511
6 dtype: float64
While interpreting standard deviation values, it is important to understand them in conjunction with the mean.
For example, in the above output, the standard deviation of the variable 'Income' is much higher than that of the
variable 'Dependents'. However, the unit of these two variables is different and, therefore, comparing the
dispersion of these two variables on the basis of standard deviation alone will be incorrect. This needs to be kept
in mind.
It is also possible to calculate the standard deviation of a particular variable, as shown in the first two lines of
code below. The third line calculates the standard deviation for the first five rows.
1print(df.loc[:,'Age'].std())
2print(df.loc[:,'Income'].std())
S. B. Patil College of Engineering,
Department of Computer Subject :
3
4#calculate the standard deviation of the first five rows
5df.std(axis = 1)[0:5]
python
Output:
1 14.728511412020659
2 711421.814154101
3
40 133651.842584
51 305660.733951
62 244137.726597
73 233466.205060
84 202769.786470
9 dtype: float64
Variance
Variance is another measure of dispersion. It is the square of the standard deviation and the covariance of the
random variable with itself. The line of code below prints the variance of all the numerical variables in the
dataset. The interpretation of the variance is similar to that of the standard deviation.
1df.var()
python
Output:
1 Dependents 1.053420e+00
2 Income 5.061210e+11
3 Loan_amount 5.246010e+11
4 Term_months 1.019777e+03
5 Age 2.169290e+02
6 dtype: float64
Skewness
Another useful statistic is skewness, which is the measure of the symmetry, or lack of it, for a real-valued
random variable about its mean. The skewness value can be positive, negative, or undefined. In a perfectly
symmetrical distribution, the mean, the median, and the mode will all have the same value. However, the
variables in our data are not symmetrical, resulting in different values of the central tendency.
We can calculate the skewness of the numerical variables using the skew() function, as shown below.
1print(df.skew())
python
Output:
1 Dependents 1.169632
2 Income 5.344587
3 Loan_amount 5.006374
4 Term_months -2.471879
5 Age -0.055537
6 dtype: float64
The skewness values can be interpreted in the following manner:
Highly skewed distribution: If the skewness value is less than −1 or greater than +1.
Moderately skewed distribution: If the skewness value is between −1 and −½ or between +½ and +1.
Approximately symmetric distribution: If the skewness value is between −½ and +½.
We have learned the measures of central tendency and dispersion, in the previous sections. It is important to
analyse these individually, however, because there are certain useful functions in python that can be called upon
to find these values. One such important function is the .describe() function that prints the summary statistic of
the numerical variables. The line of code below performs this operation on the data.
1df.describe()
python
Output:
1| | Dependents | Income | Loan_amount | Term_months | Age |
2|------- |------------ |-------------- |-------------- |------------- |------------
|
3| count | 600.000000 | 6.000000e+02 | 6.000000e+02 | 600.000000 | 600.000000
|
4| mean | 0.748333 | 7.055413e+05 | 3.237937e+05 | 183.350000 | 49.450000
|
5| std | 1.026362 | 7.114218e+05 | 7.242935e+05 | 31.933949 | 14.728511
|
6| min | 0.000000 | 3.000000e+04 | 1.090000e+04 | 18.000000 | 22.000000
|
7| 25% | 0.000000 | 3.849750e+05 | 6.100000e+04 | 192.000000 | 36.000000
|
Now we have the summary statistics for all the variables. For qualitative variables, we will not have the statistics
such as the mean or the median, but we will have statistics like the frequency and the unique label.
Conclusion
In this guide, you have learned about the fundamentals of the most widely used descriptive statistics and their
calculations with Python. We covered the following topics in this guide:
1. Mean
2. Median
3. Mode
4. Standard Deviation
5. Variance
6. Interquartile Range
7. Skewness
It is important to understand the usage of these statistics and which one to use, depending on the problem
statement and the data. To learn more about data preparation and building machine learning models using
Python's 'scikit-learn' library, please refer to the following guides:
Assignment Questions:
Assignment No-4
Title of the Assignment: Create a Linear Regression Model using Python/R to predict home
prices using Boston Housing Dataset (https://www.kaggle.com/c/boston-housing). The Boston
Housing dataset contains information about various houses in Boston through different parameters.
There are 506 samples and 14 feature variables in this dataset.
The objective is to predict the value of prices of the house using the given features.
----------------------------------------------------------------------------------------------------------------
Objective of the Assignment: Students should be able to data analysis using liner regressionusing
Python for any open source dataset
---------------------------------------------------------------------------------------------------------------
Prerequisite:
1. Basic of Python Programming
2.Concept of Regresion.
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Linear Regression : Univariate and Multivariate
2. Least Square Method for Linear Regression
3. Measuring Performance of Linear Regression
4. Example of Linear Regression
5. Training data set and Testing data set
---------------------------------------------------------------------------------------------------------------
1. Linear Regression: It is a machine learning algorithm based on supervised learning. Ittargets
prediction values on the basis of independent variables.
● It is preferred to find out the relationship between forecasting and variables.
● A linear relationship between a dependent variable (X) is continuous; while independent
variable(Y) relationship may be continuous or discrete. A linear relationship should be
available in between predictor and target variable so knownas Linear Regression.
● Linear regression is popular because the cost function is Mean Squared Error (MSE)
which is equal to the average squared difference between an observation’s actual and
predicted values.
● It is shown as an equation of line like : Y
= m*X + b + e
Where : b is intercepted, m is slope of the line and e is error term.
This equation can be used to predict the value of target variable Y based on given predictor
variable(s) X, as shown in Fig. 1.
● Fig. 2 shown below is about the relation between weight (in Kg) and height (in cm), a
linear relation. It is an approach of studying in a statistical manner to summarise and
learn the relationships among continuous (quantitative) variables.
● Here a variable, denoted by ‘x’ is considered as the predictor, explanatory, or
independent variable.
Fig.2 : Relation between weight (in Kg) and height (in cm)
MultiVariate Regression :It concerns the study of two or more predictor variables. Usually a
transformation of the original features into polynomial features from a given degree is preferred
and further Linear Regression is applied on it.
● A simple linear model Y = a + bX is in original feature will be transformed into
polynomial feature is transformed and further a linear regression applied to it and it will
be something like
Y=a + bX + cX2
● If a high degree value is used in transformation the curve becomes over-fitted as it
captures the noise from data as well.
● A simple linear model is the one which involves only one dependent and one independent
variable. Regression Models are usually denoted in Matrix Notations.
● However, for a simple univariate linear model, it can be denoted by the regressionequation
𝑦 =β + β 𝑥 (1)
0 1
of errors at all the points in the sample set. The error is the deviation of the actual sample
● data point from the regression line. The technique can be represented by the equation.
𝑛
2
(2)
𝑚𝑖𝑛 ∑ (𝑦 − 𝑦)
𝑖=0
Using differential calculus on equation 1 we can find the values of β and β such
0 1
β = 𝑦 −β 𝑥 (4)
0 1
Once the Linear Model is estimated using equations (3) and (4), we can estimate the value of the
dependent variable in the given range only. Going outside the range is called extrapolation which
is inaccurate if simple regression techniques are used.
3. Measuring Performance of Linear Regression
Mean Square Error:
The Mean squared error (MSE) represents the error of the estimator or predictive model created
based on the given set of observations in the sample. Two or more regression models created
using a given sample data can be compared based on their MSE. The lesser the MSE, the better
the regression model is. When the linear regression model is trained using a given set of
observations, the model with the least mean sum of squares error (MSE) is selected as the best
model. The Python or R packages select the best-fit model as the model with the lowest MSE or
lowest RMSE when training the linear regression models.
Mathematically, the MSE can be calculated as the average sum of the squared difference between
the actual value and the predicted or estimated value represented by the regression model (line or
plane).
An MSE of zero (0) represents the fact that the predictor is a perfect predictor.RMSE:
Root Mean Squared Error method that basically calculates the least-squares error and takes aroot of the
summed values.
Mathematically speaking, Root Mean Squared Error is the square root of the sum of all errorsdivided by
the total number of values. This is the formula to calculate RMSE
R-Squared is the ratio of the sum of squares regression (SSR) and the sum of squares total(SST).
SST : total sum of squares (SST), regression sum of squares (SSR), Sum of square of errors(SSE) are
all showing the variation with different measures.
A value of R-squared closer to 1 would mean that the regression model covers most part of the
variance of the values of the response variable and can be termed as a good model.
One can alternatively use MSE or R-Squared based on what is appropriate and the need of the hour.
However, the disadvantage of using MSE rather than R-squared is that it will be difficult to gauge the
performance of the model using MSE as the value of MSE can vary from 0 to any larger number.
However, in the case of R-squared, the value is bounded between 0 and .
4. Example of Linear Regression
Consider following data for 5 students.
Each Xi (i = 1 to 5) represents the score of ith student in standard X and correspondingYi (i =
1 to 5) represents the score of ith student in standard XII.
(i) Linear regression equation best predicts standard XIIth score
(ii) Interpretation for the equation of Linear Regression
(iii) If a student's score is 80 in std X, then what is his expected score in XII standard?
x y 𝑥 −𝑥 𝑦 −𝑦 (𝑥 −𝑥 )2 (𝑥 −𝑥 )(𝑦 − 𝑦 )
95 85 17 8 289 136
85 95 7 18 49 126
80 70 2 -7 4 -14
70 65 -8 -12 64 96
60 70 -18 -7 324 126
𝑥 = 78 𝑦= 77
ε (𝑥 −𝑥 )2= 730 ε (𝑥 −𝑥 )(𝑦 − 𝑦 ) = 470
(i) linear regression equation that best predicts standard XIIth score
𝑦 =β + β 𝑥
0 1
𝑛 𝑛 2
β = ∑ 𝑥( 𝑖 − 𝑥 ) − 𝑦 )/ ∑ 𝑥( 𝑥)
1 𝑖 𝑖−
𝑖=1 𝑖=1
β = 470/730 = 0. 644
1
β = 𝑦 −β 𝑥
0 1
𝑦 = 26. 76 + 0. 644 𝑥
Interpretation 1
For an increase in value of x by 0.644 units there is an increase in value of y in one unit.
Interpretation 2
Score in XII standard (Yi) is 0.644 units depending on Score in X standard (Xi) but otherfactors
will also contribute to the result of XII standard by 26.768 .
(iii) If a student's score is 65 in std X, then his expected score in XII standard is 78.288
● The input of the testing phase is test data, which is passed to the machine learning modeland
prediction is done to observe the correctness of mode.
● Machines can learn when they observe enough relevant data. Using this one can model
algorithms to find relationships, detect patterns, understand complex problems and make
decisions.
● Training error is the error that occurs by applying the model to the same data from which the
model is trained.
● In a simple way the actual output of training data and predicted output of the model does not
match the training error Ein is said to have occurred.
● Training error is much easier to compute.
(b) Testing Phase
● Testing dataset is provided as input to this phase.
● Test dataset is a dataset for which class label is unknown. It is tested using model
● A test dataset used for assessment of the finally chosen model.
● Training and Testing dataset are completely different.
● Testing error is the error that occurs by assessing the model by providing the unknowndata to
the model.
● In a simple way the actual output of testing data and predicted output of the model doesnot
match the testing error Eout is said to have occurred.
● E out is generally observed larger than Ein.
(c) Generalization
● Generalization is the prediction of the future based on the past system.
● It needs to generalize beyond the training data to some future data that it might not haveseen
yet.
● The ultimate aim of the machine learning model is to minimize the generalization error.
● The generalization error is essentially the average error for data the model has neverseen.
● In general, the dataset is divided into two partition training and test sets.
● The fit method is called on the training set to build the model.
● This fit method is applied to the model on the test set to estimate the target value and evaluate
the model's performance.
● The reason the data is divided into training and test sets is to use the test set to estimate how well
the model trained on the training data and how well it would perform on the unseen data.
Output:
68.63
Step 6: Predict the y_pred for all values of x.
y_pred= predict(x)
y_pred
Output:
array([81.50684932, 87.94520548, 71.84931507, 68.63013699, 71.84931507])
Step 7: Evaluate the performance of Model (R-Suare)
R squared calculation is not implemented in numpy… so that one should be borrowedfrom
sklearn.
from sklearn.metrics import r2_scorer2_score(y,
y_pred)
Output:
0.4803218090889323
Step 8: Plotting the linear regression model
y_line = model[1] + model[0]* xplt.plot(x,
y_line, c = 'r') plt.scatter(x, y_pred)
plt.scatter(x,y,c='r')
S. B. Patil College of Engineering,
Department of Computer Subject :
Output:
Step 12: Calculate Mean Square Paper for train_y and test_y
from sklearn.metrics import mean_squared_error, r2_scoremse =
mean_squared_error(ytest, ytest_pred)
print(mse)
mse = mean_squared_error(ytrain_pred,ytrain)print(mse)
Output:
33.44897999767638
mse = mean_squared_error(ytest, ytest_pred)print(mse)
Output:
19.32647020358573
Step 13: Plotting the linear regression model
lt.scatter(ytrain ,ytrain_pred,c='blue',marker='o',label='Training data') plt.scatter(ytest,ytest_pred
,c='lightgreen',marker='s',label='Test data')plt.xlabel('True values')
plt.ylabel('Predicted')
plt.title("True value vs Predicted value")plt.legend(loc=
'upper left') #plt.hlines(y=0,xmin=0,xmax=50)
plt.plot()
plt.show()
Conclusion:
In this way we have done data analysis using linear regression for Boston Dataset and
predict the price of houses using the features of the Boston Dataset.
Assignment Question:
1) Compute SST, SSE, SSR, MSE, RMSE, R Square for the below example .
2) Comment on whether the model is best fit or not based on the calculated values.
3) Write python code to calculate the RSquare for Boston Dataset.
(Consider the linear regression model created in practical session)
.
Assignment No-5
S. B. Patil College of Engineering,
Department of Computer Subject :
---------------------------------------------------------------------------------------------------------------
1. Logistic Regression: Classification techniques are an essential part of machine learning and data
mining applications. Approximately 70% of problems in Data Science are classification
problems. There are lots of classification problems that are available, but logistic regression is
common and is a useful regression method for solving the binary classification problem. Another
category of classification is Multinomial classification, which handles the issues where multiple
classes are present in the target variable. For example, the IRIS dataset is a very famous example
of multi-class classification. Other examples are classifying article/blog/document categories.
Logistic Regression can be used for various classification problems such as spam detection.
Diabetes prediction, if a given customer will purchase a particular product or will they churn
another competitor, whether the user will click on a given advertisement link or not, and many
more examples are in the bucket.
Logistic Regression is one of the most simple and commonly used Machine Learning algorithms
for two-class classification. It is easy to implement and can be used as the baseline for any binary
classification problem. Its basic fundamental concepts are also constructive in deep learning.
Logistic regression describes and estimates the relationship between one dependent binary
variable and independent variables.
Logistic regression is a statistical method for predicting binary classes. The outcome or target
variable is dichotomous in nature. Dichotomous means there are only two possible classes. For
example, it can be used for cancer detection problems. It computes the probability of an event
occurring.
It is a special case of linear regression where the target variable is categorical in nature. It uses a
log of odds as the dependent variable. Logistic Regression predicts the probability of occurrence
Where, y is a dependent variable and x1, x2 ... and Xn are explanatory variables.
Sigmoid Function:
3. Sigmoid Function
The sigmoid function, also called logistic function, gives an ‘S’ shaped curve that can take any real-
valued number and map it into a value between 0 and 1. If the curve goes to positive infinity, y predicted
will become 1, and if the curve goes to negative infinity, y predicted will become 0. If the output of the
sigmoid function is more than 0.5, we can classify the outcome as 1 or YES, and if it is less than 0.5, we
can classify it as 0 or NO. The outputcannotFor example: If the output is 0.75, we can say in terms of
probability as: There is a 75 percent chance that a patient will suffer from cancer.
4. Types of LogisticRegression
Binary Logistic Regression: The target variable has only two possible outcomes such asSpam or
Not Spam, Cancer or No Cancer.
Multinomial Logistic Regression: The target variable has three or more nominalcategories such as
predicting the type of Wine.
Ordinal Logistic Regression: the target variable has three or more ordinal categoriessuch as
restaurant or product rating from 1 to 5.
The following table shows the confusion matrix for a two class classifier.
Here each row indicates the actual classes recorded in the test data set and the each column indicates the
classes as predicted by the classifier.
Numbers on the descending diagonal indicate correct predictions, while the ascending diagonal concerns
prediction errors.
● Number of positive (Pos) : Total number instances which are labelled as positive in a given
dataset.
● Number of negative (Neg) : Total number instances which are labelled as negative in a given
dataset.
● Number of True Positive (TP) : Number of instances which are actually labelled as positive
and the predicted class by classifier is also positive.
● Number of True Negative (TN) : Number of instances which are actually labelled as negative
and the predicted class by classifier is also negative.
● Number of False Positive (FP) : Number of instances which are actually labelled as negative
and the predicted class by classifier is positive.
● Number of False Negative (FN): Number of instances which are actually labelled as positive
and the class predicted by the classifier is negative.
● Accuracy: Accuracy is calculated as the number of correctly classified instances divided by total
number of instances.
The ideal value of accuracy is 1, and the worst is 0. It is also calculated as the sum of true positive
and true negative (TP + TN) divided by the total number of instances.
𝑇𝑃+𝑇𝑁 𝑇𝑃+𝑇𝑁
𝑎𝑐𝑐 =
𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁 = 𝑃𝑜𝑠+𝑁𝑒𝑔
● Error Rate: Error Rate is calculated as the number of incorrectly classified instances divided
by total number of instances.
The ideal value of accuracy is 0, and the worst is 1. It is also calculated as the sum of false
positive and false negative (FP + FN) divided by the total number of instances.
𝐹𝑃+𝐹𝑁 𝐹𝑃+𝐹𝑁
𝑒𝑟𝑟 =
𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁 = 𝑃𝑜𝑠+𝑁𝑒𝑔 Or
𝑒𝑟𝑟 = 1 − 𝑎𝑐𝑐
● Precision: It is calculated as the number of correctly classified positive instances divided by the
total number of instances which are predicted positive. It is also called confidence value. The
ideal value is 1, whereas the worst is 0.
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃
𝑃+𝐹
● Recall: .It is calculated as the number of correctly classified positive instances divided by the
total number of positive instances. It is also called recall or sensitivity. The ideal value of
sensitivity is 1, whereas the worst is 0.
It is calculated as the number of correctly classified positive instances divided by the total number
of positive instances.
𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃
𝑃+𝐹
Step 6: Predict the y_pred for all values of train_x and test_x Step
Conclusion:
In this way we have done data analysis using logistic regression for Social Media Adv. andevaluate the
performance of model.
Value Addition: Visualising Confusion Matrix using Heatmap
Assignment Question:
1) Consider the binary classification task with two classes positive and negative.Find out TP,TP, FP,
TN, FN, Accuracy, Error rate, Precision, Recall
2) Comment on whether the model is best fit or not based on the calculated values.
3) Write python code for the preprocessing mentioned in step 4. and Explain everystep in detail.
Assignment No: 6
---------------------------------------------------------------------------------------------------------------------------------
● Naïve Bayes Classifier can be used for Classification of categorical data. ○ Let there be a
‘j’ number of classes. C={1,2,….j}
○ Let, input observation is specified by ‘P’ features. Therefore input observation x is given , x =
{F1,F2,…..Fp}
○ The Naïve Bayes classifier depends on Bayes' rule from probability theory. ● Prior probabilities:
Probabilities which are calculated for some event based on no other information are called Prior probabilities.
For example, P(A), P(B), P(C) are prior probabilities because while calculating P(A), occurrences of event
B or C are not concerned i.e. no information about occurrence of any other event is used.
Conditional Probabilities:
From
equation (1) and (2) ,
We have a dataset with some features Outlook, Temp, Humidity, and Windy, and the target here is to
predict whether a person or team will play tennis or not.
Conditional Probability
Here, we are predicting the probability of class1 and class2 based on the given condition. If I try to write the same
formula in terms of classes and features, we will get the following equation
Now we have two classes and four features, so if we write this formula for class C1, it will be something like this.
Here, we replaced Ck with C1 and X with the intersection of X1, X2, X3, X4. You might have a question, It’s
because we are taking the situation when all these features are present at the same time.
Department of Computer Engineering Subject : DSBDAL The Naive Bayes algorithm assumes that all the
features are independent of each other or in other words all the features are unrelated. With that assumption, we
can further simplify the above formula and write it in this form
This is the final equation of the Naive Bayes and we have to calculate the probability of both C1 and C2.For this
particular example.
P (N0 | Today) > P (Yes | Today) So, the prediction that golf would be played is ‘No’.
Step 5: Use Naive Bayes algorithm( Train the Machine ) to Create Model
# import the class
S. B. Patil College of Engineering,
Department of Computer Subject :
Step 6: Predict the y_pred for all values of train_x and test_x
Y_pred = gaussian.predict(X_test)
Conclusion:
In this way we have done data analysis using Naive Bayes Algorithm for Iris dataset and evaluated the
performance of the model.
Assignment Question:
1) Consider the observation for the car theft scenario having 3 attributes colour, Type and origin.
Find the probability of car theft having scenarios Red SUV and Domestic.
Assignment No: 7
----------------------------------------------------------------------------------------------------------------
Title of the Assignment:
1. Extract Sample document and apply following document preprocessing methods: Tokenization, POS Tagging, stop
words removal, Stemming and Lemmatization. 2. Create representation of document by calculating Term Frequency
and Inverse Document Frequency.
Objective of the Assignment: Students should be able to perform Text Analysis using TF IDF Algorithm
---------------------------------------------------------------------------------------------------------------
Prerequisite:
1. Basic of Python Programming
2. Basic of English language.
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Basic concepts of Text Analytics
2. Text Analysis Operations using natural language toolkit
3. Text Analysis Model using TF-IDF.
S. B. Patil College of Engineering,
Department of Computer Subject :
Text mining is also referred to as text analytics. Text mining is a process of exploring sizable textual data
and finding patterns. Text Mining processes the text itself, while NLP processes with the underlying
metadata. Finding frequency counts of words, length of the sentence, presence/absence of specific words is
known as text mining. Natural language processing is one of the components of text mining. NLP helps
identify sentiment, finding entities in the sentence, and category of blog/article. Text mining is
preprocessed data for text analytics. In Text Analytics, statistical and machine learning algorithms are used
to classify information.
NLTK(natural language toolkit) is a leading platform for building Python programs to work with human
language data. It provides easy-to-use interfaces and lexical resources such as WordNet, along with a suite
of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic
reasoning and many more.
Analysing movie reviews is one of the classic examples to demonstrate a simple NLP Bag-of-words model,
on movie reviews.
2.1. Tokenization:
Tokenization is the first step in text analytics. The process of breaking down a text paragraph into
smaller chunks such as words or sentences is called Tokenization. Token is a single entity that is
the building blocks for a sentence or paragraph.
● Sentence tokenization : split a paragraph into list of sentences using sent_tokenize() method
word_tokenize() method
Stemming is a normalization technique where lists of tokenized words are converted into
shortened root words to remove redundancy. Stemming is the process of reducing inflected (or
sometimes derived) words to their word stem, base or root form.
A computer program that stems word may be called a stemmer.
E.g.
A stemmer reduces the words like fishing, fished, and fisher to the stem fish. The stem need not be
a word, for example the Porter algorithm reduces, argue, argued, argues, arguing, and argus to the
stem argu .
Lemmatization in NLTK is the algorithmic process of finding the lemma of a word depending on
its meaning and context. Lemmatization usually refers to the morphological analysis of words,
which aims to remove inflectional endings. It helps in returning the base or dictionary form of a
word known as the lemma.
Eg. Lemma for studies is study
Lemmatization Vs Stemming
Stemming algorithm works by cutting the suffix from the word. In a broader sense cuts either the
beginning or end of the word.
On the contrary, Lemmatization is a more powerful operation, and it takes into consideration
morphological analysis of the words. It returns the lemma which is the base form of all its
inflectional forms. In-depth linguistic knowledge is required to create dictionaries and look for the
proper form of the word. Stemming is a general operation while lemmatization is an intelligent
operation where the proper form will be looked in the dictionary. Hence, lemmatization helps in
forming better machine learning features.
Example:
The initial step is to make a vocabulary of unique words and calculate TF for each document. TF
will be more for words that frequently appear in a document and less for rare words in a
document.
● Inverse Document Frequency (IDF)
It is the measure of the importance of a word. Term frequency (TF) does not consider the
importance of words. Some words such as’ of’, ‘and’, etc. can be most frequently present but are of
little significance. IDF provides weightage to each word based on its frequency in the corpus D.
In our
example, since we have two documents in the corpus, N=2.
TFIDF gives more weightage to the word that is rare in the corpus (all the documents). TFIDF provides
more importance to the word that is more frequent in the document.
After
applying TFIDF, text in A and B documents can be represented as a TFIDF vector of dimension equal to the
vocabulary words. The value corresponding to each word represents the importance of that word in a particular
document.
TFIDF is the product of TF with IDF. Since TF values lie between 0 and 1, not using ln can result in high IDF
for some words, thereby dominating the TFIDF. We don’t want that, and therefore, we use ln so that the IDF
should not completely dominate the TFIDF. ● Disadvantage of TFIDF
It is unable to capture the semantics. For example, funny and humorous are synonyms, but TFIDF does not
capture that. Moreover, TFIDF can be computationally expensive if the vocabulary is vast.
4. Bag of Words (BoW)
Machine learning algorithms cannot work with raw text directly. Rather, the text must be converted into
vectors of numbers. In natural language processing, a common technique for extracting features from text
is to place all of the words that occur in the text in a
bucket. This approach is called a bag of words model or BoW for short. It’s referred to as a “bag” of
words because any information about the structure of the sentence is lost.
Algorithm for Tokenization, POS Tagging, stop words removal, Stemming and Lemmatization:
Step 1: Download the required packages
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
Step 2: Initialize the text
text= "Tokenization is the first step in text analytics. The process of breaking down a text paragraph into
smaller chunks such as words or sentences is called Tokenization."
Step 3: Perform Tokenization
#Sentence Tokenization
from nltk.tokenize import sent_tokenize
tokenized_text= sent_tokenize(text)
print(tokenized_text)
#Word Tokenization
from nltk.tokenize import word_tokenize
tokenized_word=word_tokenize(text)
print(tokenized_word)
text= "How to remove stop words with NLTK library in Python?" text= re.sub('[^a-zA-Z]', ' ',text)
tokens = word_tokenize(text.lower())
filtered_text=[]
for w in tokens:
if w not in stop_words:
filtered_text.append(w)
print("Tokenized Sentence:",tokens)
print("Filterd Sentence:",filtered_text)
uniqueWords = set(bagOfWordsA).union(set(bagOfWordsB))
Step 5: Create a dictionary of words and their occurrence for each document in the corpus
numOfWordsA = dict.fromkeys(uniqueWords, 0)
for word in bagOfWordsA:
numOfWordsA[word] += 1
numOfWordsB = dict.fromkeys(uniqueWords, 0)
for word in bagOfWordsB:
numOfWordsB[word] += 1
Step 6: Compute the term frequency for each of our documents.
def computeTF(wordDict, bagOfWords):
tfDict = {}
bagOfWordsCount = len(bagOfWords)
for word, count in wordDict.items():
tfDict[word] = count / float(bagOfWordsCount)
S. B. Patil College of Engineering,
Department of Computer Subject :
return tfDict
tfA = computeTF(numOfWordsA, bagOfWordsA)
tfB = computeTF(numOfWordsB, bagOfWordsB)
Step 7: Compute the term Inverse Document Frequency.
def computeIDF(documents):
import math
N = len(documents)
idfDict = dict.fromkeys(documents[0].keys(), 0)
for document in documents:
for word, val in document.items():
if val > 0:
idfDict[word] += 1
Conclusion:
In this way we have done text data analysis using TF IDF algorithm
Assignment Question:
1) Perform Stemming for text = "studies studying cries cry". Compare the results generated with
Lemmatization. Comment on your answer how Stemming and Lemmatization differ from each other.
2) Write Python code for removing stop words from the below documents, conver the documents into
lowercase and calculate the TF, IDF and TFIDF score for each document.
documentA = 'Jupiter is the largest Planet'
documentB = 'Mars is the fourth planet from the Sun'
Experiment No. 8
i. Use the inbuilt dataset 'titanic'. The dataset contains 891 rows and contains information about the
passengers who boarded the unfortunate Titanic ship. Use seaborn library to see if we can find any patterns in
the data.
ii. Write a code to check how the price of the ticket (column name: ‘fare’) by each passenger is distributed
by plotting a histogram.
Introduction:
In this article we will look at Seaborn which is another extremely useful library for data visualization in Python. The
Seaborn library is built on top of Matplotlib and offers many advanced data visualization capabilities.
Though, the Seaborn library can be used to draw a variety of charts such as matrix plots, grid plots, regression plots
etc., in this article we will see how the Seaborn library can be used to draw distributional and categorial plots. In
the second part of the series, we will see how to draw regression plots, matrix plots, and grid plots.
The Dataset
The dataset that we are going to use to draw our plots will be the Titanic dataset, which is downloaded by default with
the Seaborn library. Now we have to use the load_dataset function and pass it the name of the dataset.
Distributional Plots:
Distributional plots, as the name suggests are type of plots that show the statistical distribution of data. In this section
we will see some of the most commonly used distribution plots in Seaborn.
The Dist Plot:
The distplot() shows the histogram distribution of data for a single column. The column name is passed as a parameter
to the distplot()function.
The Joint Plot
The jointplot()is used to display the mutual distribution of each column. You need to pass three parameters to jointplot.
The first parameter is the column name for which you want to display the distribution of data on x-axis. The second
parameter is the column name for which you want to display the distribution of data on y-axis. Finally, the third
parameter is the name of the data frame.
The Pair Plot: The paitplot() is a type of distribution plot that basically plots a joint plot for all the possible
combination of numeric and Boolean columns in the dataset.
The Rug Plot: The rugplot() is used to draw small bars along x-axis for each point in the dataset. To plot a rug plot, we
need to pass the name of the column.
Categorical Plots: Categorical plots, as the name suggests are normally used to plot categorical data. The categorical
plots plot the values in the categorical column against another categorical column or a numeric column. Let's see some
of the most commonly used categorical data.
The Bar Plot: The barplot() is used to display the mean value for each value in a categorical column, against a numeric
column. The first parameter is the categorical column, the second parameter is the numeric column while the third
parameter is the dataset.
The Count Plot: The count plot is similar to the bar plot, however it displays the count of the categories in a specific
column.
The Box Plot:
The box plot is used to display the distribution of the categorical data in the form of quartiles. The center of the box
shows the median value. The value from the lower whisker to the bottom of the box shows the first quartile. From the
bottom of the box to the middle of the box lies the second quartile. From the middle of the box to the top of the box lies
the third quartile and finally from the top of the box to the top whisker lies the last quartile.
The Violin Plot: The violin plot is similar to the box plot, however, the violin plot allows us to display all the
components that actually correspond to the data point. The violinplot() function is used to plot the violin plot.
The Strip Plot: The strip plot draws a scatter plot where one of the variables is categorical. We have seen scatter plots
in the joint plot and the pair plot sections where we had two numeric variables. The strip plot is different in a way that
one of the variables is categorical in this case, and for each category in the categorical variable, we will see scatter
plot with respect to the numeric column.
The Swarm Plot: The swarm plot is a combination of the strip and the violin plots. In the swarm plots, the points are
adjusted in such a way that they don't overlap. Let's plot a swarm plot for the distribution of age against gender.
The swarmplot() function is used to plot the violin plot.
Combining Swarm and Violin Plots: Swarm plots are not recommended if you have a huge dataset since they do not
scale well because they have to plot each data point. If you really like swarm plots, a better way is to combine two
plots.
Working:-
Downloading the Seaborn Library
The seaborn library can be downloaded in a couple of ways. If you are using pip installer for Python libraries, you can
execute the following command to download the library:
import pandas as pd
import numpy as np
dataset = sns.load_dataset('titanic')
dataset.head()
The script above loads the Titanic dataset and displays the first five rows of the dataset using the head function. The
output looks like this:
The dataset contains 891 rows and 15 columns and contains information about the passengers who boarded the
unfortunate Titanic ship. The original task is to predict whether or not the passenger survived depending upon different
features such as their age, ticket, cabin they boarded, the class of the ticket, etc. We will use the Seaborn library to see
if we can find any patterns in the data.
. Let's see how the price of the ticket for each passenger is distributed. Execute the following script:
The Dist Plot:
sns.distplot(dataset['fare'])
Output:
We can see that most of the tickets have been solved between 0-50 dollars. The line that we see represents the kernel
density estimation. We can remove this line by passing False as the parameter for the kde attribute as shown below:
sns.distplot(dataset['fare'], kde=False)
Output:
Now you can see there is no line for the kernel density estimation on the plot. We can also pass the value for
the bins parameter in order to see more or less details in the graph. Take a look at the following script:
We can clearly see that for more than 700 passengers, the ticket price is between 0 and 50.
The Joint Plot
Let's plot a joint plot of age and fare columns to see if we can find any relationship between the two.
From the output, we can see that a joint plot has three parts. A distribution plot at the top for the column on the x-axis, a
distribution plot on the right for the column on the y-axis and a scatter plot in between that shows the mutual
distribution of data for both the columns. We can see that there is no correlation observed between prices and the fares.
We can change the type of the joint plot by passing a value for the kind parameter. For instance, if instead of scatter
plot, we want to display the distribution of data in the form of a hexagonal plot, we can pass the value hex for
the kind parameter. Look at the following script:
In the hexagonal plot, the hexagon with most number of points gets darker color. So if we look at the above plot, we
can see that most of the passengers are between age 20 and 30 and most of them paid between 10-50 for the tickets.
sns.pairplot(dataset)
A snapshot of the portion of the output is shown below:
Note: Before executing the script above, remove all null values from the dataset using the following command: dataset
= dataset.dropna() From the output of the pair plot you can see the joint plots for all the numeric and Boolean columns
in the Titanic dataset.
To add information from the categorical column to the pair plot, you can pass the name of the categorical column to
the hue parameter. For instance, if we want to plot the gender information on the pair plot, we can execute the
following script:
sns.pairplot(dataset, hue='sex')
Output:
In the output, we can see the information about the males in orange and the information about the female in blue (as
shown in the legend). From the joint plot on the top left, we can clearly see that among the surviving passengers, the
majority were female.
Let's plot a rug plot for fare.
sns.rugplot(dataset['fare'])
Output:
From the output, we can see that as was the case with the distplot(), most of the instances for the fares have values
between 0 and 100. These are some of the most commonly used distribution plots offered by the Python's Seaborn
Library. Let's see some of categorical plots in the Seaborn library.
The Bar Plot:
For instance, if you want to know the mean value of the age of the male and female passengers, you can use the bar
plot as follows.
sns.barplot(x='sex', y='age', data=dataset)
Output:
From the output, we can clearly see that the average age of male passengers is just less than 40 while the average age of
female passengers is around 33. In addition to finding the average, the bar plot can also be used to calculate other
aggregate values for each category. To do so, we need to pass the aggregate function to the estimator. For instance, we
can calculate the standard deviation for the age of each gender as follows:
import numpy as np
The Count Plot: For instance, if we want to count the number of males and women passenger we can do so using count
plot as follows:
sns.countplot(x='sex', data=dataset)
The output shows the count as follows:
Output:
Now try to understand the box plot for female. The first quartile starts at around 5 and ends at 22 which means that 25%
of the passengers are aged between 5 and 25. The second quartile starts at around 23 and ends at around 32 which
means that 25% of the passengers are aged between 23 and 32. Similarly, the third quartile starts and ends between 34
and 42, hence 25% passengers are aged within this range and finally the fourth or last quartile starts at 43 and ends
around 65. If there are any outliers or the passengers that do not belong to any of the quartiles, they are called outliers
and are represented by dots on the box plot.
We can make our box plots more fancy by adding another layer of distribution. For instance, if we want to see the box
plots of forage of passengers of both genders, along with the information about whether or not they survived, you can
pass the survived as value to the hue parameter as shown below:
Now in addition to the information about the age of each gender, we can also see the distribution of the passengers who
survived. For instance, we can see that among the male passengers, on average more younger people survived as
compared to the older ones. Similarly, we can see that the variation among the age of female passengers who did not
survive is much greater than the age of the surviving female passengers.
The Violin Plot:-
Like the box plot, the first parameter is the categorical column; the second parameter is the numeric column while the
third parameter is the dataset. Now, plot a violin plot that displays the distribution for the age with respect to each
gender.
We can see from the figure above that violin plots provide much more information about the data as compared to the
box plot. Instead of plotting the quartile, the violin plot allows us to see all the components that actually correspond to
the data. The area where the violin plot is thicker has a higher number of instances for the age. For instance, from the
violin plot for males, it is clearly evident that the number of passengers with age between 20 and 40 is higher than all
the rest of the age brackets. Like box plots, we can also add another categorical variable to the violin plot using
the hue parameter as shown below:
Now we can see a lot of information on the violin plot. For instance, if we look at the bottom of the violin plot for the
males who survived (left-orange), you can see that it is thicker than the bottom of the violin plot for the males who
didn't survive (left-blue). This means that the number of young male passengers who survived is greater than the
number of young male passengers who did not survive. The violin plots convey a lot of information, however, on the
downside, it takes a bit of time and effort to understand the violin plots.
Instead of plotting two different graphs for the passengers who survived and those who did not, you can have one violin
plot divided into two halves, where one half represents surviving while the other half represents the non-surviving
passengers. To do so, we need to pass True as value for the split parameter of the violinplot() function. Let's see how
we can do this:
Now we can clearly see the comparison between the age of the passengers who survived and who did not for both
males and females. Both violin and box plots can be extremely useful. However, as a rule of thumb if we are presenting
our data to a non-technical audience, box plots should be preferred since they are easy to comprehend. On the other
hand, if we are presenting our results to the research community it is more convenient to use violin plot to save space
and to convey more information in less time.
The Strip Plot:
The stripplot() function is used to plot the violin plot. Like the box plot, the first parameter is the categorical column,
the second parameter is the numeric column while the third parameter is the dataset. Look at the following script:
We can see the scattered plots of age for both males and females. The data points look like strips. It is difficult to
comprehend the distribution of data in this form. To better comprehend the data, pass True for the jitter parameter
which adds some random noise to the data. Look at the following script:
Now you have a better view for the distribution of age across the genders. Like violin and box plots, we can add an
additional categorical column to strip plot using hue parameter as shown below:
Again we can see there are more points for the males who survived near the bottom of the plot compared to those who
did not survive. Like violin plots, we can also split the strip plots. Execute the following script:
Now we can clearly see the difference in the distribution for the age of both male and female passengers who survived
and those who did not survive.
The Swarm Plot:
The swarmplot() function is used to plot the violin plot. Like the box plot, the first parameter is the categorical column,
the second parameter is the numeric column while the third parameter is the dataset. Look at the following script:
We can clearly see that the above plot contains scattered data points like the strip plot and the data points are not
overlapping. Rather they are arranged to give a view similar to that of a violin plot. Now add another categorical
column to the swarm plot using the hue parameter.
From the output, it is evident that the ratio of surviving males is less than the ratio of surviving females. Since for the
male plot, there are more blue points and less orange points. On the other hand, for females, there are more orange
points (surviving) than the blue points (not surviving). Another observation is that amongst males of age less than 10,
more passengers survived as compared to those who didn't. We can also split swarm plots as we did in the case of strip
and box plots. Execute the following script to do so:
Output:
Now we can clearly see that more women survived, as compared to men.
Combining Swarm and Violin Plots:
For instance, to combine a violin plot with swarm plot, you need to execute the following script:
sns.violinplot(x='sex', y='age', data=dataset)
sns.swarmplot(x='sex', y='age', data=dataset, color='black')
Output:
There are also a lot of other visualization libraries for Python that have features that go beyond what Seaborn can do.
Conclusion: Seaborn is an advanced data visualization library built on top of Matplotlib library. In this article, we
looked at how we can draw distributional and categorical plots using Seaborn library. This is Part 1 of the series of
article on Seaborn. In the second article of the series, we will see how we play around with grid functionalities in
Seaborn and how we can draw Matrix and Regression plots in Seaborn.
Assignment Questions
1. List out different types of plot to find patterns of data
2. Explain when you will use distribution plots and when you will use categorical plots.
3. Write the conclusion from the following swarm plot (consider titanic dataset)
Assignment No: 9
----------------------------------------------------------------------------------------------------------------
Title of the Assignment: Data Visualization II
S. B. Patil College of Engineering,
Department of Computer Subject :
1. Use the inbuilt dataset 'titanic' as used in the above problem. Plot a box plot for distribution of age with
respect to each gender along with the information about whether they survived or not. (Column names : 'sex' and
'age')
2. Write observations on the inference from the above statistics.
----------------------------------------------------------------------------------------------------------------- Objective of the
Assignment: Students should be able to perform the data Visualization operation using Python on any open
source dataset
--------------------------------------------------------------------------------------------------------------- Prerequisite:
1. Basic of Python Programming
2. Seaborn Library, Concept of Data Visualization.
Introduction:
In this article we will look at Seaborn which is another extremely useful library for data visualization in Python. The
Seaborn library is built on top of Matplotlib and offers many advanced data visualization capabilities.
Though, the Seaborn library can be used to draw a variety of charts such as matrix plots, grid plots, regression plots
etc., in this article we will see how the Seaborn library can be used to draw distributional and categorial plots. In
the second part of the series, we will see how to draw regression plots, matrix plots, and grid plots.
The Dataset
The dataset that we are going to use to draw our plots will be the Titanic dataset, which is downloaded by default with
the Seaborn library. Now we have to use the load_dataset function and pass it the name of the dataset.
Distributional Plots:
Distributional plots, as the name suggests are type of plots that show the statistical distribution of data. In this section
we will see some of the most commonly used distribution plots in Seaborn.
The Dist Plot:
The distplot() shows the histogram distribution of data for a single column. The column name is passed as a parameter
to the distplot()function.
The Joint Plot
The jointplot()is used to display the mutual distribution of each column. You need to pass three parameters to jointplot.
The first parameter is the column name for which you want to display the distribution of data on x-axis. The second
parameter is the column name for which you want to display the distribution of data on y-axis. Finally, the third
parameter is the name of the data frame.
The Pair Plot: The paitplot() is a type of distribution plot that basically plots a joint plot for all the possible
combination of numeric and Boolean columns in the dataset.
The Rug Plot: The rugplot() is used to draw small bars along x-axis for each point in the dataset. To plot a rug plot, we
need to pass the name of the column.
Categorical Plots: Categorical plots, as the name suggests are normally used to plot categorical data. The categorical
plots plot the values in the categorical column against another categorical column or a numeric column. Let's see some
of the most commonly used categorical data.
The Bar Plot: The barplot() is used to display the mean value for each value in a categorical column, against a numeric
column. The first parameter is the categorical column, the second parameter is the numeric column while the third
parameter is the dataset.
The Count Plot: The count plot is similar to the bar plot, however it displays the count of the categories in a specific
column.
The Box Plot:
The box plot is used to display the distribution of the categorical data in the form of quartiles. The center of the box
shows the median value. The value from the lower whisker to the bottom of the box shows the first quartile. From the
bottom of the box to the middle of the box lies the second quartile. From the middle of the box to the top of the box lies
the third quartile and finally from the top of the box to the top whisker lies the last quartile.
The Violin Plot: The violin plot is similar to the box plot, however, the violin plot allows us to display all the
components that actually correspond to the data point. The violinplot() function is used to plot the violin plot.
The Strip Plot: The strip plot draws a scatter plot where one of the variables is categorical. We have seen scatter plots
in the joint plot and the pair plot sections where we had two numeric variables. The strip plot is different in a way that
S. B. Patil College of Engineering,
Department of Computer Subject :
one of the variables is categorical in this case, and for each category in the categorical variable, we will see scatter
plot with respect to the numeric column.
The Swarm Plot: The swarm plot is a combination of the strip and the violin plots. In the swarm plots, the points are
adjusted in such a way that they don't overlap. Let's plot a swarm plot for the distribution of age against gender.
The swarmplot() function is used to plot the violin plot.
Combining Swarm and Violin Plots: Swarm plots are not recommended if you have a huge dataset since they do not
scale well because they have to plot each data point. If you really like swarm plots, a better way is to combine two
plots.
Working:-
Downloading the Seaborn Library
The seaborn library can be downloaded in a couple of ways. If you are using pip installer for Python libraries, you can
execute the following command to download the library:
import pandas as pd
import numpy as np
dataset = sns.load_dataset('titanic')
dataset.head()
The script above loads the Titanic dataset and displays the first five rows of the dataset using the head function. The
output looks like this:
The dataset contains 891 rows and 15 columns and contains information about the passengers who boarded the
unfortunate Titanic ship. The original task is to predict whether or not the passenger survived depending upon different
features such as their age, ticket, cabin they boarded, the class of the ticket, etc. We will use the Seaborn library to see
if we can find any patterns in the data.
. Let's see how the price of the ticket for each passenger is distributed. Execute the following script:
The Dist Plot:
sns.distplot(dataset['fare'])
Output:
We can see that most of the tickets have been solved between 0-50 dollars. The line that we see represents the kernel
density estimation. We can remove this line by passing False as the parameter for the kde attribute as shown below:
sns.distplot(dataset['fare'], kde=False)
Output:
Now you can see there is no line for the kernel density estimation on the plot. We can also pass the value for
the bins parameter in order to see more or less details in the graph. Take a look at the following script:
We can clearly see that for more than 700 passengers, the ticket price is between 0 and 50.
The Joint Plot
Let's plot a joint plot of age and fare columns to see if we can find any relationship between the two.
From the output, we can see that a joint plot has three parts. A distribution plot at the top for the column on the x-axis, a
distribution plot on the right for the column on the y-axis and a scatter plot in between that shows the mutual
distribution of data for both the columns. We can see that there is no correlation observed between prices and the fares.
We can change the type of the joint plot by passing a value for the kind parameter. For instance, if instead of scatter
plot, we want to display the distribution of data in the form of a hexagonal plot, we can pass the value hex for
the kind parameter. Look at the following script:
In the hexagonal plot, the hexagon with most number of points gets darker color. So if we look at the above plot, we
can see that most of the passengers are between age 20 and 30 and most of them paid between 10-50 for the tickets.
sns.pairplot(dataset)
A snapshot of the portion of the output is shown below:
Note: Before executing the script above, remove all null values from the dataset using the following command: dataset
= dataset.dropna() From the output of the pair plot you can see the joint plots for all the numeric and Boolean columns
in the Titanic dataset.
To add information from the categorical column to the pair plot, you can pass the name of the categorical column to
the hue parameter. For instance, if we want to plot the gender information on the pair plot, we can execute the
following script:
sns.pairplot(dataset, hue='sex')
Output:
In the output, we can see the information about the males in orange and the information about the female in blue (as
shown in the legend). From the joint plot on the top left, we can clearly see that among the surviving passengers, the
majority were female.
Let's plot a rug plot for fare.
sns.rugplot(dataset['fare'])
Output:
From the output, we can see that as was the case with the distplot(), most of the instances for the fares have values
between 0 and 100. These are some of the most commonly used distribution plots offered by the Python's Seaborn
Library. Let's see some of categorical plots in the Seaborn library.
The Bar Plot:
For instance, if you want to know the mean value of the age of the male and female passengers, you can use the bar
plot as follows.
sns.barplot(x='sex', y='age', data=dataset)
Output:
From the output, we can clearly see that the average age of male passengers is just less than 40 while the average age of
female passengers is around 33. In addition to finding the average, the bar plot can also be used to calculate other
aggregate values for each category. To do so, we need to pass the aggregate function to the estimator. For instance, we
can calculate the standard deviation for the age of each gender as follows:
import numpy as np
The Count Plot: For instance, if we want to count the number of males and women passenger we can do so using count
plot as follows:
sns.countplot(x='sex', data=dataset)
The output shows the count as follows:
Output:
Now try to understand the box plot for female. The first quartile starts at around 5 and ends at 22 which means that 25%
of the passengers are aged between 5 and 25. The second quartile starts at around 23 and ends at around 32 which
means that 25% of the passengers are aged between 23 and 32. Similarly, the third quartile starts and ends between 34
and 42, hence 25% passengers are aged within this range and finally the fourth or last quartile starts at 43 and ends
around 65. If there are any outliers or the passengers that do not belong to any of the quartiles, they are called outliers
and are represented by dots on the box plot.
We can make our box plots more fancy by adding another layer of distribution. For instance, if we want to see the box
plots of forage of passengers of both genders, along with the information about whether or not they survived, you can
pass the survived as value to the hue parameter as shown below:
Now in addition to the information about the age of each gender, we can also see the distribution of the passengers who
survived. For instance, we can see that among the male passengers, on average more younger people survived as
compared to the older ones. Similarly, we can see that the variation among the age of female passengers who did not
survive is much greater than the age of the surviving female passengers.
The Violin Plot:-
Like the box plot, the first parameter is the categorical column; the second parameter is the numeric column while the
third parameter is the dataset. Now, plot a violin plot that displays the distribution for the age with respect to each
gender.
We can see from the figure above that violin plots provide much more information about the data as compared to the
box plot. Instead of plotting the quartile, the violin plot allows us to see all the components that actually correspond to
the data. The area where the violin plot is thicker has a higher number of instances for the age. For instance, from the
violin plot for males, it is clearly evident that the number of passengers with age between 20 and 40 is higher than all
the rest of the age brackets. Like box plots, we can also add another categorical variable to the violin plot using
the hue parameter as shown below:
Now we can see a lot of information on the violin plot. For instance, if we look at the bottom of the violin plot for the
males who survived (left-orange), you can see that it is thicker than the bottom of the violin plot for the males who
didn't survive (left-blue). This means that the number of young male passengers who survived is greater than the
number of young male passengers who did not survive. The violin plots convey a lot of information, however, on the
downside, it takes a bit of time and effort to understand the violin plots.
Instead of plotting two different graphs for the passengers who survived and those who did not, you can have one violin
plot divided into two halves, where one half represents surviving while the other half represents the non-surviving
passengers. To do so, we need to pass True as value for the split parameter of the violinplot() function. Let's see how
we can do this:
Now we can clearly see the comparison between the age of the passengers who survived and who did not for both
males and females. Both violin and box plots can be extremely useful. However, as a rule of thumb if we are presenting
our data to a non-technical audience, box plots should be preferred since they are easy to comprehend. On the other
hand, if we are presenting our results to the research community it is more convenient to use violin plot to save space
and to convey more information in less time.
The Strip Plot:
The stripplot() function is used to plot the violin plot. Like the box plot, the first parameter is the categorical column,
the second parameter is the numeric column while the third parameter is the dataset. Look at the following script:
We can see the scattered plots of age for both males and females. The data points look like strips. It is difficult to
comprehend the distribution of data in this form. To better comprehend the data, pass True for the jitter parameter
which adds some random noise to the data. Look at the following script:
Now you have a better view for the distribution of age across the genders. Like violin and box plots, we can add an
additional categorical column to strip plot using hue parameter as shown below:
Again we can see there are more points for the males who survived near the bottom of the plot compared to those who
did not survive. Like violin plots, we can also split the strip plots. Execute the following script:
Now we can clearly see the difference in the distribution for the age of both male and female passengers who survived
and those who did not survive.
The Swarm Plot:
The swarmplot() function is used to plot the violin plot. Like the box plot, the first parameter is the categorical column,
the second parameter is the numeric column while the third parameter is the dataset. Look at the following script:
We can clearly see that the above plot contains scattered data points like the strip plot and the data points are not
overlapping. Rather they are arranged to give a view similar to that of a violin plot. Now add another categorical
column to the swarm plot using the hue parameter.
From the output, it is evident that the ratio of surviving males is less than the ratio of surviving females. Since for the
male plot, there are more blue points and less orange points. On the other hand, for females, there are more orange
points (surviving) than the blue points (not surviving). Another observation is that amongst males of age less than 10,
more passengers survived as compared to those who didn't. We can also split swarm plots as we did in the case of strip
and box plots. Execute the following script to do so:
Output:
Now we can clearly see that more women survived, as compared to men.
Combining Swarm and Violin Plots:
For instance, to combine a violin plot with swarm plot, you need to execute the following script:
sns.violinplot(x='sex', y='age', data=dataset)
sns.swarmplot(x='sex', y='age', data=dataset, color='black')
Output:
There are also a lot of other visualization libraries for Python that have features that go beyond what Seaborn can do.
Conclusion:
Conclusion: Seaborn is an advanced data visualization library built on top of Matplotlib library. In this article, we
looked at how we can draw distributional and categorical plots using Seaborn library. This is Part 1 of the series of
article on Seaborn. In the second article of the series, we will see how we play around with grid functionalities in
Seaborn and how we can draw Matrix and Regression plots in Seaborn.And learn about hue and survive function.
Assignment Questions
1. Write down the code to use inbuilt dataset ‘titanic’ using seaborn library. 2. Write code to plot a box
plot for distribution of age with respect to each gender along with the information about whether they
survived or not.
3. Write the observations from the box plot.
Assignment No: 10
----------------------------------------------------------------------------------------------------------------
Title of the Assignment: Data Visualization III
S. B. Patil College of Engineering,
Department of Computer Subject :
Download the Iris flower dataset or any other dataset into a DataFrame. (e.g.,
https://archive.ics.uci.edu/ml/datasets/Iris ). Scan the dataset and give the inference as: 1. List down the
features and their types (e.g., numeric, nominal) available in the dataset. 2. Create a histogram for each
feature in the dataset to illustrate the feature distributions. 3. Create a box plot for each feature in the
dataset.
4. Compare distributions and identify outliers.
----------------------------------------------------------------------------------------------------------------- Objective of the
Assignment: Students should be able to perform the data Visualization operation using Python on any open
source dataset
--------------------------------------------------------------------------------------------------------------- Prerequisite:
1. Basic of Python Programming
2. Seaborn Library, Concept of Data Visualization.
3. Types of variables
--------------------------------------------------------------------------------------------------------------
3. Introduction to Dataset
A dataset is a collection of records, similar to a relational database table. Records are similar to
table rows, but the columns can contain not only strings or numbers, but also nested data
structures such as lists, maps, and other records.
Data Type: Features have a data type. They may be real or integer-valued or may have a
categorical or ordinal value. You can have strings, dates, times, and more complex types, but
typically they are reduced to real or categorical values when working with traditional machine
S. B. Patil College of Engineering,
Department of Computer Subject :
learning methods.
Datasets: A collection of instances is a dataset and when working with machine learning
methods we typically need a few datasets for different purposes.
Training Dataset: A dataset that we feed into our machine learning algorithm to train our model.
Testing Dataset: A dataset that we use to validate the accuracy of our model but is not used to
train the model. It may be called the validation dataset.
object str or string_, unicode_, mixed types Text or mixed numeric and
mixed non-numeric values
int64 int int_, int8, int16, int32, int64, uint8, uint16, uint32, Integer numbers
uint64
One of the most fundamental packages in Python, NumPy is a general-purpose array- processing
package. It provides high-performance multidimensional array objects and tools to work with the
arrays. NumPy is an efficient container of generic multi- dimensional data.
1. Basic array operations: add, multiply, slice, flatten, reshape, index arrays
2. Advanced array operations: stack arrays, split into sections, broadcast arrays
3. Work with DateTime or Linear Algebra
4. Basic Slicing and Advanced Indexing in NumPy Python
c. Matplotlib
This is undoubtedly my favorite and a quintessential Python library. You can create stories with
the data visualized with Matplotlib. Another library from the SciPy Stack, Matplotlib plots 2D
figures.
Histogram, bar plots, scatter plots, area plot to pie plot, Matplotlib can depict a wide range of
visualizations. With a bit of effort and tint of visualization capabilities, with Matplotlib, you can
create just any visualizations:Line plots
● Scatter plots
● Area plots
● Bar charts and Histograms
● Pie charts
● Stem plots
● Contour plots
● Quiver plots
● Spectrograms
Matplotlib also facilitates labels, grids, legends, and some more formatting entities with
Matplotlib.
d. Seaborn
So when you read the official documentation on Seaborn, it is defined as the data visualization
library based on Matplotlib that provides a high-level interface for drawing attractive and
informative statistical graphics. Putting it simply, seaborn is an extension of Matplotlib with
advanced features.
Introduced to the world as a Google Summer of Code project, Scikit Learn is a robust machine
learning library for Python. It features ML algorithms like SVMs, random forests, k-means
clustering, spectral clustering, mean shift, cross-validation and more... Even NumPy, SciPy and
related scientific operations are supported by Scikit Learn with Scikit Learn being a part of the
SciPy Stack.
5. Description of Dataset:
The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in
Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.
It includes three iris species with 50 samples each as well as some properties about each flower. One
flower species is linearly separable from the other two, but the other two are not linearly separable
from each other.
S. B. Patil College of Engineering,
Department of Computer Subject :
Description of Dataset-
10. The csv file at the UCI repository does not contain the variable/column names. They are
located in a separate file.
col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Species']
11. read in the dataset from the UCI Machine Learning Repository link and specify column
names to use
column=len(list(dataset))
column
axes[1,0].hist(dataset["col4"]);
axes[1,1].set_title("Distribution of first column")
axes[1,1].hist(dataset["col5"]);
Conclusion: Thus we have studided and perform the data Visualization operation using Python on an
open source dataset of Iris.
Assignment Questions
1. For the iris dataset, list down the features and their types.
2. Write a code to create a histogram for each feature. (iris dataset)
3. Write a code to create a boxplot for each feature. (iris dataset)
4. Identify the outliers from the boxplot drawn for iris dataset.
Gruop B
Assignment No- 1B
Title: Write a code in JAVA for a simple WordCount application that counts the number of occurrences
of each word in a given input set using the Hadoop MapReduce framework on local-standalone set-up
Pre-requisite
Theory:
Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of
data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in
a reliable, fault-tolerant manner.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the
map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then
input to the reduce tasks. Typically both the input and the output of the job are stored in a file- system.
The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
WordCount example reads text files and counts how often words occur. The input is text files and the
output is text files, each line of which contains a word and the count of how often it occured, separated by
a tab.
Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and
each reducer sums the counts for each word and emits a single key/value with the word and sum.
As an optimization, the reducer is also used as a combiner on the map outputs. This reduces the amount of
data sent across the network by combining each word into a single record
Steps to excute:
1. Create a text file in your local machine and write some text into it.
$ nano data.txt
2. In this example, we find out the frequency of each word exists in this text file.
Java WordCount.Java
package WordCount;
import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mapred.MapReduceBase; import
org.apache.hadoop.mapred.*;
org.apache.hadoop.mapred.Mapper; import
org.apache.hadoop.mapred.Reducer; import
S. B. Patil College of Engineering,
org.apache.hadoop.mapred.Reporter;
Department of Computer Subject :
// WordCount Class
// Mapper Class
public static class MyMap extends MapReduceBase implements Mapper<LongWritable, Text,
Text,IntWritable>
while (tokenizer.hasMoreTokens())
mykey.set(tokenizer.nextToken());
//MyReduce Class
int sum=0;
while (values.hasNext())
S. B. Patil College{of Engineering,
sum += values.next().get();
}
Department of Computer Subject :
output.collect(key, new IntWritable(sum));
//Main Method
conf.setJobName("WordCountProgram");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MyMap.class);
conf.setReducerClass(MyReduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
JobClient.runJob(conf);
}
Step 2: Create Jar File
In eclipse:
Right click on the project name then click on the Export and then select Java inside java select Java Jar and
save it in your file system
S. B. Patil College of Engineering,
Department of Computer Subject :
Right Click -> Project Name -> Export -> Java -> Jar File -> browse file-path where you want to save 3:
aniruddha@aniruddha-VirtualBox:~$ su hdoop
Password:
hdoop@aniruddha-VirtualBox:/home/aniruddha$ cd $HADOOP_HOME/sbin
hdoop@aniruddha-VirtualBox:~/hadoop-3.2.2/sbin$ start-all.sh
CTRL-C to abort.
Starting datanodes
Starting resourcemanager
Starting nodemanagers
hdoop@aniruddha-VirtualBox:~/hadoop-3.2.2/sbin$ cd
Found 1 items
/wc_output
2022-05-08 11:57:35,258 INFO mapreduce.Job: The url to track the job: http://aniruddha
VirtualBox:8088/proxy/application_1651991104225_0001/
2022-05-08 11:57:35,263 INFO mapreduce.Job: Running job: job_1651991104225_0001
2022-05-08 11:57:46,889 INFO mapreduce.Job: Job job_1651991104225_0001 running in uber mode : false
2022-05-08 11:57:46,891 INFO mapreduce.Job: map 0% reduce 0%
Job Counters
spent by all reduces in occupied slots (ms)=3111 Total time spent by all
Spilled Records=14
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
Shuffle Errors
BAD_ID=0
Department of Computer Subject :
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
Bytes Read=77
Bytes Written=47
Found 3 items
Found 2 items
Akshay 1
Aniruddha 1 Ankit 1
Shubham 2 Suraj 2
Assignment No- 2B
Aim:
Write a code in JAVA to design a distributed application using MapReduce which processes a log file of a
Department of Computer Subject :
system. List out the users who have logged for maximum period on the system. Use simple logfile from the
Internet and process it using a pseudo distribution mode on Hadoop platform.
Objective:
Theory:
Map and Reduce tasks in Hadoop-With in a MapReduce job there are two separate tasks map task and
reduce task.
Map task- A MapReduce job splits the input dataset into independent chunks known as inputsplits in
Hadoop which are processed by the map tasks in a completely parallel manner. Hadoop framework
creates separate map task for each input split.
Reduce task- The output of the maps is sorted by the Hadoop framework which then becomes input
to the reduce tasks.
Hadoop MapReduce framework operates exclusively on <key, value> pairs. In a MapReducejob, the
input to the Map function is a set of <key, value> pairs and output is also a set of
<key, value> pairs. The output <key, value> pair may have different type from the input
<key, value> pair.
The output from the map tasks is sorted by the Hadoop framework. MapReduce guarantees that the
input to every reducer is sorted by key. Input and output of the reduce task can be represented as
follows.
1. Input Data
Input to the MapReduce comes from HDFS where log files are stored on the processing cluster. By
dividing log files into small blocks we can distribute them over nodes of Hadoop cluster. The format of
input files to MapReduce is arbitrary but it is line-based for log files. Aseach line is considered as a one
record as we can say one log.
2. MapReduce Algorithm
MapReduce is a simple programming model which is easily scalable over multiple nodes in a Hadoop
cluster. MapReduce job is written in Java consisting of Map and Reduce function. MapReduce takes log
file as an input and feeds each record in the log file to the Mapper. Mapper processes all the records in the
log file and Reducer processes all the outputs from theMapper and gives final reduced results.
Map Function: Input to the map method is the InputSplit of log file. It produces intermediate results in (key,
value) pairs. For each occurrence of key it emits (key, „1‟) pair. If there are n occurrences of key, then it
produces n (key, „1‟) pairs. OutputCollector is the utility provided by MapReduce framework to collect
output from mapper and reducer and reporter is to report a progress of application.
Reduce Function: Input to reduce method is (key, values) pairs. It sums together all counts emitted by map
Map(LongWritable key, Text value, OutputCollector output, Reporter reporter) { For each key in the
value; EmitIntermediate(key, „1‟); }
method. If input to the reduce method is (key, (1,1,1,….n times)) then it aggregates all the values for that
key producing output (key, n) pair. OutputCollector and Reporter works in similar way as in map method.
int sum = 0;
ParseInt(v);
output.collect(key,(sum));
Pig queries are written in Pig Latin language. Pig Latin statements are generally organized inthe following
manner:
A LOAD statement reads data from the Hadoop file system.A series
of "transformation" statements process the data.
A STORE statement writes output to the Hadoop file system.
Conclusion:
In this assignment, we have learned what is HDFS and How Hadoop MapReduceframework
is used to process a log file of system.
package logAnalysis;
import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
// WordCount Class
// Mapper Class
public static class SalesMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text,
IntWritable> {
private final static IntWritable one = new IntWritable(1);
//MyReduce Class
int frequencyForCountry = 0;
while (values.hasNext()) {
frequencyForCountry += value.get();
//Main Method
job_conf.setJobName("SalePerCountry");
job_conf.setOutputKeyClass(Text.class);
job_conf.setOutputValueClass(IntWritable.class);
job_conf.setMapperClass(LogAnalysis.SalesMapper.class);
job_conf.setReducerClass(LogAnalysis.SalesCountryReducer.class);
job_conf.setInputFormat(TextInputFormat.class);
job_conf.setOutputFormat(TextOutputFormat.class);
//arg[0] = name of input directory on HDFS, and arg[1] = name of output directory //to be created to
store the output file.
my_client.setConf(job_conf);
try {
JobClient.runJob(job_conf);
} catch (Exception e) {
e.printStackTrace();
}
}
In eclipse:
Right click on the project name then click on the Export and then select Java inside java select Java Jar and save it
in your file system
Right Click -> Project Name -> Export -> Java -> Jar File -> browse file-path where you want to save
aniruddha@aniruddha-VirtualBox:~$ su hdoop
Password:
hdoop@aniruddha-VirtualBox:/home/aniruddha$ cd $HADOOP_HOME/sbin
hdoop@aniruddha-VirtualBox:~/hadoop-3.2.2/sbin$ start-all.sh
WARNING: Attempting to start all Apache Hadoop daemons as hdoop in 10 seconds. WARNING:
This is not a recommended production deployment configuration. WARNING: Use CTRL-C to abort.
Starting datanodes
Starting resourcemanager
Starting nodemanagers
hdoop@aniruddha-VirtualBox:~/hadoop-3.2.2/sbin$ cd
Found 2 items
-rw-r--r-- 1 hdoop supergroup 158551 2022-05-08 12:09 /demo/access_log.txt -rw-r--r-- 1 hdoop
2022-05-08 12:11:25,033 INFO mapreduce.Job: The url to track the job: http://aniruddha
VirtualBox:8088/proxy/application_1651991934281_0001/
2022-05-08 12:11:25,034 INFO mapreduce.Job: Running job: job_1651991934281_0001
2022-05-08 12:11:34,667 INFO mapreduce.Job: Job job_1651991934281_0001 running in uber mode : false
2022-05-08 12:11:34,668 INFO mapreduce.Job: map 0% reduce 0%
Job Counters
Total time spent by all maps in occupied slots (ms)=5559 Total time spent
by all reduces in occupied slots (ms)=3146 Total time spent by all map tasks
(ms)=5559
Spilled Records=2588
Shuffled Maps =2
Failed Shuffles=0
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
Bytes Read=162647
Bytes Written=3838
Found 4 items
drwxr-xr-x - hdoop supergroup 0 2022-05-08 12:09 /demo drwxr-xr-x - hdoop
Found 2 items
10.1.181.142 14
10.1.232.31 5
10.10.55.142 14
10.102.101.66 1
10.103.184.104 1
10.103.190.81 53
10.103.63.29 1
10.104.73.51 1
10.105.160.183 1
10.108.91.151 1
10.109.21.76 1
10.11.131.40 1
10.111.71.20 8
10.112.227.184 6
10.114.74.30 1
10.115.118.78 1
10.117.224.230 1
10.117.76.22 12
10.118.19.97 1
10.118.250.30 7
10.119.117.132 23 10.119.33.245 1
10.119.74.120 1
10.12.113.198 2
10.12.219.30 1
10.120.165.113 1 10.120.207.127 4
10.123.124.47 1
10.123.35.235 1
10.124.148.99 1
10.124.155.234 1 10.126.161.13 7
10.127.162.239 1 10.128.11.75 10
10.13.42.232 1
S. B. Patil College of Engineering,
Department of Computer Subject :
10.130.195.163 8 10.130.70.80 1
10.131.163.73 1
10.131.209.116 5 10.132.19.125 2
10.133.222.184 12 10.134.110.196 13
10.134.242.87 1
10.136.84.60 5
10.14.2.86 8
10.14.4.151 2
10.140.139.116 1 10.140.141.1 9
10.140.67.116 1
10.141.221.57 5
10.142.203.173 7 10.143.126.177 32
10.144.147.8 1
10.15.208.56 1
10.15.23.44 13
10.150.212.239 14 10.150.227.16 1
10.150.24.40 13
10.152.195.138 8 10.153.23.63 2
10.153.239.5 25
10.155.95.124 9
10.156.152.9 1
10.157.176.158 1 10.164.130.155 1
10.164.49.105 8
10.164.95.122 10
10.165.106.173 14 10.167.1.145 19
10.169.158.88 1
10.170.178.53 1
10.171.104.4 1
10.172.169.53 18
10.174.246.84 3
10.175.149.65 1
10.175.204.125 15 10.177.216.164 6
10.179.107.170 2 10.181.38.207 13
10.181.87.221 1
10.185.152.140 1
10.186.56.126 16
10.186.56.183 1
10.187.129.140 6 10.187.177.220 1
10.187.212.83 1
10.187.28.68 1
10.19.226.186 2
10.190.174.142 10 10.190.41.42 5
10.191.172.11 1
10.193.116.91 1
10.194.174.4 7
10.198.138.192 1 10.199.103.248 2
10.199.189.15 1
10.2.202.135 1
10.200.184.212 1 10.200.237.222 1
10.200.9.128 2
10.203.194.139 10 10.205.72.238 2
10.206.108.96 2
10.206.175.236 1 10.206.73.206 7
10.207.190.45 17
10.208.38.46 1
10.208.49.216 4
10.209.18.39 9
10.209.54.187 3
10.211.47.159 10
S. B. Patil College of Engineering,
Department of Computer Subject :
10.212.122.173 1
10.213.181.38 7
10.214.35.48 1
10.215.222.114 1 10.216.113.172 48
10.216.134.214 1 10.216.227.195 16
10.217.151.145 10 10.217.32.16 1
10.218.16.176 8
10.22.108.103 4
10.220.112.1 34
10.221.40.89 5
10.221.62.23 13
10.222.246.34 1
10.223.157.186 10 10.225.137.152 1
10.225.234.46 1
10.226.130.133 1 10.229.60.23 1
10.230.191.135 6 10.231.55.231 1
10.234.15.156 1
10.236.231.63 1
10.238.230.235 1 10.239.100.52 1
10.239.52.68 4
10.24.150.4 5
10.24.67.131 13
10.240.144.183 15 10.240.170.50 1
10.241.107.75 1
10.241.9.187 1
10.243.51.109 5
10.244.166.195 5 10.245.208.15 20
10.246.151.162 3 10.247.111.104 9
10.247.175.65 1
10.247.229.13 1
S. B. Patil College of Engineering,
Department of Computer Subject :
10.248.24.219 1
10.248.36.117 3
10.249.130.132 3 10.25.132.238 2
10.25.44.247 6
10.250.166.232 1 10.27.134.23 1
10.30.164.32 1
10.30.47.170 8
10.31.225.14 7
10.32.138.48 11
10.32.247.175 4
10.32.55.216 12
10.33.181.9 8
10.34.233.107 1
10.36.200.176 1
10.39.45.70 2
10.39.94.109 4
10.4.59.153 1
10.4.79.47 15
10.41.170.233 9
10.41.40.17 1
10.42.208.60 1
10.43.81.13 1 10.46.190.95 10
10.48.81.158 5 10.5.132.217 1
10.5.148.29 1 10.50.226.223 9
10.50.41.216 3 10.52.161.126 1
10.53.58.58 1
10.54.242.54 10 10.54.49.229 1
10.56.48.40 16 10.59.42.194 11
10.6.238.124 6 10.61.147.24 1
10.61.161.218 1 10.61.23.77 8
10.61.232.147 3 10.62.78.165 2
10.63.233.249 7 10.64.224.191 13
10.66.208.82 2 10.69.20.85 26
10.70.105.238 1 10.70.238.46 6
10.72.137.86 6 10.72.208.27 1
10.73.134.9 4 10.73.238.200 1
10.73.60.200 1 10.73.64.91 1
10.74.218.123 1 10.75.116.199 1
10.76.143.30 1 10.76.68.178 16
10.78.95.24 8 10.80.10.131 10
10.80.215.116 17 10.81.134.180 1
10.82.30.199 63 10.82.64.235 1
10.84.236.242 1 10.87.209.46 1
10.87.88.214 1 10.88.204.177 1
10.89.178.62 1 10.89.244.42 1
10.94.196.42 1 10.95.136.211 4
10.95.232.88 1 10.98.156.141 1
10.99.228.224 1
Assignment No – 3B
----------------------------------------------------------------------------------------------------------------
Objective of the Assignment: Students should be able to Locate dataset for working on weather data
which reads the text input files and finds average for temperature, dew and wind speed.
---------------------------------------------------------------------------------------------------------------
Prerequisite:
1. Basic of Eclipse IDE
--------------------------------------------------------------------------------------------------------------
Eclipse
S. B. Patil College of Engineering,
Department of Computer Subject :
Developed using Java, the Eclipse platform can be used to develop rich client applications, integrated development
environments and other tools. Eclipse can be used as an IDE for any programming language for which a plug-in is
available.
The Eclipse is defined as platform for developing the computer-based applications using various
programming language like JAVA, Python, C/C++, Ruby and many more. The Eclipse is IDE (Integrated
development kit) and mainly JAVA based programming is done in this platform. There are several plug-ins
and other additional plug-ins can be installed in the platform. The advanced client applications can be
developed. The JDT is used for doing the programming in Eclipse IDE.
Scanner Class
The Scanner class is used to get user input, and it is found in the java. util package. To use the Scanner
class, create an object of the class and use any of the available methods found in the Scanner class
documentation.
There is a delimiter pattern, which, by default, matches white space. Then, using different types of next()
methods, we can convert the resulting tokens.
hasNext()
The hasNext() method checks if the Scanner has another token in its input. A Scanner breaks its input into
tokens using a delimiter pattern, which matches whitespace by default. That is, hasNext() checks the input
and returns true if it has another non-whitespace character.
import java.io.*;
import java.util.Scanner;
public class ReadCSVExample1
{
public static void main(String[] args) throws Exception
{
//parsing a CSV file into Scanner class constructor
Scanner sc = new Scanner(new File("F:\\CSVDemo.csv"));
sc.useDelimiter(","); //sets the delimiter pattern
while (sc.hasNext()) //returns a boolean value
{
System.out.print(sc.next()); //find and returns the next complete token from this scanner
}
sc.close(); //closes the scanner
}
}
Conclusion:
Thus I have studied and implemented Locate dataset and calculated average for temperature, dew and
wind speed.
S. B. Patil College of Engineering,
Department of Computer Subject :
Assignment Questions
1. Write down steps to Eclipse IDE.