DS&BD Lab Manul
DS&BD Lab Manul
DS&BD Lab Manul
University Faculty of
Computer Engineering
MISSION
b. An ability to define a problem and provide a systematic solution with the help of
conducting experiments, as well as analyzing and interpreting the data.
e. An ability to use the techniques, skills, and modern engineering technologies tools,
standard processes necessary for practice as an IT professional.
h. An ability to understand professional, ethical, legal, security and social issues and
responsibilities.
Prerequisite Courses:
• Discrete Mathematics (210241)
• Database Management Systems (310341)
Course Objectives:
1. To understand the need of Data Science and Big Data
2. To understand computational statistics in Data Science
3. To analyze and demonstrate knowledge of statistical data analysis techniques
for decision-making
4. To gain practical, hands-on experience with statistics programming languages
and big data tools
2 Data Wrangling II
Perform the following operations using Python on any open source dataset
(eg. data.csv)
1. Scan all variables for missing values and inconsistencies. If there are
missing values and/or
inconsistencies, use any of the suitable techniques to deal with them.
Basic Statistics - Measures of Central Tendencies and Variance
3 Perform the following operations on any open source dataset (eg. data.csv)
1. Provide summary statistics (mean, median, minimum, maximum,
standard deviation) for
a dataset (age, income etc.) with numeric variables grouped by one of the
qualitative
(categorical) variable
Data Analytics I
Create a Linear Regression Model using Python/R to predict home prices
4
using Boston Housing
Dataset (https://www.kaggle.com/c/boston-housing).
Data Analytics II
1. Implement logistic regression using Python/R to perform classification
5 on
Social_Network_Ads.csv dataset
In addition to the codes and outputs, explain every operation that you do in the above steps and explain everything
that you do to import/read/scrape the data set.
Theory:-
Introduction to Dataset
A dataset is a collection of records, similar to a relational database table. Records are similar to table
rows, but the columns can contain not only strings or numbers, but also nested data structures such as
lists, maps, and other records.
Instance: A single row of data is called an instance. It is an observation from the domain. Feature: A
single column of data is called a feature. It is a component of an observation and is also called an
Data Type: Features have a data type. They may be real or integer-valued or may have a categorical or
ordinal value. You can have strings, dates, times, and more complex types, but typically they are
reduced to real or categorical values when working with traditional machine learning methods.
Datasets: A collection of instances is a dataset and when working with machine learning methods we
typically need a few datasets for different purposes.
Training Dataset: A dataset that we feed into our machine learning algorithm to train our model.
Testing Dataset: A dataset that we use to validate the accuracy of our model but is not used to train the
model. It may be called the validation dataset.
Data should be arranged in a two-dimensional space made up of rows and columns. This type of data
structure makes it easy to understand the data and pinpoint any problems. An example of some raw data
stored as a CSV (comma separated values).
A data type is essentially an internal construct that a programming language uses to understand how to
store and manipulate data.
A possible confusing point about pandas data types is that there is some overlap between pandas, python
and numpy. This table summarizes the key points
Pandas Python
NumPy Usage
dtype
type type
object str or mixed string_, unicode_, mixed types Text or mixed numeric and
non-numeric values
int64 Int int_, int8, int16, int32, int64, uint8, Integer numbers
uint16, uint32, uint64
float64 Float float_, float16, float32, float64 Floating point numbers
bool Bool bool_ True/False values
datetime64 NA datetime64[ns] Date and time values
timedelta[ns] NA NA Differences between two
datetimes
category NA NA Finite list of text values
a) Pandas
Pandas is an open-source Python package that provides high-performance, easy-to-use data structures
and data analysis tools for the labeled data in Python programming language.
b) NumPy
One of the most fundamental packages in Python, NumPy is a general-purpose array- processing
package. It provides high-performance multidimensional array objects and tools to work with the arrays.
NumPy is an efficient container of generic multi- dimensional data. NumPy’s main object is the
homogeneous multidimensional array. It is a table of elements or numbers of the same datatype, indexed
by a tuple of positive integers. In NumPy, dimensions are called axes and the number of axes is called
rank. NumPy’s array class is called ndarray aka array.
1.Basic array operations: add, multiply, slice, flatten, reshape, index arrays
2.Advanced array operations: stack arrays, split into sections, broadcast arrays
3.Work with DateTime or Linear Algebra
4.Basic Slicing and Advanced Indexing in NumPy Python
b. Matplotlib
This is undoubtedly my favorite and a quintessential Python library. You can create stories with the data
visualized with Matplotlib. Another library from the SciPy Stack, Matplotlib plots 2D figures.
Histogram, bar plots, scatter plots, area plot to pie plot, Matplotlib can depict a wide range of
visualizations. With a bit of effort and tint of visualization capabilities, with Matplotlib, you can create
just any visualizations:
● Line plots
● Scatter plots
● Area plots
● Bar charts and Histograms
● Pie charts
● Stem plots
● Contour plots
● Quiver plots
● Spectrograms
Matplotlib also facilitates labels, grids, legends, and some more formatting entities with
Matplotlib.
c.Seaborn
So when you read the official documentation on Seaborn, it is defined as the data visualization library
based on Matplotlib that provides a high-level interface for drawing attractive and informative statistical
graphics. Putting it simply, seaborn is an extension of Matplotlib with advanced features.
d. Scikit Learn
Introduced to the world as a Google Summer of Code project, Scikit Learn is a robust machine learning
library for Python. It features ML algorithms like SVMs, random forests, k-means clustering, spectral
clustering, mean shift, cross-validation and more... Even NumPy, SciPy and related scientific operations
are supported by Scikit Learn with Scikit Learn being a part of the SciPy Stack.
Description of Dataset:
Description of Dataset-
csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
2. Now Read CSV File as a Dataframe in Python from from path where you saved the same
The Iris data set is stored in .csv format. ‘.csv’ stands for comma separated values. It is
easier to load .csv files in Pandas data frame and perform various analytical operations on
it.
Load Iris.csv into a Pandas data frame —
Syntax-
4. read in the dataset from the UCI Machine Learning Repository link and specify column
names to use
2 dataset.tail(n=5)
Return the last n rows.
c. count of missing values across each column using isna() and isnull()
In order to get the count of missing values of the entire dataframe isnull() function is
used. sum() which does the column wise sum first and doing another sum() will get
the count of missing values of the entire dataframe.
Function: dataframe.isnull().sum().sum()
Output : 8
d. count row wise missing value using
isnull() Function:
dataframe.isnull().sum(axis = 1) Output:
e. count Column wise missing value using
isnull() Method 1:
Function: dataframe.isnull().sum()
Method 2:
unction: dataframe.isna().sum()
df1.Gender.isnull().sum()
Output: 2
g. groupby count of missing values of a column.
In order to get the count of missing values of the particular column by group in
pandas we will be using isnull() and sum() function with apply() and groupby()
which performs the group wise count of missing values as shown below.
Function:
df1.groupby(['Gender'])['Score'].apply(lambda x:
x.isnull().sum())
Output:
The Transforming data stage is about converting the data set into a format that can be
analyzed or modelled effectively, and there are several techniques for this process.
a. Data Formatting: Ensuring all data formats are correct (e.g. object, text, floating
number, integer, etc.) is another part of this initial ‘cleaning’ process. If you are
working with dates in Pandas, they also need to be stored in the exact format to use
special date-time functions.
b. Data normalization: Mapping all the nominal data values onto a uniform scale
(e.g. from 0 to 1) is involved in data normalization. Making the ranges consistent
across variables helps with statistical analysis and ensures better comparisons
later on.It is also known as Min-Max scaling.
Algorithm:
df = pd.DataFrame(iris.data,
columns=iris.feature_names)
DSBDL/310256/SEM_VI PAGE 17
where 0 is the label for tall, 1 is the label for medium, and 2 is a label for short height.
Label Encoding on iris dataset: For iris dataset the target column which is Species. It
contains three species Iris-setosa, Iris-versicolor, Iris-virginica.
● fit_transform(y):
Encoded labels.
This transformer should be used to encode target values, and not the input.
Algorithm:
df['Species'].unique()
Limitation: Label encoding converts the data in machine-readable form, but it assigns a
unique number(starting from 0) to each class of data. This may lead to the generation
of priority issues in the data sets. A label with a high value may be considered to have
high priority than a label having a lower value.
b. One-Hot Encoding:
In one-hot encoding, we create a new set of dummy (binary) variables that is equal to the
number of categories (k) in the variable. For example, let’s say we have a categorical
variable Color with three categories called “Red”, “Green” and “Blue”, we need to use
three dummy variables to encode this variable using one-hot encoding. A dummy
(binary) variable just takes the value 0 or 1 to indicate the exclusion or inclusion of a
category.
In one-hot encoding,
● sklearn.preprocessing.OneHotEncoder(): Encode
categorical integer features using a one-hot aka one-of-K scheme
Algorithm:
Step 4: Apply label_encoder object for label encoding the Observe the unique
values for the Species column.
df['Species'].unique()
Dummy encoding also uses dummy (binary) variables. Instead of creating a number of
dummy variables that is equal to the number of categories (k) in the variable, dummy
encoding uses k-1 dummy variables. To encode the same Color variable with three
categories using the dummy encoding, we need to use only two dummy variables.
In dummy encoding,
“Red” color is encoded as [1 0] vector of size 2.
“Green” color is encoded as [0 1] vector of size
2. “Blue” color is encoded as [0 0] vector of size
2.
Dummy encoding removes a duplicate category present in the one-hot encoding.
● Parameters:
Whether the dummy-encoded columns should be backed by a SparseArray ( True) or a regular NumPy
array (False).
Whether to get k-1 dummies out of k categorical levels by removing the first level.
Algorithm:
Conclusion- In this way we have explored the functions of the python library for Data
Preprocessing, Data Wrangling Techniques and How to Handle missing values on Iris
Dataset.
Assignment Question
Assignment No: 2
Theory:-
1. Creation of Dataset using Microsoft Excel.
The dataset is created in “CSV” format.
● The name of dataset is StudentsPerformance
● The features of the dataset are: Math_Score, Reading_Score, Writing_Score,
Placement_Score, Club_Join_Date .
● Number of Instances: 30
Step 2: Enter the name of the dataset and Save the dataset astye CSV(MS-DOS).
Step 3: Fill the dara by using RANDOMBETWEEN function. For every feature , fill the data by considering the
above spectified range.
The placement count largely depends on the placement score. It is considered that if
placement score <75, 1 offer is facilitated; for placement score >75 , 2 offer is facilitated and
for else (>85) 3 offer is facilitated. Nested If formula is used for ease of data filling.
Step 4: In 20% data, fill the impurities. The range of math score is [60,80], updating a few
instances values below 60 or above 80. Repeat this for Writing_Score [60,80],
Placement_Score[75-100], Club_Join_Date [2018-2021].
Step 5: To violate the ruleof response variable, update few valus . If placement scoreis greater
then 85, facilated only 1 offer.
with missing data, either because it exists and was not collected or it never existed. For
Example, Suppose different users being surveyed may choose not to share their income, some
users may choose not to share the address in this way many datasets went missing.
In Pandas missing data is represented by two value:
1. None: None is a Python singleton object that is often used for missing data in Python
code.
2. NaN : NaN (an acronym for Not a Number), is a special floating-point value
recognized by all systems that use the standard IEEE floating-point
representation.Pandas treat None and NaN as essentially interchangeable for indicating
missing or null values. To facilitate this convention, there are several useful functions
for detecting, removing, and replacing null values in Pandas DataFrame :
● isnull()
● notnull()
● dropna()
● fillna()
● replace()
1. Checking for missing values using isnull() and notnull()
Algorithm:
Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
df
Step 5: To create a series true for NaN values for specific columns. for example math
score in dataset and display data with only math score as NaN
series = pd.isnull(df["math score"])
df[series]
Algorithm:
Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame df
Step 5: To create a series true for NaN values for specific columns. for example math
score in dataset and display data with only math score as NaN
series1 = pd.notnull(df["math score"])
df[series1]
See that there are also categorical values in the dataset, for this, you need to use Label
Encoding or One Hot Encoding.
from sklearn.preprocessing import LabelEncoder le
= LabelEncoder()
df['gender'] = le.fit_transform(df['gender'])
newdf=df
df
In order to fill null values in a datasets, fillna(), replace() functions are used. These
functions replace NaN values with some value of their own. All these functions help in
filling null values in datasets of a DataFrame.
Step 5: filling missing values using mean, median and standard deviation of that column.
Following line will replace Nan value in dataframe with value -99
Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
df
Step 4:To drop rows with at least 1 null value
ndf.dropna()
Similarly, an Outlier is an observation in a given dataset that lies far from the rest of
the observations. That means an outlier is vastly larger or smaller than the remaining values in
the set.
Mean is the accurate measure to describe the data when we do not have any outliers
present. Median is used if there is an outlier in the dataset. Mode is used if there is an outlier
AND about ½ or more of the data is the same.
‘Mean’ is the only measure of central tendency that is affected by the outliers which in
turn impacts Standard deviation. Example:
Consider a small dataset, sample= [15, 101, 18, 7, 13, 16, 11, 21, 5, 15, 10, 9]. By
looking at it, one can quickly say ‘101’ is an outlier that is much larger than the other values.
From the above calculations, we can clearly say the Mean is more affected than the Median.
4. Detecting Outliers
If our dataset is small, we can detect the outlier by just looking at the dataset. But what
if we have a huge dataset, how do we identify the outliers then? We need to use visualization
and mathematical techniques.
● Boxplots
● Scatterplots
● Z-score
● Inter Quantile Range(IQR)
Step 4:Select the columns for boxplot and draw the boxplot.
df.boxplot(col)
Step 5: We can now print the outliers for each column with reference to the box plot.
print(np.where(df['math score']>90))print(np.where(df['reading
score']<25)) print(np.where(df['writing score']<30))
count'])
plt.show()
Labels to the axis can be assigned (Optional)
ax.set_xlabel('(Proportion non-retail business
acres)/(town)')
ax.set_ylabel('(Full-value property-tax rate)/(
$10,000)')
Step 5: We can now print the outliers with reference to scatter plot.
print(np.where((df['placement score']<50) & (df['placement
offer count']>1)))
print(np.where((df['placement score']>85) & (df['placement
offer count']<3)))
Algorithm:
Step 1 : Import numpy and stats from scipy libraries
import numpy as np
upper = Q3 +1.5*IQR
lower = Q1 – 1.5*IQR
In the above formula as according to statistics, the 0.5 scale-up of IQR
(new_IQR = IQR + 0.5*IQR) is taken.
Algorithm:
Step 1 : Import numpy library
import numpy as np
print("New array:",b)
df_stud.insert(1,"m score",b,True)
df_stud
● Mean/Median imputation:
As the mean value is highly influenced by the outliers, it is advised to replace the
outliers with the median value.
1. Plot the box plot for reading score col
= ['reading score']
df.boxplot(col)
median
4. Replace the upper bound outliers using median value
refined_df=df
refined_df['reading score'] = np.where(refined_df['reading
score'] >upr_bound, median,refined_df['reading score'])
5. Display redefined_df
refined_df.boxplot(col)
● Smoothing: It is a process that is used to remove noise from the dataset using some
algorithms It allows for highlighting important features present in the dataset. It helps in
predicting the patterns
● Aggregation: Data collection or aggregation is the method of storing and presenting data
in a summary format. The data may be obtained from multiple data sources to integrate
these data sources into a data analysis description. This is a crucial step since the
accuracy of data analysis insights is highly dependent on the quantity and quality of the
data used.
● Generalization:It converts low-level data attributes to high-level data attributes
using concept hierarchy. For Example Age initially in Numerical form (22, 25) is
Algorithm:
Step 1 : Import pandas and numpy libraries
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/demo.csv")
Step 3: Display the data frame
df
Algorithm:
Step 1 : Detecting outliers using Z-Score for the Math_score variable and remove the
outliers.
Step 2: Observe the histogram for math_score variable. import
matplotlib.pyplot as plt new_df['math
score'].plot(kind = 'hist')
Step 3: Convert the variables to logarithm at the scale 10.
df['log_math'] = np.log10(df['math score'])
Conclusion: In this way we have explored the functions of the python library for Data Identifying
and handling the outliers. Data Transformations Techniques are explored with the purpose of creating
the new variable and reducing the skewness from datasets.
Assignment Question:
1. Explain the methods to detect the outlier.
2. Explain data transformation methods
3. Write the algorithm to display the statistics of Null values present in the dataset.
4. Write an algorithm to replace the outlier value with the mean of the variable.
.
Assignment No: 3
Provide the codes with outputs and explain everything that you do in this step. .
Theory:-
Statistical Inference:
statistical inference as the process of generating conclusions about a population from a noisy sample.
Without statistical inference we’re simply living within our data. With statistical inference, we’re trying
to generate new knowledge.
Statistical analysis and probability influence our lives on a daily basis. Statistics is used to predict the
weather, restock retail shelves, estimate the condition of the economy, and much more. Used in a variety
of professional fields, statistics has the power to derive valuable insights and solve complex problems in
business, science, and society. Without hard science, decision making relies on emotions and gut
reactions. Statistics and data override intuition, inform decisions, and minimize risk and uncertainty.
In data science, statistics is at the core of sophisticated machine learning algorithms, capturing and
translating data patterns into actionable evidence. Data scientists use statistics to gather, review, analyze,
and draw conclusions from data, as well as apply quantified mathematical models to appropriate
variables.
Data science knowledge is grouped into three main areas: computer science; statistics and mathematics;
and business or field expertise. These areas separately result in a variety of careers, as displayed in the
diagram below. Combining computer science and statistics without business knowledge enables
professionals to perform an array of machine learning functions. Computer science and business
expertise leads to software development skills. Mathematics and statistics (combined with business
expertise) result in some of the most talented researchers. It is only with all three areas combined that
data scientists can maximize their performance, interpret data, recommend innovative solutions, and
create a mechanism to achieve improvements.
Statistical functions are used in data science to analyze raw data, build data models, and infer results.
Below is a list of the key statistical terms:
● Population: the source of data to be collected.
● Sample: a portion of the population.
● Variable: any data item that can be measured or counted.
● Quantitative analysis (statistical): collecting and interpreting data with patterns and data
visualization.
● Qualitative analysis (non-statistical): producing generic information from other non-data forms
of media.
● Descriptive statistics: characteristics of a population.
● Inferential statistics: predictions for a population.
● Central tendency (measures of the center): mean (average of all values), median (central value of
a data set), and mode (the most recurrent value in a data set).
● Measures of the Dispersion:
○ Range: the distance between each value in a data set.
○ Variance: the distance between a variable and its expected value.
○ Standard deviation: the dispersion of a data set from the mean.
Statistical techniques for data scientists
There are a number of statistical techniques that data scientists need to master. When just starting out, it
is important to grasp a comprehensive understanding of these principles, as any holes in knowledge will
result in compromised data or false conclusions.
General statistics: The most basic concepts in statistics include bias, variance, mean, median, mode, and
percentiles.
Probability distributions: Probability is defined as the chance that something will occur, characterized as
a simple “yes” or “no” percentage. For instance, when weather reporting indicates a 30 percent chance
of rain, it also means there is a 70 percent chance it will not rain. Determining the distribution calculates
the probability that all those potential values in the study will occur. For example, calculating the
probability that the 30 percent chance for rain will change over the next two days is an example of
probability distribution.
Dimension reduction: Data scientists reduce the number of random variables under consideration
through feature selection (choosing a subset of relevant features) and feature extraction (creating new
features from functions of the original features). This simplifies data models and streamlines the process
of entering data into algorithms.
Over and under sampling: Sampling techniques are implemented when data scientists have too much or
too little of a sample size for a classification. Depending on the balance between two sample groups,
data scientists will either limit the selection of a majority class or create copies of a minority class in
order to maintain equal distribution.
Bayesian statistics: Frequency statistics uses existing data to determine the probability of a future event.
Bayesian statistics, however, takes this concept a step further by accounting for factors we predict will
be true in the future. For example, imagine trying to predict whether at least 100 customers will visit
your coffee shop each Saturday over the next year. Frequency statistics will determine probability by
analyzing data from past Saturday visits. But Bayesian statistics will determine probability by also
factoring for a nearby art show that will start in the summer and take place every Saturday afternoon.
This allows the Bayesian statistical model to provide a much more accurate figure.
The goals of inference
1. Estimate and quantify the uncertainty of an estimate of a population quantity (the proportion of
people who will vote for a candidate).
2. Determine whether a population quantity is a benchmark value (“is the treatment effective?”).
3. Infer a mechanistic relationship when quantities are measured with noise (“What is the slope for
Hooke’s law?”)
4. Determine the impact of a policy? (“If we reduce pollution levels, will asthma rates decline?”)
5. Talk about the probability that something occurs.
Algorithm:-
Step 1. Import Dataset:
train_df=pd.read_csv('train.csv')
test_df=pd.read_csv('test.csv')
train_df.shape, test_df.shape
train_df['label']='train'
test_df['label']='test'
combined_data_df=pd.concat([train_df,test_df])
combined_data_df.shape
#The reasons for combining both training and test dataset are:
#Categorical = 10
#Numerical = 5
#Target =1
combined_data_df.isnull().sum()
combined_data_df.dropna(subset=['workclass','occupation','native-country'],axis=0,inplace=True)
combined_data_df.isnull().sum()
combined_data_df.dropna(subset=['income_>50K'],axis=0,inplace=True)
combined_data_df.isnull().sum()
sns.set_theme(style="darkgrid")
plt.figure(figsize=(20,10))
sns.countplot(data= combined_data_df, x = "education")
combined_data_df['education'] = combined_data_df['education'].replace(['1st-4th','5th-6th'],'elementary-
school')
combined_data_df['education'] = combined_data_df['education'].replace(['7th-8th'],'middle-school')
combined_data_df['education'] = combined_data_df['education'].replace(['9th','10th','11th','12th'],'high-
school')
combined_data_df['education'] = combined_data_df['education'].replace(['Doctorate','Bachelors','Some-
college','Masters','Prof-school','Assoc-voc','Assoc-acdm'],'postsecondary-education')
plt.figure(figsize=(20,10))
sns.countplot(data= combined_data_df, x = "education")
plt.figure(figsize=(20,10))
sns.countplot(data= combined_data_df, x = "marital-status")
combined_data_df['marital-status'] = combined_data_df['marital-status'].replace(['Divorced','Never-
married','Widowed'],'single')
combined_data_df['marital-status'] = combined_data_df['marital-status'].replace(['Married-civ-
spouse','Separated','Married-spouse-absent','Married-AF-spouse'],'married')
plt.figure(figsize=(20,10))
plt.figure()
sns.countplot(data= combined_data_df, x = "marital-status")
plt.figure(figsize=(20,10))
sns.countplot(data= combined_data_df, y = "occupation")
plt.figure(figsize=(20,10))
sns.countplot(data= combined_data_df, x = "relationship")
Output: Performed statistical analysis on the income prediction dataset and also converted categorical
data into numerical data.
Conclusion:-
● Handled the missing value, by dropping them from the dataset
● From the data visualization, combined/categorized the features
● Using dummy variable, converted categorical variable to numerical variable to create better model.
Assignment No: 4
Create a Linear Regression Model using Python/R to predict home prices using Boston Housing Dataset
(https://www.kaggle.com/c/boston-housing). The Boston Housing dataset contains information about
various houses in Boston through different parameters. There are 506 samples and 14 feature variables
in this dataset.
The objective is to predict the value of prices of the house using the given features.
Theory:-
Linear Regression: It is a machine learning algorithm based on supervised learning. It performs
a regression task. Regression models a target prediction value based on independent variables. It is
mostly used for finding out the relationship between variables and forecasting. Different regression
models differ based on – the kind of relationship between dependent and independent variables, they
are considering and the number of independent variables being used.
Linear regression performs the task to predict a dependent variable value (y) based on a given
independent variable (x). So, this regression technique finds out a linear relationship between x (input)
and y(output). Hence, the name is Linear Regression.
In the figure above, X (input) is the work experience and Y (output) is the salary of a person. The
regression line is the best fit line for our model.
When training the model – it fits the best line to predict the value of y for a given value of x. The
model gets the best regression fit line by finding the best θ1 and θ2 values.
θ1: intercept
θ2: coefficient of x
Once we find the best θ1 and θ2 values, we get the best fit line. So when we are finally using our
model for prediction, it will predict the value of y for the
input value of x.
Cost function(J) of Linear Regression is the Root Mean Squared Error (RMSE) between predicted
y value (pred) and true y value (y).
Gradient Descent:
To update θ1 and θ2 values in order to reduce Cost function (minimizing RMSE value) and achieving
the best fit line the model uses Gradient Descent. The idea is to start with random θ1 and θ2 values and
then iteratively updating the values, reaching minimum cost.
Algorithm:-
Step 1: Download the data set of Boston Housing Prices
(https://www.kaggle.com/c/boston-housing).
Step 2: Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Step 3: Importing Data
from sklearn.datasets import load_boston
boston = load_boston()
Step 4: Converting data from nd-array to data frame and adding feature names to the data
data = pd.DataFrame(boston.data)
data.columns = boston.feature_names
Step 5: Adding 'Price' (target) column to the data
DSBDL/310256/ Sem- VI Page 60
Computer Engineering / DYPCOEI,PUNE
data['Price'] = boston.target
Step 6: Getting input and output data and further splitting data to training and testing dataset.
# Input Data
x = boston.data
# Output Data
y = boston.target
Step 7: splitting data to training and testing dataset.
Step 8: #Applying Linear Regression Model to the dataset and predicting the prices.
# Fitting Multi Linear regression model to training model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(xtrain, ytrain)
Input: Dataset of Boston Housing Prices. This dataset concerns the housing prices in the housing city
of Boston. The dataset provided has 506 instances with 14 features.
Questions:
1) What is Linear regression
2) What are different types of linear regressions
3) Applications where linear regression is used
4) What are the limitations of linear regression
Assignment No: 5
Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is called
logistic regression, but is used to classify samples; Therefore, it falls under the classification algorithm.
Logistic Function (Sigmoid Function):
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit,
so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function or the
logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the probability of
either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
threshold values tends to 0.
Assumptions for Logistic Regression:
o The dependent variable must be categorical in nature.
o The independent variable should not have multi-collinearity.
Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
o We know the equation of the straight line can be written as:
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation
by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:
A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model,
where N is the number of target classes. The matrix compares the actual target values with those
predicted by the machine learning model. This gives us a holistic view of how well our classification
model is performing and what kinds of errors it is making.
For a binary classification problem, we would have a 2 x 2 matrix as shown below with 4 values:
Precision tells us how many of the correctly predicted cases actually turned out to be positive.
Here’s how to calculate Precision:
Algorithm:-
Step 5: Scale the features to avoid variation and let the features follow a normal distribution
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
y_pred = classifier.predict(X_test)
#Accuray=(TN+TP)/Total
#Error_rate=(FN+FP)/Total
cl_report=classification_report(y_test,y_pred)
cl_report
Output: Classification using Logistic Regression, Computed Confusion matrix to find TP, FP, TN, FN,
Accuracy, Error rate, Precision, Recall on the given dataset.
Questions:
1) What is logistic regression
2) How it is different from linear regression
3) What are the types of logistic regression
4) What are the limitations of logistic regression
5) List application where logistic regression can be applied
Assignment No: 6
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis
is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Advantages of Naïve Bayes Classifier:
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
Algorithm:-
Step 4: Splitting the dataset into the Training set and Test set
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Step 6: Training the Naive Bayes Classification model on the Training Set
cl_report=classification_report(y_test,y_pred)
print(cl_report)
Output: Confusion matrix, Accuracy, Error rate, Precision, Recall on the given dataset.
Conclusion: Implemented successfully Simple Naïve Bayes classification algorithm using Python on
iris.csv dataset
Questions:
1) What is confusion matrix
2) How to calculate Accuracy, precision and recall
3) Explain applications of naïve Bays classification algorithm
4) State advantages and limitations of Naïve Bays algorithm
Assignment No: 7
Theory:-
Natural language processing is one of the fields in programming where the natural language is processed
by the software. This has many applications like sentiment analysis, language translation, fake news
detection, grammatical error detection etc.
The input in natural language processing is text. The data collection for this text happens from a lot of
sources. This requires a lot of cleaning and processing before the data can be used for analysis.
These are some of the methods of processing the data in NLP:
Tokenization
Stop words removal
Stemming
Normalization
Lemmatization
Parts of speech tagging
Tokenization:
Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words,
sentences called tokens. These tokens help in understanding the context or developing the model for the
NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the
words.
For example, the text “It is raining” can be tokenized into ‘It’, ‘is’, ‘raining’
There are different methods and libraries available to perform tokenization. NLTK, Gensim, Keras are
some of the libraries that can be used to accomplish the task.
Tokenization can be done to either separate words or sentences. If the text is split into words using some
separation technique it is called word tokenization and same separation done for sentences is called
sentence tokenization.
There are various tokenization techniques available which can be applicable based on the language and
purpose of modeling. Below are a few of the tokenization techniques used in NLP.
This is a tokenizer which is advanced and is available before Spacy was introduced. It is basically a
collection of complex normalization and segmentation logic which works very well for structured
language like English.
Subword Tokenization
This tokenization is very useful for specific application where sub words make significance. In this
technique the most frequently used words are given unique ids and less frequent words are split into sub
words and they best represent the meaning independently. For example if the word few is appearing
frequently in the text it will be assigned a unique id, where fewer and fewest which are rare words and are
less frequent in the text will be split into sub words like few, er, and est. This helps the language model
not to learn fewer and fewest as two separate words. This allows to identify the unknown words in the
data set during training. There are different types of subword tokenization and they are given below and
Byte-Pair Encoding and WordPiece will be discussed briefly.
Byte-Pair Encoding (BPE)
WordPiece
Unigram Language Model
SentencePiece
Byte-Pair Encoding (BPE)
This technique is based on the concepts in information theory and compression. BPE uses Huffman
encoding for tokenization meaning it uses more embedding or symbols for representing less frequent
words and less symbols or embedding for more frequently used words.
The BPE tokenization is bottom up sub word tokenization technique. The steps involved in BPE
algorithm is given below.
1. Starts with splitting the input words into single unicode characters and each of them corresponds to a
symbol in the final vocabulary.
2. Find the most frequent occurring pair of symbols from the current vocabulary.
3. Add this to the vocabulary and size of vocabulary increases by one.
4. Repeat steps ii and iii till the defined number of tokens are built or no new combination of symbols
exist with required frequency.
WordPiece
WordPiece is similar to BPE techniques expect the way the new token is added to the vocabulary. BPE
considers the token with most frequent occurring pair of symbols to merge into the vocabulary. While
WordPiece considers the frequency of individual symbols also and based on below count it merges into
the vocabulary.
Count (x, y) = frequency of (x, y) / frequency (x) * frequency (y)
The pair of symbols with maximum count will be considered to merge into vocabulary. So it allows rare
tokens to be included into vocabulary as compared to BPE.
Tokenization with NLTK
NLTK (natural language toolkit ) is a python library developed by Microsoft to aid in NLP.
#Tokenization
Input: Text can be sentences, strings, words, characters and large documents.
Now lets create a sentence to understand basics of text mining methods.
Our sentence is "no woman no cry" from Bob Marley
playing swimming dancing played danced
Output: tokenize with nltk: ['Text', 'can', 'be', 'sentences', ',', 'strings',
',', 'words', ',', 'characters', 'and', 'large', 'documents', '.', 'Now', 'lets
', 'create', 'a', 'sentence', 'to', 'understand', 'basics', 'of', 'text', 'mini
ng', 'methods', '.', 'Our', 'sentence', 'is', '``', 'no', 'woman', 'no', 'cry',
"''", 'from', 'Bob', 'Marley', 'playing', 'swimming', 'dancing', 'played', 'da
nced']
Stemming:
Stemming is a natural language processing technique that lowers inflection in words to their root forms,
hence aiding in the preprocessing of text, words, and documents for text normalization.
According to Wikipedia, inflection is the process through which a word is modified to communicate
many grammatical categories, including tense, case, voice, aspect, person, number, gender, and mood.
Thus, although a word may exist in several inflected forms, having multiple inflected forms inside the
same text adds redundancy to the NLP process.
As a result, we employ stemming to reduce words to their basic form or stem, which may or may not be
a legitimate word in the language.
For instance, the stem of these three words, connections, connected, connects, is “connect”. On the other
hand, the root of trouble, troubled, and troubles is “troubl,” which is not a recognized word.
Application of Stemming
In information retrieval, text mining SEOs, Web search results, indexing, tagging systems, and word
analysis, stemming is employed. For instance, a Google search for prediction and predicted returns
comparable results.
Martin Porter invented the Porter Stemmer or Porter algorithm in 1980. Five steps of word reduction are
used in the method, each with its own set of mapping rules. Porter Stemmer is the original stemmer and
is renowned for its ease of use and rapidity. Frequently, the resultant stem is a shorter word with the
same root meaning.
PorterStemmer() is a module in NLTK that implements the Porter Stemming technique. Let us examine
this with the aid of an example.
Example of PorterStemmer()
In the example below, we construct an instance of PorterStemmer() and use the Porter algorithm to stem
the list of words.
Lemmatization:
# lemmatization
lemma = nlp.WordNetLemmatizer()
Lemmatization is similar to stemming with one difference i.e. the final form is also a meaningful word.
Thus, stemming operation does not need a dictionary like lemmatization.
Stopword:
There different techniques for removing stop words from strings in Python. Stop words are those words
in natural language that have a very little meaning, such as "is", "an", "the", etc. Search engines and
other enterprise indexing platforms often filter the stop words while fetching results from the database
against the user queries.
Stop words are often removed from the text before training deep learning and machine learning models
since stop words occur in abundance, hence providing little to no unique information that can be used
for classification or clustering.
With the Python programming language, we have a myriad of options to use in order to remove stop
words from strings. we can either use one of the several natural language processing libraries such as
NLTK, SpaCy, Gensim, TextBlob, etc.,
There are number of different approaches, depending on the NLP library in use
The NLTK library is one of the oldest and most commonly used Python libraries for Natural Language
Processing. NLTK supports stop word removal, and we can find the list of stop words in
the corpus module. To remove stop words from a sentence, we can divide your text into words and then
remove the word if it exits in the list of stop words provided by NLTK.
Abbreviation Meaning
CC coordinating conjunction
CD cardinal digit
DT Determiner
EX existential there
FW foreign word
IN preposition/subordinating conjunction
LS list market
Abbreviation Meaning
RP particle (about)
UH interjection (goodbye)
VB verb (ask)
Abbreviation Meaning
The above NLTK POS tag list contains all the NLTK POS Tags. NLTK POS tagger is used to assign
grammatical information of each word of the sentence. Installing, Importing and downloading all the
packages of POS NLTK is complete.
import nltk
nltk.download('averaged_perceptron_tagger')
print("\nPOS tagging:",nltk.pos_tag(text_tokens))
POS tagging: [('Text', 'NN'), ('can', 'MD'), ('be', 'VB'), ('sentences', 'NNS')
, (',', ','), ('strings', 'NNS'), (',', ','), ('words', 'NNS'), (',', ','), ('c
haracters', 'NNS'), ('and', 'CC'), ('large', 'JJ'), ('documents', 'NNS'), ('.',
'.'), ('Now', 'RB'), ('lets', 'VBZ'), ('create', 'VB'), ('a', 'DT'), ('sentenc
e', 'NN'), ('to', 'TO'), ('understand', 'VB'), ('basics', 'NNS'), ('of', 'IN'),
('text', 'NN'), ('mining', 'NN'), ('methods', 'NNS'), ('.', '.'), ('Our', 'PRP
$'), ('sentence', 'NN'), ('is', 'VBZ'), ('``', '``'), ('no', 'DT'), ('woman', '
NN'), ('no', 'DT'), ('cry', 'NN'), ("''", "''"), ('from', 'IN'), ('Bob', 'NNP')
, ('Marley', 'NNP'), ('playing', 'VBG'), ('swimming', 'VBG'), ('dancing', 'VBG'
), ('played', 'VBN'), ('danced', 'VBD')]
TF (Term Frequency) :
Term frequency is simply the count of a word present in a sentence
TF is basically capturing the importance of the word irrespective of the length of the document.
A word with the frequency of 3 with the length of sentence being 10 is not the same as when the
word length of sentence being 100 words. It should get more importance in the first scenario; that is
what TF does.
CountVectorizer is used to find the word count in each document of a dataset. Also known as to calculate
Term Frequency
#fit all the sentences using count-vectorizer and get an array of word counts of each document
import pandas as pd
cv = CountVectorizer()
word_count_vector = cv.fit_transform(docs)
tf = pd.DataFrame(word_count_vector.toarray(), columns=cv.get_feature_names())
print(tf)
ate away cat end finally from has house little mouse of ran \
0 0 0 1 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 1 1 1 1 0 0
2 0 1 0 0 0 1 0 1 0 1 0 1
3 1 0 1 0 1 0 0 0 0 1 0 0
4 0 0 0 1 0 0 0 0 0 1 1 0
#declare TfidfTransformer() instance and fit it with the above word_count_vector to get IDF and final
normalized features
tfidf_transformer = TfidfTransformer()
X = tfidf_transformer.fit_transform(word_count_vector)
idf = pd.DataFrame({'feature_name':cv.get_feature_names(), 'idf_weights':tfidf_transformer.idf_})
print(idf)
feature_name idf_weights
0 ate 2.098612
1 away 2.098612
2 cat 1.693147
3 end 2.098612
4 finally 2.098612
5 from 2.098612
6 has 2.098612
7 house 1.693147
8 little 2.098612
9 mouse 1.000000
10 of 2.098612
11 ran 2.098612
12 see 2.098612
13 story 2.098612
14 the 1.000000
15 tiny 2.098612
the tiny
0 0.570941 0.000000
1 0.235185 0.493562
2 0.435614 0.000000
3 0.489774 0.000000
4 0.468646 0.000000
docs
0 the cat see the mouse
1 the house has a tiny little mouse
2 the mouse ran away from the house
3 the cat finally ate the mouse
4 the end of the mouse story
Output: list of tokens,POS tags, lammetization and stemming output , TF,IDF and TF-IDF
Conclusion: In this way we have applied document pre-processing methods for tokenization, Stop word
removal, stemming, lemmatization, PoS tagging on input document and created document representation b
calculating TF-IDF.
Questions:
1) What is tokenization explain with example
2) Differentiate between stemming and lemmatization
3) Explain TF-IDF with an example
4) What is NLTK?
5) Explain different methods of stopword removal
Assignment No: 8
Data Visualization I
1. Use the inbuilt dataset 'titanic'. The dataset contains 891 rows and contains information
about the passengers who boarded the unfortunate Titanic ship. Use the Seaborn library to
see if we can find any patterns in the data.
2. Write a code to check how the price of the ticket (column name: 'fare') for each passenger is
distributed by plotting a histogram.
Theory:-
Data visualization is the graphical representation of information and data. By using visual elements like charts,
graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and
patterns in data.
In the world of Big Data, data visualization tools and technologies are essential to analyze massive amounts of
information and make data-driven decisions.
Our eyes are drawn to colors and patterns. We can quickly identify red from blue, square from circle. Our culture
is visual, including everything from art and advertisements to TV and movies. Data visualization is another form
of visual art that grabs our interest and keeps our eyes on the message. When we see a chart, we quickly see trends
and outliers. If we can see something, we internalize it quickly. It’s storytelling with a purpose. If you’ve ever
stared at a massive spreadsheet of data and couldn’t see a trend, you know how much more effective a
visualization can be.
As the “age of Big Data” kicks into high-gear, visualization is an increasingly key tool to make sense of the
trillions of rows of data generated every day. Data visualization helps to tell stories by curating data into a form
easier to understand, highlighting the trends and outliers. A good visualization tells a story, removing the noise
from data and highlighting the useful information. However, it’s not simply as easy as just dressing up a graph to
make it look better or slapping on the “info” part of an infographic. Effective data visualization is a delicate
balancing act between form and function. The plainest graph could be too boring to catch any notice or it make
tell a powerful point; the most stunning visualization could utterly fail at conveying the right message or it could
speak volumes. The data and the visuals need to work together, and there’s an art to combining great analysis with
great storytelling.
Algorithm:
(https://www.kaggle.com/uciml/iris)
dataset = sns.load_dataset('titanic')
dataset.head()
sns.distplot(dataset['fare'])
sns.distplot(dataset['fare'], kde=False)
sns.pairplot(dataset)
Conclusion: Implemented successfully Simple Data visualization techniques using Python on Titanic
dataset.
Assignment No: 9
1. Use the inbuilt dataset 'titanic' as used in the above problem. Plot a box plot for distribution
of age with respect to each gender along with the information about whether they survived
or not. (Column names : 'sex' and 'age')
Theory:-
Why data visualization is important
It’s hard to think of a professional industry that doesn’t benefit from making data more understandable. Every
STEM field benefits from understanding data—and so do fields in government, finance, marketing, history,
consumer goods, service industries, education, sports, and so on. While we’ll always wax poetically about data
visualization (you’re on the Tableau website, after all) there are practical, real-life applications that are
undeniable. And, since visualization is so prolific, it’s also one of the most useful professional skills to develop.
The better you can convey your points visually, whether in a dashboard or a slide deck, the better you can
leverage that information. The concept of the citizen data scientist is on the rise. Skill sets are changing to
accommodate a data-driven world. It is increasingly valuable for professionals to be able to use data to make
decisions and use visuals to tell stories of when data informs the who, what, when, where, and how. While
traditional education typically draws a distinct line between creative storytelling and technical analysis, the
modern professional world also values those who can cross between the two: data visualization sits right in the
middle of analysis and visual storytelling.
When you think of data visualization, your first thought probably immediately goes to simple bar graphs or pie
charts. While these may be an integral part of visualizing data and a common baseline for many data graphics, the
right visualization must be paired with the right set of information. Simple graphs are only the tip of the iceberg.
There’s a whole selection of visualization methods to present data in effective and interesting ways. Common
general types of data visualization:
● Charts
● Tables
● Graphs
● Maps
● Infographics
● Dashboards
● Area Chart
● Bar Chart
● Box-and-whisker Plots
● Bubble Cloud
● Bullet Graph
● Cartogram
● Circle View
● Dot Distribution Map
● Gantt Chart
● Heat Map
● Highlight Table
● Histogram
● Matrix
● Network
● Polar Area
● Radial Tree
● Scatter Plot (2D or 3D)
Algorithm:-
Step 1: Download the data set of Titanic
df = pd.read_csv('train.csv')
def displayCount(df,ax):
# ylim max value to be set
y_max = df.value_counts().max() + 75
ax.set_ylim(top=y_max)
assignAxis(df['Sex'],g)
Step 11: #number of males and females based on Age
g = sns.factorplot(x='Age', data=df,kind='count',size=4, aspect=1.8, hue='Sex',alpha=0.7,
palette='muted').set(xlabel='Age',ylabel='Count',title='Gender Distribution by Age')
assignAxis(df['Age'],g)
Step 12: People Survived based on gender and age
g = sns.factorplot(x='Survived', data=df,kind='count',size=4, aspect=1.2,hue='Age',col='Sex',alpha=0.7,
palette='muted',order=['Survived','Not
Survived'],legend_out=True).set(xlabel='Survival',ylabel='Count')
g.fig.subplots_adjust(wspace=.3)
assignAxis(df['Survived'],g)
Conclusion: Implemented successfully Simple Data visualization techniques using Python on Titanic
dataset.
Assignment No: 10
Download the Iris flower dataset or any other dataset into a DataFrame.
(e.g., https://archive.ics.uci.edu/ml/datasets/Iris ). Scan the dataset and give the inference as:
1. List down the features and their types (e.g., numeric, nominal) available in the dataset.
2. Create a histogram for each feature in the dataset to illustrate the feature distributions.
3. Create a box plot for each feature in the dataset.
4. Compare distributions and identify outliers.
Theory:-
Data Visualization: Data visualization is the practice of translating information into a visual context, such as a
map or graph, to make data easier for the human brain to understand and pull insights from. The main goal of data
visualization is to make it easier to identify patterns, trends and outliers in large data sets. The term is often used
interchangeably with others, including information graphics, information visualization and statistical graphics.
Data visualization is one of the steps of the data science process, which states that after data has been collected,
processed and modeled, it must be visualized for conclusions to be made. Data visualization is also an element of
the broader data presentation architecture (DPA) discipline, which aims to identify, locate, manipulate, format and
deliver data in the most efficient way possible.
Examples of data visualization
In the early days of visualization, the most common visualization technique was using a Microsoft Excel
spreadsheet to transform the information into a table, bar graph or pie chart. While these visualization methods are
still commonly used, more intricate techniques are now available, including the following:
● infographics
● bubble clouds
● bullet graphs
● heat maps
● fever charts
● time series charts
Some other popular techniques are as follows.
Line charts: This is one of the most basic and common techniques used. Line charts display how variables can
change over time.
Area charts: This visualization method is a variation of a line chart; it displays multiple values in a time series --
or a sequence of data collected at consecutive, equally spaced points in time.
Scatter plots: This technique displays the relationship between two variables. A scatter plot takes the form of an
x- and y-axis with dots to represent data points.
Treemaps: This method shows hierarchical data in a nested format. The size of the rectangles used for each
category is proportional to its percentage of the whole. Treemaps are best used when multiple categories are
present, and the goal is to compare different parts of a whole.
Population pyramids: This technique uses a stacked bar graph to display the complex social narrative of a
population. It is best used when trying to display the distribution of a population.
Common data visualization use cases
Common use cases for data visualization include the following:
Sales and marketing: Research from the media agency Magna predicts that half of all global advertising dollars
will be spent online by 2020. As a result, marketing teams must pay close attention to their sources of web traffic
and how their web properties generate revenue. Data visualization makes it easy to see traffic trends over time as
a result of marketing efforts.
Politics: A common use of data visualization in politics is a geographic map that displays the party each state or
district voted for.
Healthcare: Healthcare professionals frequently use choropleth maps to visualize important health data. A
choropleth map displays divided geographical areas or regions that are assigned a certain color in relation to a
numeric variable. Choropleth maps allow professionals to see how a variable, such as the mortality rate of heart
disease, changes across specific territories.
Scientists: Scientific visualization, sometimes referred to in shorthand as SciVis, allows scientists and researchers
to gain greater insight from their experimental data than ever before.
Finance: Finance professionals must track the performance of their investment decisions when choosing to buy or
sell an asset. Candlestick charts are used as trading tools and help finance professionals analyze price movements
over time, displaying important information, such as securities, derivatives, currencies, stocks, bonds and
commodities. By analyzing how the price has changed over time, data analysts and finance professionals can
detect trends.
Logistics. Shipping companies can use visualization tools to determine the best global shipping routes.
Algorithm:-
Step 1: Download the data set of Iris Flower
Step 2: Importing Libraries
import numpy as np
import pandas as pd
Step 3: Reading the dataset
df = pd.read_csv("iris-flower-dataset.csv")
df
Step 4: Perform statistical analysis on data
df.mean()
df.median()
df.std()
df.min()
df.max()
df.describe()
Step 5: Print number of features and their datatypes
column = len(list(df))
column
df.info()
np.unique(df["species"])
Step 6: Data Visualization-Create a histogram for each feature in the dataset to illustrate the
feature distributions. Plot each histogram.
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style("whitegrid")
# Creating a figure instance
fig = plt.figure(1, figsize=(12,8))
Conclusion: Implemented successfully Simple Data visualization techniques using Python on iris flower
dataset.