FDS Unit 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

1.

EXPLORATORY DATA ANALYSIS (EDA)

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets
and summarize their main characteristics, often employing data visualization methods. It
helps determine how best to manipulate data sources to get the answers you need, making it
easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check
assumptions.

The main purpose of EDA is to help look at data before making any assumptions. It can help
identify obvious errors, as well as better understand patterns within the data, detect outliers or
anomalous events, find interesting relations among the variables.

Data scientists can use exploratory analysis to ensure the results they produce are valid and
applicable to any desired business outcomes and goals. EDA also helps stakeholders by
confirming they are asking the right questions. EDA can help answer questions about
standard deviations, categorical variables, and confidence intervals. Once EDA is complete
and insights are drawn, its features can then be used for more sophisticated data analysis or
modeling, including machine learning.

Exploratory data analysis tools

Specific statistical functions and techniques you can perform with EDA tools include:

 Clustering and dimension reduction techniques, which help create graphical displays
of high-dimensional data containing many variables.
 Univariate visualization of each field in the raw dataset, with summary statistics.
 Bivariate visualizations and summary statistics that allow you to assess the
relationship between each variable in the dataset and the target variable you’re
looking at.
 Multivariate visualizations, for mapping and understanding interactions between
different fields in the data.
 K-means Clustering is a clustering method in unsupervised learning where data
points are assigned into K groups, i.e. the number of clusters, based on the distance
from each group’s centroid. The data points closest to a particular centroid will be
clustered under the same category. K-means Clustering is commonly used in market
segmentation, pattern recognition, and image compression.
 Predictive models, such as linear regression, use statistics and data to predict
outcomes.

Types of exploratory data analysis

There are four primary types of EDA:

 Univariate non-graphical. This is simplest form of data analysis, where the data
being analyzed consists of just one variable. Since it’s a single variable, it doesn’t
deal with causes or relationships. The main purpose of univariate analysis is to
describe the data and find patterns that exist within it.
 Univariate graphical. Non-graphical methods don’t provide a full picture of the
data. Graphical methods are therefore required. Common types of univariate
graphics include:
o Stem-and-leaf plots, which show all data values and the shape of the
distribution.
o Histograms, a bar plot in which each bar represents the frequency (count) or
proportion (count/total count) of cases for a range of values.
o Box plots, which graphically depict the five-number summary of minimum,
first quartile, median, third quartile, and maximum.
 Multivariate nongraphical: Multivariate data arises from more than one variable.
Multivariate non-graphical EDA techniques generally show the relationship
between two or more variables of the data through cross-tabulation or statistics.
 Multivariate graphical: Multivariate data uses graphics to display relationships
between two or more sets of data. The most used graphic is a grouped bar plot or bar
chart with each group representing one level of one of the variables and each bar
within a group representing the levels of the other variable.

Exploratory Data Analysis Tools

Some of the most common data science tools used to create an EDA include:

 Python: An interpreted, object-oriented programming language with dynamic


semantics. Its high-level, built-in data structures, combined with dynamic typing
and dynamic binding, make it very attractive for rapid application development, as
well as for use as a scripting or glue language to connect existing components
together. Python and EDA can be used together to identify missing values in a data
set, which is important so you can decide how to handle missing values for machine
learning.
 R: An open-source programming language and free software environment for
statistical computing and graphics supported by the R Foundation for Statistical
Computing. The R language is widely used among statisticians in data science in
developing statistical observations and data analysis.

Philosophy of Exploratory Data Analysis

There are important reasons anyone working with data should do EDA.

 Namely, to gain intuition about the data;


 To make comparisons between distributions;
 For sanity checking (making sure the data is on the scale you expect, in the format
you thought it should be);
 To find out where data is missing or if there are outliers;
 To summarize the data.

In the context of data generated from logs, EDA also helps with de‐bugging the logging
process. For example, “patterns” you find in the data could actually be something wrong in
the logging process that needs to be fixed. If you never go to the trouble of debugging, you’ll
continue to think your patterns are real. The engineers we’ve worked with are always grateful
for help in this area.
2. THE LIFECYCLE OF DATA SCIENCE

1. Business Understanding: The complete cycle revolves around the enterprise goal. What
will you resolve if you do no longer have a specific problem? It is extraordinarily essential
to apprehend the commercial enterprise goal sincerely due to the fact that will be your
ultimate aim of the analysis. After desirable perception only we can set the precise aim of
evaluation that is in sync with the enterprise objective. You need to understand
understan if the
customer desires to minimize savings loss, or if they prefer to predict the rate of a
commodity, etc.

2. Data Understanding: After enterprise understanding, the subsequent step is data


understanding. This includes a series of all the reachable dat
data.
a. Here you need to intently
work with the commercial enterprise group as they are certainly conscious of what
information is present, what facts should be used for this commercial enterprise problem,
and different information. This step includes describin
describingg the data, their structure, their
relevance, their records type. Explore the information using graphical plots. Basically,
extracting any data that you can get about the information through simply exploring the
data.

3. Preparation of Data: Next comes the data preparation stage. This consists of steps like
choosing the applicable data, integrating the data by means of merging the data sets,
cleaning it, treating the lacking values through either
eit eliminating them , treating inaccurate
data through eliminating ng them, additionally test forfor outliers the use of box plots.
plots
Constructing new data, derive new elements from present ones.

4. Exploratory Data Analysis: This step includes getting some concept about the answer
and elements affecting it, earlier than con
constructing
structing the real model. Distribution of data
inside distinctive variables of a character is explored graphically the usage of bar-graphs,
bar
Relations between distinct aspects are captured via graphical representations like scatter
plots and warmth maps. Many ny data visualization strategies are considerably used to
discover each and every characteristic individually and by means of combining them with
different features.

5. Data Modeling: A model takes the organized data as input and gives the preferred
output. This step consists of selecting the suitable kind of model, whether the problem is a
classification problem, or a regression problem or a clustering problem. After deciding on
the model family, amongst the number of algorithms amongst that family, we need to
cautiously pick out the algorithms to put into effect and enforce them. We need to tune the
hyperparameters of every model to obtain the preferred performance.

6. Model Evaluation: Here the model is evaluated for checking if it is geared up to be


deployed. The model is examined on an unseen data, evaluated on a cautiously thought out
set of assessment metrics.

7. Model Deployment: This is the last step in the data science life cycle. Each step in the
data science life cycle defined above must be laboured upon carefully. If any step is
performed improperly, and hence, have an effect on the subsequent step and the complete
effort goes to waste. For example, if data is no longer accumulated properly, you’ll lose
records and you will no longer be constructing an ideal model. If information is not cleaned
properly, the model will no longer work. If the model is not evaluated properly, it will fail
in the actual world. Right from Business perception to model deployment, every step has to
be given appropriate attention, time, and effort.

3. DESCRIPTIVE STATISTICS
Descriptive statistics summarize and organize characteristics of a data set. A data set is a
collection of responses or observations from a sample or entire population.

In quantitative research, after collecting data, the first step of statistical analysis is to describe
characteristics of the responses, such as the average of one variable (e.g., age), or the relation
between two variables (e.g., age and creativity).

The next step is inferential statistics, which help you decide whether your data confirms or
refutes your hypothesis and whether it is generalizable to a larger population.

Types of descriptive statistics


There are 3 main types of descriptive statistics:

 The distribution concerns the frequency of each value.


 The central tendency concerns the averages of the values.
 The variability or dispersion concerns how spread out the values are.

Frequency distribution

Frequency distribution in statistics is a representation that displays the number of


observations within a given interval.
The representation of a frequency distribution can be graphical or tabular so that it is easier to
understand.

Frequency distributions are particularly useful for normal distributions, which show the
observations of probabilities divided among standard deviations.

In finance, traders use frequency distributions to take note of price action and identify trends.

Measures of central tendency


Measures of central tendency estimate the center, or average, of a data set.
The mean, median and mode are 3 ways of finding the average

The mean, or M, is the most commonly used method for finding the average.

To find the mean, simply add up all response values and divide the sum by the total number
of responses. The total number of responses or observations is called N.

Mean number of library visits


Data set 15, 3, 12, 0, 24, 3
Sum of all values 15 + 3 + 12 + 0 + 24 + 3 = 57
Total number of responses N = 6
Mean Divide the sum of values by N to find M: 57/6 = 9.5
The median is the value that’s exactly in the middle of a data set.

To find the median, order each response value from the smallest to the biggest. Then, the
median is the number in the middle. If there are two numbers in the middle, find their mean.

Median number of library visits


Ordered data set 0, 3, 3, 12, 15, 24
Middle numbers 3, 12
Median Find the mean of the two middle numbers: (3 + 12)/2 = 7.5

The mode is the simply the most popular or most frequent response value. A data set can
have no mode, one mode, or more than one mode.

To find the mode, order your data set from lowest to highest and find the response that occurs
most frequently.

Mode number of library visits


Ordered data set 0, 3, 3, 12, 15, 24
Mode Find the most frequently occurring response: 3
The mode is the simply the most popular or most frequent response value. A data set can
have no mode, one mode, or more than one mode.

To find the mode, order your data set from lowest to highest and find the response that occurs
most frequently.

Mode number of library visits


Ordered data set 0, 3, 3, 12, 15, 24
Mode Find the most frequently occurring response: 3

Measures of variability
Measures of variability give you a sense of how spread out the response values are. The
range, standard deviation and variance each reflect different aspects of spread.

Range
The range gives you an idea of how far apart the most extreme response scores are. To find
the range, simply subtract the lowest value from the highest value.

Range of visits to the library in the past year Ordered data set: 0, 3, 3, 12, 15, 24

Range: 24 – 0 = 24

Standard deviation
The standard deviation (s) is the average amount of variability in your dataset. It tells you, on
average, how far each score lies from the mean. The larger the standard deviation, the more
variable the data set is.

There are six steps for finding the standard deviation:

1. List each score and find their mean.


2. Subtract the mean from each score to get the deviation from the mean.
3. Square each of these deviations.
4. Add up all of the squared deviations.
5. Divide the sum of the squared deviations by N – 1.
6. Find the square root of the number you found.

Standard deviations of visits to the library in the past yearIn the table below, you
complete Steps 1 through 4.
Raw data Deviation from mean Squared deviation

15 15 – 9.5 = 5.5 30.25

3 3 – 9.5 = -6.5 42.25

12 12 – 9.5 = 2.5 6.25

0 0 – 9.5 = -9.5 90.25

24 24 – 9.5 = 14.5 210.25

3 3 – 9.5 = -6.5 42.25

M = 9.5 Sum = 0 Sum of squares = 421.5

Step 5: 421.5/5 = 84.3

Step 6: √84.3 = 9.18

From learning that s = 9.18, you can say that on average, each score deviates from the mean
by 9.18 points.

Variance
The variance is the average of squared deviations from the mean. Variance reflects the degree
of spread in the data set. The more spread the data, the larger the variance is in relation to the
mean.

To find the variance, simply square the standard deviation. The symbol for variance is s2.

Variance of visits to the library in the past year Data set: 15, 3, 12, 0, 24, 3

s = 9.18

s2 = 84.3
DATA VISUALIZATION
Data visualization is the graphical representation of information and data. By using visual
elements like charts, graphs, and maps, data visualization tools provide an accessible way to
see and understand trends, outliers, and patterns in data.

In the world of Big Data, data visualization tools and technologies are essential to analyze
massive amounts of information and make data-driven decisions.

More specific examples of methods to visualize data:

 Bar Chart
 Box-and-whisker Plots
 Bubble Cloud
 Gantt Chart
 Heat Map
 Histogram
 Radial Tree
 Scatter Plot (2D or 3D)

Scatter Plot
A scatter plot is a chart type that is normally used to observe and visually display the
relationship between variables. The values of the variables are represented by dots.
The positioning of the dots on the vertical and horizontal axis will inform the value of
the respective data point; hence, scatter plots make use of Cartesian coordinates to
display the values of the variables in a data set. Scatter plots are also known as
scattergrams, scatter graphs, or scatter charts.
Scatter Plot Applications and Uses

1. Demonstration of the relationship between two variables

2. Identification of correlational relationships

3. Identification of data patterns

Creating a Scatter Plot Diagram

The scatter plot diagram for the data above is seen below:

Drawing a Scatter Plot

Scatter plot can be created using the DataFrame.plot.scatter() methods.


import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(50, 4), columns=['a', 'b', 'c', 'd'])
df.plot.scatter(x='a', y='b')
Its output is as follows −

Bar Graph
The pictorial representation of a grouped data, in the form of vertical or horizontal
rectangular bars, where the lengths of the bars are equivalent to the measure of data, are
known as bar graphs or bar charts.
The bars drawn are of uniform width, and the variable quantity is repres
represented
ented on one of the
axes. Also, the measure of the variable is depicted on the other axes. The heights or the
lengths of the bars denote the value of the variable, and these graphs are also used to compare
certain quantities. The frequency distribution tab
tables
les can be easily represented using bar charts
which simplify the calculations and understanding of data.
The three major attributes of bar graphs are:

 The bar graph helps to compare the different sets of data among different groups
easily.
 It shows the relationship
lationship using two axes, in which the categories on one axis and the
discrete values on the other axis.
 The graph shows the major changes in data over time.
The types of bar charts are as follows:

1. Vertical bar chart


2. Horizontal bar chart

Properties of Bar Graph


Some of the important properties of a bar graph are as follows:

 All the bars should have a common base.


 Each column in the bar graph should have equal width.
 The height of the bar should correspond to the data value.
 The distance between each bar should be the same.

Advantages:

 Bar graph summarises the large set of data in simple visual form.
 It displays each category of data in the frequency distribution.
 It clarifies the trend of data better than the table.
 It helps in estimating the key values at a glance.

Following is a simple example of the Matplotlib bar plot. It shows the number of
students enrolled for various courses offered at an institute.

 import matplotlib.pyplot as plt


 fig = plt.figure()
 ax = fig.add_axes([0,0,1,1])
 langs = ['C', 'C++', 'Java', 'Python', 'PHP']
 students = [23,17,35,29,12]
 ax.bar(langs,students)
 plt.show()

Histogram
A histogram is a graphical representation of a grouped frequency distribution with
continuous classes. It is an area diagram and can be defined as a set of rectangles with bases
along with the intervals between class boundaries and with areas proportional to frequencies
in the corresponding classes. In such representations, all the rectangles are adjacent since the
base covers the intervals between class boundaries. The heights of rectangles are proportional
to corresponding frequencies of similar classes and for different classes, the heights will be
proportional to corresponding frequency densities.
In other words, histogram a diagram involving rectangles whose area is proportional to the
frequency of a variable and width is equal to the class interval.

When to Use Histogram?


The histogram graph is used under certain conditions. They are:

 The data should be numerical.


 A histogram is used to check the shape of the data distribution.
 Used to check whether the process changes from one period to another.
 Used to determine whether the output is different when it involves two or more
processes.
 Used to analyse whether the given process meets the customer requirements.

Histogram Types
The histogram can be classified into different types based on the frequency distribution of the
data. There are different types of distributions, such as normal distribution, skewed
distribution, bimodal distribution, multimodal distribution, comb distribution, edge peak
distribution, dog food distributions, heart cut distribution, and so on.
Following example plots a histogram of marks obtained by students in a class. Four bins, 0-
25, 26-50, 51-75, and 76-100 are defined. The Histogram shows number of students falling
in this range.
from matplotlib import pyplot as plt
import numpy as np
fig,ax = plt.subplots(1,1)
a = np.array([22,87,5,43,56,73,55,54,11,20,51,5,79,31,27])
ax.hist(a, bins = [0,25,50,75,100])
ax.set_title("histogram of result")
ax.set_xticks([0,25,50,75,100])
ax.set_xlabel('marks')
ax.set_ylabel('no. of students')
plt.show()
The plot appears as shown below −
Heat Map

A heat map (or heatmap) is a graphical representation of data where values are depicted by
color. Heat maps make it easy to visualize complex data and understand it at a glance:
glance

Types of heatmap

Heat map is really an umbrella term for different heatmapping tools: scroll maps, click maps,
and move maps.

Scroll maps show you the exact percentage of people who scroll down to any point on
the page: the redder the area, the more visitors saw it.
Click maps show you an aggregate of where visitors click their mouse on desktop
devices and tap their finger on mobile devices (in this case, they are known as touch
heatmaps). The map is color
color-coded
coded to show the elements that have been clicked and
tapped the most (red, orange, yellow).
Move maps track where desktop users move their mouse as they navigate the page. The
hot spots in a move map represent where users have moved their cursor on a page

The below example is a twotwo-dimensional


dimensional plot of values which are mapped to the
indices and columns of the chart.
 from pandas import DataFrame
 import matplotlib.pyplot as plt

 data=[{2,3,4,1},{6,3,5,2},{6,3,5,4},{3,7,5,4},{2,8,1,5}]
 Index= ['I1', 'I2','I3','I4','I5']
 Cols = ['C1', 'C2', 'C3','C4']
 df = DataFrame(data, index=Index, columns=Cols)

 plt.pcolor(df)
 plt.show()
 Its output is as follows −


Box Plots

When we display the data distribution in a standardized way using 5 summary –


minimum, Q1 (First Quartile), median, Q3(third Quartile), and maximum, it is called
a Box plot. It is also termed as box and whisker plot.

Parts of Box Plots


Check the image below which shows the minimum, maximum, first quartile, third quartile,
median and outliers.

Minimum: The minimum value in the given dataset


First Quartile (Q1): The first quartile is the median of the lower half of the data set.
Median: The median is the middle value of the dataset, which divides the given dataset into
two equal parts. The median is considered as the second quartile.
Third Quartile (Q3): The third quartile is the median of the upper half of the data.
Maximum: The maximum value in the given dataset.
Apart from these five terms, the other terms used in the box plot are:
Interquartile Range (IQR): The difference between the third quartile and first quartile is
known as the interquartile range. (i.e.) IQR = Q3-Q1
Outlier: The data that falls on the far left or right side of the ordered data is tested to be the
outliers. Generally, the outliers fall more than the specified distance from the first and third
quartile.

import matplotlib.pyplot as plt


import numpy as np

# Creating dataset
np.random.seed(10)
data = np.random.normal(100, 20, 200)
fig = plt.figure(figsize =(10, 7))
plt.boxplot(data)
# show plot
plt.show()

Output:

You might also like