FDS Unit 2
FDS Unit 2
FDS Unit 2
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets
and summarize their main characteristics, often employing data visualization methods. It
helps determine how best to manipulate data sources to get the answers you need, making it
easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check
assumptions.
The main purpose of EDA is to help look at data before making any assumptions. It can help
identify obvious errors, as well as better understand patterns within the data, detect outliers or
anomalous events, find interesting relations among the variables.
Data scientists can use exploratory analysis to ensure the results they produce are valid and
applicable to any desired business outcomes and goals. EDA also helps stakeholders by
confirming they are asking the right questions. EDA can help answer questions about
standard deviations, categorical variables, and confidence intervals. Once EDA is complete
and insights are drawn, its features can then be used for more sophisticated data analysis or
modeling, including machine learning.
Specific statistical functions and techniques you can perform with EDA tools include:
Clustering and dimension reduction techniques, which help create graphical displays
of high-dimensional data containing many variables.
Univariate visualization of each field in the raw dataset, with summary statistics.
Bivariate visualizations and summary statistics that allow you to assess the
relationship between each variable in the dataset and the target variable you’re
looking at.
Multivariate visualizations, for mapping and understanding interactions between
different fields in the data.
K-means Clustering is a clustering method in unsupervised learning where data
points are assigned into K groups, i.e. the number of clusters, based on the distance
from each group’s centroid. The data points closest to a particular centroid will be
clustered under the same category. K-means Clustering is commonly used in market
segmentation, pattern recognition, and image compression.
Predictive models, such as linear regression, use statistics and data to predict
outcomes.
Univariate non-graphical. This is simplest form of data analysis, where the data
being analyzed consists of just one variable. Since it’s a single variable, it doesn’t
deal with causes or relationships. The main purpose of univariate analysis is to
describe the data and find patterns that exist within it.
Univariate graphical. Non-graphical methods don’t provide a full picture of the
data. Graphical methods are therefore required. Common types of univariate
graphics include:
o Stem-and-leaf plots, which show all data values and the shape of the
distribution.
o Histograms, a bar plot in which each bar represents the frequency (count) or
proportion (count/total count) of cases for a range of values.
o Box plots, which graphically depict the five-number summary of minimum,
first quartile, median, third quartile, and maximum.
Multivariate nongraphical: Multivariate data arises from more than one variable.
Multivariate non-graphical EDA techniques generally show the relationship
between two or more variables of the data through cross-tabulation or statistics.
Multivariate graphical: Multivariate data uses graphics to display relationships
between two or more sets of data. The most used graphic is a grouped bar plot or bar
chart with each group representing one level of one of the variables and each bar
within a group representing the levels of the other variable.
Some of the most common data science tools used to create an EDA include:
There are important reasons anyone working with data should do EDA.
In the context of data generated from logs, EDA also helps with de‐bugging the logging
process. For example, “patterns” you find in the data could actually be something wrong in
the logging process that needs to be fixed. If you never go to the trouble of debugging, you’ll
continue to think your patterns are real. The engineers we’ve worked with are always grateful
for help in this area.
2. THE LIFECYCLE OF DATA SCIENCE
1. Business Understanding: The complete cycle revolves around the enterprise goal. What
will you resolve if you do no longer have a specific problem? It is extraordinarily essential
to apprehend the commercial enterprise goal sincerely due to the fact that will be your
ultimate aim of the analysis. After desirable perception only we can set the precise aim of
evaluation that is in sync with the enterprise objective. You need to understand
understan if the
customer desires to minimize savings loss, or if they prefer to predict the rate of a
commodity, etc.
3. Preparation of Data: Next comes the data preparation stage. This consists of steps like
choosing the applicable data, integrating the data by means of merging the data sets,
cleaning it, treating the lacking values through either
eit eliminating them , treating inaccurate
data through eliminating ng them, additionally test forfor outliers the use of box plots.
plots
Constructing new data, derive new elements from present ones.
4. Exploratory Data Analysis: This step includes getting some concept about the answer
and elements affecting it, earlier than con
constructing
structing the real model. Distribution of data
inside distinctive variables of a character is explored graphically the usage of bar-graphs,
bar
Relations between distinct aspects are captured via graphical representations like scatter
plots and warmth maps. Many ny data visualization strategies are considerably used to
discover each and every characteristic individually and by means of combining them with
different features.
5. Data Modeling: A model takes the organized data as input and gives the preferred
output. This step consists of selecting the suitable kind of model, whether the problem is a
classification problem, or a regression problem or a clustering problem. After deciding on
the model family, amongst the number of algorithms amongst that family, we need to
cautiously pick out the algorithms to put into effect and enforce them. We need to tune the
hyperparameters of every model to obtain the preferred performance.
7. Model Deployment: This is the last step in the data science life cycle. Each step in the
data science life cycle defined above must be laboured upon carefully. If any step is
performed improperly, and hence, have an effect on the subsequent step and the complete
effort goes to waste. For example, if data is no longer accumulated properly, you’ll lose
records and you will no longer be constructing an ideal model. If information is not cleaned
properly, the model will no longer work. If the model is not evaluated properly, it will fail
in the actual world. Right from Business perception to model deployment, every step has to
be given appropriate attention, time, and effort.
3. DESCRIPTIVE STATISTICS
Descriptive statistics summarize and organize characteristics of a data set. A data set is a
collection of responses or observations from a sample or entire population.
In quantitative research, after collecting data, the first step of statistical analysis is to describe
characteristics of the responses, such as the average of one variable (e.g., age), or the relation
between two variables (e.g., age and creativity).
The next step is inferential statistics, which help you decide whether your data confirms or
refutes your hypothesis and whether it is generalizable to a larger population.
Frequency distribution
Frequency distributions are particularly useful for normal distributions, which show the
observations of probabilities divided among standard deviations.
In finance, traders use frequency distributions to take note of price action and identify trends.
The mean, or M, is the most commonly used method for finding the average.
To find the mean, simply add up all response values and divide the sum by the total number
of responses. The total number of responses or observations is called N.
To find the median, order each response value from the smallest to the biggest. Then, the
median is the number in the middle. If there are two numbers in the middle, find their mean.
The mode is the simply the most popular or most frequent response value. A data set can
have no mode, one mode, or more than one mode.
To find the mode, order your data set from lowest to highest and find the response that occurs
most frequently.
To find the mode, order your data set from lowest to highest and find the response that occurs
most frequently.
Measures of variability
Measures of variability give you a sense of how spread out the response values are. The
range, standard deviation and variance each reflect different aspects of spread.
Range
The range gives you an idea of how far apart the most extreme response scores are. To find
the range, simply subtract the lowest value from the highest value.
Range of visits to the library in the past year Ordered data set: 0, 3, 3, 12, 15, 24
Range: 24 – 0 = 24
Standard deviation
The standard deviation (s) is the average amount of variability in your dataset. It tells you, on
average, how far each score lies from the mean. The larger the standard deviation, the more
variable the data set is.
Standard deviations of visits to the library in the past yearIn the table below, you
complete Steps 1 through 4.
Raw data Deviation from mean Squared deviation
From learning that s = 9.18, you can say that on average, each score deviates from the mean
by 9.18 points.
Variance
The variance is the average of squared deviations from the mean. Variance reflects the degree
of spread in the data set. The more spread the data, the larger the variance is in relation to the
mean.
To find the variance, simply square the standard deviation. The symbol for variance is s2.
Variance of visits to the library in the past year Data set: 15, 3, 12, 0, 24, 3
s = 9.18
s2 = 84.3
DATA VISUALIZATION
Data visualization is the graphical representation of information and data. By using visual
elements like charts, graphs, and maps, data visualization tools provide an accessible way to
see and understand trends, outliers, and patterns in data.
In the world of Big Data, data visualization tools and technologies are essential to analyze
massive amounts of information and make data-driven decisions.
Bar Chart
Box-and-whisker Plots
Bubble Cloud
Gantt Chart
Heat Map
Histogram
Radial Tree
Scatter Plot (2D or 3D)
Scatter Plot
A scatter plot is a chart type that is normally used to observe and visually display the
relationship between variables. The values of the variables are represented by dots.
The positioning of the dots on the vertical and horizontal axis will inform the value of
the respective data point; hence, scatter plots make use of Cartesian coordinates to
display the values of the variables in a data set. Scatter plots are also known as
scattergrams, scatter graphs, or scatter charts.
Scatter Plot Applications and Uses
The scatter plot diagram for the data above is seen below:
Bar Graph
The pictorial representation of a grouped data, in the form of vertical or horizontal
rectangular bars, where the lengths of the bars are equivalent to the measure of data, are
known as bar graphs or bar charts.
The bars drawn are of uniform width, and the variable quantity is repres
represented
ented on one of the
axes. Also, the measure of the variable is depicted on the other axes. The heights or the
lengths of the bars denote the value of the variable, and these graphs are also used to compare
certain quantities. The frequency distribution tab
tables
les can be easily represented using bar charts
which simplify the calculations and understanding of data.
The three major attributes of bar graphs are:
The bar graph helps to compare the different sets of data among different groups
easily.
It shows the relationship
lationship using two axes, in which the categories on one axis and the
discrete values on the other axis.
The graph shows the major changes in data over time.
The types of bar charts are as follows:
Bar graph summarises the large set of data in simple visual form.
It displays each category of data in the frequency distribution.
It clarifies the trend of data better than the table.
It helps in estimating the key values at a glance.
Following is a simple example of the Matplotlib bar plot. It shows the number of
students enrolled for various courses offered at an institute.
Histogram
A histogram is a graphical representation of a grouped frequency distribution with
continuous classes. It is an area diagram and can be defined as a set of rectangles with bases
along with the intervals between class boundaries and with areas proportional to frequencies
in the corresponding classes. In such representations, all the rectangles are adjacent since the
base covers the intervals between class boundaries. The heights of rectangles are proportional
to corresponding frequencies of similar classes and for different classes, the heights will be
proportional to corresponding frequency densities.
In other words, histogram a diagram involving rectangles whose area is proportional to the
frequency of a variable and width is equal to the class interval.
Histogram Types
The histogram can be classified into different types based on the frequency distribution of the
data. There are different types of distributions, such as normal distribution, skewed
distribution, bimodal distribution, multimodal distribution, comb distribution, edge peak
distribution, dog food distributions, heart cut distribution, and so on.
Following example plots a histogram of marks obtained by students in a class. Four bins, 0-
25, 26-50, 51-75, and 76-100 are defined. The Histogram shows number of students falling
in this range.
from matplotlib import pyplot as plt
import numpy as np
fig,ax = plt.subplots(1,1)
a = np.array([22,87,5,43,56,73,55,54,11,20,51,5,79,31,27])
ax.hist(a, bins = [0,25,50,75,100])
ax.set_title("histogram of result")
ax.set_xticks([0,25,50,75,100])
ax.set_xlabel('marks')
ax.set_ylabel('no. of students')
plt.show()
The plot appears as shown below −
Heat Map
A heat map (or heatmap) is a graphical representation of data where values are depicted by
color. Heat maps make it easy to visualize complex data and understand it at a glance:
glance
Types of heatmap
Heat map is really an umbrella term for different heatmapping tools: scroll maps, click maps,
and move maps.
Scroll maps show you the exact percentage of people who scroll down to any point on
the page: the redder the area, the more visitors saw it.
Click maps show you an aggregate of where visitors click their mouse on desktop
devices and tap their finger on mobile devices (in this case, they are known as touch
heatmaps). The map is color
color-coded
coded to show the elements that have been clicked and
tapped the most (red, orange, yellow).
Move maps track where desktop users move their mouse as they navigate the page. The
hot spots in a move map represent where users have moved their cursor on a page
Box Plots
# Creating dataset
np.random.seed(10)
data = np.random.normal(100, 20, 200)
fig = plt.figure(figsize =(10, 7))
plt.boxplot(data)
# show plot
plt.show()
Output: