Unit 3
Unit 3
Unit 3
UNIT – III
Exploratory Data Analysis and the Data Science Process – SCSA3016
1
UNIT 3 EXPLORATORY DATA ANALYSIS AND THE DATA SCIENCE PROCESS
Exploratory Data Analysis and the Data Science Process - Basic tools (plots, graphs and
summary statistics) of EDA -Philosophy of EDA - The Data Science Process – Data
Visualization - Basic principles, ideas and tools for data visualization - Examples of exciting
projects- Data Visualization using Tableau.
To explore data in a systematic way, a task that statisticians call exploratory data
analysis, or EDA.
3. Use what you learn to refine your questions and/or generate new questions.
WHAT IS EDA?
2
Find out if the assumptions about the data, that you or your team started out with is
correct or way off.
Extract variables or dimensions on which the data can be pivoted.
Determine whether to apply univariate or multivariate analytical techniques.
EDA is typically used for these four goals:
Exploring a single variable and looking at trends over time.
Checking data for errors.
Checking assumptions.
Looking at relationships between variables
The main purpose of EDA is to help look at data before making any assumptions. It can help
identify obvious errors, as well as better understand patterns within the data, detect outliers or
anomalous events, find interesting relations among the variables.
Data scientists can use exploratory analysis to ensure the results they produce are valid and
applicable to any desired business outcomes and goals. EDA also helps stakeholders by
confirming they are asking the right questions. EDA can help answer questions about
standard deviations, categorical variables, and confidence intervals. Once EDA is complete
and insights are drawn, its features can then be used for more sophisticated data analysis or
modelling, including machine learning.
3.1.2. Various exploratory data analysis methods like:
Descriptive Statistics, which is a way of giving a brief overview of the dataset we are
dealing with, including some measures and features of the sample.
Grouping data (Basic grouping with group by)
ANOVA, Analysis Of Variance, which is a computational method to divide variations
in an observations set into different components.
Correlation and correlation methods.
Descriptive Statistics: It is a helpful way to understand characteristics of your data and to get
a quick summary of it. Pandas in python provide an interesting method describe(). The
describe function applies basic statistical computations on the dataset like extreme values,
count of data points standard deviation etc. Any missing value or NaN value is automatically
skipped. describe() function gives a good picture of distribution of data.
Grouping data: Group by is an interesting measure available in pandas which can help us
figure out effect of different categorical attributes on other data variables.
ANOVA
3
Under ANOVA we have two measures as result:
– F-testscore : which shows the variation of groups mean over variation
– p-value: it shows the importance of the result
This can be performed using python module scipy method name f_oneway()
Exploratory data analysis is generally cross-classified in two ways. First, each method
is either non-graphical or graphical. And second, each method is either univariate or
multivariate (usually just bivariate)
Univariate Non-graphical
Multivariate Non-graphical
Univariate graphical
Multivariate graphical
Univariate non-graphical: This is the simplest form of data analysis as during this we use
just one variable to research the info. The standard goal of univariate non-graphical EDA is to
know the underlying sample distribution/ data and make observations about the population.
Outlier detection is additionally part of the analysis.
Central tendency: The central tendency or location of distribution has got to do with
typical or middle values. The commonly useful measures of central tendency are
statistics called mean, median, and sometimes mode during which the foremost
common is mean. For skewed distribution or when there’s concern about outliers, the
median may be preferred.
Spread: Spread is an indicator of what proportion distant from the middle we are to
seek out the find the info values. the quality deviation and variance are two useful
measures of spread. The variance is that the mean of the square of the individual
deviations and therefore the variance is the root of the variance
Skewness and kurtosis: Two more useful univariates descriptors are the skewness
and kurtosis of the distribution. Skewness is that the measure of asymmetry and
kurtosis may be a more subtle measure of peakedness compared to a normal
distribution
4
Multivariate non-graphical: Multivariate non-graphical EDA technique is usually wont to
show the connection between two or more variables within the sort of either cross-tabulation
or statistics.
Univariate graphical: Non-graphical methods are quantitative and objective, they are doing
not give the complete picture of the data; therefore, graphical methods are more involve a
degree of subjective analysis also are required.
Histogram: The foremost basic graph is a histogram, which may be a bar plot during
which each bar represents the frequency (count) or proportion (count/total count) of
cases for a variety of values. Histograms are one of the simplest ways to quickly learn
a lot about your data, including central tendency, spread, modality, shape and outliers.
Stem-and-leaf plots: An easy substitute for a histogram may be stem-and-leaf plots.
It shows all data values and therefore the shape of the distribution.
Boxplots: Another very useful univariate graphical technique is that the boxplot.
Boxplots are excellent at presenting information about central tendency and show
robust measures of location and spread also as providing information about symmetry
and outliers, although they will be misleading about aspects like multimodality. One
among the simplest uses of boxplots is within the sort of side-by-side boxplots.
Quantile-normal plots: The ultimate univariate graphical EDA technique is that the
most intricate. it’s called the quantile-normal or QN plot or more generally the
quantile-quantile or QQ plot. it’s wont to see how well a specific sample follows a
specific theoretical distribution. It allows detection of non-normality and diagnosis of
skewness and kurtosis
5
Scatterplot: For 2 quantitative variables, the essential graphical EDA technique is
that the scatterplot, sohas one variable on the x-axis and one on the y-axis and
therefore the point for every case in your dataset.
Run chart: It’s a line graph of data plotted over time.
Heat map: It’s a graphical representation of data where values are depicted by color.
Multivariate chart: It’s a graphical representation of the relationships between
factors and response.
Bubble chart: It’s a data visualization that displays multiple circles (bubbles) in two-
dimensional plot.
Graphical exploratory data analysis employs visual tools to display data, such as:
Box plots
Box plots are used where there is a need to summarize data on an interval scale
like the ones on the stock market, where ticks observed in one whole day may be
represented in a single box, highlighting the lowest, highest, median and outliers.
Heatmap
Heatmaps are most often used for the representation of the correlation between
variables. Here is an example of a heatmap.
As you can see from the chart, there is a strong correlation between density and
residual sugar and absolutely no correlation between alcohol and residual sugar.
Histograms
6
The histogram is the graphical representation of numerical data that splits the data
into ranges. The taller the bar, the greater the number of data points falling in that
range. A good example here is the height data of a class of students. You would
notice that the height data looks like a bell curves for a particular class with most
the data lying within a certain range and a few of outside these ranges. There will
be outliers too, either very short or very small.
Line graphs: one of the most basic types of charts that plots data points on a graph; has a
wealth of uses in almost every field of study.
Pictograms: replace numbers with images to visually explain data. They’re common in the
design of infographics, as well as visuals that data scientists can use to explain complex
findings to non-data-scientist professionals and the public.
Scattergrams or scatterplots: typically used to display two variables in a set of data and
then look for correlations among the data. For example, scientists might use it to evaluate
the presence of two particular chemicals or gases in marine life in an effort to look for a
relationship between the two variables.
• The father of EDA is John Tukey who officially coined the term in his 1977
masterpiece. Lyle Jones, the editor of the multi-volume “The collected works of John
W. Tukey: Philosophy and principles of data analysis” describes EDA as “an attitude
towards flexibility that is absent of prejudice”.
• The key frame of mind when engaging with EDA and thus VDA is to approach the
dataset with little to no expectation, and not be influenced by rigid parameterisations.
EDA commands to let the data speak for itself. To use the words of Tukey (1977,
preface):
• “It is important to understand what you CAN DO before you learn to measure how
WELL you seem to have DONE it… Exploratory data analysis can never be the
whole story, but nothing else can serve as the foundation stone –as the first step.”
• Since the inception of EDA as unifying class of methods, it has influenced the
development of several other major statistical developments including in non-
parametric statistics, robust analysis, data mining, and visual data analytics. These
classes of methods are motivated by the need to stop relying on rigid assumption-
driven mathematical formulations that often fail to be confirmed by observables.
• EDA is not identical to statistical graphics although the two terms are used almost
interchangeably. Statistical graphics is a collection of techniques--all graphically
based and all focusing on one data characterization aspect. EDA encompasses a larger
venue; EDA is an approach to data analysis that postpones the usual assumptions
about what kind of model the data follow with the more direct approach of allowing
the data itself to reveal its underlying structure and model. EDA is not a mere
collection of techniques; EDA is a philosophy as to how we dissect a data set; what
we look for; how we look; and how we interpret. It is true that EDA heavily uses the
7
collection of techniques that we call "statistical graphics", but it is not identical to
statistical graphics.
8
The next step in Data Science Modelling is Data Extraction. Not just any Data, but the
Unstructured Data pieces you collect, relevant to the business problem you’re trying
to address. The Data Extraction is done from various sources online, surveys, and
existing Databases.
9
This awesome GitHub repository has high-quality datasets.
https://github.com/awesomedata/awesome-public-datasets
Import the dataset & Libraries
First step is usually importing the libraries that will be needed in the program. A
library is essentially a collection of modules that can be called and used.
Pandas offer tools for cleaning and process your data. It is the most popular Python
library that is used for data analysis. In pandas, a data table is called a dataframe.
Dealing with Missing Values
Sometimes we may find some data are missing in the dataset. if we found then we
will remove those rows or we can calculate either mean, mode or median of the
feature and replace it with missing values. This is an approximation which can add
variance to the dataset.
#Check for null values- dataset.isna() or dataset.isnull() to see the null values in
dataset.
#Drop Null values- Pandas provide a dropna() function that can be used to drop
either row or columns with missing data.
#Replacing Null values with Strategy: For replacing null values we use the strategy
that can be applied on a feature which has numeric data. We can calculate the Mean, Median
or Mode of the feature and replace it with the missing values.
De-Duplicate means remove all duplicate values. There is no need for duplicate
values in data analysis. These values only affect the accuracy and efficiency of the
analysis result. To find duplicate values in the dataset we will use a simple dataframe
function i.e. duplicated(). Let’s see the example:
dataset.duplicated()
Feature Scaling
The final step of data preprocessing is to apply the very important feature scaling.
Feature Scaling is a technique to standardize the independent features present in the
data in a fixed range. It is performed during the data pre-processing.
Why Scaling :- Most of the times, your dataset will contain features highly varying in
magnitudes, units and range. But since, most of the machine learning algorithms use
Euclidean distance between two data points in their computations, this is a problem.
Standardization and Normalization
Data Standardization and Normalization is a common practice in machine learning.
Standardization is another scaling technique where the values are centered around the
mean with a unit standard deviation. This means that the mean of the attribute
becomes zero and the resultant distribution has a unit standard deviation.
10
Normalization is a scaling technique in which values are shifted and rescaled so that
they end up ranging between 0 and 1. It is also known as Min-Max scaling.
Step 4: Exploratory Data Analysis
Exploratory Data Analysis (EDA) is a robust technique for familiarising yourself with
Data and extracting useful insights. Data Scientists sift through Unstructured Data to
find patterns and infer relationships between Data elements. Data Scientists use
Statistics and Visualisation tools to summarise Central Measurements and variability
to perform EDA.
Step 5: Feature Selection
Feature Selection is the process of identifying and selecting the features that
contribute the most to the prediction variable or output that you are interested in,
either automatically or manually.
The presence of irrelevant characteristics in your Data can reduce the Model accuracy
and cause your Model to train based on irrelevant features. In other words, if the
features are strong enough, the Machine Learning Algorithm will give fantastic
outcomes.
Two types of characteristics must be addressed:
Consistent characteristics that are unlikely to change.
Variable characteristics whose values change over time
Step 6: Incorporating Machine Learning Algorithms
This is one of the most crucial processes in Data Science Modelling as the Machine
Learning Algorithm aids in creating a usable Data Model. There are a lot of
algorithms to pick from, the Model is selected based on the problem. There are three
types of Machine Learning methods that are incorporated:
1) Supervised Learning
It is based on the results of a previous operation that is related to the existing business
operation. Based on previous patterns, Supervised Learning aids in the prediction of
an outcome. Some of the Supervised Learning Algorithms are:
Linear Regression
Random Forest
Support Vector Machines
2) Unsupervised Learning
This form of learning has no pre-existing consequence or pattern. Instead, it
concentrates on examining the interactions and connections between the presently
available Data points. Some of the Unsupervised Learning Algorithms are:
11
KNN (k-Nearest Neighbors)
K-means Clustering
Hierarchical Clustering
Anomaly Detection
3.5 DATA VISUALIZATION
Data visualization is the process of translating large data sets and metrics into charts,
graphs and other visuals.
The resulting visual representation of data makes it easier to identify and share real-
time trends, outliers, and new insights about the information represented in the data.
Data visualization is one of the steps of the data science process, which states that
after data has been collected, processed and modeled, it must be visualized for
conclusions to be made.
Data visualization is also an element of the broader data presentation architecture
(DPA) discipline, which aims to identify, locate, manipulate, format and deliver data
in the most efficient way possible.
It’s hard to think of a professional industry that doesn’t benefit from making data
more understandable. Every STEM field benefits from understanding data—and so do
fields in government, finance, marketing, history, consumer goods, service industries,
education, sports, and so on. And, since visualization is so prolific, it’s also one of the
most useful professional skills to develop. The better we can convey the points
visually, whether in a dashboard or a slide deck, the better we can leverage that
information. The concept of the citizen data scientist is on the rise. Skill sets are
changing to accommodate a data-driven world. It is increasingly valuable for
professionals to be able to use data to make decisions and use visuals to tell stories of
when data informs the who, what, when, where, and how. While traditional education
typically draws a distinct line between creative storytelling and technical analysis, the
modern professional world also values those who can cross between the two: data
visualization sits right in the middle of analysis and visual storytelling.
12
Some examples of Data Visualization
• Charts
• Tables
• Graphs
13
• Maps
• Infographics
• Dashboards
• More specific examples of methods to visualize data:
• Area Chart
• Bar Chart
• Box-and-whisker Plots
• Bubble Cloud
• Bullet Graph
• Cartogram
• Circle View
• Dot Distribution Map
• Gantt Chart
• Heat Map
• Highlight Table
• Histogram
• Matrix
• Network
• Polar Area
• Radial Tree
• Scatter Plot (2D or 3D)
• Streamgraph
• Text Tables
• Timeline
• Treemap
• Wedge Stack Graph
• Word Cloud
• And any mix-and-match combination in a dashboard
14
• How to make these graphs looking good for publication or presentation?
1. Tableau
Pros:
Cons:
Inflexible pricing
No option for auto-refresh
Restrictive imports
Manual updates for static features
2. Power BI
Power BI, Microsoft's easy-to-use data visualization tool, is available for both on-
premise installation and deployment on the cloud infrastructure. Power BI is one of
the most complete data visualization tools that supports a myriad of backend
databases, including Teradata, Salesforce, PostgreSQL, Oracle, Google Analytics,
Github, Adobe Analytics, Azure, SQL Server, and Excel. The enterprise-level tool
creates stunning visualizations and delivers real-time insights for fast decision-
making.
15
High-grade security
No speed or memory constraints
Compatible with Microsoft products
3. Dundas BI
Exceptional flexibility
A large variety of data sources and charts
Wide range of in-built features for extracting, displaying, and modifying data
4. JupyteR
A web-based application, JupyteR, is one of the top-rated data visualization tools that
enable users to create and share documents containing visualizations, equations,
narrative text, and live code. JupyteR is ideal for data cleansing and transformation,
statistical modeling, numerical simulation, interactive computing, and machine
learning.
Rapid prototyping
Visually appealing results
Facilitates easy sharing of data insights
Tough to collaborate
At times code reviewing becomes complicated
16
5. Zoho Reports
6. GoogleCharts
One of the major players in the data visualization market space, Google Charts, coded
with SVG and HTML5, is famed for its capability to produce graphical and pictorial
data visualizations. Google Charts offers zoom functionality, and it provides users
with unmatched cross-platform compatibility with iOS, Android, and even the earlier
versions of the Internet Explorer browser.
User-friendly platform
Easy to integrate data
Visually attractive data graphs
Compatibility with Google products.
7. Sisense
17
Regarded as one of the most agile data visualization tools, Sisense gives users access to
instant data analytics anywhere, at any time. The best-in-class visualization tool can identify
key data patterns and summarize statistics to help decision-makers make data-driven
decisions.
8. Plotly
An open-source data visualization tool, Plotly offers full integration with analytics-
centric programming languages like Matlab, Python, and R, which enables complex
visualizations. Widely used for collaborative work, disseminating, modifying,
creating, and sharing interactive, graphical data, Plotly supports both on-premise
installation and cloud deployment.
9. Data Wrapper
Data Wrapper is one of the very few data visualization tools on the market that is
available for free. It is popular among media enterprises because of its inherent ability
to quickly create charts and present graphical statistics on Big Data. Featuring a
18
simple and intuitive interface, Data Wrapper allows users to create maps and charts
that they can easily embed into reports.
10. QlikView
A major player in the data visualization market, Qlikview provides solutions to over 40,000
clients in 100 countries. Qlikview's data visualization tool, besides enabling accelerated,
customized visualizations, also incorporates a range of solid features, including analytics,
enterprise reporting, and Business Intelligence capabilities.
User-friendly interface
Appealing, colorful visualizations
Trouble-free maintenance
A cost-effective solution
RAM limitations
Poor customer support
Does not include the 'drag and drop' feature
Python offers multiple great graphing libraries that come packed with lots of different
features.
19
Plotly: can create interactive plots
Matplotlib
Seaborn
Conceptualized and built originally at the Stanford University, this library sits on top
of matplotlib. In a sense, it has some flavors of matplotlib while from the visualization
point, its is much better than matplotlib and has added features as well. Below are its
advantages
Built-in themes aid better visualization
Statistical functions aiding better data insights
Better aesthetics and built-in plots
Helpful documentation with effective examples
Bokeh
plotly
plotly
20
includes scientific charts, 3D graphs, statistical charts, financial charts among others.
Plotly graphs can be viewed in Jupyter notebooks, standalone HTML files, or hosted
online. Plotly library provides options for interaction and editing. The robust API
works perfectly in both local and web browser mode.
ggplot
21
22
Getting Information about the Dataset
We will use the shape parameter to get the shape of the dataset.
iris_data.shape
Output:
(150, 6)We can see that the dataframe contains 6 columns and 150 rows.
Gaining information from data
iris_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
23
We can see that only one column has categorical data and all the other columns are of the
numeric type with non-Null entries.
We can see that only one column has categorical data and all the other columns are of
the numeric type with non-Null entries.
Data Insights:
1 All columns are not having any Null Entries
2 Four columns are numerical type
3 Only Single column categorical type
Statistical Insight
iris_data.describe()
Data Insights:
Mean values
Standard Deviation ,
Minimum Values
Maximum Values
Checking Missing Values
We will check if our data contains any missing values or not. Missing values can
occur when no information is provided for one or more items or for a whole unit. We
will use the isnull() method.
24
iris_data.isnull().sum()
There are 3 duplicates, therefore we must check whether each species data set is balanced in
no's or no
Checking the balance
iris_data[‘species’].value_counts()
Therefore we shouldn’t delete the entries as it might imbalance the data sets and hence will
prove to be less useful for valuable insights
Data Visualization
Visualizing the target column
25
Our target column will be the Species column because at the end we will need the
result according to the species only. Note: We will use Matplotlib and Seaborn library
for the data visulalization.
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
plt.title(‘Species Count’)
sns.countplot(iris_data[‘species’])
Data Insight:
This further visualizes that species are well balanced
Each species ( Iris virginica, setosa, versicolor) has 50 as it’s count
26
Uni-variate Analysis
Comparison between various species based on sepal length and width
plt.figure(figsize=(17,9))
plt.title(‘Comparison between various species based on sapel length and width’)
sns.scatterplot(iris_data[‘sepal_length’],iris_data[‘sepal_width’],hue
=iris_data[‘species’],s=50)
Data Insights:
Iris Setosa species has smaller sepal length but higher width.
Versicolor lies in almost middle for length as well as width
Virginica has larger sepal lengths and smaller sepal widths
Comparison between various species based on petal length and width
plt.figure(figsize=(16,9))
plt.title(‘Comparison between various species based on petal lenght and width’)
sns.scatterplot(iris_data[‘petal_length’], iris_data[‘petal_width’], hue =
iris_data[‘species’], s= 50)
27
Data Insights
Setosa species have the smallest petal length as well as petal width
Versicolor species have average petal length and petal width
Virginica species have the highest petal length as well as petal width
Let’s plot all the column’s relationships using a pairplot. It can be used for multivariate
analysis.
sns.pairplot(iris_data,hue=”species”,height=4)
28
Data Insights:
High co relation between petal length and width columns.
Setosa has both low petal length and width
Versicolor has both average petal length and width
Virginica has both high petal length and width.
Sepal width for setosa is high and length is low.
Versicolor have average values for for sepal dimensions.
Virginica has small width but large sepal length
The heatmap is a data visualization technique that is used to analyze the dataset as colors
in two dimensions. Basically, it shows a correlation between all numerical variables in the
dataset. In simpler terms, we can plot the above-found correlation using the heatmaps.
Checking Correlation
plt.figure(figsize=(10,11))
sns.heatmap(iris_data.corr(),annot=True)
plt.plot()
29
Data Insights:
Sepal Length and Sepal Width features are slightly correlated with each other
Checking Mean & Median Values for each species
iris.groupby(‘species’).agg([‘mean’, ‘median’])
30
visualizing the distribution , mean and median using box plots & violin plots
Box plots to know about distribution
boxplot to see how the categorical feature “Species” is distributed with all other four
input variables
fig, axes = plt.subplots(2, 2, figsize=(16,9))
sns.boxplot( y=”petal_width”, x= “species”, data=iris_data, orient=’v’ , ax=axes[0,
0])
sns.boxplot( y=”petal_length”, x= “species”, data=iris_data, orient=’v’ ,
ax=axes[0, 1])
sns.boxplot( y=”sepal_length”, x= “species”, data=iris_data, orient=’v’ ,
ax=axes[1, 0])
sns.boxplot( y=”sepal_width”, x= “species”, data=iris_data, orient=’v’ ,
ax=axes[1, 1])
plt.show()
Data Insights:
31
The violin plot shows density of the length and width in the species. The thinner part
denotes that there is less density whereas the fatter part conveys higher density
fig, axes = plt.subplots(2, 2, figsize=(16,10))
sns.violinplot( y=”petal_width”, x= “species”, data=iris_data, orient=’v’ , ax=axes[0,
0],inner=’quartile’)
sns.violinplot( y=”petal_length”, x= “species”, data=iris_data, orient=’v’ , ax=axes[0,
1],inner=’quartile’)
sns.violinplot( y=”sepal_length”, x= “species”, data=iris_data, orient=’v’ , ax=axes[1,
0],inner=’quartile’)
sns.violinplot( y=”sepal_width”, x= “species”, data=iris_data, orient=’v’ , ax=axes[1,
1],inner=’quartile’)
plt.show()
Data Insights:
Setosa is having less distribution and density in case of petal length & width
Versicolor is distributed in a average manner and average features in case of petal
length & width
Virginica is highly distributed with large no .of values and features in case of sepal
length & width
High density values are depicting the mean/median values, for example: Iris Setosa
has highest density at 5.0 cm ( sepal length feature) which is also the median
value(5.0) as per the table
32
Mean / Median Table for reference
33
Plot 1 | Classification feature : Sepal Length
34
Plot 3 | Classification feature : Petal Length
35
Plot 3 shows that petal length is a good Classification feature as it clearly separates
the species . The overlap is extremely less (between Versicolor and Virginica) ,
Setosa is well separated from the rest two
Just like Plot 3, Plot 4 also shows that petal width is a good Classification feature .
The overlap is significantly less (between Versicolor and Virginica) , Setosa is well
separated from the rest two
Choosing Plot 3 (Classification feature as Petal Length)to distinguish among the species
Choosing Plot 3 (Classification feature as Petal Length)to distinguish among the
species
Why Tableau?
Tableau is greatly used because data can be analyzed very quickly with it. Also,
visualizations are generated as dashboards and worksheets. Tableau allows one
to create dashboards that provide actionable insights and drive the business forward.
Tableau products always operate in virtualized environments when they are
configured with the proper underlying operating system and hardware. Tableau
is used by data scientists to explore data with limitless visual analytics.
Features of Tableau
37
Tableau Dashboard
Collaboration and Sharing
Live and In-memory Data
Data Sources in Tableau
Advanced Visualizations
Mobile View
Revision History
Licensing Views
Subscribe others
ETL Refresh and many more make Tableau one of the most famous Data
Visualization tools.
38
Figure 3.2: Tableau Product Suite
For a clear understanding, data analytics in Tableau tool can be classified into two section.
Developer Tools: The Tableau tools that are used for development such as the
creation of dashboards, charts, report generation, visualization fall into this category.
The Tableau products, under this category, are the Tableau Desktop and the Tableau
Public.
Sharing Tools: As the name suggests, the purpose of these Tableau products is
sharing the visualizations, reports, dashboards that were created using the developer
tools. Products that fall into this category are Tableau Online, Server, and Reader.
Tableau Desktop
Tableau Desktop has a rich feature set and allows you to code and customize reports.
Right from creating the charts, reports, to blending them all together to form a
dashboard, all the necessary work is created in Tableau Desktop.
For live data analysis, Tableau Desktop provides connectivity to Data Warehouse, as
well as other various types of files. The workbooks and the dashboards created here
can be either shared locally or publicly.
Based on the connectivity to the data sources and publishing option, Tableau Desktop
is classified into
Tableau Desktop Personal: The development features are similar to Tableau
Desktop. Personal version keeps the workbook private, and the access is
39
limited. The workbooks cannot be published online. Therefore, it should be
distributed either Offline or in Tableau Public.
Tableau Desktop Professional: It is pretty much similar to Tableau Desktop.
The difference is that the work created in the Tableau Desktop can be
published online or in Tableau Server. Also, in Professional version, there is
full access to all sorts of the datatype. It is best suitable for those who wish to
publish their work in Tableau Server.
Tableau Public
It is Tableau version specially build for the cost-effective users. By the word “Public,”
it means that the workbooks created cannot be saved locally; in turn, it should be
saved to the Tableau’s public cloud which can be viewed and accessed by anyone.
There is no privacy to the files saved to the cloud since anyone can download and
access the same. This version is the best for the individuals who want to learn Tableau
and for the ones who want to share their data with the general public.
Tableau Server
The software is specifically used to share the workbooks, visualizations that are
created in the Tableau Desktop application across the organization. To share
dashboards in the Tableau Server, you must first publish your work in the Tableau
Desktop. Once the work has been uploaded to the server, it will be accessible only to
the licensed users.
However, It’s not necessary that the licensed users need to have the Tableau Server
installed on their machine. They just require the login credentials with which they can
check reports via a web browser. The security is high in Tableau server, and it is
much suited for quick and effective sharing of data in an organization.
The admin of the organization will always have full control over the server. The
hardware and the software are maintained by the organization.
Tableau Online
As the name suggests, it is an online sharing tool of Tableau. Its functionalities are
similar to Tableau Server, but the data is stored on servers hosted in the cloud which
are maintained by the Tableau group.
There is no storage limit on the data that can be published in the Tableau Online.
Tableau Online creates a direct link to over 40 data sources that are hosted in the
cloud such as the MySQL, Hive, Amazon Aurora, Spark SQL and many more.
To publish, both Tableau Online and Server require the workbooks created by
Tableau Desktop. Data that is streamed from the web applications say Google
Analytics, Salesforce.com are also supported by Tableau Server and Tableau Online.
Tableau Reader
40
Tableau Reader is a free tool which allows you to view the workbooks and
visualizations created using Tableau Desktop or Tableau Public. The data can be
filtered but editing and modifications are restricted. The security level is zero in
Tableau Reader as anyone who gets the workbook can view it using Tableau Reader.
If you want to share the dashboards that you have created, the receiver should have
Tableau Reader to view the document.
3.8.2. How does Tableau work?
Tableau connects and extracts the data stored in various places. It can pull data from
any platform imaginable. A simple database such as an excel, pdf, to a complex
database like Oracle, a database in the cloud such as Amazon webs services,
Microsoft Azure SQL database, Google Cloud SQL and various other data sources
can be extracted by Tableau.
When Tableau is launched, ready data connectors are available which allows you to
connect to any database. Depending on the version of Tableau that you have
purchased the number of data connectors supported by Tableau will vary.
The pulled data can be either connected live or extracted to the Tableau’s data engine,
Tableau Desktop. This is where the Data analyst, data engineer work with the data
that was pulled up and develop visualizations. The created dashboards are shared with
the users as a static file. The users who receive the dashboards views the file using
Tableau Reader.
The data from the Tableau Desktop can be published to the Tableau server. This is an
enterprise platform where collaboration, distribution, governance, security model,
automation features are supported. With the Tableau server, the end users have a
better experience in accessing the files from all locations be it a desktop, mobile or
email.
3.8.3. Tableau Uses- Following are the main uses and applications of Tableau:
Business Intelligence
Data Visualization
Data Collaboration
Data Blending
Real-time data analysis
Query translation into visualization
To import large size of data
To create no-code data queries
41
To manage large size metadata
3.8.4. Excel Vs. Tableau
Both Excel and Tableau are data analysis tools, but each tool has its unique approach
to data exploration. However, the analysis in Tableau is more potent than excel.
Excel works with rows and columns in spreadsheets whereas Tableau enables in
exploring excel data using its drag and drop feature. Tableau formats the data in
Graphs, pictures that are easily understandable.
Spreadsheet application
Perfect visualization tool
Purpose used for manipulating the
used for analysis.
data.
42
60 applications over 250 applications
43
Figure 3.3: Tableau Work page
Source: Local
Menu Bar: Here you’ll find various commands such as File, Data, and Format.
Toolbar Icon: The toolbar contains a number of buttons that enable you to perform
various tasks with a click, such as Save, Undo, and New Worksheet.
Dimension Shelf: This shelf contains all the categorical columns under it. example:
categories, segments, gender, name, etc
Measure Shelf: This shelf contains all numerical columns under it like profit, total
sales, discount, etc
Page Shelf: This shelf is used for joining pages and create animations. we will come
on it later
Filter Shelf: You can choose which data to include and exclude using the Filters
shelf, for example, you might want to analyze the profit for each customer segment,
44
but only for certain shipping containers and delivery times. You may make a view
like this by putting fields on the Filters tier.
Marks Card: The visualization can be designed using the Marks card. The markings
card can be used to change the data components of the visualization, such as color,
size, shape, path, label, and tooltip.
Worksheet: In the workbook, the worksheet is where the real visualization may be
seen. The worksheet contains information about the visual’s design and functionality.
Data Source: Using Data Source we can add new data, modify, remove data.
Current Sheet: The current sheets are those sheets which we have created and to
those, we can give some names.
New Sheet: If we want to create a new worksheet ( blank canvas ) we can do using
this tab.
New Dashboard: This button is used to create a dashboard canvas.
New Storyboard: It is used to create a new story
45
QUESTION BANK
Part-A
Understand BTL 2
9. List the steps involved in data preprocessing
Understand BTL 2
10. Define feature scaling
List some popular plotting libraries of python in data BTL 2
11. Understand
visualization?
46
Errors
PART B
BT
Q.No Questions Competence
Level
Analysis BTL 4
5. Explain about the different stages of preprocessing?
Analysis BTL 4
6. Discuss in detail about preprocessing and data cleaning stages?
Explain the following BTL 4
1. Univariate Non-graphical
2. Multivariate Non-graphical Analysis
7.
3. Univariate graphical
4. Multivariate graphical
8. Analysis BTL 4
Discuss about the different data visualization tools
9
. Explain about the tableau product suit? Analysis
BTL 4
10. How to analyse the data insights for the isis dataset? Create
BTL 5
47