devish all unit

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 42

lOMoARcPSD|6185608

AD83301 DEV QB (All


Units)
lOMoARcPSD|6185608
AD3301 DATA EXPLORATION AND VISUALIZATION LTP
C302
4
OBJECTIVES:
● To outline an overview of exploratory data analysis.
● To implement data visualization using Matplotlib.
● To perform univariate data exploration and analysis.
● To apply bivariate data exploration and analysis.
● To use Data exploration and visualization techniques for multivariate and time series data.
UNIT I EXPLORATORY DATA ANALYSIS 9
EDA fundamentals – Understanding data science – Significance of EDA – Making sense of data – Comparing EDA with classical
and Bayesian analysis – Software tools for EDA - Visual Aids for EDA- Data transformation techniques-merging database,
reshaping and pivoting, Transformation techniques - Grouping Datasets - data aggregation – Pivot tables and cross-tabulations.
UNIT II VISUALIZING USING MATPLOTLIB 9
Importing Matplotlib – Simple line plots – Simple scatter plots – visualizing errors – density and contour plots – Histograms –
legends – colors – subplots – text and annotation – customization – three-dimensional plotting - Geographic Data with Basemap -
Visualization with Seaborn.
UNIT III UNIVARIATE ANALYSIS 9
Introduction to Single variable: Distributions and Variables - Numerical Summaries of Level and Spread - Scaling and Standardizing
– Inequality - Smoothing Time Series.
UNIT IV BIVARIATE ANALYSIS 9
Relationships between Two Variables - Percentage Tables - Analyzing Contingency Tables - Handling Several Batches -
Scatterplots and Resistant Lines – Transformations.
UNIT V MULTIVARIATE AND TIME SERIES ANALYSIS 9
Introducing a Third Variable - Causal Explanations - Three-Variable Contingency Tables and Beyond - Longitudinal Data –
Fundamentals of TSA – Characteristics of time series data – Data Cleaning – Time-based indexing – Visualizing – Grouping –
Resampling.

45 PERIODS

PRACTICAL EXERCISES: 30 PERIODS

1. Install the data Analysis and Visualization tool: R/ Python /Tableau Public/ Power BI.
2. Perform exploratory data analysis (EDA) on with datasets like email data set. Export all your emails as a dataset, import
them inside a pandas data frame, visualize them and get different insights from the data.
3. Working with Numpy arrays, Pandas data frames , Basic plots using Matplotlib.
4. Explore various variable and row filters in R for cleaning data. Apply various plot features in R on sample data sets and
visualize.
5. Perform Time Series Analysis and apply the various visualization techniques.
6. Perform Data Analysis and representation on a Map using various Map data sets with Mouse Rollover effect, user
interaction, etc..
7. Build cartographic visualization for multiple datasets involving various countries of the world; states and districts in India
etc.
8. Perform EDA on Wine Quality Data Set.
9. Use a case study on a data set and apply the various EDA and visualization techniques and present an analysis report.

1
TOTAL: 75 PERIODS

TEXT BOOKS:

1. Suresh Kumar Mukhiya, Usman Ahmed, “Hands-On Exploratory Data Analysis with Python”, Packt Publishing, 2020.
(Unit 1)
2. Jake Vander Plas, "Python Data Science Handbook: Essential Tools for Working with Data", Oreilly, 1st Edition, 2016.
(Unit 2)
3. Catherine Marsh, Jane Elliott, “Exploring Data: An Introduction to Data Analysis for Social Scientists”, Wiley
Publications, 2nd Edition, 2008. (Unit 3,4,5)

REFERENCES:
1. Eric Pimpler, Data Visualization and Exploration with R, GeoSpatial Training service, 2017.
2. Claus O. Wilke, “Fundamentals of Data Visualization”, O’reilly publications, 2019.
3. Matthew O. Ward, Georges Grinstein, Daniel Keim, “Interactive Data Visualization: Foundations, Techniques, and
Applications”, 2nd Edition, CRC press, 2015.

2
UNIT I EXPLORATORY DATA ANALYSIS
EDA fundamentals – Understanding data science – Significance of EDA – Making sense of data –
Comparing EDA with classical and Bayesian analysis – Software tools for EDA - Visual Aids for EDA-
Data transformation techniques-merging database, reshaping and pivoting, Transformation techniques -
Grouping Datasets - data aggregation – Pivot tables and cross-tabulations.

UNIT-I / PART-A
. 1 What is data?
Data encompasses a collection of discrete objects, numbers, words, events, facts, measurements,
. observations, or even descriptions of things.
2 What is a dataset? Give example.
● A dataset contains many observations about a particular object.
● For instance, a dataset about patients in a hospital can contain many observations.
● A patient can be described by a patient identifier (ID), name, address, weight, date of birth,
address, email, and gender.
● Each of these features that describe a patient is a variable. Each observation can have a
specific value for each of these variables.
● For example, a patient can have the following:
PATIENT_ID = 1001
Name = Yoshmi Mukhiya
Address = Mannsverk 61, 5094, Bergen, Norway
Date of birth = 10th July 2018
Email = [email protected]
Weight = 10
. Gender = Female
3 What is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is a process of examining the available dataset to discover
patterns, spot anomalies, test hypotheses, and check assumptions using statistical measures.
4 Write some of the data summarization techniques.
Some of the techniques used for data summarization are
● summary tables
● graphs
● descriptive statistics
● inferential statistics
● correlation statistics
● searching
● grouping
. ● mathematical models
5 Write briefly about the data collection phase.
● Data collected from several sources must be stored in the correct format and transferred to the
right information technology personnel within a company.
● Data can be collected from several objects during several events using different types of
3
sensors and storage tools.
. 6 Write a short note on data cleaning.
● Data must be correctly transformed for an incompleteness check, duplicates check, error
check, and missing value check.
● These tasks are performed in the data cleaning stage, which involves responsibilities such as
matching the correct record, finding inaccuracies in the dataset, understanding the overall
data quality, removing duplicate items, and filling in the missing values.
. 7 What is a categorical dataset? Give two examples.
● This type of data represents the characteristics of an object.
● This data is often referred to as qualitative datasets in statistics.
● Examples of categorical data:
o Gender (Male, Female, Other, or Unknown)
o Blood type (A, B, AB, or O)
. 8 List the methods involved in the data preparation step.
The data preparation step involves
● defining the sources of data
● defining data schemas and tables
● understanding the main characteristics of the data
● cleaning the dataset
● deleting non-relevant datasets
● transforming the data
● dividing the data into required chunks for analysis
. 9 What is data visualization?
Data visualization deals with information relay techniques such as tables, charts, summary
diagrams, and bar charts to show the analyzed result.
.10 Write short notes on the significance of EDA.
● It is practically impossible to make sense of datasets containing more than a handful of data
points without the help of computer programs.
● Exploratory data analysis is key, and usually the first exercise in data mining.
● It allows us to visualize data to understand it as well as to create hypotheses for further
analysis.
● The exploratory analysis centers around creating a synopsis of data or insights for the next
steps in a data mining project.
● EDA actually reveals the ground truth about the content without making any underlying
assumptions.
.11 List the expert tools for exploratory analysis and mention their purpose.
Python provides expert tools for exploratory analysis:
● pandas for summarization
● scipy for statistical analysis
● matplotlib and plotly for visualizations
.12 List the common tasks in the data processing stage.
The common tasks in the data processing stage include
● exporting the dataset
● placing them under the right tables
● structuring them, and
● exporting them in the correct format
4
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


13 .13 Write short notes on nominal scales.
● These are used for labeling variables without any quantitative value.
● They are generally referred to as labels.
● These scales are mutually exclusive and do not carry any numerical importance.
● Nominal scales are considered qualitative scales and the measurements that are taken using
qualitative scales are considered qualitative data.
● No form of arithmetic calculation can be made on nominal measures.
● Examples:
o The languages that are spoken in a particular country
o Biological species
.14 Give an example of an ordinal scale using the Likert scale.
Consider a question: “WordPress is making content managers' lives easier. How do you feel about
this statement?”
The answer to the question is scaled down to five different ordinal values, Strongly
Agree, Agree, Neutral, Disagree, and Strongly Disagree.
.15 Write short notes on interval scales.
● In interval scales, both the order and exact differences between the values are significant.
● Interval scales are widely used in statistics, for example, in the measure of central tendencies
such as mean, median, mode, and standard deviations.
● Examples include location in Cartesian coordinates and direction measured in degrees from
magnetic north.
.16 Give short notes on ratio scales.
● Ratio scales contain order, exact values, and absolute zero.
● They are used in descriptive and inferential statistics.
● These scales provide numerous possibilities for statistical analysis.
● Mathematical operations, the measure of central tendencies, and the measure of
dispersion and coefficient of variation can also be computed from such scales.
● Examples: the measure of energy, mass, length, duration, electrical energy, plan angle, and
volume.
.17 Compare EDA with classical data analysis.
● Classical data analysis: The problem definition and data collection step are followed by
model development, which is followed by analysis and result communication.
● Exploratory data analysis approach: It follows the same approach as classical data analysis
except for the model imposition, and the data analysis steps are swapped. The main focus is
on the data, its structure, outliers, models, and visualizations.
.18 List the software tools available for EDA.
● Python - widely used in data analysis, data mining, and data science
● R programming language - widely utilized in statistical computation and graphical data
analysis
● Weka - involves several EDA tools and algorithms
● KNIME - an open-source tool for data analysis and is based on Eclipse
.19 Write short notes on matplotlib.
● Matplotlib provides a huge library of customizable plots, along with a comprehensive set of
backends.
● It can be utilized to create professional reporting applications, interactive analytical
applications, complex dashboard applications, web/GUI applications, embedded views, etc.
5
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


20 Write the python code for plotting the pie chart for the pokemon dataset.
import matplotlib.pyplot as plt
plt.pie(pokemon['amount'], labels=pokemon.index, shadow=False,
startangle=90, autopct='%1.1f%%',)
plt.axis('equal')
plt.show()
.21 Write the guidelines for choosing the best chart.
● For continuous variables, a histogram would be a good choice. To show ranking, an ordered
bar chart would be a good choice.
● The chart that effectively conveys the right and relevant meaning of the data without actually
distorting the facts should be chosen.
● It is better to draw a simple chart that is comprehensible than to draw sophisticated ones that
require several reports and texts in order to understand them.
● A chart should illustrate abstract information in a clear way (i.e.) A diagram that does not
overload the audience with information should be chosen.
.22 What is data transformation?
Data transformation is a set of techniques used to convert data from one format or structure to
another format or structure.
.23 List some of the transformation activities.
● Data deduplication
● Key restructuring
● Data cleansing
● Data validation
● Format revisioning
● Data derivation
● Data aggregation
● Data integration
● Data filtering
● Data joining
.24 Write two benefits of data transformation.
● Data transformation promotes interoperability between several applications. The main reason
for creating a similar format and structure in the dataset is that it becomes compatible with
other systems.
● Data transformation ensures higher performance and scalability for modern analytical
databases and dataframes.
.25 Give short notes on the challenges of data transformation.
● Data transformation requires a qualified team of experts and state-of-the-art infrastructure.
● The cost of attaining such experts and infrastructure can increase the cost of the operation.
● It requires data cleaning before data transformation and data migration. This process of
cleansing can be expensive and time-consuming.
● The activities of data transformations involve batch processing. This means that sometimes,
we might have to wait for a day before the next batch of data is ready for cleansing. This can
be very slow.
.26 What is data aggregation? List two most frequently used aggregations.
Aggregation is the process of implementing any mathematical operation on a dataset or a subset of

6
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


it. It is one of the many techniques in pandas that are used to manipulate the data in the dataframe
for data analysis.
● sum: Returns the sum of the values for the requested axis
● min: Returns the minimum of the values for the requested axis
● max: Returns the maximum of the values for the requested axis
.27 Give two reasons for a value to be NaN.
● Missing values due to data collection errors.
● Reindexing of data can result in incomplete data.
.28 Write the code snippet that transforms the index terms of a dataframe into capital letters.
To transform the index terms of a dataframe into capital
letters: dframe1.index = dframe1.index.map(str.upper)
dframe1

.29 Write the code snippet to create a transformed version of the dataframe.
To create a transformed version of the dataframe, the rename() method can be used. This method is
used to modify the original data.
dframe1.rename(index=str.title, columns=str.upper)

.30 What are outliers? Why should we detect and filter them?
● Outliers are data points that diverge from other observations for several reasons. During the
EDA phase, one of our common tasks is to detect and filter these outliers.
● The main reason for this detection and filtering of outliers is that the presence of such outliers
can cause serious issues in statistical analysis
.31 Write short notes on groupby function.
● The pandas groupby function is one of the most efficient and time-saving features for
categorizing a dataset into multiple categories or groups.
● Groupby provides functionalities that allow us to split-apply-combine throughout the
dataframe.
.32 Write the two essential functions of the groupby function.
● It splits the data into groups based on some criteria.
● It applies a function to each group independently.
.33 How will you consistently rearrange the data in a dataframe?
The data in a dataframe can be rearranged in some consistent manner with hierarchical indexing
using two actions:
● Stacking: Stack rotates from any particular column in the data to the rows.
● Unstacking: Unstack rotates from the rows into the column.
7
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


.34 Write short notes on the pivot table.
● The pandas.pivot_table() function creates a spreadsheet-style pivot table as a dataframe.
● The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the
index and columns of the resulting dataframe.
.35 Write briefly about types of joins.
● The inner join takes the intersection from two or more dataframes.
● The outer join takes the union from two or more dataframes.
● The left join uses the keys from the left-hand dataframe only.
● The right join uses the keys from the right-hand dataframe only.
UNIT-I / PART-B
1.1 Elucidate the steps in EDA.
2.2 Explain the types of data with suitable examples.
3 .3 Explain the process of creating a) line chart b) bar chart. Show the outputs.
4 .4 Generate scatter plots for the Iris dataset.
5 .5 Consider the following use case. A survey created in vocational training sessions of developers had
100 participants. They had several years of Python programming experience ranging from 0 to 20.
Perform the following:
a) Import the required libraries and create the dataset
b) Plot the histogram chart
c) Plot a normal distribution using the mean and standard deviation of this data to see the
distribution pattern
6 .6 Generate a lollipop chart for the carDF dataset.
.7 How to identify and count the missing values? Explain in detail.
7
Describe the data deduplication technique in detail.
8
. Describe the various techniques for filling missing values.
9
0 Describe about dropping missing values in detail.
1
1. Explain discretization and binning in detail with suitable example.
1
2. How can we perform permutation and random sampling using the pandas library? Write the steps
1 involved to compute
a. random sampling without replacement
b. random sampling with replacement
3. Explain reshaping and pivoting with an example.
1
4. Write python code to perform the following using NumPy:
1 a) Creation of different types of NumPy arrays
b) Display basic information, such as the data type, shape, size, and strides of a NumPy array
c) Creation of an array using built-in NumPy functions
5. Write briefly about pivot tables. Write python code to perform the following customizations on the
1 carDF dataset.
a) Create a pivot table of a new dataframe
b) Create a pivot table of a summarized subset of a dataframe by passing the aggregation function,
values, and columns that aggregation will be applied to
c) Apply a different aggregation function to different columns

8
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


1 6. How to customize the pandas dataframe using the cross-tabulations technique? Write the code
snippets to perform the customization.
1 7. Explain group-wise operations in data aggregation.
1 8. Write python code to perform the following using pandas:
a) Creation of a dataframe from series, dictionary, and n-dimensional arrays.
b) Load a dataset from an external source into a pandas DataFrame.

UNIT II VISUALIZING USING MATPLOTLIB


Importing Matplotlib – Simple line plots – Simple scatter plots – visualizing errors – density and
contour plots – Histograms – legends – colors – subplots – text and annotation – customization – three
dimensional plotting - Geographic Data with Basemap - Visualization with Seaborn.

UNIT-II / PART-A
1. Write short notes on plt.show() command.
● The plt.show() command starts an event loop, looks for all currently active figure objects, and
opens one or more interactive windows that display the figures.
● This command should be used only once per python session, and is most often seen at the
very end of the script.
● Multiple show() commands can lead to unpredictable backend-dependent behavior, and
should be avoided.
2. Give an example to create line plots.
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x),’-’)
plt.plot(x, np.cos(x),’--’)
plt.show()

3. Define figures and axes in matplotlib.


● In Matplotlib, the figure is a single container that contains all the objects representing axes,
graphics, text, and labels.
● The axes are a bounding box with ticks and labels, which will eventually contain the plot
elements that make up our visualization.
4. Name the library and interface used to plot a chart in python.
● Library – matplotlib

9
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


● interface – pyplot
5. What is Density Plot? Give example.
The density plot is the continuous and smoothed version of the Histogram estimated from the data. It
is estimated through Kernel Density Estimation. KDE implementation exists in the scipy.stats
package.
Example:
from scipy.stats import
gaussian_kde data = np.vstack([x,
y])
kde = gaussian_kde(data)
xgrid = np.linspace(-3.5, 3.5, 40)
ygrid = np.linspace(-6, 6, 40)
Xgrid, Ygrid = np.meshgrid(xgrid, ygrid)
Z = kde.evaluate(np.vstack([Xgrid.ravel(), Ygrid.ravel()]))
plt.imshow(Z.reshape(Xgrid.shape), origin='lower', aspect='auto', extent=[-3.5, 3.5, -6,
6], cmap='Blues')
cb = plt.colorbar()
cb.set_label("density")

6.
What is a histogram? How do you plot a simple histogram using matplotlib in python?
The histogram is the graphical representation that organizes a group of data points into the specified
range. A histogram divides the variable into bins, counts the data points in each bin, and shows the
bins on the x-axis and the counts on the y-axis.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
data = np.random.randn(1000)
plt.hist(data);

7.
What are legends in data visualization?
Plot legends give meaning to a visualization, assigning meaning to the various plot elements. The
simplest legend can be created with the plt.legend() command, which automatically creates a legend
10
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


for any labeled plot elements. The labels are applied to the plot elements to show on the legend.
8. What are the basic elements/components of the chart?
The chart has the following elements/components:
● Chart area or figure
● Axis
● Artist
● Titles
● Legends
9. Mention the methods for setting axis limits while plotting a figure. Give example.
The methods used for adjusting axis limits while plotting a figure are the plt.xlim() and plt.ylim()
methods.
Example:
plt.plot(x, np.sin(x))
plt.xlim(-1, 11)
plt.ylim(-1.5, 1.5);

10 . What types of graphs can be plotted using matplotlib?


The matplotlib provides the following types of charts:
● Line chart
● Bar chart
● Horizontal bar chart
● Histogram
● Scatter chart
● Boxplot
● Pie Chart
11. How do you change the thickness of a line?
To change the thickness of a line, use the linewidth parameter inside matplotlib.pyplot.plot() function
with a numeric value.
Example: plt.plot(x,y,linewidth=2)

11
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023

12 . How to display the plots? Give an example to display a graph.


plt.show() method is used to display your figure or figures.
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x))
plt.plot(x, np.cos(x))
plt.show()

13 . Write short notes on scatter plot.


The graph where the points are represented individually with a dot, circle, or other shape is called a
scatter plot. Two different methods of creating scatter plots:
● Using plt.plot function
plt.plot(x, y, '-ok');

● Using plt.scatter function


plt.scatter(x, y, marker='o')

12
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023

14 . What are the three different categories of colormaps?


Sequential colormaps: These are made up of one continuous sequence of colors (e.g., binary or
viridis).
Divergent colormaps: These usually contain two distinct colors, which show positive and negative
deviations from a mean (e.g., RdBu or PuOr).
Qualitative colormaps: These mix colors with no particular sequence (e.g., rainbow or jet).

15 . Write code to do the following:


Plot the following data on a line chart:

Runs in 10 20
Overs

MI 110 224

RCB 85 210

import matplotlib.pyplot as plt


overs = [10,20]
mi = [110,224]
plt.plot(overs,mi,'blue')
rcb=[109,210]
plt.plot(overs,rcb,'red')
plt.xlabel('Runs')
plt.ylabel('Overs')
plt.title('Match Summary')
plt.show()

13
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023

16 . Mention the use of plt.axis() method. How to set the axis limits with plt.axis() method?
The plt.axis() method is used to set the x and y limits with a single call, by passing a list that specifies
[xmin, xmax, ymin,ymax].
Example:
plt.plot(x, np.sin(x))
plt.axis([-1, 11, -1.5, 1.5]);

17 . Plot the following data on a bar graph:


Average Pulse: 80, 85, 90, 95, 100, 105, 110, 115, 120, 125
Calorie Burnage: 240, 250, 260, 270, 280, 290, 300, 310, 320, 330
import numpy as np
import matplotlib.pyplot as plt
x=np.array([80, 85, 90, 95, 100, 105, 110, 115, 120, 125])
y=np.array([240, 250, 260, 270, 280, 290, 300, 310, 320, 330])
plt.bar(x,y)
plt.xlabel("Average Pulse")
plt.ylabel("Calorie Burnage")
plt.show()

18 . What is the use of the color keyword in plotting? Give examples to control the colors of plot

14
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


elements.
The color keyword is used to adjust the color. It accepts a string argument representing any
imaginable color virtually.
plt.plot(x, np.sin(x - 0), color='blue') # specify color by name
plt.plot(x, np.sin(x - 1), color='g') # short color code (rgbcmyk)

19 . How to draw two sets of scatterplots in the same plot?


# Draw two sets of points
plt.plot([1,2,3,4,5], [1,2,3,4,10], 'go') # green dots
plt.plot([1,2,3,4,5], [2,3,4,5,11], 'b*') # blue stars
plt.show()

20 . Write code to plot a line chart to depict the run rate of T20 match from given
data: Overs Runs
5 45
10 79
15 145
20 234
import matplotlib.pyplot as plt
overs = [5,10,15,20]
runs = [54,79,145,234]
plt.plot(overs,runs)
plt.xlabel('Overs')
plt.ylabel('Runs')
plt.show()

15
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023

21 . How to change the thickness of the line style?


The linestyle keyword is used to adjust the line style. This linestyle can be one of the following:
● solid
● dashed
● dashdot
● dotted
Example:
import matplotlib.pyplot as plt
plt.plot(x, x + 0, linestyle='solid')
plt.plot(x, x + 4, linestyle='dashed')
plt.plot(x, x + 8, linestyle='dashdot')
plt.plot(x, x + 12, linestyle='dotted');

22 . What is the use of plt.legend() method?


● When multiple lines are being shown within a single axis, it is useful to create a plot legend
that labels each line type.
● Matplotlib includes plt.legend() method for quickly creating such a
legend. Example:
plt.plot(x, np.sin(x), '-g', label='sin(x)')
plt.plot(x, np.cos(x), ':b', label='cos(x)')
plt.legend();

16
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023

23 . How to set multiple properties of plotting at once using an object-oriented interface approach?
In the object-oriented interface to plotting, rather than calling the functions individually, it is often
more convenient to use the ax.set() method to set all the properties at once.
Example:
ax.plot(x, np.sin(x))
ax.set(xlim=(0, 10), ylim=(-2, 2), xlabel='x', ylabel='sin(x)', title='A Simple Plot');

24 . What is the use of plt.imshow() function?


The plt.imshow() function interprets a two-dimensional grid of data as an image.
Example:
plt.imshow(Z, extent=[0, 5, 0, 5], origin='lower', cmap='RdGy')
plt.colorbar()
plt.axis(aspect='image');

25 . What is the use of plt.fill_between function?


The plt.fill_between function is used with a light color to visualize the continuous error.
With the fill_between function: we pass an x value, then the lower y-bound, then the upper y-bound,
and the result is that the area between these regions is filled.
Example:
plt.plot(xdata, ydata, 'or')
plt.plot(xfit, yfit, '-', color='gray')
17
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


plt.fill_between(xfit, yfit - dyfit, yfit + dyfit, color='gray', alpha=0.2)
plt.xlim(0, 10);

26 . Give the command for importing matplotlib. Draw a scatterplot with green dots using
matplotlib.
import matplotlib.pyplot as plt
plt.plot([1,2,3,4,5], [1,2,3,4,10], 'go')
plt.show()

27 . What are sharex and sharey?


By specifying sharex and sharey, we will automatically remove inner labels on the grid to make the
plot cleaner. The resulting grid of axes instances is returned within a NumPy array, allowing for
convenient specification of the desired axes using standard array indexing notation.
Example:
fig, ax = plt.subplots(2, 3, sharex='col', sharey='row')

28 . What do you mean by data visualization technique?


The data visualization technique refers to the graphical or pictorial or visual representation of data.
This can be achieved by charts, graphs, diagrams, or maps.
29 . How to plot two subplots using a MATLAB-style interface?
plt.figure() # create a plot figure
# create the first of two panels and set current axis
plt.subplot(2, 1, 1) # (rows, columns, panel number)
18
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


plt.plot(x, np.sin(x))
# create the second panel and set current axis
plt.subplot(2, 1, 2)
plt.plot(x, np.cos(x));

30 . What are the pre-defined transforms when considering the placement of text on a figure?
There are three pre-defined transforms:
ax.transData: Transform associated with data coordinates
ax.transAxes: Transform associated with the axes (in units of axes dimensions)
fig.transFigure: Transform associated with the figure (in units of figure dimensions)
31 . What are the ways of importing matplotlib?
Matplotlib can be imported in the following two ways:
● Using alias name: import matplotlib.pyplot as plt
● Without alias name: import matplotlib.pyplot
32 . What are transData and transAxes coordinates?
● The transData coordinates give the usual data coordinates associated with the x and y-axis
labels.
● The transAxes coordinates give the location from the bottom-left corner of the axes as a
fraction of the axes’ size.
33 . How do you create a contour plot?
● A contour plot can be created with the plt.contour function.
● It takes three arguments:
o a grid of x values
o a grid of y values
o a grid of z values.
The x and y values represent positions on the plot, and the z values will be represented by the contour
levels.
34 . Write the function for listing five plot styles randomly.
plt.style.available[:5]
Output:
['fivethirtyeight', 'seaborn-pastel', 'seaborn-whitegrid', 'ggplot', 'grayscale']
35 . Write a code to use a gray background and draw solid white grid lines.
(i) use a gray background
ax = plt.axes(axisbg='#E6E6E6')
ax.set_axisbelow(True)
(ii) draw solid white grid lines
plt.grid(color='w', linestyle='solid')

19
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023

36 . Write short notes on ggplot.


The ggplot package in the R language is a very popular visualization tool. Matplotlib's ggplot style
mimics the default styles from that package.
Example:
with plt.style.context('ggplot'):
hist_and_lines()

37 . How do you import three-dimensional plots?


The three-dimensional plots are enabled by importing the mplot3d toolkit, included with the main
Matplotlib installation:
from mpl_toolkits import mplot3d
Once this submodule is imported, a three-dimensional axis can be created by passing the keyword
projection='3d' to any of the normal axes’ creation routines.
fig = plt.figure()
ax = plt.axes(projection='3d')

38 . What are cylindrical projections?


● The simplest of map projections are cylindrical projections in which lines of constant latitude
and longitude are mapped to horizontal and vertical lines respectively.
● This type of mapping represents equatorial regions well but results in extreme distortions near
the poles.

20
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


● The spacing of latitude lines varies between different cylindrical projections leading to
different conservation properties and different distortion near the poles.
39 . List the map-specific methods of the Basemap instance.
● contour()/contourf()
● imshow()
● pcolor()/pcolormesh()
● plot()
● scatter()
● quiver()
● barbs()
● drawgreatcircle()
40 . How do you hide ticks or labels?
The most common tick/label formatting operation is the act of hiding ticks or labels. This can be
done using plt.NullLocator() and plt.NullFormatter() as:
ax = plt.axes()
ax.plot(np.random.rand(50))
ax.yaxis.set_major_locator(plt.NullLocator())
ax.xaxis.set_major_formatter(plt.NullFormatter())

41 . What is seaborn?
● Seaborn provides an API on top of Matplotlib that offers the same choices for plot style and
color defaults, defines simple high-level functions for common statistical plot types, and
integrates with the functionality provided by Pandas DataFrames.
● It provides high-level commands to create a variety of plot types useful for statistical data
exploration and even some statistical model fitting.

UNIT-II / PART-B
1. Show the implementation code describing the below components of the matplotlib module in python
a. Create a simple figure and axes.
b. Control the line colors and styles
c. Adjust the axis limits
d. Label the Plots
2. Describe the scatter plot in detail and demonstrate the scatter plot using plt.plot and plt.scatter.
Differentiate both plots.
3. Describe in detail about Density and Contour Plots.
4. Explain Histograms, Binnings, and Density in detail.
5. Illustrate the three Matplotlib functions that can be useful to display three-dimensional data in two
dimensions.

21
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


6. Show how the following customization can be achieved using matplotlib.
a. Plot legends
b. Color bars
7. Explain the four routines for creating subplots in Matplotlib.
8. Describe rcParams in detail.
9. Create a function that will make two basic types of plots: histogram and line plots. Explore how these
plots look using the various built-in styles.
10 . Illustrate Three-Dimensional Plotting in Matplotlib.
a. Plot a trigonometric spiral, along with some points drawn randomly near the line
b. Show a three-dimensional contour diagram of a three-dimensional sinusoidal function.
11. Explain the different map projections of the Basemap package.
12 . Create a plot using Basemap for the California Cities data.
a. Load information such as the location, size, and population of California cities
b. Set up the map projection
c. Scatter the data
d. Create a colorbar and legend
13 . Discuss the differences between matplotlib and seaborn.
. Explain the plot types available in seaborn.
14
. Explain the process of using matplotlib for generating data visualization for different kinds of
15
multidimensional datasets, to correctly capture their latent patterns for data analytics

UNIT III UNIVARIATE ANALYSIS


Introduction to Single variable: Distributions and Variables - Numerical Summaries of Level and Spread
- Scaling and Standardizing – Inequality - Smoothing Time Series.

UNIT-III/ PART-A

1. What is sampling?
Sampling is a method that allows us to get information about the population based on the
statistics from a subset of the population (sample or case), without having to investigate every
individual.
2. What are the two basic units of data analysis?
● Cases and variables are the two organizing concepts that are considered the basis of data
analysis.
● The cases are the samples about which information is collected.
● The information is collected on certain features of all the cases.
● These features are the variables that vary across different cases.
Example: In a survey of individuals, their income, sex, and age are some of the variables that
might be recorded.
3. What are the measurement scales used by social scientists?
Many of the variables used by social scientists are measured on nominal scales or ordinal scales
(also referred to as categorical variables), rather than interval scales (also referred to as continuous
variables).
4. What are two techniques for reducing the number of digits?

22
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


● Rounding and truncating are two methods for reducing the number of digits.
● Rounding method: Values from zero to four are rounded down, and six to ten are rounded
up. The value 5 can be arbitrarily rounded up or down according to a fixed rule, or it could
be rounded up after an odd digit and down after an even digit.
● Truncating method: Values that are not needed are simply ‘cut off’ or ‘truncated’. Thus,
when cutting, all the numbers from 899.0 to 899.9 become 899. This procedure is much
quicker and does not run the extra risk of large mistakes.
5. Write the drawback of the rounding method.
● The rounding of digit five causes a problem; it can be arbitrarily rounded up or down
according to a fixed rule, or it could be rounded up after an odd digit and down after an even
digit.
● The trouble with such fuzzy rules is that people tend to make mistakes, and often they are
not trivial.
● It is an easy mistake to round 899.6 to 890.
6. What is a bar chart? Give example.
A bar chart is a visual display in which bars are drawn to represent each category of a variable such
that the length of the bar is proportional to the number of cases in the category. For instance, a bar
chart of the drinking classification variable is shown in the figure below.

7. When to prefer pie charts?


Pie charts are to be preferred when there are only a few categories and when the sizes of the
categories are very different.
8. What are the two types of distributions in histograms?
The two types of distribution in histograms are unimodal and bimodal, depending on the frequency
of the occurring values.
9. What are the four important aspects of any distribution inspected by histograms?
● Level: What are typical values in the distribution?
● Spread: How widely dispersed are the values? Do they differ very much from one another?
● Shape: Is the distribution flat or peaked? Symmetrical or skewed?
● Outliers: Are there any particularly unusual values?
1 0. What is SPSS?
● SPSS is an acronym for Statistical Package for the Social Sciences.
● SPSS is a very useful computer package that includes hundreds of different procedures for
displaying and analyzing data.

23
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


11 . What is unimodal distribution?
● Unimodal is a single-peaked distribution in that one value occurs with the greatest
frequency than the other values.
● It is a distribution with a single clearly visible peak or a single most frequent value.
● The distribution’s shape in the unimodal distribution has only one main high point.
1 2. What is bimodal distribution?
● Bimodal distribution is a distribution where two values occur with the greatest frequency
which means two frequent values are separated by a gap in between.
● This type of distribution has two fairly equal high points (or the modes).
● The two modes are usually separated by a big gap in between and the distribution contains
more data than others.
1 3. What are histograms?
● Histograms are charts that are similar to bar charts that can be used to display interval-level
variables grouped into categories.
● They are constructed in exactly the same way as bar charts except that the ordering of the
categories is fixed.
1 4. What are the three main windows of SPSS?
SPSS has three main windows:
● The Data Editor window
● The Output window
● The Syntax window
1 5. What are the advantages and disadvantages of a summary?
Advantages:
● Summaries focus the attention of the data analyst on one thing at a time and prevent
exploring aimlessly over a display of the data.
● They also help focus the process of comparison from one dataset to another and make
it more rigorous.
Disadvantages:
● Summaries always involve some loss of information.
● They do not contain the richness of information that existed in the original picture.
1 6. How do we define 'typical' value for summarization?
● The value halfway between the extremes might be chosen
● the single most common number
● a summary of the middle portion of the distribution.
1 7. What is a residual? Give example.
● A residual can be defined as the difference between a data point and the observed typical, or
average, value.
● For example, if 40 hours a week is chosen as the typical level of men's working hours, using
data from the General Household Survey in 2005, then a man who was recorded in the
survey as working 45 hours a week would have a residual of 5 hours.
● Another way of expressing this is that the residual is the observed data value minus the
predicted value and in this case, 45-40 = 5.
1 8. What is data in terms of summarization?
● Any data value (such as a measurement of hours worked or income earned) is composed of
two components: a fitted part and a residual part.
● This can be expressed as an equation:
24
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


Data = Fit + Residual
1 9. What are the measures of central tendency?
Mean, median, and mode are the measures of central tendencies.
● Mean: the sum of all values divided by the total number of values.
● Median: the middle number in an ordered dataset.
● Mode: the most frequent value.
2 0. What is arithmetic mean?
● The mean value of a dataset is the average value. (i.e.) a number around which a whole data
is spread out.
● To calculate it, first, all of the values are summed, and then the total is divided by the
number of data points.
● In mathematical terms,

Where Y is conventionally used to refer to an actual variable. The subscript ‘i’ is an index that
indicates which case is being referred to, and N is the number of data points.
2 1. What is a median?
● Median is the middle value of the dataset (i.e.) the data is sorted from smallest to biggest (or
biggest to smallest) and then the value in the middle of the set is taken.
● It is the value of the case that has equal numbers of data points above and below it.
● With N data points, the median M is the value at depth

.
2 2. What are quartiles and midspread?
● The points which divide the distribution into quarters are called the quartiles (or hinges or
fourths).
● The lower quartile is usually denoted QL and the upper quartile QU. The middle quartile is
the median.
● The distance between QL and QU is called the midspread (dQ or interquartile range).
2 3. Why range cannot be recommended as a summary measure of spread?
● Range only uses information from two data points, and these are drawn from the most
unreliable part of the data.
● Therefore, despite its intuitive appeal, it cannot be recommended as a summary measure of
spread.
2 4. Write a brief note on the usefulness of the mean and the standard deviation measures.
● The mean and the standard deviation are less resistant than other measures.
● So, they are often preferable for much descriptive and exploratory work, especially when
there are measurement errors.
● The mean and the standard deviation measures are used to make very precise statements
about the likely degree of sampling error in any data.
2 5. State Twyman's law for data analysis.
The more unusual or interesting the data, the more likely they are to have been the result of an error
of one kind or another.
2 6. There are 15 cases in the small datasets of men's working hours. The median is at depth 8, and

25
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


the quartiles are at depth 4.5. Find the midspread for the dataset given below.

30,37,39,40,45,47,48,48,50,54,55,55,67,70,80

Lower Quartile (QL) = ((n+1)/4)th term


= ((15+1)/4)th term
= 4th term = 42
Upper Quartile (QU) = (3(n+1)/4)th term
= (3(15+1)/4)th term
= 12th term = 55
QL = 42.5 and QU = 55
The distance between them, the midspread (dQ), is therefore 12.5 hours.
2 7. What is the general principle in comparing different measures?
The general principle in comparing different measures is: one measure is more resistant than another
if it tends to be less influenced by a change in any small part of the data.
2 8. How do we decide between the median and mean to summarize a typical value, or between the
range, the midspread, and the standard deviation to summarize the spread?
● Locational statistics such as the range, median, and midspread generally fare better than the
more abstract means and standard deviations.
● Means and standard deviations are more influenced by unusual data values than medians and
midspreads.
● Means and standard deviations are usually more influenced by a change in any individual
data point than the medians and midspreads.
2 9. What is smoothing?
The process of smoothing time series also produces such a decomposition of the data.
Message = Signal +Noise
Data= Smooth+ Rough
3 0. Calculate the standard deviation of the hours worked by the small sample of men given
below: 54,30,47,39,50,48,45,40,37,48,67,55,55,80,70.

26
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023

s=
=
= 13.29
3 1. Write a brief note on Gaussian distribution.
● Gaussian distributions are bell-shaped and have the convenient property of being
reproducible from their mean and standard deviation.
● Given these two pieces of information, the exact shape of the curve can be reconstructed,
and the proportion of the area under the curve falling between various points can be
calculated.
● Gaussian distribution is the one that, when used to represent a sample, involves the simplest
calculations from sample values.
3 2. Explain Lorenz curves.
● Lorenz curves have visual appeal because they portray how near total equality or total
inequality a particular distribution falls.
● The degree of inequality in two distributions can be compared by superimposing their
Lorenz curves.
3 3. Define the Gini Coefficient.
● A measure that summarizes what is happening across all the distribution is the Gini
coefficient.
● The Gini coefficient expresses the ratio between the area between the Lorenz curve and the
line of total equality and the total area in the triangle formed between the perfect equality
and perfect inequality lines.
● It therefore varies between 0 and 1 although it is sometimes multiplied by 100 to express the
coefficient in percentage form.
3 4. Explain smoothing in time series.
Smoothing is a technique applied to time series to remove the fine-grained variation between time
steps. The hope of smoothing is to remove noise and better expose the signal of the underlying
causal processes. Moving averages are a simple and common type of smoothing used in time series
analysis and time series forecasting.
3 5. What is the effect on distribution aspects of adding or subtracting a constant from every data
value? Why should we add or subtract a constant?
● The change made to the data by adding or subtracting a constant is fairly trivial.
● Only the level is affected; spread, shape, and outliers remain unaltered.
● The reason for adding or subtracting a constant from every data value is to make a division
above and below a particular point.
● This is also done to bring the data within a particular range.
36 List the different smoothing process in refinement?
● Endpoint Smoothing
● Breaking the smooth
UNIT-III / PART-B

1. The dataset below shows the gross earnings in pounds per week of twenty men and twenty women
drawn randomly from the 1979 New Earnings Survey. The respondents are all full-time adult
workers. Men are deemed to be adult when they reach age 21; women when they reach age 18.
27
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


Men Women
150 58 90 39
55 122 76 47
82 120 87 80
107 83 58 42
102 115 50 40
78 69 46 99
154 99 63 77
85 94 68 67
123 144 116 49
66 55 60 54
Calculate the median and dQ of both male and female earnings, and compare the two distributions.
Write your interpretations with respect to the four aspects of histogram
Describe histograms, bar graphs and pie charts in detail. Draw charts for the below table that shows
2. a specimen case by variable data matrix. It contains the first few cases in a subset of the 2005 GHS.

Explain in detail numerical summaries of level and spread.


3.
Explain in detail the concepts of Scaling and Standardizing.
4.
Write in detail about Inequalities.
5.
Write a detailed explanation about time series smoothing.
6. Explain various smoothing Techniques.
7. Pick any three numbers and calculate their mean and median. Calculate the residuals and squared
8. residuals from each, and sum them. Confirm that the median produces smaller absolute residuals
and the mean produces smaller squared residuals.
Explain about variables in a data matrix with an example.
9.0. Calculate the mean and standard deviation of the male earnings of the data given below. Compare
1 them with the median and midspread you calculated. Why do they differ?

28
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


Men's working hours (ranked)
30
37
39
40
45
47
48
Median value 48
50
54
55
55
67
70
80

If you were told that the distribution of a test of ability on a set of children was Gaussian, with a
mean of 75 and a standard deviation of 12,

(a) What proportion of children would have scores over 75?


(b) What proportion of children would have got scores between 51 and 99?
(c) What proportion of children would you expect to have scores of less than 39?
. a) How should income be defined and What should be the unit of measurement?
11 b) How do we display the income distribution using Lorenz curve and Gini coefficient

UNIT IV BIVARIATE ANALYSIS


Relationships between Two Variables - Percentage Tables - Analyzing Contingency Tables - Handling
Several Batches - Scatterplots and Resistant Lines – Transformations.
UNIT-IV / PART-A
1. Write briefly about the contingency table.
● A contingency table shows the distribution of each variable conditional upon each category
29
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


of the other.
● The categories of one of the variables form the rows, and the categories of the other variable
form the columns.
● Each individual case is then tallied in the appropriate cell depending on its value on both
variables.
● The number of cases in each cell is called the cell frequency.
2. What are the two types of variables?
● The two types of variables are explanatory and response variables.
● The variable that is presumed to be the cause is the explanatory variable (denoted as X).
● The one that is presumed to be the effect is the response variable (denoted as Y).
● They are also termed independent and dependent variables respectively.
3. What are bounded numbers?
Proportions and percentages are bounded numbers, in that they have a floor of zero, below which
they cannot go, and a ceiling of 1.0 and 100 respectively.
4. Draw the general diagram of the causal path model.

5. Write short notes on contingency table. Draw a schematic four-by-four contingency table.
● A contingency table shows the distribution of each variable conditional upon each category
of the other.
● The categories of one of the variables form the rows, and the categories of the other variable
form the columns.
● Each individual case is then tallied in the appropriate pigeonhole depending on its value on
both variables.
● The pigeonholes are called as cells, and the number of cases in each cell is called the cell
frequency
● Each row and column can have a total presented at the right-hand end and at the bottom
respectively; these are called the marginals.

6. What are three different ways of representing contingency table in percentage form?
30
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


The three different ways of representing contingency table in percentage form are
● The table that is constructed by dividing each cell frequency by the grand total.
● Outflow table: The table that is constructed by dividing each cell frequency by its
appropriate row total.
● Inflow table: The table that is constructed by dividing each cell frequency by its appropriate
column total.
7. What are marginals?
● Each row and column in a contingency table can have a total presented at the right-hand end
and at the bottom respectively; these are called the marginals.
● The univariate distributions can be obtained from the marginal distributions.

8. Write a brief note on labeling a table.


● The title of a table should be clear and concise, summarising the contents.
● It should be as short as possible, while at the same time making clear when the data were
collected, the geographical unit covered, and the unit of analysis.
● It helps in numbering figures and can refer to them more succinctly in the text.
● Other parts of a table also need clear, informative labels.
● The variables included in the rows and columns must be clearly identified.
9. What is the importance of using a layout in a table?
● The effective use of space and grid lines can make the difference between a table that is easy
to read and one which is not.
● Grid lines can help indicate how far a heading or subheading extends in a complex table.
10 . What are the considerations to make a decision about which variable to put in the rows and
which in the columns?
● Closer figures are easier to compare.
● Comparisons are more easily made down a column.
● A variable with more than three categories is best put in the rows so that there is plenty of
room for category labels.
11. What is the difference in proportions?
The difference in proportions, d, is used to summarize the effect of being in a category of one
variable upon the chances of being in a category of another.
12 . Write the properties of difference in proportions?
● Symmetric measures of association have the same value regardless of which way the causal
effect is assumed to run.
● Asymmetric measures have varying values depending on which variable is presumed to be
the cause of the other.
13 . What are inferential statistics?
The analysis of data from samples of individuals to infer information about the population as a
whole is called inferential statistics.
14 . Write the equation for the chi-square statistic.
The equation for chi-square is given by,

Where O - Observed frequency


E - Expected frequency

31
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


The difference between the observed and expected frequencies for each cell of the table is
calculated. Then, this value is squared before dividing it by the expected frequency for that cell.
Finally, these values are summed over all the cells of the table.
15 . What is a null hypothesis?
The null hypothesis is that the two variables under analysis are not associated with the population
as a whole and the relationship that is observed between variables in the sample is small enough to
have occurred due to random error. (i.e.) the null hypothesis states that, in the population of
interest, changes in the explanatory variable have no impact on the outcome of the response
variable.
16 . What are outliers?
Some datasets contain points that are a lot higher or lower than the main body of the data. Outliers
are points that are unusually distant from the rest of the data.
17 . What are the elementary details which must always appear in a table?
● Labelling
● Sources
● Sample data
● Missing data
● Layout
● Definitions
● Opinion data
● Ensuring frequencies can be reconstructed
● Showing the way percentages run
● Layout
18 . What is a degree of freedom? Give example.
The number of degrees of freedom for a table with r rows and c columns is given by the equation
below:
Degrees of freedom (Df) =
Example: A table with two rows and two columns is said to have one degree of freedom. A table
with two columns and three rows is said to have two degrees of freedom.
19 . What is the limitation of the chi-square statistic for tables with more than one degree
of freedom?
● Chi-square only gives an overall measure of whether the two variables are likely to be
associated, but it does not provide information on the locations of the differences within the
table.
● It is therefore necessary to recode the variables or to select specific groups for more detailed
analysis.
20 . What are the limitations of using chi-square statistic?
● The probability associated with a specific value of chi-square can only be calculated reliably
if all the expected frequencies in the table are at least 5. (i.e.) the size of sample required
partly depends on the distribution of the variables of interest.
● It only focuses on the relationship between two categorical variables. It does not examine the
relationship between a number of different categorical variables.
21 . How to identify the outliers in a particular dataset?
● To identify the outliers in a particular dataset, a value 1.5 times the dQ, or a step, is
calculated.
● Fractions other than one-half are ignored.
32
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


● Then the points beyond which the outliers fall (the inner fences) and the points beyond
which the far outliers fall (the outer fences) are identified.
● The inner fences lie one step beyond the quartiles and outer fences lie two steps beyond the
quartiles.
22 . What are the five principal advantages of transforming data?
1. Data batches can be made more symmetrical.
2. The shape of data batches can be made more Gaussian.
3. Outliers that arise simply from the skewness of the distribution can be removed, and previously
hidden outliers may be forced into view.
4. Multiple batches can be made to have more similar spreads.
5. Linear, additive models may be fitted to the data.

23 . What is a boxplot?
The boxplot is a device for conveying the information in the five number summaries economically
and effectively.
Example:

24 . Provide two reasons why some data points are outliers.


● Outliers occur when the whole distribution is skewed.
● The particular data points do not really belong substantively to the same data batch.
25 . Draw the anatomy of a boxplot.

33
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023

26 . What is GNI?
GNI is the sum of values of both final goods and services and investment goods in a country. If one
focuses on the production that is undertaken by the residents of that country, the income earned by
nationals from abroad has to be added to the gross domestic product, to arrive at the gross national
income.

27 . What are the two types of hypotheses used by statistical tests?


Whenever researchers use a statistical test, two hypotheses are involved:
● Null hypothesis
● Alternative hypothesis
28 . What is GDP?
If one focuses on all the production that takes place within national boundaries, the measure is
termed the Gross Domestic Product (GDP).
29 . Write the formula for a two-sample t-test.
The formula for a two-sample t-test where the samples are independent is,

Where and are the means of the two samples


is the pooled standard deviation which is calculated as

, - standard deviation of the first and second sample


, - sample size of the first and second sample

34
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


UNIT-IV / PART-B
1. The data given below relate to the percentage of households that are headed by a lone parent and
contain dependent children, and the percentage of households that have no car or van for each of the
ten Government Office Regions of England and Wales.

Summarize the linear relationship between the two interval-level variables and explain in detail.
Also discuss the rules to draw the line.
2. Consider three variables: financial circumstances of a family (comfortable vs struggling); marital
status of parents (married and cohabiting vs divorced or separated) and educational outcomes for
children. Draw a causal path diagram indicating the relationships you would assume to exist
between these variables. Suppose the bivariate effect of parents' marital status on educational
outcomes for children was strongly positive. What would you expect to happen to the magnitude of
the effect once you had controlled for financial circumstances?
3. Describe in detail contingency tables and percentage tables with suitable examples.
4. Explain the guidelines to construct a lucid table of numerical data.
5. Examine the given data. Which group of women is most likely to have high levels of worry about
violent crime? And which group is least likely to have high levels of worry about violent crime?
Using the data in the table create a simpler table with just three age groups; 16-34; 35-54; and 55+.
Decide which should be the reference or base category and use the table to construct a causal path
diagram.
Women High levels of Not worried Total
age worry
group P N P N P N
16-24 0.32 686 0.68 1447 1 2133
25-34 0.27 1040 0.73 2815 1 3855
35-44 0.25 1224 0.75 3725 1 4949
45-54 0.24 952 0.76 3008 1 3960
55-64 0.22 943 0.78 3430 1 4373
65-74 0.22 781 0.78 2735 1 3516
75 or 0.14 511 0.86 3044 1 3555
older
Total 6,137 20,204 26,341
Consider an imaginary piece of research in which 100 men and 100 women are asked about their
6. fear of walking alone after dark. Until we conduct the survey, we have no information other than the

35
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


number of men and women in our sample (see the below table).
Feeling safe walking alone after dark by gender
Very safe, fairly Very unsafe Total
safe, or a bit
unsafe
P N P N P N
Male ? ? ? ? 1 100
Female ? ? ? ? 1 100
Total ? ? ? ? 1 200
Find the following:
a. After the survey, imagine that we find that in total 20 individuals i.e., 0.1 of the sample state
that they feel very unsafe when walking alone after dark. Add this information to the given
table.
b. If, in the population as a whole, the proportion of men who feel very unsafe walking alone
after dark is the same as the proportion of women who feel very unsafe walking alone after
dark, we would expect this to be reflected in our sample survey. Find the expected
proportions and frequencies.
c. Find the observed values after carrying out the survey and the fear of walking alone after
dark by gender were cross-tabulated.
d. Compute the chi-square.
Briefly explain the statistically significant relationship between the variables.
7. Explain the following:
a. Null Hypothesis
b. Type 1 and Type 2 errors
c. Degrees of freedom
8. Describe the essentials of interpreting contingency tables.
9. Consider the problem of feeling safe walking alone after dark by gender. (sample restricted to those
of Black Caribbean ethnic origin)

For the above table, chi-square is calculated using SPSS

36
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023

Describe the probability of making a Type 1 error.


10 . Explain boxplots in detail with an example.
11 . Explain fitting a resistant line with the data given below.

12 . Describe outliers in detail.

UNIT V MULTIVARIATE AND TIME SERIES ANALYSIS


Introducing a Third Variable - Causal Explanations - Three-Variable Contingency Tables and Beyond
- Longitudinal Data – Fundamentals of TSA – Characteristics of time series data – Data Cleaning –
Time-based indexing – Visualizing – Grouping – Resampling.
UNIT-V / PART-A
1. Define cause.
A cause is defined as an object followed by another, and where all the objects, similar to the first,
are followed by objects similar to the second. (i.e.) if the first object had not been, the second never
had existed.
2. Define causality.

37
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


Causality can be defined in terms of constant conjunction or statistical association it is clearly
not sufficient for one event invariably to precede another for us to be convinced that the first
event causes the second.
3. What is multiple causality?
Multiple causality is a process where many different component causes can combine to produce
a specific outcome.
4. What are direct and indirect casual effects?
Direct causal effects are effects that go directly from one variable to another. Indirect effects occur
when the relationship between two variables is mediated by one or more variables.
5. List the different casual relationships between variables.
The different casual relationships between variables are prior, intervene and ensue.
6. What is Simpson’s paradox?
Simpson’s paradox is every statistical relationship between two variables may be reversed by
including additional factors in the analysis.
7. What are enhancer and suppressor variables?
● Test factors which are either positively related to both the other variables or negatively
related to both of them are called enhancer variables.
● Test factors which are positively associated with one variable and negatively with the other
are called suppressor variables.
8. Write the equation for logistic regression.

Where X denotes the single explanatory variable.


9. What is a panel study?
The participants in a research study are contacted by researchers and asked to provide information
about themselves and their circumstances on a number of different occasions. This is referred to as
a panel study.
0. What are transition tables?
1
Transition tables have a longitudinal dimension in that the two variables that are being cross-
tabulated can be understood as a single categorical variable that has been measured at two time
points.
. Define cohort. Give example.
11
A cohort has been defined as an 'aggregate of individuals who experienced the same event
within the same time interval. The most obvious type of cohort used in longitudinal
quantitative research is the birth cohort (i.e.) a sample of individuals born within a
relatively short time
1 period.
2. What are the characteristics of time series data?
When working with time series data, there are several unique characteristics that can be observed.
● Trend
● Outliers
● Seasonality
● Abrupt changes
● Constant variance
1
3. Write short notes on cohort studies.
38
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


● Cohort studies allow an explicit focus on the social and cultural context that frames the
experiences, behavior, and decisions of individuals.
● For example, in the case of the 1958 British Birth Cohort study, it is important to
understand the cohort's educational experiences in the context of profound changes in the
organization of secondary education during the 1960s and 1970s, and the rapid expansion
of
higher education, which was well underway by the time cohort members left school in the
1 mid-1970s.
4. What is cross-sectional survey?
The change over time is determined by conducting two surveys asking the same questions at
1 different points in historical time. This is known as repeated cross-sectional survey.
5. Write short notes on Time-based indexing.
Time-based indexing is a very powerful method of the pandas library when it comes to time series
data. It allows using a formatted string to select data. Example:
df_power.loc['2015-10-02']
Output:
Consumption 1391.05
Wind 81.229
Solar 160.641
Wind+Solar 241.87
Year 2015
Month 10
Weekday Name Friday
Name: 2015-10-02 00:00:00, dtype: object
Here, the pandas dataframe loc accessor is used. The date is used as a string to select a row. All
1 sorts of techniques can be used to access rows just as we can do with a normal dataframe index.
6. What is the major issue in longitudinal studies?
A major methodological issue in longitudinal studies is the problem of attrition, i.e., the dropout of
1 participants through successive waves of a prospective study.
7. What is univariate time series? Give example.
● The series that captures a sequence of observations for the same variable over a particular
duration of time is referred to as univariate time series.
● In general, the observations are taken over regular time periods, such as the change in
1 temperature over time throughout a day.
8. What is event history analysis?
Event history analysis focuses on the timing of events or the duration until a particular event
1 occurs, rather than changes in attributes over time.
9. What are the two main approaches to longitudinal data analysis?
● Repeated measures analysis
● Event history analysis or Event history modeling
2
0. What is repeated measures analysis?
Repeated measures analysis focuses on the changes in an individual attribute over time. For
2 example, weight, performance score, attitude, voting behavior, reaction time, depression, etc.
1. What are time series?
An ordered sequence of timestamp values at equally spaced intervals is referred to as a time series.
2 It is a collection of observations made sequentially in time.
2. List the applications of time series analysis.
39
lOMoARcPSD|6185608

AD3301- Data Exploration and Visualization - 2022-2023


Analysis of a time series is used in many applications such as sales forecasting, utility studies,
budget analysis, economic forecasting, inventory studies, etc.
UNIT-V / PART-B
1. Explain the following:
a. Causality
b. Multiple causality
c. Direct and indirect effects
2. Describe in detail about the assumptions required to infer causes.
3. Consider a hypothetical example of the causes of absenteeism from work. Suppose previous
research had shown a positive bivariate relationship between low social status jobs and
absenteeism. Is there something about such jobs that directly causes the people who do them to go
off sick more than others? Discuss assumptions and possible outcomes.
4. Explain Simpson's paradox with an example.
5. Consider as a test factor, a variable that represents the extent to which the respondent suffers from
chronic nervous disorders, such as sleeplessness, anxiety, and so on. Such conditions would be
likely to lead to absence from work. It is also quite conceivable that they could be caused in part by
stressful, low-status jobs. Therefore, assume that nervous disorders act as an intervening variable.
What will happen to the original relationship if we control for this test factor? Show similar
possible outcomes.
Explain in detail about longitudinal data collection. Give examples for longitudinal studies.
6.
Explain event history modelling. Discuss about the various approaches of event history modelling.
7.
Perform the following and show the outputs:
8.
(i) Randomly generate a normalized time series dataset using the Numpy library.
(ii) Plot the time series data using the seaborn library
(iii) Generate an array of the cumulative sum of the data
(iv) Plot the data using a time series plot
Perform TSA with Open Power System Data.
9.0. Describe the characteristics of time series data.
1 The Open Power System data consists of 4 columns, such as, ‘Consumption', 'Wind', 'Solar',
'Wind+Solar’. Write the code to visualize the Open Power System dataset.
(i) Generate a line plot of the full time series of Germany's daily electricity consumption
(ii) Plot the data for all the other columns
(iii) Visualize the electricity consumption between 2016-12-23 and 2016-12-30

40

You might also like