devish all unit
devish all unit
devish all unit
45 PERIODS
1. Install the data Analysis and Visualization tool: R/ Python /Tableau Public/ Power BI.
2. Perform exploratory data analysis (EDA) on with datasets like email data set. Export all your emails as a dataset, import
them inside a pandas data frame, visualize them and get different insights from the data.
3. Working with Numpy arrays, Pandas data frames , Basic plots using Matplotlib.
4. Explore various variable and row filters in R for cleaning data. Apply various plot features in R on sample data sets and
visualize.
5. Perform Time Series Analysis and apply the various visualization techniques.
6. Perform Data Analysis and representation on a Map using various Map data sets with Mouse Rollover effect, user
interaction, etc..
7. Build cartographic visualization for multiple datasets involving various countries of the world; states and districts in India
etc.
8. Perform EDA on Wine Quality Data Set.
9. Use a case study on a data set and apply the various EDA and visualization techniques and present an analysis report.
1
TOTAL: 75 PERIODS
TEXT BOOKS:
1. Suresh Kumar Mukhiya, Usman Ahmed, “Hands-On Exploratory Data Analysis with Python”, Packt Publishing, 2020.
(Unit 1)
2. Jake Vander Plas, "Python Data Science Handbook: Essential Tools for Working with Data", Oreilly, 1st Edition, 2016.
(Unit 2)
3. Catherine Marsh, Jane Elliott, “Exploring Data: An Introduction to Data Analysis for Social Scientists”, Wiley
Publications, 2nd Edition, 2008. (Unit 3,4,5)
REFERENCES:
1. Eric Pimpler, Data Visualization and Exploration with R, GeoSpatial Training service, 2017.
2. Claus O. Wilke, “Fundamentals of Data Visualization”, O’reilly publications, 2019.
3. Matthew O. Ward, Georges Grinstein, Daniel Keim, “Interactive Data Visualization: Foundations, Techniques, and
Applications”, 2nd Edition, CRC press, 2015.
2
UNIT I EXPLORATORY DATA ANALYSIS
EDA fundamentals – Understanding data science – Significance of EDA – Making sense of data –
Comparing EDA with classical and Bayesian analysis – Software tools for EDA - Visual Aids for EDA-
Data transformation techniques-merging database, reshaping and pivoting, Transformation techniques -
Grouping Datasets - data aggregation – Pivot tables and cross-tabulations.
UNIT-I / PART-A
. 1 What is data?
Data encompasses a collection of discrete objects, numbers, words, events, facts, measurements,
. observations, or even descriptions of things.
2 What is a dataset? Give example.
● A dataset contains many observations about a particular object.
● For instance, a dataset about patients in a hospital can contain many observations.
● A patient can be described by a patient identifier (ID), name, address, weight, date of birth,
address, email, and gender.
● Each of these features that describe a patient is a variable. Each observation can have a
specific value for each of these variables.
● For example, a patient can have the following:
PATIENT_ID = 1001
Name = Yoshmi Mukhiya
Address = Mannsverk 61, 5094, Bergen, Norway
Date of birth = 10th July 2018
Email = [email protected]
Weight = 10
. Gender = Female
3 What is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is a process of examining the available dataset to discover
patterns, spot anomalies, test hypotheses, and check assumptions using statistical measures.
4 Write some of the data summarization techniques.
Some of the techniques used for data summarization are
● summary tables
● graphs
● descriptive statistics
● inferential statistics
● correlation statistics
● searching
● grouping
. ● mathematical models
5 Write briefly about the data collection phase.
● Data collected from several sources must be stored in the correct format and transferred to the
right information technology personnel within a company.
● Data can be collected from several objects during several events using different types of
3
sensors and storage tools.
. 6 Write a short note on data cleaning.
● Data must be correctly transformed for an incompleteness check, duplicates check, error
check, and missing value check.
● These tasks are performed in the data cleaning stage, which involves responsibilities such as
matching the correct record, finding inaccuracies in the dataset, understanding the overall
data quality, removing duplicate items, and filling in the missing values.
. 7 What is a categorical dataset? Give two examples.
● This type of data represents the characteristics of an object.
● This data is often referred to as qualitative datasets in statistics.
● Examples of categorical data:
o Gender (Male, Female, Other, or Unknown)
o Blood type (A, B, AB, or O)
. 8 List the methods involved in the data preparation step.
The data preparation step involves
● defining the sources of data
● defining data schemas and tables
● understanding the main characteristics of the data
● cleaning the dataset
● deleting non-relevant datasets
● transforming the data
● dividing the data into required chunks for analysis
. 9 What is data visualization?
Data visualization deals with information relay techniques such as tables, charts, summary
diagrams, and bar charts to show the analyzed result.
.10 Write short notes on the significance of EDA.
● It is practically impossible to make sense of datasets containing more than a handful of data
points without the help of computer programs.
● Exploratory data analysis is key, and usually the first exercise in data mining.
● It allows us to visualize data to understand it as well as to create hypotheses for further
analysis.
● The exploratory analysis centers around creating a synopsis of data or insights for the next
steps in a data mining project.
● EDA actually reveals the ground truth about the content without making any underlying
assumptions.
.11 List the expert tools for exploratory analysis and mention their purpose.
Python provides expert tools for exploratory analysis:
● pandas for summarization
● scipy for statistical analysis
● matplotlib and plotly for visualizations
.12 List the common tasks in the data processing stage.
The common tasks in the data processing stage include
● exporting the dataset
● placing them under the right tables
● structuring them, and
● exporting them in the correct format
4
lOMoARcPSD|6185608
6
lOMoARcPSD|6185608
.29 Write the code snippet to create a transformed version of the dataframe.
To create a transformed version of the dataframe, the rename() method can be used. This method is
used to modify the original data.
dframe1.rename(index=str.title, columns=str.upper)
.30 What are outliers? Why should we detect and filter them?
● Outliers are data points that diverge from other observations for several reasons. During the
EDA phase, one of our common tasks is to detect and filter these outliers.
● The main reason for this detection and filtering of outliers is that the presence of such outliers
can cause serious issues in statistical analysis
.31 Write short notes on groupby function.
● The pandas groupby function is one of the most efficient and time-saving features for
categorizing a dataset into multiple categories or groups.
● Groupby provides functionalities that allow us to split-apply-combine throughout the
dataframe.
.32 Write the two essential functions of the groupby function.
● It splits the data into groups based on some criteria.
● It applies a function to each group independently.
.33 How will you consistently rearrange the data in a dataframe?
The data in a dataframe can be rearranged in some consistent manner with hierarchical indexing
using two actions:
● Stacking: Stack rotates from any particular column in the data to the rows.
● Unstacking: Unstack rotates from the rows into the column.
7
lOMoARcPSD|6185608
8
lOMoARcPSD|6185608
UNIT-II / PART-A
1. Write short notes on plt.show() command.
● The plt.show() command starts an event loop, looks for all currently active figure objects, and
opens one or more interactive windows that display the figures.
● This command should be used only once per python session, and is most often seen at the
very end of the script.
● Multiple show() commands can lead to unpredictable backend-dependent behavior, and
should be avoided.
2. Give an example to create line plots.
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x),’-’)
plt.plot(x, np.cos(x),’--’)
plt.show()
9
lOMoARcPSD|6185608
6.
What is a histogram? How do you plot a simple histogram using matplotlib in python?
The histogram is the graphical representation that organizes a group of data points into the specified
range. A histogram divides the variable into bins, counts the data points in each bin, and shows the
bins on the x-axis and the counts on the y-axis.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
data = np.random.randn(1000)
plt.hist(data);
7.
What are legends in data visualization?
Plot legends give meaning to a visualization, assigning meaning to the various plot elements. The
simplest legend can be created with the plt.legend() command, which automatically creates a legend
10
lOMoARcPSD|6185608
11
lOMoARcPSD|6185608
12
lOMoARcPSD|6185608
Runs in 10 20
Overs
MI 110 224
RCB 85 210
13
lOMoARcPSD|6185608
16 . Mention the use of plt.axis() method. How to set the axis limits with plt.axis() method?
The plt.axis() method is used to set the x and y limits with a single call, by passing a list that specifies
[xmin, xmax, ymin,ymax].
Example:
plt.plot(x, np.sin(x))
plt.axis([-1, 11, -1.5, 1.5]);
18 . What is the use of the color keyword in plotting? Give examples to control the colors of plot
14
lOMoARcPSD|6185608
20 . Write code to plot a line chart to depict the run rate of T20 match from given
data: Overs Runs
5 45
10 79
15 145
20 234
import matplotlib.pyplot as plt
overs = [5,10,15,20]
runs = [54,79,145,234]
plt.plot(overs,runs)
plt.xlabel('Overs')
plt.ylabel('Runs')
plt.show()
15
lOMoARcPSD|6185608
16
lOMoARcPSD|6185608
23 . How to set multiple properties of plotting at once using an object-oriented interface approach?
In the object-oriented interface to plotting, rather than calling the functions individually, it is often
more convenient to use the ax.set() method to set all the properties at once.
Example:
ax.plot(x, np.sin(x))
ax.set(xlim=(0, 10), ylim=(-2, 2), xlabel='x', ylabel='sin(x)', title='A Simple Plot');
26 . Give the command for importing matplotlib. Draw a scatterplot with green dots using
matplotlib.
import matplotlib.pyplot as plt
plt.plot([1,2,3,4,5], [1,2,3,4,10], 'go')
plt.show()
30 . What are the pre-defined transforms when considering the placement of text on a figure?
There are three pre-defined transforms:
ax.transData: Transform associated with data coordinates
ax.transAxes: Transform associated with the axes (in units of axes dimensions)
fig.transFigure: Transform associated with the figure (in units of figure dimensions)
31 . What are the ways of importing matplotlib?
Matplotlib can be imported in the following two ways:
● Using alias name: import matplotlib.pyplot as plt
● Without alias name: import matplotlib.pyplot
32 . What are transData and transAxes coordinates?
● The transData coordinates give the usual data coordinates associated with the x and y-axis
labels.
● The transAxes coordinates give the location from the bottom-left corner of the axes as a
fraction of the axes’ size.
33 . How do you create a contour plot?
● A contour plot can be created with the plt.contour function.
● It takes three arguments:
o a grid of x values
o a grid of y values
o a grid of z values.
The x and y values represent positions on the plot, and the z values will be represented by the contour
levels.
34 . Write the function for listing five plot styles randomly.
plt.style.available[:5]
Output:
['fivethirtyeight', 'seaborn-pastel', 'seaborn-whitegrid', 'ggplot', 'grayscale']
35 . Write a code to use a gray background and draw solid white grid lines.
(i) use a gray background
ax = plt.axes(axisbg='#E6E6E6')
ax.set_axisbelow(True)
(ii) draw solid white grid lines
plt.grid(color='w', linestyle='solid')
19
lOMoARcPSD|6185608
20
lOMoARcPSD|6185608
41 . What is seaborn?
● Seaborn provides an API on top of Matplotlib that offers the same choices for plot style and
color defaults, defines simple high-level functions for common statistical plot types, and
integrates with the functionality provided by Pandas DataFrames.
● It provides high-level commands to create a variety of plot types useful for statistical data
exploration and even some statistical model fitting.
UNIT-II / PART-B
1. Show the implementation code describing the below components of the matplotlib module in python
a. Create a simple figure and axes.
b. Control the line colors and styles
c. Adjust the axis limits
d. Label the Plots
2. Describe the scatter plot in detail and demonstrate the scatter plot using plt.plot and plt.scatter.
Differentiate both plots.
3. Describe in detail about Density and Contour Plots.
4. Explain Histograms, Binnings, and Density in detail.
5. Illustrate the three Matplotlib functions that can be useful to display three-dimensional data in two
dimensions.
21
lOMoARcPSD|6185608
UNIT-III/ PART-A
1. What is sampling?
Sampling is a method that allows us to get information about the population based on the
statistics from a subset of the population (sample or case), without having to investigate every
individual.
2. What are the two basic units of data analysis?
● Cases and variables are the two organizing concepts that are considered the basis of data
analysis.
● The cases are the samples about which information is collected.
● The information is collected on certain features of all the cases.
● These features are the variables that vary across different cases.
Example: In a survey of individuals, their income, sex, and age are some of the variables that
might be recorded.
3. What are the measurement scales used by social scientists?
Many of the variables used by social scientists are measured on nominal scales or ordinal scales
(also referred to as categorical variables), rather than interval scales (also referred to as continuous
variables).
4. What are two techniques for reducing the number of digits?
22
lOMoARcPSD|6185608
23
lOMoARcPSD|6185608
Where Y is conventionally used to refer to an actual variable. The subscript ‘i’ is an index that
indicates which case is being referred to, and N is the number of data points.
2 1. What is a median?
● Median is the middle value of the dataset (i.e.) the data is sorted from smallest to biggest (or
biggest to smallest) and then the value in the middle of the set is taken.
● It is the value of the case that has equal numbers of data points above and below it.
● With N data points, the median M is the value at depth
.
2 2. What are quartiles and midspread?
● The points which divide the distribution into quarters are called the quartiles (or hinges or
fourths).
● The lower quartile is usually denoted QL and the upper quartile QU. The middle quartile is
the median.
● The distance between QL and QU is called the midspread (dQ or interquartile range).
2 3. Why range cannot be recommended as a summary measure of spread?
● Range only uses information from two data points, and these are drawn from the most
unreliable part of the data.
● Therefore, despite its intuitive appeal, it cannot be recommended as a summary measure of
spread.
2 4. Write a brief note on the usefulness of the mean and the standard deviation measures.
● The mean and the standard deviation are less resistant than other measures.
● So, they are often preferable for much descriptive and exploratory work, especially when
there are measurement errors.
● The mean and the standard deviation measures are used to make very precise statements
about the likely degree of sampling error in any data.
2 5. State Twyman's law for data analysis.
The more unusual or interesting the data, the more likely they are to have been the result of an error
of one kind or another.
2 6. There are 15 cases in the small datasets of men's working hours. The median is at depth 8, and
25
lOMoARcPSD|6185608
30,37,39,40,45,47,48,48,50,54,55,55,67,70,80
26
lOMoARcPSD|6185608
s=
=
= 13.29
3 1. Write a brief note on Gaussian distribution.
● Gaussian distributions are bell-shaped and have the convenient property of being
reproducible from their mean and standard deviation.
● Given these two pieces of information, the exact shape of the curve can be reconstructed,
and the proportion of the area under the curve falling between various points can be
calculated.
● Gaussian distribution is the one that, when used to represent a sample, involves the simplest
calculations from sample values.
3 2. Explain Lorenz curves.
● Lorenz curves have visual appeal because they portray how near total equality or total
inequality a particular distribution falls.
● The degree of inequality in two distributions can be compared by superimposing their
Lorenz curves.
3 3. Define the Gini Coefficient.
● A measure that summarizes what is happening across all the distribution is the Gini
coefficient.
● The Gini coefficient expresses the ratio between the area between the Lorenz curve and the
line of total equality and the total area in the triangle formed between the perfect equality
and perfect inequality lines.
● It therefore varies between 0 and 1 although it is sometimes multiplied by 100 to express the
coefficient in percentage form.
3 4. Explain smoothing in time series.
Smoothing is a technique applied to time series to remove the fine-grained variation between time
steps. The hope of smoothing is to remove noise and better expose the signal of the underlying
causal processes. Moving averages are a simple and common type of smoothing used in time series
analysis and time series forecasting.
3 5. What is the effect on distribution aspects of adding or subtracting a constant from every data
value? Why should we add or subtract a constant?
● The change made to the data by adding or subtracting a constant is fairly trivial.
● Only the level is affected; spread, shape, and outliers remain unaltered.
● The reason for adding or subtracting a constant from every data value is to make a division
above and below a particular point.
● This is also done to bring the data within a particular range.
36 List the different smoothing process in refinement?
● Endpoint Smoothing
● Breaking the smooth
UNIT-III / PART-B
1. The dataset below shows the gross earnings in pounds per week of twenty men and twenty women
drawn randomly from the 1979 New Earnings Survey. The respondents are all full-time adult
workers. Men are deemed to be adult when they reach age 21; women when they reach age 18.
27
lOMoARcPSD|6185608
28
lOMoARcPSD|6185608
If you were told that the distribution of a test of ability on a set of children was Gaussian, with a
mean of 75 and a standard deviation of 12,
5. Write short notes on contingency table. Draw a schematic four-by-four contingency table.
● A contingency table shows the distribution of each variable conditional upon each category
of the other.
● The categories of one of the variables form the rows, and the categories of the other variable
form the columns.
● Each individual case is then tallied in the appropriate pigeonhole depending on its value on
both variables.
● The pigeonholes are called as cells, and the number of cases in each cell is called the cell
frequency
● Each row and column can have a total presented at the right-hand end and at the bottom
respectively; these are called the marginals.
6. What are three different ways of representing contingency table in percentage form?
30
lOMoARcPSD|6185608
31
lOMoARcPSD|6185608
23 . What is a boxplot?
The boxplot is a device for conveying the information in the five number summaries economically
and effectively.
Example:
33
lOMoARcPSD|6185608
26 . What is GNI?
GNI is the sum of values of both final goods and services and investment goods in a country. If one
focuses on the production that is undertaken by the residents of that country, the income earned by
nationals from abroad has to be added to the gross domestic product, to arrive at the gross national
income.
34
lOMoARcPSD|6185608
Summarize the linear relationship between the two interval-level variables and explain in detail.
Also discuss the rules to draw the line.
2. Consider three variables: financial circumstances of a family (comfortable vs struggling); marital
status of parents (married and cohabiting vs divorced or separated) and educational outcomes for
children. Draw a causal path diagram indicating the relationships you would assume to exist
between these variables. Suppose the bivariate effect of parents' marital status on educational
outcomes for children was strongly positive. What would you expect to happen to the magnitude of
the effect once you had controlled for financial circumstances?
3. Describe in detail contingency tables and percentage tables with suitable examples.
4. Explain the guidelines to construct a lucid table of numerical data.
5. Examine the given data. Which group of women is most likely to have high levels of worry about
violent crime? And which group is least likely to have high levels of worry about violent crime?
Using the data in the table create a simpler table with just three age groups; 16-34; 35-54; and 55+.
Decide which should be the reference or base category and use the table to construct a causal path
diagram.
Women High levels of Not worried Total
age worry
group P N P N P N
16-24 0.32 686 0.68 1447 1 2133
25-34 0.27 1040 0.73 2815 1 3855
35-44 0.25 1224 0.75 3725 1 4949
45-54 0.24 952 0.76 3008 1 3960
55-64 0.22 943 0.78 3430 1 4373
65-74 0.22 781 0.78 2735 1 3516
75 or 0.14 511 0.86 3044 1 3555
older
Total 6,137 20,204 26,341
Consider an imaginary piece of research in which 100 men and 100 women are asked about their
6. fear of walking alone after dark. Until we conduct the survey, we have no information other than the
35
lOMoARcPSD|6185608
36
lOMoARcPSD|6185608
37
lOMoARcPSD|6185608
40