Lecture-1 Descriptive Statistics

APEX INSTITUTE OF TECHNOLOGY
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Statistics for Data Science(14CSHB-233)

Faculty: Prof. (Dr.) Madan Lal Saini(E13485)
Lecture – 1
Descriptive Statistics DISCOVER . LEARN . EMPOWER
1
Statistics for Data Science : Course Objectives
COURSE OBJECTIVES
The Course aims to:
1. To equip students with the skills to summarize and interpret data using descriptive
statistics and visualization techniques.
2. To develop a foundational understanding of probability and its applications in data
science.
3. To enable students to perform hypothesis testing and construct confidence intervals
for statistical inference.
4. To teach students how to build and assess linear and logistic regression models for
predictive analysis.
5. To provide hands-on experience with statistical software for data manipulation,
analysis, and visualization.
2
COURSE OUTCOMES
On completion of this course, the students shall be able to:-
Summarize and describe the main features of a dataset using measures such as mean,
CO1 median, mode, variance, and standard deviation, as well as graphical representations
like histograms, box plots, and scatter plots.
Understand of probability theory, including concepts such as random variables,
CO2 probability distributions, and the law of large numbers, enabling them to model and
reason about uncertainty in data.
Apply/perform statistical inference, including hypothesis testing, confidence interval
CO3 estimation, and p-value computation, to draw valid conclusions from sample data about
larger populations.
Apply linear and logistic regression techniques to identify relationships between

CO4
variables, make predictions, and evaluate model performance.
Utilize statistical software tools to perform data analysis, including data cleaning,
CO5
transformation, visualization, and implementing various statistical methods.
3
Unit-1 Syllabus
Unit-1 Descriptive Statistics

Descriptive Descriptive statistics basics
Statistics and Mean, median, and mode
Bayes Standard deviation
Theorem Use of the central tendency measures
Bayes Theorem
Data Types of visualization
Visualization Calculation and interpretation of graphs, plot, and
measures
4
SUGGESTIVE READINGS
TEXT BOOKS:
• T1. Hastie, Trevor, et al., The elements of statistical learning. Vol. 2. No. 1. New York:
Publisher: Springer, Edition: Second Edition (2009), ISBN: 978-0387848570
• T2. Montgomery, Douglas C., and George C. Runger. Applied statistics and probability for
engineers. John Wiley & Sons, 2010.
• T3. Probability and Statistics The Science of Uncertainty Second Ed., Michael J. Evans and
Jeffrey S. Rosenthal.
REFERENCE BOOKS:
• R1. Practical Statistics for Data Scientists: 50 Essential Concepts, Authors: Peter Bruce, et al,
Publisher: O'Reilly Media, Edition: Second Edition (2020), ISBN: 978-1492072942
• R2. An Introduction to Statistical Learning: with Applications in R, Authors: Gareth James, et
al, Publisher: Springer, Edition: Second Edition (2021), ISBN: 978-1071614174
• R3. Think Stats: Exploratory Data Analysis in Python, Author: Allen B. Downey, Publisher:
O'Reilly Media, Publication Year: 2014 (2nd Edition), ISBN: 978-1491907337
5
Table of Contents
 Introduction to Statistics
 Measures of Central Tendency
 Mean, Median, Mode
 Examples
 Solved Questions
6
3. Descriptive Statistics
• Describing data with tables and graphs

(quantitative or categorical variables)
• Numerical descriptions of center, variability, position (quantitative

variables)
• Bivariate descriptions (In practice, most studies have several variables)

1. Tables and Graphs
Frequency distribution: Lists possible values of variable

and number of times each occurs
Example: Student survey (n = 60)

www.stat.ufl.edu/~aa/social/data.html
“political ideology” measured as ordinal variable with 1

= very liberal, …, 4 = moderate, …, 7 = very
conservative
Histogram: Bar graph of
frequencies or percentages
Shapes of histograms
(for quantitative variables)
• Bell-shaped (IQ, SAT, political ideology in all U.S. )

• Skewed right (annual income, no. times arrested)
• Skewed left (score on easy exam)
• Bimodal (polarized opinions)
Ex. GSS data on sex before marriage in Exercise 3.73: always

wrong, almost always wrong, wrong only sometimes, not
wrong at all
category counts 238, 79, 157, 409
Stem-and-leaf plot (John Tukey, 1977)
Example: Exam scores (n = 40 students)
Stem Leaf
3 6
4
5 37
6 235899
7 011346778999
8 00111233568889
9 02238
2.Numerical descriptions
Let y denote a quantitative variable, with observations

y1 , y2 , y3 , … , yn
a. Describing the center
Median: Middle measurement of ordered sample
Mean:
y1  y2  ...  yn yi
y 
n n
Example: Annual per capita carbon dioxide emissions (metric
tons) for n = 8 largest nations in population size
Bangladesh 0.3, Brazil 1.8, China 2.3, India 1.2, Indonesia

1.4, Pakistan 0.7, Russia 9.9, U.S. 20.1
Ordered sample:
Median =
Mean =
y

Ordered sample: 0.3, 0.7, 1.2, 1.4, 1.8, 2.3, 9.9, 20.1
Median =
Mean =
y

Ordered sample: 0.3, 0.7, 1.2, 1.4, 1.8, 2.3, 9.9, 20.1
Median = (1.4 + 1.8)/2 = 1.6
Mean = (0.3 + 0.7 + 1.2 + … + 20.1)/8 = 4.7
y
Properties of mean and median
• For symmetric distributions, mean = median

• For skewed distributions, mean is drawn in direction
of longer tail, relative to median
• Mean valid for interval scales, median for interval or
ordinal scales
• Mean sensitive to “outliers” (median often
preferred for highly skewed distributions)
• When distribution symmetric or mildly skewed or
discrete with few values, mean preferred because
uses numerical values of observations
Examples:
• New York Yankees baseball team, 2006

mean salary = $7.0 million
median salary = $2.9 million
How possible? Direction of skew?
• Give an example for which you would expect
mean < median

b. Describing variability
Range: Difference between largest and smallest observations

(but highly sensitive to outliers, insensitive to shape)
Standard deviation: A “typical” distance from the mean
The deviation of observation i from the mean is
yi  y
The variance of the n observations is
2 2 2
( yi  y ) ( y1  y )  ...  ( yn  y )
2
s 
n 1

n 1 s
The standard deviation s is the square root of the variance,
2
s  s
Example: Political ideology
• For those in the student sample who attend religious
services at least once a week (n = 9 of the 60),
• y = 2, 3, 7, 5, 6, 7, 5, 6, 4
y 5.0,
2 2 2
2 (2  5)  (3  5)  ...  (4  5) 24
s   3.0
9 1 8
s  3.0 1.7
For entire sample (n = 60), mean = 3.0, standard deviation = 1.6,

tends to have similar variability but be more liberal
• Properties of the standard deviation:
• s  0, and only equals 0 if all observations are equal
• s increases with the amount of variation around the mean
• Division by n - 1 (not n) is due to technical reasons (later)
• s depends on the units of the data (e.g. measure euro vs $)
• Like mean, affected by outliers
• Empirical rule: If distribution is approx. bell-shaped,

 about 68% of data within 1 standard dev. of mean
 about 95% of data within 2 standard dev. of mean
 all or nearly all data within 3 standard dev. of mean
Example: SAT with mean = 500, s = 100
(sketch picture summarizing data)
Example: y = number of close friends you have

GSS: The variable ‘frinum’ has mean 7.4, s = 11.0
Probably highly skewed: right or left?
Empirical rule fails; in fact, median = 5, mode=4
Example: y = selling price of home in Syracuse, NY.

If mean = $130,000, which is realistic?
s = 0, s = 1000, s = 50,000, s = 1,000,000

c. Measures of position
pth percentile: p percent of observations below it, (100 - p)% above it.
 p = 50: median
 p = 25: lower quartile (LQ)
 p = 75: upper quartile (UQ)
 Interquartile range IQR = UQ - LQ

Quartiles portrayed graphically by box plots
(John Tukey)
Example: weekly TV watching for n=60 from student survey data file, 3 outliers
Box plots have box from LQ to UQ, with median marked. They portray a
five-number summary of the data:
Minimum, LQ, Median, UQ, Maximum
except for outliers identified separately
Outlier = observation falling

below LQ – 1.5(IQR)
or above UQ + 1.5(IQR)
Ex. If LQ = 2, UQ = 10, then IQR = 8 and outliers above 10 + 1.5(8) = 22

3. Bivariate description
• Usually we want to study associations between two or more

variables (e.g., how does number of close friends depend on
gender, income, education, age, working status, rural/urban,
religiosity…)
• Response variable: the outcome variable
• Explanatory variable(s): defines groups to compare
Ex.: number of close friends is a response variable, while

gender, income, … are explanatory variables
Response var. also called “dependent variable”

Explanatory var. also called “independent variable”
Summarizing associations:
• Categorical var’s: show data using contingency tables

• Quantitative var’s: show data using scatterplots
• Mixture of categorical var. and quantitative var. (e.g.,
number of close friends and gender) can give numerical
summaries (mean, standard deviation) or side-by-side box
plots for the groups
• Ex. General Social Survey (GSS) data

Men: mean = 7.0, s = 8.4
Women: mean = 5.9, s = 6.0
Shape? Inference questions for later chapters?
Example: Income by highest degree
Contingency Tables
• Cross classifications of categorical variables in which rows

(typically) represent categories of explanatory variable
and columns represent categories of response variable.
• Counts in “cells” of the table give the numbers of

individuals at the corresponding combination of levels of
the two variables
Happiness and Family Income
(GSS 2008 data: “happy,”
“finrela”)
Happiness
Income Very Pretty Not too Total
-------------------------------
Above Aver. 164 233 26 423
Average 293 473 117 883
Below Aver. 132 383 172 687
------------------------------
Total 589 1089 315 1993
Can summarize by percentages on response variable
(happiness)
Example: Percentage “very happy” is
39% for above aver. income (164/423 = 0.39)

33% for average income (293/883 = 0.33)
19% for below average income (??)
Happiness
Income Very Pretty Not too Total
--------------------------------------------
Above 164 (39%) 233 (55%) 26 (6%) 423
Average 293 (33%) 473 (54%) 117 (13%) 883
Below 132 (19%) 383 (56%) 172 (25%) 687
----------------------------------------------
Inference questions for later chapters? (i.e., what can we

conclude about the corresponding population?)
Scatterplots (for quantitative variables)
plot response variable on vertical axis,
explanatory variable on horizontal axis
Example: Table 9.13 (p. 294) shows UN data for several

nations on many variables, including fertility (births per
woman), contraceptive use, literacy, female economic
activity, per capita gross domestic product (GDP), cell-phone
use, CO2 emissions
Data available at
http://www.stat.ufl.edu/~aa/social/data.html
Example: Survey in Alachua County, Florida, on predictors of mental health
(data for n = 40 on p. 327 of text and at
www.stat.ufl.edu/~aa/social/data.html)
y = measure of mental impairment (incorporates various

dimensions of psychiatric symptoms, including aspects of
depression and anxiety)
(min = 17, max = 41, mean = 27, s = 5)
x = life events score (events range from severe personal

disruptions such as death in family, extramarital affair, to less
severe events such as new job, birth of child, moving)
(min = 3, max = 97, mean = 44, s = 23)
Bivariate data from 2000 Presidential election
Butterfly ballot, Palm Beach County, FL, text p.290
Example: The Massachusetts Lottery
(data for 37 communities)
% income
spent on
lottery
Per capita income

Correlation describes
strength of association
• Falls between -1 and +1, with sign indicating direction of
association (formula later in Chapter 9)
The larger the correlation in absolute value, the stronger the

association (in terms of a straight line trend)
Examples: (positive or negative, how strong?)

Mental impairment and life events, correlation =
GDP and fertility, correlation =
GDP and percent using Internet, correlation =
Correlation describes strength
of association
• Falls between -1 and +1, with sign indicating direction of

association
Examples: (positive or negative, how strong?)
Mental impairment and life events, correlation = 0.37

GDP and fertility, correlation = - 0.56
GDP and percent using Internet, correlation = 0.89
Regression analysis gives
line predicting y using x
Example:
y = mental impairment, x = life events
Predicted y = 23.3 + 0.09x
e.g., at x = 0, predicted y =
at x = 100, predicted y =
Regression analysis gives
line predicting y using x
Example:
y = mental impairment, x = life events
Predicted y = 23.3 + 0.09x
e.g., at x = 0, predicted y = 23.3

at x = 100, predicted y = 23.3 + 0.09(100) = 32.3
Inference questions for later chapters?

(i.e., what can we conclude about the population?)
Example: student survey
y = college GPA, x = high school GPA
(data at www.stat.ufl.edu/~aa/social/data.html)
What is the correlation?
What is the estimated regression equation?
We’ll see later in course the formulas for finding the

correlation and the “best fitting” regression equation
(with possibly several explanatory variables), but for
now, try using software such as SPSS to find the
answers.
Sample statistics /
Population parameters
• We distinguish between summaries of samples

(statistics) and summaries of populations (parameters).
• Common to denote statistics by Roman letters,

parameters by Greek letters:
Population mean =m, standard deviation = s,

proportion  are parameters.
In practice, parameter values unknown, we make

inferences about their values using sample statistics.
• The sample mean estimates
y
the population mean m (quantitative variable)
• The sample standard deviation s estimates

the population standard deviation s (quantitative
variable)
• A sample proportion p estimates

a population proportion  (categorical variable)
Questions?
• What is the purpose of descriptive statistics?
• Define the term "mean" in the context of descriptive statistics.
• How is the median of a dataset calculated?
• What does the mode of a dataset represent?
• Explain the concept of "range" in descriptive statistics.
• What is a "standard deviation" and what does it indicate about a
dataset?
• How is variance related to standard deviation?
• Describe what a histogram is used for in descriptive statistics.
• What is a "box plot" and what information does it convey?
• Explain the difference between a population and a sample in statistics.
48
References
Books:
• Hastie, Trevor, et al., The elements of statistical learning. Vol. 2. No. 1. New York: Publisher:
Springer, Edition: Second Edition (2009), ISBN: 978-0387848570
• Practical Statistics for Data Scientists: 50 Essential Concepts, Authors: Peter Bruce, et al,
Publisher: O'Reilly Media, Edition: Second Edition (2020), ISBN: 978-1492072942
Research Papers:
• Carmichael, Iain, and J. S. Marron. "Data science vs. statistics: two cultures?." Japanese Journal of
Statistics and Data Science 1.1 (2018): 117-138.
• Hardin, Johanna, et al. "Data science in statistics curricula: Preparing students to “think with data”." The
American Statistician 69.4 (2015): 343-353.
Websites:
• https://365datascience.com/resources-center/course-notes/statistics/
• https://www.geeksforgeeks.org/7-basic-statistics-concepts-for-data-science/
Videos:
• https://www.youtube.com/playlist?
list=PLZ2ps__7DhBYrMs3zybOqr1DzMFCX49xG
49
THANK YOU
For queries
Email: [email protected]

Lecture-1 Descriptive Statistics

Uploaded by

Copyright:

Available Formats

Lecture-1 Descriptive Statistics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture-1 Descriptive Statistics

Uploaded by

Copyright:

Available Formats

APEX INSTITUTE OF TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Statistics for Data Science(14CSHB-233)

Apply linear and logistic regression techniques to identify relationships between

Unit-1 Descriptive Statistics

• Describing data with tables and graphs

• Numerical descriptions of center, variability, position (quantitative

• Bivariate descriptions (In practice, most studies have several variables)

Frequency distribution: Lists possible values of variable

Example: Student survey (n = 60)

“political ideology” measured as ordinal variable with 1

• Bell-shaped (IQ, SAT, political ideology in all U.S. )

Ex. GSS data on sex before marriage in Exercise 3.73: always

Example: Exam scores (n = 40 students)

Let y denote a quantitative variable, with observations

a. Describing the center

Median: Middle measurement of ordered sample

Bangladesh 0.3, Brazil 1.8, China 2.3, India 1.2, Indonesia

Bangladesh 0.3, Brazil 1.8, China 2.3, India 1.2, Indonesia

Bangladesh 0.3, Brazil 1.8, China 2.3, India 1.2, Indonesia

Median = (1.4 + 1.8)/2 = 1.6

Mean = (0.3 + 0.7 + 1.2 + … + 20.1)/8 = 4.7

• For symmetric distributions, mean = median

• New York Yankees baseball team, 2006

How possible? Direction of skew?

• Give an example for which you would expect

mean < median

Range: Difference between largest and smallest observations

Standard deviation: A “typical” distance from the mean

The deviation of observation i from the mean is

For entire sample (n = 60), mean = 3.0, standard deviation = 1.6,

• Empirical rule: If distribution is approx. bell-shaped,

Example: y = number of close friends you have

Probably highly skewed: right or left?

Empirical rule fails; in fact, median = 5, mode=4

Example: y = selling price of home in Syracuse, NY.

s = 0, s = 1000, s = 50,000, s = 1,000,000

 Interquartile range IQR = UQ - LQ

Outlier = observation falling

Ex. If LQ = 2, UQ = 10, then IQR = 8 and outliers above 10 + 1.5(8) = 22

• Usually we want to study associations between two or more

Ex.: number of close friends is a response variable, while

Response var. also called “dependent variable”

• Categorical var’s: show data using contingency tables

• Ex. General Social Survey (GSS) data

• Cross classifications of categorical variables in which rows

• Counts in “cells” of the table give the numbers of

Example: Percentage “very happy” is

39% for above aver. income (164/423 = 0.39)

Inference questions for later chapters? (i.e., what can we

Example: Table 9.13 (p. 294) shows UN data for several

y = measure of mental impairment (incorporates various

x = life events score (events range from severe personal

Per capita income

The larger the correlation in absolute value, the stronger the

Examples: (positive or negative, how strong?)

• Falls between -1 and +1, with sign indicating direction of

Examples: (positive or negative, how strong?)

Mental impairment and life events, correlation = 0.37