Statistical Foundations - Intro 64zlf

Predictive Analytics
David Pratap, Lead Data scientist

David Pratap Certifications Achievements
• Artificial Intelligence &

• Principal Investigator for city
Advanced Analytics
wide Epidemiological Survey in
• Python for Data Science
Muscat, Oman
• Statistics in Medicine
Lead Data Scientist Experience Strength
Senior Consultant
• Five years as Predictive Data
Biostatistician Analyst • Inferential Statistics
• 3 Years as Senior Data Scientist • Deep Learning
Predictive Data Analyst • 1 year as Lead Data Scientist
Certified in AI & Advanced Analytics

Module 1: Introduction to Statistics
Introductory Statistics-Part1
Nature of Data
Objectives for Today
-- Understand the Big Picture of Statistics

-- What are data types?
-- What are the branches of statistics ?
-- How are samples selected ?
-- Organizing data into tables and charts
-- Summarizing data- Descriptive Statistics
What is Predictive Analytics?
• Predictive Analytics is the use of data, statistical
methods/algorithms and machine learning to predict
future outcomes based on historical data.
• Why is this so hot now ?
• Three reasons:
-Growing volumes of data
-faster and cheaper computers
-easy to use software
5
The Big Picture of Predictive Analytics
What is Statistics ?
7
Introduction
• Statistics is the science of conducting studies to
collect,
organize,
summarize,
analyze, and
draw conclusions from data.
8
Variables and Types of Data
Data
Qualitative Quantitative
Categorical Numerical
Discrete Continuous
Countable Can be decimals
5, 29, 8000, etc. 2.59, 312.1, etc.
9
No Ranking
No precise
Levels of Measurement-bases on how variables
possible, eg. are difference
categorized, counted or measured Blood types,
gender etc between
ranks, eg.
1. Nominal – categorical (names) Grades
2. Ordinal – nominal, plus can be ranked (order)

Ranking
3. Interval – ordinal, plus intervals are consistent possible but No
true Zero ,
eg.temperature
4. Ratio – interval, plus ratios are consistent, true
zero Ranking possible
and True zero,
Eg. Height, age,
time, number of
children etc. 10
Determine the measurement level.

Variable Nominal Ordinal Interval Ratio Level
Hair Color Yes No Nominal
Postal Code Yes No Nominal
Letter Grade Yes Yes No Ordinal
IQ Score Yes Yes Yes No Interval
Height Yes Yes Yes Yes Ratio
Age Yes Yes Yes Yes Ratio
Temperature (F) Yes Yes Yes No Interval
11
Descriptive vs Inferential Statistics National
Census
• Descriptive statistics consists of the collection,
organization, summarization, and presentation of
data.
• Inferential statistics consists of generalizing from
samples to populations, performing estimations and
hypothesis tests, determining relationships among
variables, and making predictions.
Estimating Is smoking
Prevalence of related to lung
Smoking ? cancer ? 12
Sampling
Descriptive
Representative
Sample
Inferential
Some Probability Sampling Techniques
Simple Random Sample(SRS)
• Random – random number generator

• Systematic – every kth subject, k= N/n
• Stratified – divide population into “layers”
• Cluster – use intact(geographical) groups
• Convenient (Non-probability)– mall surveys
Note: Whenever we use the word “sample” we mean a “ random & representative
sample”.
14
Simple Random Sampling(SRS)
• Randomly sample cases from the population
• random number generator
• Equal probability of being selected
• Limitation-Time consuming-large populations
Systematic Sampling
• Every kth subject, k =N/n
• First subject from 1 through k
• Easy to implement
• Can pick up hidden patterns
Stratified sampling
• Divide population into layers( strata)
• SRS from each homogenous layer
• Ensures representation
• Difficult for large populations
Cluster sampling
• Use intact geographical groups
• Randomly select a few clusters by SRS
• Select all(or SRS) from each cluster
• cheap, simple & convenient
Pop Quiz
Fill in the blanks
• Two major branches of Statistics are
________________and _______________
• The group of all subjects under study is called
__________________
• A group of subjects selected from the group of all
subjects under study is called _____________
• The number of students in a classroom is an example
of ________________data .
• For a sample to be unbiased we should
use_________methods of sampling.
State ‘True’ or ‘False’
• The variable age is an example of qualitative variable.
• The weight of pumpkins is considered to be a continuous
variable.
• “Drinking Decaffeinated coffee can increase cholesterol by
7%” . In this statement descriptive statistics is used.
• The level of measurement of the number of students in various
colleges of Hyderabad is ratio.
• When the population of college professors are divided into
groups according to their rank(instructor, assistant, professor
etc) and then several are selected from each group to make up
a sample, the sample is called a cluster sample.
Classify each variable as discrete or continuous.
• Ages of people in a factory

• Number of coffee cups served in a restaurant
• The time it takes a student to drive to his/her college
• Water temperatures of six swimming pools in Pittsburgh on a given
day
• The number of gallons of milk sold each day at a grocery store
Organizing Data
• Data collected in original form is called raw data.
• A frequency distribution is the organization of raw
data in table form, using classes and frequencies.
23
Uses of Frequency Tables
• Easy to eyeball the data

• Center of the distribution
• Outliers(extreme values)
For Categorical Data (nominal or ordinal)
• Use Frequency tables & Bar charts
• Shows frequency(count) or proportion(%) of
observations in each category
Class Frequency Percent

A 5 20
B 7 28
O 9 36
AB 4 16
For Quantitative
•
Data
Use binned frequency tables and Histograms
• Shows the frequency or relative frequency of each
class interval
Class Class Frequenc

Limits Boundaries y
100 - 104 99.5 - 104.5
105 - 109 2
104.5 - 109.5 8
110 - 114
115 - 119 109.5 - 114.5 18
120 - 124 114.5 - 119.5 13
125 - 129 119.5 - 124.5 7
130 - 134 124.5 - 129.5 1
129.5 - 134.5 1
Presenting Quantitative data-Histograms
The Most Common Graphs in Research
1. Histograms contain all the data
2. Histograms show the shape of the distribution
3. Histograms show the center and spread of the
distribution
4. Histograms show outliers if any
27
Histograms
Histograms use class boundaries and
frequencies of the classes.
28
Histograms
Use the class boundaries and the
relative frequencies of the classes.
25% of runners ran

between 20.5 and 25.5 miles
5% of runners ran less than 10.5 miles

29
Shapes of Distributions
30
Shapes of Distributions
31
Feelings about math (0=lowest, 100=highest)
Closest to a normal
distribution!
Example data: Optimism…
Left skew or bimodal

distribution!
Fruit and vegetable consumption (servings/day)…
Right skew distribution!

Pop Quiz: Distribution Type
• Which of the following is a bimodal distribution?
(A) (B)
Other Types of Graphs
Horizontal Bar Graphs
36
Pareto Charts
37
Time Series Graphs
38
Pie Graphs
39
Recap- Presenting Data(Univariate Analysis)
• Categorical Data
• - Proportions & Bar charts
• Quantitative Data
• -Binned Frequency distribution & Histograms
• Distribution shapes
Summarizing Data( Get descriptive statistics)
Traditional Statistics
•Average(Center)- Measure of central tendency
•Variation(Spread)- Measures of spread
•Position- Measures of position
41
Measures of Central Tendency
What Do We Mean By Average?
• Mean
• Median
• Mode
42
Measures of Central Tendency: Mean
• The mean is the quotient of the sum of the values
and the total number of values.
• The symbol is used for sample mean.
• For a population, the Greek letter μ (mu) is used for

the mean.
43
Example : Days Off per Year
The data represent the number of days off per year for
a sample of individuals selected from nine different
countries. Find the mean.
20, 26, 40, 36, 23, 42, 35, 24, 30
The mean number of days off is 30.7 years.
44
Mean is a balancing Point
Mean = 24.5
The Balancing Point

Measures of Central Tendency: Median
• The median is the midpoint of the data array. The
symbol for the median is MD.
• The median will be one of the data values if there is
an odd number of values.
• The median will be the average of two data values if
there is an even number of values.
46
Example: Hotel Rooms
The number of rooms in the seven hotels in
downtown Pittsburgh is 713, 300, 618, 595, 311,
401, and 292. Find the median.
Step1: Sort in ascending order.

292, 300, 311, 401, 596, 618, 713
Step 2:Select the middle value.

MD = 401
The median is 401 rooms.

47
Example : Tornadoes in the U.S.
The number of tornadoes that have occurred in
the United States over an 8-year period follows.
Find the median.
684, 764, 656, 702, 856, 1133, 1132, 1303
Find the average of the two middle values.

656, 684, 702, 764, 856, 1132, 1133, 1303
The median number of tornadoes is 810.

48
Measures of Central Tendency: Mode
• The mode is the value that occurs most often in a
data set.
• It is sometimes said to be the most typical case.
• There may be no mode, one mode (unimodal), two
modes (bimodal), or many modes (multimodal).
49
Example : NFL Signing Bonuses
Find the mode of the signing bonuses of eight
NFL players for a specific year. The bonuses in
millions of dollars are
18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10
You may find it easier to sort first.

10, 10, 10, 11.3, 12.4, 14.0, 18.0, 34.5
Select the value that occurs the most.
The mode is 10 million dollars.
50
Example: Coal Employees in PA
Find the mode for the number of coal employees per
county for 10 selected counties in southwestern
Pennsylvania.
110, 731, 1031, 84, 20, 118, 1162, 1977, 103, 752
No value occurs more than once.
There is no mode.
51
Example : Licensed Nuclear Reactors
The data show the number of licensed nuclear
reactors in the United States for a recent 15-year
period. Find the mode.
104 104 104 104 104 107 109 109 109 110
104 104 104 104 104 107 109 109 109 110
109 111 112 111 109
109 111 112 111 109
104 and 109 both occur the most. The data set is
said to be bimodal.
The modes are 104 and 109.

52
Mode for Categorical(Nominal) Data
• most typical case when data are nominal
Properties of the Mean
Uses all data values.
 gives the balancing point.
Used in computing other statistics, such as the
variance
Unique, usually not one of the data values
Affected by extremely high or low values, called
outliers, should not be generally used for highly
skewed data.
54
Which diet is better?
Latest Weight loss diet Latest Weight loss diet
DIET -1 DIET -2
Mean Weight loss = 34.5 Mean Weight loss = 18.5

Kg Kg
Diet 1: +4, +3, 0,-3, -4, -5, -11, -14, -15, -300
Diet2: -8, -10, -12, -16, -18,-20,-21,-24,-26, -30
Pop Quiz: Means vs Medians
• The change in weight (lbs) of two dieting groups is
shown :
• Diet 1: +4, +3, 0,-3, -4, -5, -11, -14, -15, -300
• Diet2: -8, -10, -12, -16, -18,-20,-21,-24,-26, -30
• Calculate the average weight loss in each group and
decide which is better.
• Answer: Diet 1: mean=-34.5, median =-4.5
• Diet 2: mean = - 18.5, median =-19
Diet 2 is better since there are no outliers and mean and

median agree with each other
When to use mean, median,mode
Type of data Best measure of
Central Tendency
Quantitative( Symmetric) Mean
Quantitative(Skewed) Median
Ordinal Median/Mode
Nominal/ Bimodal Mode
57
Distributions
Note how the mean gets
dragged towards the skew
Positively Skewed or Right Skewed
Negatively skewed or eft Skewed

Symmetric
58
QUIZ
• Which measure of Central Tendency would you use:
• 1. Salaries of doctors in a hospital.(Hint: salaries are
typically skewed)
• 2.Test scores of all students in USLME Step1
• 3.Disease stages in a group of patients with Reye’s
syndrome.(Hint: ordering?)
• The blood- type of 25 army cadets
Answers: 1)median, 2)mean, 3)median,

4) Mode
59
Introductory Statistics-Part2
Summarizing data and EDA
Objectives for Today
-- Summarizing data- Measures of Spread

-- Summarizing data – Measures of Position
-- EDA – Exploratory Data Analysis
Measures of Variation
How Can We Measure Variability?
• Range
• Variance
• Standard Deviation
• Empirical Rule (Normal)
61
Measures of Variation: Range
• The range is the difference between the highest and
lowest values in a data set.
• Very useful to get the spread quickly
62
Example : Outdoor Paint
Two experimental brands of outdoor paint are tested
to see how long each will last before fading. Six cans
of each brand constitute a small population. The
results (in months) are shown. Find the mean and
range of each group.
Brand A Brand B
10 35
60 45
50 30
30 35
40 40
20 25
63
Example : Outdoor Paint
Brand A Brand B
10 35
60 45
50 30
30 35
40 40
20 25
The average for both brands is the same, but the range
for Brand A is much greater than the range for Brand B.
Which brand would you buy?

64
Examine Data Sets Graphically
Brand B is less
variable and
hence more
consistent
Range is affected by
outliers
Challenge
• Find the average distance from the mean !
• Hint: Distance from the mean =

Measures of Variation: Variance & Standard
Deviation
• The variance is the average of the squares of the

distance each value is from the mean. (Mean Squared
Distance)
• The standard deviation is the square root of the
variance.
• The standard deviation is a measure of the average
spread of the data ( not the exact average distance ).
67
Measures of Variation:
Variance & Standard Deviation
• The population variance and population standard
deviation are
• The sample variance and sample standard

deviation are
68
• Uses of the Variance and Standard Deviation
• To determine the spread of the data.( if variance or standard deviation is

large the data are more dispersed)
• To determine the consistency of a variable.
• Used in inferential statistics.
• To determine the number of data values that fall within a specified
interval in a distribution (Empirical rule ).
69
Measures of Variation:
Empirical Rule (Normal)
70
Pop Quiz !
• The mean weight of a group of children is 49 Kg
with a standard deviation of 3 Kg. Assuming that
the distribution of weights are Bell Shaped, then
approximately what percent of children weigh :
• A) between 46 and 52 kg?
• B) between 46 and 55 Kg?
• C) more than 55Kg?
• Answer: 68%, 81.5%, 2.5%
Measures of Position: Z-score
• A z-score or standard score for a value is obtained by
subtracting the mean from the value and dividing the
result by the standard deviation.
• A z-score represents the number of standard

deviations a value is above or below the mean.
72
Example : Test Scores
A student scored 65 on a calculus test that had a
mean of 50 and a standard deviation of 10; she scored
30 on a history test with a mean of 25 and a standard
deviation of 5. Compare her relative positions on the
two tests.
She has a higher relative position in the Calculus class.
73
Measures of Position: Percentiles
• Percentiles separate the data set into 100 equal
groups.
• A percentile rank for a datum represents the
percentage of data values below the datum.
• They are denoted as P1 , P2 , …, P99
74
Measures of Position: Example of
a Percentile Graph
P
E
R What is the percentile rank for a
C Blood pressure of 130 ?
E
N 70th percentile
T What is the 40th percentile BP?
I
L 118 mm Hg
E
What is the 60th percentile BP ?
(Blood Pressure) 75
Measures of Position: Quartiles
• Quartiles separate the data set into 4 equal groups.
Q1=P25, Q2=MD, Q3=P75
• Q2 = median(Low,High)
Q1 = median(Low,Q2)
Q3 = median(Q2,High)
• The Interquartile Range, represents the middle 50%
of the data IQR = Q3 – Q1.
76
Example : Quartiles
Find Q1, Q2, and Q3 for the data set. Also find the IQR
15, 13, 6, 5, 12, 50, 22, 18
Sort in ascending order.

5, 6, 12, 13, 15, 18, 22, 50
IQR = Q3 – Q1=20-9=11 77
Outliers
• An outlier is an extremely high or low data value.
• Any data value less than Q1 – 1.5(IQR) or greater
than Q3 + 1.5(IQR)
• Example in previous case: IQR = 20-9=11
• Q1 – 1.5(IQR)= -7.5
• Q3 + 1.5(IQR) =36.5
• 50 is an outlier
78
Exploratory Data Analysis(EDA)
Exploratory Data Analysis(EDA)
• The Five-Number Summary is composed
of the following numbers: Low, Q1, MD,
Q3, High
• The Five-Number Summary can be
graphically represented using a Boxplot.
80
Example : Meteorites
The number of meteorites found in 10 U.S. states is shown.
Find the 5 number summary and construct a boxplot for the
data. 89, 47, 164, 296, 30, 215, 138, 78, 48, 39
Sort the data
30, 39, 47, 48, 78, 89, 138, 164, 215, 296
Low Q1 MD Q3 High
Five-Number Summary: 30-47-83.5-164-296

47 83.5 164
30 296
81
Informations from Boxplots
• To find average use the median( line inside the box)
• To find variability use the IQR (length of the boxes)
• If median is near the centre of the box and lines are about equal 
symmetric
• If median is at the left side of the box and/or right line is larger 
right skew
• If median is at the right side of the box and/or left line is larger  left
skew
Cheese substitute has higher median ,but Real cheese shows greater variability
TRY THIS !
• Which area has highest age ?

• Which area has highest variability?
• Which area has highest range ?
Symbols
 s2= Sample variance
 s = Sample standard deviation
 2 = Population (true or theoretical) variance
  = Population standard deviation
 X = Sample mean
 µ = Population mean
 IQR = interquartile range (middle 50%)
 r = Sample correlation coefficient
  = Population correlation coefficient
 n = Sample size
 N = Population size
Reference Book
• Open Intro Statistics https://www.openintro.org/book/os/
• Link for intro google sheet

https://docs.google.com/spreadsheets/d/1sxvDzkgJUdLmYwPJD0lMw
2MGI-HG_qPkp6ROhzwxC7I/edit?usp=sharing

Statistical Foundations - Intro 64zlf

Uploaded by

Copyright:

Available Formats

Statistical Foundations - Intro 64zlf

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistical Foundations - Intro 64zlf

Uploaded by

Copyright:

Available Formats

Predictive Analytics

David Pratap, Lead Data scientist

• Artificial Intelligence &

Lead Data Scientist Experience Strength

Certified in AI & Advanced Analytics

-- Understand the Big Picture of Statistics

2. Ordinal – nominal, plus can be ranked (order)

Determine the measurement level.

• Random – random number generator

• Ages of people in a factory

• Easy to eyeball the data

Class Frequency Percent

Class Class Frequenc

25% of runners ran

5% of runners ran less than 10.5 miles

Left skew or bimodal

Right skew distribution!

• For a population, the Greek letter μ (mu) is used for

The mean number of days off is 30.7 years.

The Balancing Point

Step1: Sort in ascending order.

Step 2:Select the middle value.

The median is 401 rooms.

Find the average of the two middle values.

The median number of tornadoes is 810.

You may find it easier to sort first.

Select the value that occurs the most.

The mode is 10 million dollars.

No value occurs more than once.

The modes are 104 and 109.

Latest Weight loss diet Latest Weight loss diet

Mean Weight loss = 34.5 Mean Weight loss = 18.5

Diet 2 is better since there are no outliers and mean and

Positively Skewed or Right Skewed

Negatively skewed or eft Skewed

Answers: 1)median, 2)mean, 3)median,

Objectives for Today

-- Summarizing data- Measures of Spread

• Very useful to get the spread quickly

Which brand would you buy?

• Hint: Distance from the mean =

• The variance is the average of the squares of the

• The sample variance and sample standard

• To determine the spread of the data.( if variance or standard deviation is

• A z-score represents the number of standard

She has a higher relative position in the Calculus class.

Sort in ascending order.

Five-Number Summary: 30-47-83.5-164-296

• Which area has highest age ?

• Link for intro google sheet

You might also like