Statistical Foundations - Intro 64zlf
Statistical Foundations - Intro 64zlf
Statistical Foundations - Intro 64zlf
Senior Consultant
• Five years as Predictive Data
Biostatistician Analyst • Inferential Statistics
• 3 Years as Senior Data Scientist • Deep Learning
Predictive Data Analyst • 1 year as Lead Data Scientist
5
The Big Picture of Predictive Analytics
What is Statistics ?
7
Introduction
• Statistics is the science of conducting studies to
collect,
organize,
summarize,
analyze, and
draw conclusions from data.
8
Variables and Types of Data
Data
Qualitative Quantitative
Categorical Numerical
Discrete Continuous
Countable Can be decimals
5, 29, 8000, etc. 2.59, 312.1, etc.
9
Variables and Types of Data
No Ranking
No precise
Levels of Measurement-bases on how variables
possible, eg. are difference
categorized, counted or measured Blood types,
gender etc between
ranks, eg.
1. Nominal – categorical (names) Grades
11
Descriptive vs Inferential Statistics National
Census
• Descriptive statistics consists of the collection,
organization, summarization, and presentation of
data.
• Inferential statistics consists of generalizing from
samples to populations, performing estimations and
hypothesis tests, determining relationships among
variables, and making predictions.
Estimating Is smoking
Prevalence of related to lung
Smoking ? cancer ? 12
Sampling
Descriptive
Representative
Sample
Inferential
Some Probability Sampling Techniques
Simple Random Sample(SRS)
Note: Whenever we use the word “sample” we mean a “ random & representative
sample”.
14
Simple Random Sampling(SRS)
• Randomly sample cases from the population
• random number generator
• Equal probability of being selected
• Limitation-Time consuming-large populations
Systematic Sampling
• Every kth subject, k =N/n
• First subject from 1 through k
• Easy to implement
• Can pick up hidden patterns
Stratified sampling
• Divide population into layers( strata)
• SRS from each homogenous layer
• Ensures representation
• Difficult for large populations
Cluster sampling
• Use intact geographical groups
• Randomly select a few clusters by SRS
• Select all(or SRS) from each cluster
• cheap, simple & convenient
Pop Quiz
Fill in the blanks
• Two major branches of Statistics are
________________and _______________
• The group of all subjects under study is called
__________________
• A group of subjects selected from the group of all
subjects under study is called _____________
• The number of students in a classroom is an example
of ________________data .
• For a sample to be unbiased we should
use_________methods of sampling.
State ‘True’ or ‘False’
• The variable age is an example of qualitative variable.
• The weight of pumpkins is considered to be a continuous
variable.
• “Drinking Decaffeinated coffee can increase cholesterol by
7%” . In this statement descriptive statistics is used.
• The level of measurement of the number of students in various
colleges of Hyderabad is ratio.
• When the population of college professors are divided into
groups according to their rank(instructor, assistant, professor
etc) and then several are selected from each group to make up
a sample, the sample is called a cluster sample.
Classify each variable as discrete or continuous.
23
Uses of Frequency Tables
27
Histograms
Histograms use class boundaries and
frequencies of the classes.
28
Histograms
Use the class boundaries and the
relative frequencies of the classes.
30
Shapes of Distributions
31
Feelings about math (0=lowest, 100=highest)
Closest to a normal
distribution!
Example data: Optimism…
(A) (B)
Other Types of Graphs
Horizontal Bar Graphs
36
Other Types of Graphs
Pareto Charts
37
Other Types of Graphs
Time Series Graphs
38
Other Types of Graphs
Pie Graphs
39
Recap- Presenting Data(Univariate Analysis)
• Categorical Data
• - Proportions & Bar charts
• Quantitative Data
• -Binned Frequency distribution & Histograms
• Distribution shapes
Summarizing Data( Get descriptive statistics)
Traditional Statistics
•Average(Center)- Measure of central tendency
•Variation(Spread)- Measures of spread
•Position- Measures of position
41
Measures of Central Tendency
What Do We Mean By Average?
• Mean
• Median
• Mode
42
Measures of Central Tendency: Mean
• The mean is the quotient of the sum of the values
and the total number of values.
• The symbol is used for sample mean.
43
Example : Days Off per Year
The data represent the number of days off per year for
a sample of individuals selected from nine different
countries. Find the mean.
20, 26, 40, 36, 23, 42, 35, 24, 30
44
Mean is a balancing Point
Mean = 24.5
46
Example: Hotel Rooms
The number of rooms in the seven hotels in
downtown Pittsburgh is 713, 300, 618, 595, 311,
401, and 292. Find the median.
49
Example : NFL Signing Bonuses
Find the mode of the signing bonuses of eight
NFL players for a specific year. The bonuses in
millions of dollars are
18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10
50
Example: Coal Employees in PA
Find the mode for the number of coal employees per
county for 10 selected counties in southwestern
Pennsylvania.
110, 731, 1031, 84, 20, 118, 1162, 1977, 103, 752
There is no mode.
51
Example : Licensed Nuclear Reactors
The data show the number of licensed nuclear
reactors in the United States for a recent 15-year
period. Find the mode.
104 104 104 104 104 107 109 109 109 110
104 104 104 104 104 107 109 109 109 110
109 111 112 111 109
109 111 112 111 109
104 and 109 both occur the most. The data set is
said to be bimodal.
54
Which diet is better?
DIET -1 DIET -2
Diet 1: +4, +3, 0,-3, -4, -5, -11, -14, -15, -300
Diet2: -8, -10, -12, -16, -18,-20,-21,-24,-26, -30
Pop Quiz: Means vs Medians
• The change in weight (lbs) of two dieting groups is
shown :
• Diet 1: +4, +3, 0,-3, -4, -5, -11, -14, -15, -300
• Diet2: -8, -10, -12, -16, -18,-20,-21,-24,-26, -30
• Calculate the average weight loss in each group and
decide which is better.
• Answer: Diet 1: mean=-34.5, median =-4.5
• Diet 2: mean = - 18.5, median =-19
57
Distributions
Note how the mean gets
dragged towards the skew
61
Measures of Variation: Range
• The range is the difference between the highest and
lowest values in a data set.
62
Example : Outdoor Paint
Two experimental brands of outdoor paint are tested
to see how long each will last before fading. Six cans
of each brand constitute a small population. The
results (in months) are shown. Find the mean and
range of each group.
Brand A Brand B
10 35
60 45
50 30
30 35
40 40
20 25
63
Example : Outdoor Paint
Brand A Brand B
10 35
60 45
50 30
30 35
40 40
20 25
The average for both brands is the same, but the range
for Brand A is much greater than the range for Brand B.
Brand B is less
variable and
hence more
consistent
Range is affected by
outliers
Challenge
• Find the average distance from the mean !
67
Measures of Variation:
Variance & Standard Deviation
• The population variance and population standard
deviation are
68
• Uses of the Variance and Standard Deviation
69
Measures of Variation:
Empirical Rule (Normal)
70
Pop Quiz !
• The mean weight of a group of children is 49 Kg
with a standard deviation of 3 Kg. Assuming that
the distribution of weights are Bell Shaped, then
approximately what percent of children weigh :
• A) between 46 and 52 kg?
• B) between 46 and 55 Kg?
• C) more than 55Kg?
• Answer: 68%, 81.5%, 2.5%
Measures of Position: Z-score
• A z-score or standard score for a value is obtained by
subtracting the mean from the value and dividing the
result by the standard deviation.
72
Example : Test Scores
A student scored 65 on a calculus test that had a
mean of 50 and a standard deviation of 10; she scored
30 on a history test with a mean of 25 and a standard
deviation of 5. Compare her relative positions on the
two tests.
73
Measures of Position: Percentiles
• Percentiles separate the data set into 100 equal
groups.
• A percentile rank for a datum represents the
percentage of data values below the datum.
• They are denoted as P1 , P2 , …, P99
74
Measures of Position: Example of
a Percentile Graph
P
E
R What is the percentile rank for a
C Blood pressure of 130 ?
E
N 70th percentile
T What is the 40th percentile BP?
I
L 118 mm Hg
E
What is the 60th percentile BP ?
(Blood Pressure) 75
Measures of Position: Quartiles
• Quartiles separate the data set into 4 equal groups.
Q1=P25, Q2=MD, Q3=P75
• Q2 = median(Low,High)
Q1 = median(Low,Q2)
Q3 = median(Q2,High)
• The Interquartile Range, represents the middle 50%
of the data IQR = Q3 – Q1.
76
Example : Quartiles
Find Q1, Q2, and Q3 for the data set. Also find the IQR
15, 13, 6, 5, 12, 50, 22, 18
IQR = Q3 – Q1=20-9=11 77
Outliers
• An outlier is an extremely high or low data value.
• Any data value less than Q1 – 1.5(IQR) or greater
than Q3 + 1.5(IQR)
• Example in previous case: IQR = 20-9=11
• Q1 – 1.5(IQR)= -7.5
• Q3 + 1.5(IQR) =36.5
• 50 is an outlier
78
Exploratory Data Analysis(EDA)
Exploratory Data Analysis(EDA)
• The Five-Number Summary is composed
of the following numbers: Low, Q1, MD,
Q3, High
• The Five-Number Summary can be
graphically represented using a Boxplot.
80
Example : Meteorites
The number of meteorites found in 10 U.S. states is shown.
Find the 5 number summary and construct a boxplot for the
data. 89, 47, 164, 296, 30, 215, 138, 78, 48, 39
Sort the data
30, 39, 47, 48, 78, 89, 138, 164, 215, 296
Low Q1 MD Q3 High
81
Informations from Boxplots
• To find average use the median( line inside the box)
• To find variability use the IQR (length of the boxes)
• If median is near the centre of the box and lines are about equal
symmetric
• If median is at the left side of the box and/or right line is larger
right skew
• If median is at the right side of the box and/or left line is larger left
skew
Cheese substitute has higher median ,but Real cheese shows greater variability
TRY THIS !