Statistical Foundations - Intro 64zlf

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 86

Predictive Analytics

David Pratap, Lead Data scientist


David Pratap Certifications Achievements

• Artificial Intelligence &


• Principal Investigator for city
Advanced Analytics
wide Epidemiological Survey in
• Python for Data Science
Muscat, Oman
• Statistics in Medicine

Lead Data Scientist Experience Strength

Senior Consultant
• Five years as Predictive Data
Biostatistician Analyst • Inferential Statistics
• 3 Years as Senior Data Scientist • Deep Learning
Predictive Data Analyst • 1 year as Lead Data Scientist

Certified in AI & Advanced Analytics


Module 1: Introduction to Statistics
Introductory Statistics-Part1
Nature of Data
Objectives for Today

-- Understand the Big Picture of Statistics


-- What are data types?
-- What are the branches of statistics ?
-- How are samples selected ?
-- Organizing data into tables and charts
-- Summarizing data- Descriptive Statistics
What is Predictive Analytics?
• Predictive Analytics is the use of data, statistical
methods/algorithms and machine learning to predict
future outcomes based on historical data.
• Why is this so hot now ?
• Three reasons:
-Growing volumes of data
-faster and cheaper computers
-easy to use software

5
The Big Picture of Predictive Analytics
What is Statistics ?

7
Introduction
• Statistics is the science of conducting studies to
collect,
organize,
summarize,
analyze, and
draw conclusions from data.

8
Variables and Types of Data
Data

Qualitative Quantitative
Categorical Numerical

Discrete Continuous
Countable Can be decimals
5, 29, 8000, etc. 2.59, 312.1, etc.

9
Variables and Types of Data
No Ranking
No precise
Levels of Measurement-bases on how variables
possible, eg. are difference
categorized, counted or measured Blood types,
gender etc between
ranks, eg.
1. Nominal – categorical (names) Grades

2. Ordinal – nominal, plus can be ranked (order)


Ranking
3. Interval – ordinal, plus intervals are consistent possible but No
true Zero ,
eg.temperature
4. Ratio – interval, plus ratios are consistent, true
zero Ranking possible
and True zero,
Eg. Height, age,
time, number of
children etc. 10
Variables and Types of Data

Determine the measurement level.


Variable Nominal Ordinal Interval Ratio Level
Hair Color Yes No Nominal
Postal Code Yes No Nominal
Letter Grade Yes Yes No Ordinal
IQ Score Yes Yes Yes No Interval
Height Yes Yes Yes Yes Ratio
Age Yes Yes Yes Yes Ratio
Temperature (F) Yes Yes Yes No Interval

11
Descriptive vs Inferential Statistics National
Census
• Descriptive statistics consists of the collection,
organization, summarization, and presentation of
data.
• Inferential statistics consists of generalizing from
samples to populations, performing estimations and
hypothesis tests, determining relationships among
variables, and making predictions.

Estimating Is smoking
Prevalence of related to lung
Smoking ? cancer ? 12
Sampling

Descriptive

Representative
Sample

Inferential
Some Probability Sampling Techniques
Simple Random Sample(SRS)

• Random – random number generator


• Systematic – every kth subject, k= N/n
• Stratified – divide population into “layers”
• Cluster – use intact(geographical) groups
• Convenient (Non-probability)– mall surveys

Note: Whenever we use the word “sample” we mean a “ random & representative
sample”.
14
Simple Random Sampling(SRS)
• Randomly sample cases from the population
• random number generator
• Equal probability of being selected
• Limitation-Time consuming-large populations
Systematic Sampling
• Every kth subject, k =N/n
• First subject from 1 through k
• Easy to implement
• Can pick up hidden patterns
Stratified sampling
• Divide population into layers( strata)
• SRS from each homogenous layer
• Ensures representation
• Difficult for large populations
Cluster sampling
• Use intact geographical groups
• Randomly select a few clusters by SRS
• Select all(or SRS) from each cluster
• cheap, simple & convenient
Pop Quiz
Fill in the blanks
• Two major branches of Statistics are
________________and _______________
• The group of all subjects under study is called
__________________
• A group of subjects selected from the group of all
subjects under study is called _____________
• The number of students in a classroom is an example
of ________________data .
• For a sample to be unbiased we should
use_________methods of sampling.
State ‘True’ or ‘False’
• The variable age is an example of qualitative variable.
• The weight of pumpkins is considered to be a continuous
variable.
• “Drinking Decaffeinated coffee can increase cholesterol by
7%” . In this statement descriptive statistics is used.
• The level of measurement of the number of students in various
colleges of Hyderabad is ratio.
• When the population of college professors are divided into
groups according to their rank(instructor, assistant, professor
etc) and then several are selected from each group to make up
a sample, the sample is called a cluster sample.
Classify each variable as discrete or continuous.

• Ages of people in a factory


• Number of coffee cups served in a restaurant
• The time it takes a student to drive to his/her college
• Water temperatures of six swimming pools in Pittsburgh on a given
day
• The number of gallons of milk sold each day at a grocery store
Organizing Data
• Data collected in original form is called raw data.
• A frequency distribution is the organization of raw
data in table form, using classes and frequencies.

23
Uses of Frequency Tables

• Easy to eyeball the data


• Center of the distribution
• Outliers(extreme values)
For Categorical Data (nominal or ordinal)
• Use Frequency tables & Bar charts
• Shows frequency(count) or proportion(%) of
observations in each category

Class Frequency Percent


A 5 20
B 7 28
O 9 36
AB 4 16
For Quantitative

Data
Use binned frequency tables and Histograms
• Shows the frequency or relative frequency of each
class interval

Class Class Frequenc


Limits Boundaries y
100 - 104 99.5 - 104.5
105 - 109 2
104.5 - 109.5 8
110 - 114
115 - 119 109.5 - 114.5 18
120 - 124 114.5 - 119.5 13
125 - 129 119.5 - 124.5 7
130 - 134 124.5 - 129.5 1
129.5 - 134.5 1
Presenting Quantitative data-Histograms
The Most Common Graphs in Research
1. Histograms contain all the data
2. Histograms show the shape of the distribution
3. Histograms show the center and spread of the
distribution
4. Histograms show outliers if any

27
Histograms
Histograms use class boundaries and
frequencies of the classes.

28
Histograms
Use the class boundaries and the
relative frequencies of the classes.

25% of runners ran


between 20.5 and 25.5 miles

5% of runners ran less than 10.5 miles


29
Shapes of Distributions

30
Shapes of Distributions

31
Feelings about math (0=lowest, 100=highest)

Closest to a normal
distribution!
Example data: Optimism…

Left skew or bimodal


distribution!
Fruit and vegetable consumption (servings/day)…

Right skew distribution!


Pop Quiz: Distribution Type
• Which of the following is a bimodal distribution?

(A) (B)
Other Types of Graphs
Horizontal Bar Graphs

36
Other Types of Graphs
Pareto Charts

37
Other Types of Graphs
Time Series Graphs

38
Other Types of Graphs
Pie Graphs

39
Recap- Presenting Data(Univariate Analysis)
• Categorical Data
• - Proportions & Bar charts
• Quantitative Data
• -Binned Frequency distribution & Histograms
• Distribution shapes
Summarizing Data( Get descriptive statistics)

Traditional Statistics
•Average(Center)- Measure of central tendency
•Variation(Spread)- Measures of spread
•Position- Measures of position

41
Measures of Central Tendency
What Do We Mean By Average?
• Mean
• Median
• Mode

42
Measures of Central Tendency: Mean
• The mean is the quotient of the sum of the values
and the total number of values.
• The symbol is used for sample mean.

• For a population, the Greek letter μ (mu) is used for


the mean.

43
Example : Days Off per Year
The data represent the number of days off per year for
a sample of individuals selected from nine different
countries. Find the mean.
20, 26, 40, 36, 23, 42, 35, 24, 30

The mean number of days off is 30.7 years.

44
Mean is a balancing Point
Mean = 24.5

The Balancing Point


Measures of Central Tendency: Median
• The median is the midpoint of the data array. The
symbol for the median is MD.
• The median will be one of the data values if there is
an odd number of values.
• The median will be the average of two data values if
there is an even number of values.

46
Example: Hotel Rooms
The number of rooms in the seven hotels in
downtown Pittsburgh is 713, 300, 618, 595, 311,
401, and 292. Find the median.

Step1: Sort in ascending order.


292, 300, 311, 401, 596, 618, 713

Step 2:Select the middle value.


MD = 401

The median is 401 rooms.


47
Example : Tornadoes in the U.S.
The number of tornadoes that have occurred in
the United States over an 8-year period follows.
Find the median.
684, 764, 656, 702, 856, 1133, 1132, 1303

Find the average of the two middle values.


656, 684, 702, 764, 856, 1132, 1133, 1303

The median number of tornadoes is 810.


48
Measures of Central Tendency: Mode
• The mode is the value that occurs most often in a
data set.
• It is sometimes said to be the most typical case.
• There may be no mode, one mode (unimodal), two
modes (bimodal), or many modes (multimodal).

49
Example : NFL Signing Bonuses
Find the mode of the signing bonuses of eight
NFL players for a specific year. The bonuses in
millions of dollars are
18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10

You may find it easier to sort first.


10, 10, 10, 11.3, 12.4, 14.0, 18.0, 34.5

Select the value that occurs the most.

The mode is 10 million dollars.

50
Example: Coal Employees in PA
Find the mode for the number of coal employees per
county for 10 selected counties in southwestern
Pennsylvania.
110, 731, 1031, 84, 20, 118, 1162, 1977, 103, 752

No value occurs more than once.

There is no mode.

51
Example : Licensed Nuclear Reactors
The data show the number of licensed nuclear
reactors in the United States for a recent 15-year
period. Find the mode.
104 104 104 104 104 107 109 109 109 110
104 104 104 104 104 107 109 109 109 110
109 111 112 111 109
109 111 112 111 109

104 and 109 both occur the most. The data set is
said to be bimodal.

The modes are 104 and 109.


52
Mode for Categorical(Nominal) Data
• most typical case when data are nominal
Properties of the Mean
Uses all data values.
 gives the balancing point.
Used in computing other statistics, such as the
variance
Unique, usually not one of the data values
Affected by extremely high or low values, called
outliers, should not be generally used for highly
skewed data.

54
Which diet is better?

Latest Weight loss diet Latest Weight loss diet

DIET -1 DIET -2

Mean Weight loss = 34.5 Mean Weight loss = 18.5


Kg Kg

Diet 1: +4, +3, 0,-3, -4, -5, -11, -14, -15, -300
Diet2: -8, -10, -12, -16, -18,-20,-21,-24,-26, -30
Pop Quiz: Means vs Medians
• The change in weight (lbs) of two dieting groups is
shown :
• Diet 1: +4, +3, 0,-3, -4, -5, -11, -14, -15, -300
• Diet2: -8, -10, -12, -16, -18,-20,-21,-24,-26, -30
• Calculate the average weight loss in each group and
decide which is better.
• Answer: Diet 1: mean=-34.5, median =-4.5
• Diet 2: mean = - 18.5, median =-19

Diet 2 is better since there are no outliers and mean and


median agree with each other
When to use mean, median,mode
Type of data Best measure of
Central Tendency
Quantitative( Symmetric) Mean
Quantitative(Skewed) Median
Ordinal Median/Mode
Nominal/ Bimodal Mode

57
Distributions
Note how the mean gets
dragged towards the skew

Positively Skewed or Right Skewed

Negatively skewed or eft Skewed


Symmetric
58
QUIZ
• Which measure of Central Tendency would you use:
• 1. Salaries of doctors in a hospital.(Hint: salaries are
typically skewed)
• 2.Test scores of all students in USLME Step1
• 3.Disease stages in a group of patients with Reye’s
syndrome.(Hint: ordering?)
• The blood- type of 25 army cadets

Answers: 1)median, 2)mean, 3)median,


4) Mode
59
Introductory Statistics-Part2
Summarizing data and EDA

Objectives for Today

-- Summarizing data- Measures of Spread


-- Summarizing data – Measures of Position
-- EDA – Exploratory Data Analysis
Measures of Variation
How Can We Measure Variability?
• Range
• Variance
• Standard Deviation
• Empirical Rule (Normal)

61
Measures of Variation: Range
• The range is the difference between the highest and
lowest values in a data set.

• Very useful to get the spread quickly

62
Example : Outdoor Paint
Two experimental brands of outdoor paint are tested
to see how long each will last before fading. Six cans
of each brand constitute a small population. The
results (in months) are shown. Find the mean and
range of each group.
Brand A Brand B
10 35
60 45
50 30
30 35
40 40
20 25

63
Example : Outdoor Paint
Brand A Brand B
10 35
60 45
50 30
30 35
40 40
20 25

The average for both brands is the same, but the range
for Brand A is much greater than the range for Brand B.

Which brand would you buy?


64
Examine Data Sets Graphically

Brand B is less
variable and
hence more
consistent

Range is affected by
outliers
Challenge
• Find the average distance from the mean !

• Hint: Distance from the mean =


Measures of Variation: Variance & Standard
Deviation

• The variance is the average of the squares of the


distance each value is from the mean. (Mean Squared
Distance)
• The standard deviation is the square root of the
variance.
• The standard deviation is a measure of the average
spread of the data ( not the exact average distance ).

67
Measures of Variation:
Variance & Standard Deviation
• The population variance and population standard
deviation are

• The sample variance and sample standard


deviation are

68
• Uses of the Variance and Standard Deviation

• To determine the spread of the data.( if variance or standard deviation is


large the data are more dispersed)
• To determine the consistency of a variable.
• Used in inferential statistics.
• To determine the number of data values that fall within a specified
interval in a distribution (Empirical rule ).

69
Measures of Variation:
Empirical Rule (Normal)

70
Pop Quiz !
• The mean weight of a group of children is 49 Kg
with a standard deviation of 3 Kg. Assuming that
the distribution of weights are Bell Shaped, then
approximately what percent of children weigh :
• A) between 46 and 52 kg?
• B) between 46 and 55 Kg?
• C) more than 55Kg?
• Answer: 68%, 81.5%, 2.5%
Measures of Position: Z-score
• A z-score or standard score for a value is obtained by
subtracting the mean from the value and dividing the
result by the standard deviation.

• A z-score represents the number of standard


deviations a value is above or below the mean.

72
Example : Test Scores
A student scored 65 on a calculus test that had a
mean of 50 and a standard deviation of 10; she scored
30 on a history test with a mean of 25 and a standard
deviation of 5. Compare her relative positions on the
two tests.

She has a higher relative position in the Calculus class.

73
Measures of Position: Percentiles
• Percentiles separate the data set into 100 equal
groups.
• A percentile rank for a datum represents the
percentage of data values below the datum.
• They are denoted as P1 , P2 , …, P99

74
Measures of Position: Example of
a Percentile Graph

P
E
R What is the percentile rank for a
C Blood pressure of 130 ?
E
N 70th percentile
T What is the 40th percentile BP?
I
L 118 mm Hg
E
What is the 60th percentile BP ?

(Blood Pressure) 75
Measures of Position: Quartiles
• Quartiles separate the data set into 4 equal groups.
Q1=P25, Q2=MD, Q3=P75
• Q2 = median(Low,High)
Q1 = median(Low,Q2)
Q3 = median(Q2,High)
• The Interquartile Range, represents the middle 50%
of the data IQR = Q3 – Q1.

76
Example : Quartiles
Find Q1, Q2, and Q3 for the data set. Also find the IQR
15, 13, 6, 5, 12, 50, 22, 18

Sort in ascending order.


5, 6, 12, 13, 15, 18, 22, 50

IQR = Q3 – Q1=20-9=11 77
Outliers
• An outlier is an extremely high or low data value.
• Any data value less than Q1 – 1.5(IQR) or greater
than Q3 + 1.5(IQR)
• Example in previous case: IQR = 20-9=11
• Q1 – 1.5(IQR)= -7.5
• Q3 + 1.5(IQR) =36.5
• 50 is an outlier

78
Exploratory Data Analysis(EDA)
Exploratory Data Analysis(EDA)
• The Five-Number Summary is composed
of the following numbers: Low, Q1, MD,
Q3, High
• The Five-Number Summary can be
graphically represented using a Boxplot.

80
Example : Meteorites
The number of meteorites found in 10 U.S. states is shown.
Find the 5 number summary and construct a boxplot for the
data. 89, 47, 164, 296, 30, 215, 138, 78, 48, 39
Sort the data
30, 39, 47, 48, 78, 89, 138, 164, 215, 296

Low Q1 MD Q3 High

Five-Number Summary: 30-47-83.5-164-296


47 83.5 164
30 296

81
Informations from Boxplots
• To find average use the median( line inside the box)
• To find variability use the IQR (length of the boxes)
• If median is near the centre of the box and lines are about equal 
symmetric
• If median is at the left side of the box and/or right line is larger 
right skew
• If median is at the right side of the box and/or left line is larger  left
skew
Cheese substitute has higher median ,but Real cheese shows greater variability
TRY THIS !

• Which area has highest age ?


• Which area has highest variability?
• Which area has highest range ?
Symbols
 s2= Sample variance
 s = Sample standard deviation
 2 = Population (true or theoretical) variance
  = Population standard deviation
 X = Sample mean
 µ = Population mean
 IQR = interquartile range (middle 50%)
 r = Sample correlation coefficient
  = Population correlation coefficient
 n = Sample size
 N = Population size
Reference Book
• Open Intro Statistics https://www.openintro.org/book/os/

• Link for intro google sheet


https://docs.google.com/spreadsheets/d/1sxvDzkgJUdLmYwPJD0lMw
2MGI-HG_qPkp6ROhzwxC7I/edit?usp=sharing

You might also like