Chapter Two
Chapter Two
Chapter Two
Statistics
Objectives
At the end of this course students will be able to:
Purpose
– Make decisions about
population
characteristics
Inferential statistics
Inferential Statistics
Hypothesis
Estimation
testing
One Point
sample estimation
Two Interval
samples estimation
Inferential process
Statistical Estimation
Estimation is the process of determining a likely value
of population parameter, based on information
collected from the sample
Estimation is the use of sample statistics to estimate the
corresponding population parameters
The objective of estimation is to determine the
approximate value of unknown population parameter
on the basis of a sample statistic
Sample Statistics as Estimators of Population
Parameters
Parameter
Random sample
Estimation
Statistic
Estimation
Estimation
Point Interval
estimation estimation
Point and Interval Estimates
A point estimate is a single value used as an estimate of a population
parameter
Lower Upper
Confidence Confidence
Point Estimate Limit
Limit
Interval estimate
Estimation Process
Interval estimate
Population Point estimate
Mean I am 95%
confident that
Mean, , is X = 50
is between 40 &
unknown 60.
RandomSample
Point estimation
C) Consistency:
C) Consistency: An
An estimator
estimator isis said
said to
to be
be consistent
consistent ifif its
its
probability of
probability of being
being close
close to
to the
the parameter
parameter itit estimates
estimates increases
increases as
as
thesample
the samplesize
sizeincreases
increases
Consistency
n = 10 n = 100
Interval estimation
Confidence
Confidence interval
interval oror interval
interval estimate
estimate isis aa range
range or
or
interval of
interval of numbers
numbers believed
believed to to include
include anan unknown
unknown
population parameter
population parameter
Confidence
Confidence interval:
interval: provide
provide aa range
range of
of values
values of of the
the
estimate likely
estimate likely to
to include
include the
the “true”
“true” population
population parameter
parameter
with aa given
with given probability
probability
(168.04, 171.96)
s
X t /2, n-1
n
Example: The average earnings per share (EPS)
for 10 industrial stocks randomly selected from
those listed on the Dow-Jones Industrial
Average was found to be X = 1.85 with a
standard deviation of S=0.395. Calculate a 99%
confidence interval for the average EPS of all
the industrials listed on the DJIA.
Solution:
Example: A random sample of 900 workers
showed an average height of 67 inches with a
standard deviation of 5 inches.
Solution:
Example:
p±zα/2 p(1-p)
n
standard deviation, s
Similar studies
Estimation of single proportion
(zα/ 2 ) 2 pq
n=
d2
Where:
n = sample size
P = percentage
q = 1-p
d = desired degree of precision
Z= is the standard normal value at the level of
confidence desired, usually at 95% confidence
level
Example
A) Suppose that you are interested to know the proportion of
infants who breastfed >18 months of age in a rural area.
Suppose that in a similar area, the proportion (p) of breastfed
infants was found to be 0.20. What sample size is required to
estimate the true proportion within ±3% with 95% confidence
n 683
n final = 557
n 683
1+ 1
N 3000
An estimate of p is not always available
However, the formula may also be used for sample size
calculation based on various assumptions for the values of p.
Is X 20 likely if ?
Take a Sample
No, not likely!
REJECT H0
X 20
Steps in hypothesis testing
- Alternative hypotheses
State the null hypotheses
H0: μ = μ0
HA: μ μ0
Two- tailed
Example:
A. Is the mean SBP of the population is different from 120
mmHg?
2.Type II Error
– Probability of failing to reject a false null hypothesis
– Probability of rejecting a true alternative hypothesis
– Probability of Type II Error is (Beta)
Type I & II Errors Have an Inverse
Relationship
If you reduce the probability of one
error, the other one increases so that
everything else is unchanged.
a
Factors Affecting Type II Error
Significance level
– Increases when decreases
Population standard deviation
– Increases when increases
Sample size
– Increases when n decreases n
Controlling Type I and
Type II Errors
For any fixed , an increase in the sample
size n will cause a decrease in
For any fixed sample size n, a decrease in
will cause an increase in . Conversely, an
increase in will cause a decrease in .
To decrease both and , increase the
sample size.
Power of a statistical test
The power of a statistical test is the probability of
rejecting Ho, when Ho is really false. Thus power =
1-β.
Clearly if the test maximizes power, it minimizes the
probability of Type 2 error β.
Summary:
Elements of a Hypothesis Test
Null Hypothesis (H0)
– A theory about the values of one or more population
parameters. The status quo.
Alternative Hypothesis (Ha)
– A theory that contradicts the null hypothesis. The theory
generally represents that which we will accept only when
sufficient evidence exists to establish its truth.
Test Statistic
– A sample statistic used to decide whether to reject the null
hypothesis. In general,
Estimate-Hypothesized Parameter
test statistic=
Standard Error
Summary:
Elements of a Hypothesis Test
Critical Value
– A value to which the test statistic is compared at some
particular significance level. (usually at =.01, .05, .10)
Rejection Region
– The numerical values of the test statistic for which the null
hypothesis will be rejected.
– The probability is that the rejection region will contain the
test statistic when the null hypothesis is true, leading to a
Type I error. is usually chosen to be small (.01, .05, .10)
and is the level of significance of the test.
Summary of One- and Two-Tail Tests
One-Tail Test Two-Tail Test One-Tail Test
(left tail) (right tail)
H0: μ μ0 H0: μ ≤ μ0
HA: μ > μ0
HA: μ < μ0
Summary: Rejection Regions
1. Rejection Regions (In Grey)
.5
.5
Form of Ha: 0 2 2
2-tail hypothesis
2 2
If |z|>|z/2|
0
If z< z
0
Z Value of
Sample Statistic
p -Value Solution
(p-Value = 0.0668) ³ (a = 0.05)
Do Not reject H0 .
p Value = 0.0668
Reject
a = 0.05
0 1.645
Z
1.50
Test Statistic 1.50 is in the non reject region
Example: One-ailed Test
(n 1) s 2
The statistic 2
has a distribution
called Chi-squared, if the population is
normally distributed.
(n 1)s 2
d.f. = 1 2 2
d.f . n 1
d.f. = 5 d.f. = 10
Properties of Chi-Square Distribution
Variable A Variable B
B1 B2 B3 B4 Totals
A1
A2
A3
Totals Grand total
where:
r = number of rows (number of categories of variable A)
c = number of columns (number of categories of variable B)
Chi-Square
Hypothesis to be tested:
H0: There is no association between the
row and column variables
HA: There is an association
or
H0: The row and column variables are
independent
HA: The two variables are dependent
Test Statistic: χ 2 - test with df= (r -1)x(c -1)
Chi-Square( 2) - test
where:
Oij -Observed frequency of i th row and jth column
i th row total×jth column total R i ×C j
E ij = =
grand total n
R i -Marginal total of the i th row
C j -Marginal total of the jth column
n-Grand total
An alternative method to calculate Chi-
Square for 2×2 table
Outcome
Exposure Yes No Total
Yes a b r1
No c d r2
Total c1 c2 n
n ( ad bc ) 2
2
r1r2c1c2
Remember that Chi-Square test should be applied
to counts and not percentages
Characteristics of the Chi-Square
Distribution
1. It is not symmetric.
2.The shape of the chi-square distribution depends upon
the degrees of freedom, just like Student’s t-distribution.
0 2 (df , ) 2
Steps
(2) no more than 20% of the expected frequencies are less than 5.
2
2 (O E )
χ
E
Alcohol consumption
Low Moderate High Row
Gender total
Male 10 9 8 27
Female 13 16 12 41
Column 23 25 20 68
total
Solution
Step 1 State the hypothesis
gender
Step 2 Find the critical value: the critical value is
4.605, since the degrees of freedom are (2-1)(3-1)=2
Alcohol consumption
Row
Low Moderate High
Gender total
total
Then, the test value is
(O E ) 2
2
all cells E
0.283
Step 4 Make the decision: Do not reject the null
hypothesis, since 0.283 < 4.605
Scatter Plot
The graph suggests a 100
positive relationship 80
between hours of Grade (%) 60
studies and grades 40
20
0
0 2 4 6 8
Hours Studied
Correlation Coefficient
Measures the strength and direction of a relationship
between two variables. n
(x -x)(y -y)
i=1
i i
covariance(x,y) n-1
r= =
varx vary n n
(x -x) (y -y)
i=1
i
2
i=1
i
2
n-1 n-1
An alternative formula:
n
(x -x)(y -y)
i=1
i i
r=
n n
i i
(x
i=1
-x) 2
(y -y) 2
i=1
Correlation Coefficient
Unit-less
0
-1 +1
Strength of relationship
Correlation from 0 to 0.25 (or 0 to –0.25) indicate
little or no relationship
those from 0.25 to 0.5 (or –0.25 to –0.50)
indicate a fair degree of relationship;
those from 0.50 to 0.75 (or –0.50 to –0.75) a
moderate to good relationship; and
those greater than 0.75 (or –0.75 to –1.00) a very
good to excellent relationship.
Scatter Plots of Data with Various
Correlation Coefficients
Y Y Y
X X X
r = -1 r = -.6 r=0
Y
Y Y
X X X
r = +1 r = +.3 r=0
Linear Correlation
Linear relationships Curvilinear relationships
Y Y
X X
Y Y
X X
Linear Correlation
Strong relationships Weak relationships
Y Y
X X
Y Y
X X
Linear Correlation
No relationship
X
Example
The following table shows the Systolic blood pressure and body
weight from a sample of twenty young adults
Subject SBP(mmHg) Weight(kg) Subject SBP(mmHg) Weight(kg)
1 106 60 11 94 48
2 111 42 12 97 46
3 115 53 13 96 39
4 102 49 14 115 66
5 126 67 15 79 39
6 85 47 16 108 51
7 125 62 17 97 57
8 103 48 18 96 37
9 108 49 19 95 49
10 98 55 20 95 37
Correlation coefficient ?
( x x ) ( y y ) ( x x )( y y ) ( x x )2 (y-y)2
Subject Weight(x) SBP(y)
1 60 106 9.95 3.45 34.33 99.00 11.90
2 42 111 -8.05 8.45 -68.02 64.80 71.40
3 53 115 2.95 12.45 36.73 8.70 155.00
4 49 102 -1.05 -0.55 0.58 1.10 0.30
5 67 126 16.95 23.45 397.48 287.30 549.90
6 47 85 -3.05 -17.55 53.53 9.30 308.00
7 62 125 11.95 22.45 268.28 142.80 504.00
8 48 103 -205 0.45 -0.92 4.20 0.20
9 49 108 -1.05 5.45 -5.72 1.10 29.70
10 55 98 4.95 -4.55 -22.52 24.50 20.70
Cont…
Subject Weight(x) SBP(y)
( x x ) ( y y ) ( x x )( y y ) ( x x )2 (y-y)2
11 48 94 -2.05 -8.55 17.53 4.20 73.10
12 46 97 -4.05 -5.55 22.48 16.40 30.80
13 39 96 -11.05 -6.55 72.38 122.10 42.90
14 66 115 15.95 12.45 198.58 254.40 155.00
15 39 79 -11.05 -23.55 260.23 122.10 554.60
16 51 108 0.95 5.45 5.18 0.90 29.70
17 57 97 6.95 -5.55 -38.57 48.30 30.80
18 37 96 -13.05 -6.55 85.48 170.30 42.90
19 49 95 -1.05 -7.55 7.93 1.10 57.00
20 37 95 -13.05 -7.55 98.53 170.30 57.00
(x -x)(y -y)
i i
r= i=1
n n
(x -x) (y -y)
i=1
i
2
i=1
i
2
1423.45
r= = 0.69
1552.90×2724.90
2. Regression
Regression analysis is used to predict the value of one variable
(the dependent variable) on the basis of other variables (the
independent variables).
o Dependent variable: denoted Y
o Independent variables: denoted X1, X2, …, Xk
Prediction
If you know something about X, this knowledge helps you
predict something about Y.
Linear Regression
aaisisthe
theintercept
interceptof
ofthe
thesystematic
systematiccomponent
componentof
ofthe
theregression
regression
relationship.
relationship.
isisthe
theslope
slopeof
ofthe
thesystematic
systematiccomponent.
component.
Picturing the Simple Linear Regression Model
1 Actualobserved
Actual observedvalues valuesofofYY(y)
(y)
differfrom fromthe
theexpected
expectedvalue
value
{
differ
((mmy|xy|x))by
byan
anunexplained
unexplainedor or
a = Intercept
randomerror(
random error(ee):):
yy== mmy|xy|x ++
X
0
x
==aa++bbxx++
Simple Linear Regression equation…
E ( y / x) x
Regression coefficient β
– Measures association between y and x
– Amount by which y changes on average when x
changes by one unit
Assumptions of the Simple Linear Regression Model
The
Therelationship
relationshipbetween
betweenXX
LINE assumptions of the
andYYisisaastraight-Line
and straight-Line Y
Simple Linear Regression
(linear)relationship.
(linear) relationship.
Model
Theobservations
The observationsare are
independent
independent my|x=a + x
errorsare
Theerrors
The are
uncorrelated(i.e.
uncorrelated (i.e.
Independent)in
Independent) in y
successiveobservations.
successive observations.
The errorsare
Theerrors areNormally
Normally Identical normal
distributedwith
distributed withmean
mean00andand distributions of
variance22(Equal
variance (Equal N(my|x, sy|x2) errors, all
variance).That
variance). is: ~~
Thatis: centered on the
regression line.
N(0,22))
N(0, x X
Regression Picture
ŷi xi
yi
C A
B
B
y
A y
C
yi
* Least squares
estimation gave us the
x line (β) that minimized
C2
n n n
(y
i 1
i y) 2
( yˆ
i 1
i y) 2
( yˆ
i 1
i yi ) 2
A2 B2 C2
n n n
i 1
2
( yi y )
i 1
2
( yˆ i y ) ( yˆ i y i )
i 1
2
A2 B2 C2
SSresidual
SSreg Variance around the
SStotal regression line
Distance from regression
Total squared distance line to mean of y Additional
variability
of observations Variability due to
not explained
from mean of y x
by
Total variation (regression)
x : what least
squares method
aims
to minimize
The equation of straight line
The equation of straight line is
y x
n
(xi -x)(yi -y)
ß = i=1 α = y + ßx
n
2
i (x -x)
i=1
How Good is the Regression?
The coefficient of determination, R2, is a descriptive measure of
the strength of the regression relationship, a measure how well the
regression line fits the data.
R2 : coefficient of ( y y ) ( y yˆ ) ( yˆ y)
Y
determination Total = Unexplained Explained
. Deviation Deviation Deviation
}
Y
{
(Error) (Regression)
Unexplained Deviation
Total Deviation 2
Y ( y y )2 ( y yˆ )2 ( yˆ y )
Y
Explained Deviation
{ SST = SSE + SSR
Percentage of
Rr22=
SSR 1 SSE total variation
SST SST
explained by the
X regression.
X
Coefficient of Determination
ˆ y) 2
2 Regression sum of squares ( y
R =
Total sum of squares
( y y )2
R2 is a measure of linear association between x
and y (0 £ R2 £ 1)
ei Yi Yˆi
The residual for observation i, ei, is the difference between its
observed and predicted value
Check the assumptions of regression by examining the
residuals
– Examine for linearity assumption
– Examine for constant variance for all levels of X (homoscedasticity)
– Evaluate normal distribution assumption
– Evaluate independence assumption
Y Y
x x
residuals
x residuals x
Not Linear
Linear
Residual Analysis for
Homoscedasticity
Y Y
x x
residuals
x residuals x
Non-constant variance
Constant variance
Residual Analysis for
Independence
Not Independent
Independent
residuals
residuals
X
residuals
X
Logistic Regression