Hypothesis Testing

Hypothesis Testing
By
Hirbo Shore (MPH, Assistant Professor)
School of Public Health, CHMS-HU
Contact: [email protected]
Hypothesis testing
Objectives:
•Calculate p-values and test statistics
•Understand the role of significance and the difference between Type I and Type II
errors
•Explain the connection between hypothesis testing and confidence intervals
•Perform parametric hypothesis tests relating to:
a) A sample mean
b) The difference between two sample means (independent samples)
c) The mean difference between two dependent samples.
d) A sample proportion
e) The difference between two sample proportions (independent samples)
2
Hypothesis Testing
• A research question is typically posed as a hypothesis
• Sometimes based on prior research, preliminary observations, or an
“educated guess”
– e.g. in a randomised control trial the hypothesis might be “the
proportion of patients with disease X who survive after receiving
a drug treatment is greater than the proportion of patients with
disease X who survive after receiving a placebo”
• The hypothesis often states a difference between two groups
3
• Hypothesis: A statement about one or more population
• The majority of statistical analyses involve comparison, most

obviously between treatments or procedures or between groups of
subjects.
• The numerical value corresponding to the comparison of interest is

often called the effect.
• The purpose of hypothesis testing is to aid the researcher in

reaching a decision concerning a population by examining a sample
from that population.
4
Steps involved in testing about a
hypothesis
1. State the research question in terms of statistical hypothesis.
• The null hypothesis, H0 , is a statement claiming that there is no

difference b/n the hypothesized value and the population value.
– (The effect of interest is zero)
• The alternative hypothesis, H1, is a statement that disagrees with

the null hypothesis.
– (The effect of interest is not zero)
5
• Example
H0: μ = μ0 H0: μ ≤ μ0 H0: μ ≥ μ0

H1: μ ≠ μ0 H1: μ > μ0 H1: μ < μ0
two‐tailed one‐tailed one‐tailed
Null Hypothesis: Things are what they say they are, status quo.
Ho: μ ≠ μ0 Ho: μ =1.6m
Research Hypotheses: The thing we are primarily interested in “proving”.

HA: μ ≠ μ HA: μ ≠ 1.6m mean height
HA: μ > μ HA: μ > 1.6m
HA: μ < μ HA: μ< 1.6m
6
2. Select a sample and collect data
3. Decide on the appropriate test statistic for the hypothesis (Z,
t, χ2, F, etc.)
– Test Statistic is function of the data that uses estimates of the
parameters we are interested in and whose sampling distribution
is known when we assume the null hypothesis is true.
– It is a value computed from the sample data that is used in

making the decision about the rejection of the null hypothesis
7
4. Select the level of significance for the statistical test (α=0.05,
0.01, 0.001, etc.)
– Determine the critical value. A value the test statistic must attain
to be declared significant (2) (3)
H0: True
HA: True (1) (2) (3)
μ0 μ1
8
• What would you conclude if the calculated value of the test statistic
fell in location (1)?
• How about location (2)?
• Location (3)?
• Which is most likely 1, 2, or 3 when H0 is true?
9
10
6. Perform the calculation
𝐎𝐛𝐬𝐞𝐫𝐯𝐞𝐝 𝐯𝐚𝐥𝐮𝐞−𝐡𝐲𝐩𝐨𝐭𝐡𝐞𝐬𝐢𝐳𝐞 𝐯𝐚𝐥𝐮𝐞

Statistics=
𝐬𝐭𝐚𝐧𝐝𝐚𝐫𝐝 𝐞𝐫𝐫𝐨𝐫
7. Decision: reject or accept HO
– If the numerical value of the test statistic falls in the rejection

region, we reject the null hypothesis
– If the test statistic does not fall in the rejection region, we do
not reject H0.
8. Draw and state the conclusion.
– if HO true conclude by stating Ho or HA otherwise
11
• Another way to make decision
– Reject the null hypothesis if P < α
– Accept the null hypothesis if P≥α
• P is the probability of getting a sample statistic at least as extreme
as the calculated statistic if the null hypothesis is true.
12
Level of Significance
13
Example
• Suppose we are interested in finding evidence that there
is a difference in average height between adult
Ethiopian men and women
• We collected some sample data (randomly and with

attention to power, sample size, and potential biases)
• What would the hypothesis testing procedure look like?
14
Example
15
Types of Errors
16
Types of Errors
Why does this happen?
• Due to natural random variation among subjects in a
population, sometimes the sample data will lead us to
an incorrect conclusion
• This can happen even if the researcher has managed to

avoid every conceivable source of bias in the design and
implementation of the study; it is an inevitable
consequence of random variation.
17
Types of Errors
• There are four possible situations:
Decision based on sample In the population

H0 is true H0 is false (HA is true)
Do not reject H0 √ Correct X type II error
Reject H0 (have evidence for HA) X type I Error √ Correct
18
Types of Errors
Img src: https://effectsizefaq.com/2010/05/31/i-always-get-confused-about-type-i-and-ii-errors-can-you-show-me-something-to-help-me-

remember-the-difference/
19
Type I Error
• Type I error occurs if, based on the sample data, it is decided
to reject the null hypothesis when in fact (i.e., in the
population) the null hypothesis is true
• The level of significance () is the probability of making a

Type I error
• It is the probability of incorrectly rejecting the null hypothesis
• The probability of a Type I error is also called the “false

positive rate”
20
Type II Error
• A Type II Error occurs if, based on the sample data, we do not
reject the null hypothesis when in fact (i.e in the population) the
null hypothesis is false
• It is the probability of incorrectly no rejecting the null hypothesis
• The probability of a Type II error is also called the “false negative

rate”
• If the probability of Type II error is β, then the power of the test is
1- β
• Power is the probability of rejecting the null hypothesis when it is
false
21
CI’s and Hypothesis testing
There is a strong link between (two-tailed) hypothesis testing and
confidence intervals:
• Suppose 𝐻0 is that a treatment effect is zero and the significance is
set at 𝛼 = 0.05
• Then 𝐻0 can be rejected if the 95% confidence interval for the
treatment effect does not include zero.
• This means that the observed test statistic is not inside the CI for
the population value under the null hypothesis
• i.e. the chance of observing a test statistic like this by chance is less
than 𝛼 = 0.05
22
Hypothesis tests
A. A sample mean
B. The difference between two sample means (independent

samples)
C. The difference between two sample means (dependent/paired

samples)
D. A sample proportion
E. The difference between two sample proportions (independent

samples)
23
Hypothesis tests
24
Hypothesis tests
a) Hypothesis test for a single mean EXAMPLE
• Researchers collected serum amylase values from a random
sample of 50 apparently healthy subjects.
• They want to know whether they can conclude that the mean
of the population from which the sample was drawn is
different from 120 units/100 ml, at a significance level of 0.05.
• The mean and standard deviation computed from the sample
are 96 and 35 units/100 ml respectively.
• We don’t have the data but will assume a normal distribution
for sample values
25
Hypothesis tests
26
Hypothesis tests
27
Hypothesis tests
95% CI for the mean
Test
statistic
df
p-value for two-
sided test
28
Hypothesis tests
29
Hypothesis tests
b) Hypothesis test for difference between two means (independent
samples)
Assumptions:
1. The two populations are normally distributed.
2. The two populations have the same variance.
3. The two samples are independent of each other.
4. Each sample is obtained using independent random sampling from its

corresponding population
30
Hypothesis tests
samples)
Notes about the normality assumption:
• For hypothesis tests for means, the assumption that the underlying population
distributions are approximately normally distributed is fairly robust.
• That is, as long as the underlying population distributions are approximately

symmetrical and mound shaped, the use of the t-test is valid.
• Check whether it is plausible to assume that the underlying population

distributions are approximately mound shaped and symmetrical by examining
the distribution of the data in your sample.
31
Hypothesis tests
samples)
Notes about the equality of variance assumption:
• If one is unwilling to assume equality of population variances for the independent

sample t-test then a version based on unequal variances can be used.
• Stata performs this version of the t-test when you specify the ,unequal option.
• The degrees of freedom for this test are not the usual n1 + n2 - 2 and may not even
be an integer.
32
Hypothesis tests
33
Hypothesis tests
34
Checking assumptions of Normality:
histogram pulse1, by(sex) norm graph hbox pulse1, by(sex)
1 2 1 2
.06
.04
Density
.02
0
40 60 80 100 40 60 80 100
pulse1
Density 50 60 70 80 90 100 50 60 70 80 90 100
normal pulse1 pulse1
Graphs by sex
Graphs by sex
pnorm pulse1 if sex==1 pnorm pulse1 if sex==2
1.00
1.00
0.75
0.75
Normal F[(pulse1-m)/s]
0.50
0.50
0.25
0.25
0.00
0.00
0.00 0.25 0.50 0.75 1.00

Empirical P[i] = i/(N+1)
0.00 0.25 0.50 0.75 1.00
Empirical P[i] = i/(N+1) 35
Hypothesis tests
samples) EXAMPLE
Checking the equality of variance assumption:
variances
are similar
36
Hypothesis tests
37
Hypothesis tests
38
Hypothesis tests
39
Hypothesis tests
samples) EXAMPLE
Step 4: using Stata to calculate the test statistic and p-value
40
41
Allowing
unequal
variances affects
• Standard errors fo
diff. in means and
the CI
• test statistic t
• p-value
• df
42
Hypothesis tests
c) Hypothesis test for difference between two means (dependent
samples)
Assumptions
1. The population of difference scores is normally distributed.
2. The two samples are dependent (e.g. before and after, pairs of knees)
3. Each sample is obtained using independent random sampling.
Just like for confidence intervals, we start by calculating the difference scores
(e.g. BP_after - BP_before) for each pair. Then perform a t-test on the new
variable.
43
Hypothesis tests
Two paired One difference

variables variable d
44
Hypothesis tests
samples) EXAMPLE
• 13 apnea patients received aminophylline treatment
• The number of apneic episodes per hour was measured 24 hours
before treatment and 16 hours after treatment
• Test the difference in mean apneic episodes per hour before and
after the treatment
• The groups are not independent because they are

paired observations on the same people
• So the appropriate test is a paired t-test
45
Hypothesis tests
46
Hypothesis tests
samples) EXAMPLE
Checking normality of diff=before-after
hist diff, bin(8) pnorm diff
1.00
1
0.75
Normal F[(diff-m)/s]
Density
0.50
.5
0.25
0.00
0
0 .5 1 1.5 0.00 0.25 0.50 0.75 1.00

diff Empirical P[i] = i/(N+1)
47
Hypothesis tests
48
Hypothesis tests
49
Hypothesis tests
samples) EXAMPLE
Step 4: calculate test statistic and p-value
OR using Stata dataset: ttest varname==0
50
Hypothesis tests
samples) EXAMPLE
Step 5: conclusion
The test statistic is t=5.278 compared to a t-distribution with df=12.
The p-value is p=0.0002 which is less than the pre-specified alpha level 0.05. We
therefore reject H0 and conclude that there is a significant difference between
apneic episodes before and after treatment.
Because the mean difference of the “before – after” scores was positive, this means that
on average the number of apneic episodes was higher before the treatment.
That is, treatment with Aminophylline seems to influence (reduce) the frequency of
apneic episodes.
51
Hypothesis tests
52
Hypothesis tests
53
Hypothesis tests
54
Hypothesis tests
55
Hypothesis tests
56
Hypothesis tests
Z-statistic
p-value for
two-sided test
Don’t worry that this is labelled “mean”,
it’s still a test of proportion
57
Hypothesis tests
58
Hypothesis tests
59
Hypothesis tests
60
Hypothesis tests
e) Hypothesis test for difference between two proportions
(independent samples) EXAMPLE
From Bland (1990):
• In a field trial of Salk poliomyelitis vaccine 200,745 children received the vaccine,
of whom 33 developed paralytic polio.
• Placebo was given to 201,229 children, of whom 115 contract paralytic polio.
• Is there a significant difference in proportion of polio cases between groups?
61
Hypothesis tests
62
Hypothesis tests
63
Hypothesis tests
64
Hypothesis tests
Test
statistic
p-value (two-sided test) 65

Hypothesis tests
proportions
vaccine group
(row percentages)
placebo group
p-value (two-sided test)
66
• There is a strong link between (two-tailed) hypothesis testing and
confidence intervals
Hypothesis test for single Confidence Interval for

mean (two-tailed) single mean
ҧ
𝑥−𝜇 𝑠
Test statistic 𝑡 = is 95% CI is: 𝑥ҧ ± 𝑡
𝑠/ 𝑛 𝑛
compared to t-distribution with where t is a value from the t-
df=n-1 distribution with df=n-1
If p-value is small (<5%) the The interval does not include 0

test is statistically significant
(reject 𝐻0 ) The interval includes 0
(crosses 0)
If p is not significant
67
Reporting considerations
• Reporting the results of hypothesis tests as “significant” or “not significant”
is not very informative
• The results of a medical study don’t simplify into a “yes” or “no” answer
• It is better to provide the actual p-value for the hypothesis test. Medical
journals require that hypothesis tests in submitted papers be reported with
p-values and not as “S” or “NS”.
• The best approach is to also provide confidence intervals as they focus

attention on the effect size and its precision.
68
Statistical vs. Clinical significance
• Reporting results as “significant” or “not significant” encourages the
possibly erroneous interpretation that statistical significance is the same as
medical or practical importance.
• Remember that we want to minimise the chance of making a Type I error

(false positive) i.e. finding a result when there really isn’t one
• But test statistics often have the sample size n in the denominator, which
means…
• In very large sample sizes it is EASY for the test statistic to be large
enough to have a small/significant p-value
69
70
Power considerations
• If the null hypothesis is not rejected, it does not mean that it should be
accepted as true.
• We may have made a Type II error (failed to detect a difference when one
really exists).
• In this case one should consider the power of the test (i.e., its ability to
detect a difference that would be considered to be clinically important or
worth detecting).
• Often the power is not sufficiently large to detect a difference even if it

really exists, due to small sample size.
71
72
Thankyou!
• For more exercises on t-tests and proportions

see Pagano
73
Exercise 1a
single proportion – one tail test
• In a sample of 1500 residents of an inner city
neighbourhood who participated in a health screening
program, 125 tests yielded positive results for sickle-
cell anaemia.
• Do these data provide sufficient evidence to indicate

that the proportion of individuals with sickle-cell
anaemia in the population from which the sample was
drawn is greater than 0.06?
• Let α=0.05
74
Ho: The proportion of individuals in the population with sickle cell
anaemia is less than or equal to 0.06.
(P  0.06)
Ha: The proportion of individuals in the population with sickle cell
anaemia is greater than 0.06.
(P > 0.06) one-sided test
Assumptions:
i) must assume random sample, independent observations.
ii) nP0 = 90, n (1 − P0 ) = 140, both > 5 so normal approximation
to the binomial is appropriate ( P0 = 0.06 ).
Set  = 0.05
Where: 125
pˆ = = 0.0833
1500
P0 = 0.06
pˆ − P0 0.0833 − 0.06 0.0833 − 0.06

z= = = = 3.82
P0 (1 − P0 ) 0.06(1 − 0.06) 0.0061
n 1500
75
Exercise 1a
single proportion – one sided test
display 1-normprob(3.82)
0.00006673 (one-sided test)
P-value = 0.0001 (4 decimal places)
display 2*(1-normprob(3.82))
.00013345 (two-sided test)
prtesti 1500 125 0.06, count
One-sample test of proportion x: Number of obs = 1500

------------------------------------------------------------------------------
Variable | Mean Std. Err. [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .0833333 .0071362 .0693466 .0973201
------------------------------------------------------------------------------
p = proportion(x) z = 3.8052
Ho: p = 0.06
Ha: p < 0.06 Ha: p != 0.06 Ha: p > 0.06

Pr(Z < z) = 0.9999 Pr(|Z| > |z|) = 0.0001 Pr(Z > z) = 0.0001
76
prtesti 1500 125 0.06, count
------------------------------------------------------------------------------
-------------+----------------------------------------------------------------
x | .0833333 .0071362 .0693466 .0973201
------------------------------------------------------------------------------
Ho: p = 0.06
Ha: p < 0.06 Ha: p != 0.06 Ha: p > 0.06

Pr(Z < z) = 0.9999 Pr(|Z| > |z|) = 0.0001 Pr(Z > z) = 0.0001
Set the level of significance (α)=1% and thus also consider 99% confidence intervals
prtesti 1500 125 0.06, count level(99)

------------------------------------------------------------------------------
-------------+----------------------------------------------------------------
x | .0833333 .0071362 .0649516 .1017151
------------------------------------------------------------------------------
Ho: p = 0.06
Ha: p < 0.06 Ha: p != 0.06 Ha: p > 0.06

Pr(Z < z) = 0.9999 Pr(|Z| > |z|) = 0.0001 Pr(Z > z) = 0.0001
77
Exercise 1b
Comparing two means (independent)
Test the hypothesis that the vaccines differ in effectiveness with  = 0.05
(assume that antibody responses are normally distributed).
From the question:
Ho: The antibody responses of the two vaccines are the same ( 1 = 2 )
Ha: The antibody responses of the two vaccines are different
( two-tailed test) ( 1  2 )
Assumptions:
We are told in the question that the populations are normally distributed.
The sample standard deviations are similar.
We would need to check with the investigator that the two groups were chosen at random
and that the observations in the two groups are independent.
78
Set  = 0.05
( x1 − x 2 ) = 4.5 − 2.5 = 2.0
s 12 ( n1 − 1) + s 22 ( n 2 − 1) 2.52  9 + 2.0 2  8
s 2p = =
n1 + n 2 − 2 10 + 9 − 2
= 5.1912
2.0
t = = 1.91
5.1912 ( 1 10 + 1 9 )
To obtain the p-value we would type the following Stata command:
display tprob(17, 1.91)

0.07315349
Alternatively you could get Stata to do all the calculations by

typing:
ttesti 10 4.5 2.5 9 2.5 2.0
79
Exercise 1b
Comparing two means (independent)
ttesti 10 4.5 2.5 9 2.5 2.0
Two-sample t test with equal variances
------------------------------------------------------------------------------
| Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
x | 10 4.5 .7905694 2.5 2.711608 6.288392
y | 9 2.5 .6666667 2 .9626639 4.037336
---------+--------------------------------------------------------------------
combined | 19 3.552632 .5598594 2.440371 2.376411 4.728853
---------+--------------------------------------------------------------------
diff | 2 1.04686 -.2086807 4.208681
------------------------------------------------------------------------------
diff = mean(x) - mean(y) t = 1.9105
Ho: diff = 0 degrees of freedom = 17
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Pr(T < t) = 0.9635 Pr(|T| > |t|) = 0.0731 Pr(T > t) = 0.0365
Conclusion:
There is no statistically significant difference in the mean antibody response
from the two vaccines since p-value<0.05 (p = 0.073).
What else can we say from these results?
80
Exercise 1c- Paired t-test
81
Ho: The mean difference in dexterity scores is less than or equal
to zero (   0)
D
Ha:The mean difference in dexterity scores is greater than zero

(one-tailed test) ( D  0 )
82
• Assumptions:
• that the population of differences is normally distributed

(histogram box plot and normal probability plot (not shown) do not indicate gross
skewness)
• that the two samples are dependent (we are told that we have matched pairs)
• that the pairs were obtained by independent random sampling (need to check with
investigator)
• that individuals within pairs were randomly allocated to new or standard treatment (we
are told that this is so)
• Set  = 0.05
• You could calculate the mean and standard deviations of the within pair differences by
hand.
• Alternatively you could enter the data into Stata and get Stata to do this for you:
83
gen diff = new - standard
sum diff
Variable | Obs Mean Std.Dev. Min Max
---------+--------------------------------
diff | 24 5.375 5.64772 -5 17
One tail test result

ttesti 24 5.375 5.64772 0
One-sample t test
--------------------------------------------------------------------
Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
--------+----------------------------------------------------------
x | 24 5.375 1.152836 5.64772 2.990177 7.759823
--------------------------------------------------------------------
Degrees of freedom: 23
Ho: mean(x) = 0
Ha: mean < 0 Ha: mean != 0 Ha: mean > 0
t = 4.6624 t = 4.6624 t = 4.6624
P < t = 0.9999 P > |t| = 0.0001 P > t = 0.0001
Note: Stata says One sample t -test
We have considered a paired t-test where Ho: diff=0
And hence the similar output to the one sample test
84
Conclusion:
• Since p< 0.05, we reject Ho and conclude that the
dexterity scores are statistically significantly higher
on the new therapy as compared to the standard
treatment.
• Therefore the new procedure is more effective than

the standard procedure.
85
!!
THANK YOU FOR YOUR ATTENTION!!
86

Hypothesis Testing

Uploaded by

Copyright:

Available Formats

Hypothesis Testing

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hypothesis Testing

Uploaded by

Copyright:

Available Formats

Hypothesis Testing

• The hypothesis often states a difference between two groups

• The majority of statistical analyses involve comparison, most

• The numerical value corresponding to the comparison of interest is

• The purpose of hypothesis testing is to aid the researcher in

• The null hypothesis, H0 , is a statement claiming that there is no

– (The effect of interest is zero)

• The alternative hypothesis, H1, is a statement that disagrees with

– (The effect of interest is not zero)

H0: μ = μ0 H0: μ ≤ μ0 H0: μ ≥ μ0

Research Hypotheses: The thing we are primarily interested in “proving”.

– It is a value computed from the sample data that is used in

HA: True (1) (2) (3)

fell in location (1)?

• How about location (2)?

• Which is most likely 1, 2, or 3 when H0 is true?

𝐎𝐛𝐬𝐞𝐫𝐯𝐞𝐝 𝐯𝐚𝐥𝐮𝐞−𝐡𝐲𝐩𝐨𝐭𝐡𝐞𝐬𝐢𝐳𝐞 𝐯𝐚𝐥𝐮𝐞

7. Decision: reject or accept HO

– If the numerical value of the test statistic falls in the rejection

8. Draw and state the conclusion.

– if HO true conclude by stating Ho or HA otherwise

– Reject the null hypothesis if P < α

– Accept the null hypothesis if P≥α

• P is the probability of getting a sample statistic at least as extreme

as the calculated statistic if the null hypothesis is true.

• We collected some sample data (randomly and with

• What would the hypothesis testing procedure look like?

• This can happen even if the researcher has managed to

Decision based on sample In the population

Img src: https://effectsizefaq.com/2010/05/31/i-always-get-confused-about-type-i-and-ii-errors-can-you-show-me-something-to-help-me-

• The level of significance () is the probability of making a

• It is the probability of incorrectly rejecting the null hypothesis

• The probability of a Type I error is also called the “false

• It is the probability of incorrectly no rejecting the null hypothesis

• The probability of a Type II error is also called the “false negative

B. The difference between two sample means (independent

C. The difference between two sample means (dependent/paired

E. The difference between two sample proportions (independent

95% CI for the mean

2. The two populations have the same variance.

3. The two samples are independent of each other.

4. Each sample is obtained using independent random sampling from its

• That is, as long as the underlying population distributions are approximately

• Check whether it is plausible to assume that the underlying population

• If one is unwilling to assume equality of population variances for the independent

pnorm pulse1 if sex==1 pnorm pulse1 if sex==2

0.00 0.25 0.50 0.75 1.00

3. Each sample is obtained using independent random sampling.

Two paired One difference

• The groups are not independent because they are

0 .5 1 1.5 0.00 0.25 0.50 0.75 1.00

From Bland (1990):

• Is there a significant difference in proportion of polio cases between groups?

p-value (two-sided test) 65

p-value (two-sided test)

Hypothesis test for single Confidence Interval for

If p-value is small (<5%) the The interval does not include 0

• The best approach is to also provide confidence intervals as they focus

• Remember that we want to minimise the chance of making a Type I error

• Often the power is not sufficiently large to detect a difference even if it