Hypothesis Testing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 86

Hypothesis Testing

By
Hirbo Shore (MPH, Assistant Professor)
School of Public Health, CHMS-HU
Contact: [email protected]
Hypothesis testing
Objectives:
•Calculate p-values and test statistics
•Understand the role of significance and the difference between Type I and Type II
errors
•Explain the connection between hypothesis testing and confidence intervals
•Perform parametric hypothesis tests relating to:
a) A sample mean
b) The difference between two sample means (independent samples)
c) The mean difference between two dependent samples.
d) A sample proportion
e) The difference between two sample proportions (independent samples)

2
Hypothesis Testing
• A research question is typically posed as a hypothesis
• Sometimes based on prior research, preliminary observations, or an
“educated guess”
– e.g. in a randomised control trial the hypothesis might be “the
proportion of patients with disease X who survive after receiving
a drug treatment is greater than the proportion of patients with
disease X who survive after receiving a placebo”

• The hypothesis often states a difference between two groups

3
• Hypothesis: A statement about one or more population

• The majority of statistical analyses involve comparison, most


obviously between treatments or procedures or between groups of
subjects.

• The numerical value corresponding to the comparison of interest is


often called the effect.

• The purpose of hypothesis testing is to aid the researcher in


reaching a decision concerning a population by examining a sample
from that population.

4
Steps involved in testing about a
hypothesis
1. State the research question in terms of statistical hypothesis.

• The null hypothesis, H0 , is a statement claiming that there is no


difference b/n the hypothesized value and the population value.

– (The effect of interest is zero)

• The alternative hypothesis, H1, is a statement that disagrees with


the null hypothesis.

– (The effect of interest is not zero)

5
• Example

H0: μ = μ0 H0: μ ≤ μ0 H0: μ ≥ μ0


H1: μ ≠ μ0 H1: μ > μ0 H1: μ < μ0
two‐tailed one‐tailed one‐tailed

Null Hypothesis: Things are what they say they are, status quo.
Ho: μ ≠ μ0 Ho: μ =1.6m

Research Hypotheses: The thing we are primarily interested in “proving”.


HA: μ ≠ μ HA: μ ≠ 1.6m mean height
HA: μ > μ HA: μ > 1.6m
HA: μ < μ HA: μ< 1.6m

6
2. Select a sample and collect data
3. Decide on the appropriate test statistic for the hypothesis (Z,
t, χ2, F, etc.)
– Test Statistic is function of the data that uses estimates of the
parameters we are interested in and whose sampling distribution
is known when we assume the null hypothesis is true.

– It is a value computed from the sample data that is used in


making the decision about the rejection of the null hypothesis

7
4. Select the level of significance for the statistical test (α=0.05,
0.01, 0.001, etc.)
– Determine the critical value. A value the test statistic must attain
to be declared significant (2) (3)

H0: True

HA: True (1) (2) (3)

μ0 μ1

8
• What would you conclude if the calculated value of the test statistic

fell in location (1)?

• How about location (2)?

• Location (3)?

• Which is most likely 1, 2, or 3 when H0 is true?

9
10
6. Perform the calculation

𝐎𝐛𝐬𝐞𝐫𝐯𝐞𝐝 𝐯𝐚𝐥𝐮𝐞−𝐡𝐲𝐩𝐨𝐭𝐡𝐞𝐬𝐢𝐳𝐞 𝐯𝐚𝐥𝐮𝐞


Statistics=
𝐬𝐭𝐚𝐧𝐝𝐚𝐫𝐝 𝐞𝐫𝐫𝐨𝐫

7. Decision: reject or accept HO

– If the numerical value of the test statistic falls in the rejection


region, we reject the null hypothesis
– If the test statistic does not fall in the rejection region, we do
not reject H0.

8. Draw and state the conclusion.

– if HO true conclude by stating Ho or HA otherwise

11
• Another way to make decision

– Reject the null hypothesis if P < α

– Accept the null hypothesis if P≥α

• P is the probability of getting a sample statistic at least as extreme

as the calculated statistic if the null hypothesis is true.

12
Level of Significance

13
Example
• Suppose we are interested in finding evidence that there
is a difference in average height between adult
Ethiopian men and women

• We collected some sample data (randomly and with


attention to power, sample size, and potential biases)

• What would the hypothesis testing procedure look like?

14
Example

15
Types of Errors

16
Types of Errors
Why does this happen?
• Due to natural random variation among subjects in a
population, sometimes the sample data will lead us to
an incorrect conclusion

• This can happen even if the researcher has managed to


avoid every conceivable source of bias in the design and
implementation of the study; it is an inevitable
consequence of random variation.

17
Types of Errors
• There are four possible situations:

Decision based on sample In the population


H0 is true H0 is false (HA is true)
Do not reject H0 √ Correct X type II error
Reject H0 (have evidence for HA) X type I Error √ Correct

18
Types of Errors

Img src: https://effectsizefaq.com/2010/05/31/i-always-get-confused-about-type-i-and-ii-errors-can-you-show-me-something-to-help-me-


remember-the-difference/

19
Type I Error
• Type I error occurs if, based on the sample data, it is decided
to reject the null hypothesis when in fact (i.e., in the
population) the null hypothesis is true

• The level of significance () is the probability of making a


Type I error

• It is the probability of incorrectly rejecting the null hypothesis

• The probability of a Type I error is also called the “false


positive rate”

20
Type II Error
• A Type II Error occurs if, based on the sample data, we do not
reject the null hypothesis when in fact (i.e in the population) the
null hypothesis is false

• It is the probability of incorrectly no rejecting the null hypothesis

• The probability of a Type II error is also called the “false negative


rate”
• If the probability of Type II error is β, then the power of the test is
1- β
• Power is the probability of rejecting the null hypothesis when it is
false

21
CI’s and Hypothesis testing
There is a strong link between (two-tailed) hypothesis testing and
confidence intervals:
• Suppose 𝐻0 is that a treatment effect is zero and the significance is
set at 𝛼 = 0.05
• Then 𝐻0 can be rejected if the 95% confidence interval for the
treatment effect does not include zero.
• This means that the observed test statistic is not inside the CI for
the population value under the null hypothesis
• i.e. the chance of observing a test statistic like this by chance is less
than 𝛼 = 0.05

22
Hypothesis tests
A. A sample mean

B. The difference between two sample means (independent


samples)

C. The difference between two sample means (dependent/paired


samples)

D. A sample proportion

E. The difference between two sample proportions (independent


samples)

23
Hypothesis tests

24
Hypothesis tests
a) Hypothesis test for a single mean EXAMPLE
• Researchers collected serum amylase values from a random
sample of 50 apparently healthy subjects.
• They want to know whether they can conclude that the mean
of the population from which the sample was drawn is
different from 120 units/100 ml, at a significance level of 0.05.
• The mean and standard deviation computed from the sample
are 96 and 35 units/100 ml respectively.
• We don’t have the data but will assume a normal distribution
for sample values

25
Hypothesis tests

26
Hypothesis tests

27
Hypothesis tests

95% CI for the mean

Test
statistic

df
p-value for two-
sided test
28
Hypothesis tests

29
Hypothesis tests
b) Hypothesis test for difference between two means (independent
samples)

Assumptions:
1. The two populations are normally distributed.

2. The two populations have the same variance.

3. The two samples are independent of each other.

4. Each sample is obtained using independent random sampling from its


corresponding population

30
Hypothesis tests
b) Hypothesis test for difference between two means (independent
samples)
Notes about the normality assumption:

• For hypothesis tests for means, the assumption that the underlying population
distributions are approximately normally distributed is fairly robust.

• That is, as long as the underlying population distributions are approximately


symmetrical and mound shaped, the use of the t-test is valid.

• Check whether it is plausible to assume that the underlying population


distributions are approximately mound shaped and symmetrical by examining
the distribution of the data in your sample.

31
Hypothesis tests
b) Hypothesis test for difference between two means (independent
samples)
Notes about the equality of variance assumption:

• If one is unwilling to assume equality of population variances for the independent


sample t-test then a version based on unequal variances can be used.

• Stata performs this version of the t-test when you specify the ,unequal option.

• The degrees of freedom for this test are not the usual n1 + n2 - 2 and may not even
be an integer.

32
Hypothesis tests

33
Hypothesis tests

34
Checking assumptions of Normality:
histogram pulse1, by(sex) norm graph hbox pulse1, by(sex)
1 2 1 2
.06
.04
Density

.02
0

40 60 80 100 40 60 80 100
pulse1
Density 50 60 70 80 90 100 50 60 70 80 90 100
normal pulse1 pulse1
Graphs by sex
Graphs by sex

pnorm pulse1 if sex==1 pnorm pulse1 if sex==2

1.00
1.00

0.75
0.75

Normal F[(pulse1-m)/s]

0.50
0.50

0.25
0.25

0.00
0.00

0.00 0.25 0.50 0.75 1.00


Empirical P[i] = i/(N+1)
0.00 0.25 0.50 0.75 1.00
Empirical P[i] = i/(N+1) 35
Hypothesis tests
b) Hypothesis test for difference between two means (independent
samples) EXAMPLE
Checking the equality of variance assumption:

variances
are similar

36
Hypothesis tests

37
Hypothesis tests

38
Hypothesis tests

39
Hypothesis tests
b) Hypothesis test for difference between two means (independent
samples) EXAMPLE
Step 4: using Stata to calculate the test statistic and p-value

40
41
Allowing
unequal
variances affects

• Standard errors fo
diff. in means and
the CI
• test statistic t
• p-value
• df

42
Hypothesis tests
c) Hypothesis test for difference between two means (dependent
samples)

Assumptions
1. The population of difference scores is normally distributed.

2. The two samples are dependent (e.g. before and after, pairs of knees)

3. Each sample is obtained using independent random sampling.

Just like for confidence intervals, we start by calculating the difference scores
(e.g. BP_after - BP_before) for each pair. Then perform a t-test on the new
variable.

43
Hypothesis tests

Two paired One difference


variables variable d

44
Hypothesis tests
c) Hypothesis test for difference between two means (dependent
samples) EXAMPLE
• 13 apnea patients received aminophylline treatment
• The number of apneic episodes per hour was measured 24 hours
before treatment and 16 hours after treatment
• Test the difference in mean apneic episodes per hour before and
after the treatment

• The groups are not independent because they are


paired observations on the same people
• So the appropriate test is a paired t-test

45
Hypothesis tests

46
Hypothesis tests
c) Hypothesis test for difference between two means (dependent
samples) EXAMPLE
Checking normality of diff=before-after
hist diff, bin(8) pnorm diff

1.00
1

0.75
Normal F[(diff-m)/s]
Density

0.50
.5

0.25
0.00
0

0 .5 1 1.5 0.00 0.25 0.50 0.75 1.00


diff Empirical P[i] = i/(N+1)

47
Hypothesis tests

48
Hypothesis tests

49
Hypothesis tests
c) Hypothesis test for difference between two means (dependent
samples) EXAMPLE
Step 4: calculate test statistic and p-value
OR using Stata dataset: ttest varname==0

50
Hypothesis tests
c) Hypothesis test for difference between two means (dependent
samples) EXAMPLE
Step 5: conclusion
The test statistic is t=5.278 compared to a t-distribution with df=12.
The p-value is p=0.0002 which is less than the pre-specified alpha level 0.05. We
therefore reject H0 and conclude that there is a significant difference between
apneic episodes before and after treatment.
Because the mean difference of the “before – after” scores was positive, this means that
on average the number of apneic episodes was higher before the treatment.
That is, treatment with Aminophylline seems to influence (reduce) the frequency of
apneic episodes.

51
Hypothesis tests

52
Hypothesis tests

53
Hypothesis tests

54
Hypothesis tests

55
Hypothesis tests

56
Hypothesis tests

Z-statistic

p-value for
two-sided test
Don’t worry that this is labelled “mean”,
it’s still a test of proportion
57
Hypothesis tests

58
Hypothesis tests

59
Hypothesis tests

60
Hypothesis tests
e) Hypothesis test for difference between two proportions
(independent samples) EXAMPLE

From Bland (1990):

• In a field trial of Salk poliomyelitis vaccine 200,745 children received the vaccine,
of whom 33 developed paralytic polio.

• Placebo was given to 201,229 children, of whom 115 contract paralytic polio.

• Is there a significant difference in proportion of polio cases between groups?

61
Hypothesis tests

62
Hypothesis tests

63
Hypothesis tests

64
Hypothesis tests

Test
statistic

p-value (two-sided test) 65


Hypothesis tests

proportions
vaccine group
(row percentages)
placebo group

p-value (two-sided test)

66
CI’s and Hypothesis testing
• There is a strong link between (two-tailed) hypothesis testing and
confidence intervals

Hypothesis test for single Confidence Interval for


mean (two-tailed) single mean
ҧ
𝑥−𝜇 𝑠
Test statistic 𝑡 = is 95% CI is: 𝑥ҧ ± 𝑡
𝑠/ 𝑛 𝑛
compared to t-distribution with where t is a value from the t-
df=n-1 distribution with df=n-1

If p-value is small (<5%) the The interval does not include 0


test is statistically significant
(reject 𝐻0 ) The interval includes 0
(crosses 0)
If p is not significant

67
CI’s and Hypothesis testing
Reporting considerations
• Reporting the results of hypothesis tests as “significant” or “not significant”
is not very informative
• The results of a medical study don’t simplify into a “yes” or “no” answer
• It is better to provide the actual p-value for the hypothesis test. Medical
journals require that hypothesis tests in submitted papers be reported with
p-values and not as “S” or “NS”.

• The best approach is to also provide confidence intervals as they focus


attention on the effect size and its precision.

68
CI’s and Hypothesis testing
Statistical vs. Clinical significance
• Reporting results as “significant” or “not significant” encourages the
possibly erroneous interpretation that statistical significance is the same as
medical or practical importance.

• Remember that we want to minimise the chance of making a Type I error


(false positive) i.e. finding a result when there really isn’t one
• But test statistics often have the sample size n in the denominator, which
means…
• In very large sample sizes it is EASY for the test statistic to be large
enough to have a small/significant p-value

69
CI’s and Hypothesis testing

70
CI’s and Hypothesis testing
Power considerations
• If the null hypothesis is not rejected, it does not mean that it should be
accepted as true.
• We may have made a Type II error (failed to detect a difference when one
really exists).
• In this case one should consider the power of the test (i.e., its ability to
detect a difference that would be considered to be clinically important or
worth detecting).

• Often the power is not sufficiently large to detect a difference even if it


really exists, due to small sample size.

71
CI’s and Hypothesis testing

72
Thankyou!

• For more exercises on t-tests and proportions


see Pagano

73
Exercise 1a
single proportion – one tail test
• In a sample of 1500 residents of an inner city
neighbourhood who participated in a health screening
program, 125 tests yielded positive results for sickle-
cell anaemia.

• Do these data provide sufficient evidence to indicate


that the proportion of individuals with sickle-cell
anaemia in the population from which the sample was
drawn is greater than 0.06?

• Let α=0.05
74
Ho: The proportion of individuals in the population with sickle cell
anaemia is less than or equal to 0.06.
(P  0.06)
Ha: The proportion of individuals in the population with sickle cell
anaemia is greater than 0.06.
(P > 0.06) one-sided test

Assumptions:
i) must assume random sample, independent observations.
ii) nP0 = 90, n (1 − P0 ) = 140, both > 5 so normal approximation
to the binomial is appropriate ( P0 = 0.06 ).
Set  = 0.05

Where: 125
pˆ = = 0.0833
1500
P0 = 0.06

pˆ − P0 0.0833 − 0.06 0.0833 − 0.06


z= = = = 3.82
P0 (1 − P0 ) 0.06(1 − 0.06) 0.0061
n 1500
75
Exercise 1a
single proportion – one sided test

display 1-normprob(3.82)
0.00006673 (one-sided test)

P-value = 0.0001 (4 decimal places)

display 2*(1-normprob(3.82))
.00013345 (two-sided test)

prtesti 1500 125 0.06, count

One-sample test of proportion x: Number of obs = 1500


------------------------------------------------------------------------------
Variable | Mean Std. Err. [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .0833333 .0071362 .0693466 .0973201
------------------------------------------------------------------------------
p = proportion(x) z = 3.8052
Ho: p = 0.06

Ha: p < 0.06 Ha: p != 0.06 Ha: p > 0.06


Pr(Z < z) = 0.9999 Pr(|Z| > |z|) = 0.0001 Pr(Z > z) = 0.0001

76
prtesti 1500 125 0.06, count
One-sample test of proportion x: Number of obs = 1500
------------------------------------------------------------------------------
Variable | Mean Std. Err. [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .0833333 .0071362 .0693466 .0973201
------------------------------------------------------------------------------
p = proportion(x) z = 3.8052
Ho: p = 0.06

Ha: p < 0.06 Ha: p != 0.06 Ha: p > 0.06


Pr(Z < z) = 0.9999 Pr(|Z| > |z|) = 0.0001 Pr(Z > z) = 0.0001

Set the level of significance (α)=1% and thus also consider 99% confidence intervals

prtesti 1500 125 0.06, count level(99)


One-sample test of proportion x: Number of obs = 1500
------------------------------------------------------------------------------
Variable | Mean Std. Err. [99% Conf. Interval]
-------------+----------------------------------------------------------------
x | .0833333 .0071362 .0649516 .1017151
------------------------------------------------------------------------------
p = proportion(x) z = 3.8052
Ho: p = 0.06

Ha: p < 0.06 Ha: p != 0.06 Ha: p > 0.06


Pr(Z < z) = 0.9999 Pr(|Z| > |z|) = 0.0001 Pr(Z > z) = 0.0001

77
Exercise 1b
Comparing two means (independent)
Test the hypothesis that the vaccines differ in effectiveness with  = 0.05
(assume that antibody responses are normally distributed).

From the question:

Ho: The antibody responses of the two vaccines are the same ( 1 = 2 )
Ha: The antibody responses of the two vaccines are different
( two-tailed test) ( 1  2 )
Assumptions:
We are told in the question that the populations are normally distributed.
The sample standard deviations are similar.
We would need to check with the investigator that the two groups were chosen at random
and that the observations in the two groups are independent.

78
Set  = 0.05
( x1 − x 2 ) = 4.5 − 2.5 = 2.0
s 12 ( n1 − 1) + s 22 ( n 2 − 1) 2.52  9 + 2.0 2  8
s 2p = =
n1 + n 2 − 2 10 + 9 − 2
= 5.1912
2.0
t = = 1.91
5.1912 ( 1 10 + 1 9 )

To obtain the p-value we would type the following Stata command:

display tprob(17, 1.91)


0.07315349

Alternatively you could get Stata to do all the calculations by


typing:

ttesti 10 4.5 2.5 9 2.5 2.0

79
Exercise 1b
Comparing two means (independent)
ttesti 10 4.5 2.5 9 2.5 2.0
Two-sample t test with equal variances
------------------------------------------------------------------------------
| Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
x | 10 4.5 .7905694 2.5 2.711608 6.288392
y | 9 2.5 .6666667 2 .9626639 4.037336
---------+--------------------------------------------------------------------
combined | 19 3.552632 .5598594 2.440371 2.376411 4.728853
---------+--------------------------------------------------------------------
diff | 2 1.04686 -.2086807 4.208681
------------------------------------------------------------------------------
diff = mean(x) - mean(y) t = 1.9105
Ho: diff = 0 degrees of freedom = 17

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0


Pr(T < t) = 0.9635 Pr(|T| > |t|) = 0.0731 Pr(T > t) = 0.0365

Conclusion:
There is no statistically significant difference in the mean antibody response
from the two vaccines since p-value<0.05 (p = 0.073).

What else can we say from these results?

80
Exercise 1c- Paired t-test

81
Exercise 1c- Paired t-test
Ho: The mean difference in dexterity scores is less than or equal
to zero (   0)
D

Ha:The mean difference in dexterity scores is greater than zero


(one-tailed test) ( D  0 )

82
Exercise 1c- Paired t-test
• Assumptions:

• that the population of differences is normally distributed


(histogram box plot and normal probability plot (not shown) do not indicate gross
skewness)

• that the two samples are dependent (we are told that we have matched pairs)

• that the pairs were obtained by independent random sampling (need to check with
investigator)

• that individuals within pairs were randomly allocated to new or standard treatment (we
are told that this is so)

• Set  = 0.05
• You could calculate the mean and standard deviations of the within pair differences by
hand.

• Alternatively you could enter the data into Stata and get Stata to do this for you:
83
gen diff = new - standard
sum diff
Variable | Obs Mean Std.Dev. Min Max
---------+--------------------------------
diff | 24 5.375 5.64772 -5 17

One tail test result


ttesti 24 5.375 5.64772 0

One-sample t test
--------------------------------------------------------------------
Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
--------+----------------------------------------------------------
x | 24 5.375 1.152836 5.64772 2.990177 7.759823
--------------------------------------------------------------------
Degrees of freedom: 23
Ho: mean(x) = 0
Ha: mean < 0 Ha: mean != 0 Ha: mean > 0
t = 4.6624 t = 4.6624 t = 4.6624
P < t = 0.9999 P > |t| = 0.0001 P > t = 0.0001
Note: Stata says One sample t -test
We have considered a paired t-test where Ho: diff=0
And hence the similar output to the one sample test
84
Exercise 1c- Paired t-test
Conclusion:
• Since p< 0.05, we reject Ho and conclude that the
dexterity scores are statistically significantly higher
on the new therapy as compared to the standard
treatment.

• Therefore the new procedure is more effective than


the standard procedure.

85
!!
THANK YOU FOR YOUR ATTENTION!!

86

You might also like