Applied Statistics II-2 and III
Applied Statistics II-2 and III
Applied Statistics II-2 and III
Jingwen Gu ([email protected])
Clinical Statistician
Bioinformatics and Computational Biosciences Branch (BCBB) /OCICB /NIAID
2. Strength of association
• Odds ratio
• Relative Risk
3. Test of independence
• Cases with nominal large sample size and small sample size
• Cases with stratified and paired data
• Cases with ordinal data
2
Introduction
Categorical variable is one for which the measurement scale
consists of a set of categories.
3
Categorical data type
4
Contingency table
Cases: A data frame where each row represents one case. (e.g.
patient-level data)
Count Contingency table
Smoking Lung Cancer Count
Lung Cancer
Yes Case 688 Smoking Case Control
Yes Control 650 Yes 688 650
No Case 21 No 21 59
No Control 59
Table with cells that represent the IJ possible outcomes, when the cells
contain frequency counts of outcomes for a sample, called contingency
table.
5
Sensitivity and Specificity
Sensitivity of a test is the ability to identify correctly those who have the
disease and it is the proportion of patients with disease in whom the test
is positive.
Type I error is the rejection of a true null hypothesis, also known as false
positive.
Type II error is fail to reject a false null hypothesis, also known as false
negative.
6
Example – Screening test
Screening
Test Result
Diseased Not Diseased Total
7
PPV and NPV
Positive
predictive value (PPV) of a test is the probability of an
individual with a positive test has the disease. .
8
Partial table
9
Probability distribution
Joint distribution
– Let denote the probability that (X, Y) occurs in the cell in row i and column j.
is the joint distribution of X and Y.
Marginal distribution
– The marginal distribution that or
– Sum of the marginal distribution is 1.
Conditional distribution
– Given that a subject is classified in row i of X, we use to denote the
probability of classification in column j of Y, j = 1 , . . . , J. Then, .
10
Example - Probability distribution
A new drug is being tested on a group of 800 people (400 men and 400
women) with a particular disease. We wish to establish whether there is a
link between taking the drug and recovery from the disease.
We can conclude that the drug has positive effect. But if the result break
down into gender…
11
Cont
Gender Male Female
Drug taken Yes No Yes No
Recovered
Yes 180 70 20 210
No 120 30 80 90
Recovery rate 60% 70% 20% 30%
Both for male and female, the recovery rates are better without drug. Gender
influences drug taken because men are much more likely in this study to have been
given the drug than women.
The result that a marginal association can have a different direction from each
conditional association is called Simpson's paradox.
Avoid? Yes if we are certain that we know every possible variable that can impact
the outcome variable. If we are not certain – and in general we simply cannot be –
then Simpson’s paradox is theoretically unavoidable.
12
Confounding
13
Strength of association
To measure the strength of association, use other methods like:
• Odds ratio
• Relative risk
14
Odds ratio
15
Confidence intervals
The
general form of confidence interval is:
16
Example – Odds ratio
In a study, patients admitted with lung cancer in the preceding year were queried
about their smoking behavior.
For each of the 709 patients admitted, they recorded the smoking behavior of a
noncancer patient at the same hospital of the same gender and within the same 5-
year grouping on age (smoker was defined as a person who had smoked at least one
cigarette a day for at least a year).
Evaluate the strength of association by calculating odds ratio and 95% confidence
interval. Is the odds ratio significantly different from 1?
17
Cont
example,
From
The odds of patient to have lung cancer are three times that if smoke compare to if
did not smoke.
)=
95% CI for is
95% CI for is
The 95% confidence interval is Since it does not include one, odds ratio is
significantly different from 1. The odds of the smoking group to have lung cancer is
between 1.79 and 4.95 times compared to the non smoking group.
18
Relative risk
Relative
risk is the ratio of the probability of an outcome in an
exposed group to the probability of an outcome in an unexposed
group.
19
Example – Relative risk
In smoking and lung cancer example, the proportions having lung cancer were for smoker and
were for non-smoker.
20
Similarity between OR and RR
The
relationship between odds ratio and relative risk:
21
Test of Independence
Lung Cancer
Yes Large688sample
650 size
1338 Small sample size
No 21 59 80
22
Test of Independence
Pearson’s chi-square test and likelihood ratio test are used for testing
independence by evaluating the closeness between observed and expected
frequencies.
– Assumption: large samples and independence of individual observation.
Degree of freedom:
and follow . The larger the values of and are, the more evidence exists
against independence. If p-value less than significance level, then reject null
hypothesis.
23
Example – Chi-square test
24
Cont – Pearson Chi-square test
Degree of freedom is
25
Cont – Likelihood ratio test
The
likelihood ratio statistic equals to,
Degree of freedom is
26
Properties
and
have the same limiting chi-squared distribution
(asymptotically equivalent).
The order of the row or column vector does not change for the
result of a chi-square or likelihood ratio test of independence.
27
Small sample size
Fisher’s exact test can be used for test of independence when is small.
28
Example – Fisher’s test for small size data
Example: Lady Tasting Tea
R. A. Fisher described the following experiment from his
days working at Rothamsted Experimental Station. His
colleague, Dr. Muriel Bristol declares that by tasting a
cup of tea made with milk she can discriminate whether
the milk or the tea infusion was first added to the cup.
29
Cont – Fisher’s test for small size data
Distinguishing the order of pouring better than with pure guessing corresponds to ,
reflecting a positive association between order of pouring and the prediction.
against
All Even All Correct
Incorrect
0 4 1 3 2 2 3 1 4 0
4 0 3 1 2 2 1 3 0 4
The probability is The more extreme in the direction of has correct.
The P-value is p
Cannot reject at significant level 0.05, which means that the result does not establish
an association between actual order of pouring and her predictions.
30
Stratified data
Cochran-Mantel-Haenszel test for the analysis of stratified or matched
categorical data.
– Often used in observational studies where random assignment of subjects to different
treatments cannot be controlled, but confounding covariates can be measured.
31
Example – CHM test for stratified data
There
is no association between obesity and CVD.
=
32
Paire
Pairedd data
McNemar’s
McNemar’s
test is
test
two samples
is used
used for
for comparing
comparingcategorical
categoricalresponses
responsesfor
fortwo
samples that that are statistically
are statistically dependent.
dependent.
Commonly occur in studies with repeated measurement of
Commonly
subjects. occur in studies with repeated measurement of subjects.
McNemar statistic
McNemar statistic with
with continuity
continuitycorrection:
correction:
(� − � − 1)
� =
� +�
For large samples, has a chi-squared distribution with , p-value less
For large samples, � has a chi-squared distribution with �� = 1,
than significance
p-value less thanlevel, reject the
significance nullreject
level, hypothesis of hypothesis
the null independence.
of
independence.
33
32
Example – McNemar’s test for matched pair
data
In the 2010 General Social Survey, subjects were asked who they voted for
democrat or republican in the 2004 and 2008 Presidential elections. Was
there a shift in this direction?
The McNemar statistic is: with , p-value , extremely strong evidence of a
shift in the Democrat direction.
34
Ordinal data
35
Example – Linear trend test for ordinal data
36
Cont
Assign scores to the level of ordinal variables
• Average of the category interval
– Obesity: low, median, high assign to 1, 2, 3
– High BP: no, yes assign to 0, 1
– Alcohol: 0, 1-2, 3-5, 6+ assign to 0, 1.5, 4, 7
(Left graphic shown the recoding data)
• Midrank
Rank the observations and applies midrank as scores.
– Obesity: 83 for low (average among 1-165), 246 for median (from 166-
326), 409 for high (from 327-491)
37
Compare result with other statistic
Obesity vs Alcohol 0.325 0.317 0.022
High BP vs Alcohol 0.026 0.022 0.003
Obesity vs High BP 0.003 0.003 0.001
38
Summary of Statistical Testing Applied
to Specific Cases
Large and independent data Pearson Chi-square test/
Likelihood ratio test
** There are much more than listed statistical testing can be applied to categorical data, compare and select one most fit
to your cases! Check assumption before applying it.
39
Bernoulli and Binomial distribution
Bernoulli trial: two possible outcomes for one trial (success, failure).
Assume as a success, the probability of a success is . Assume as a failure, the probability of
a failure .
Binomial PMF:
40
Example – Binomial distribution
are three students registered a class. The decision of whether attend class or
There
not are independent for each student. Assume the probability of attending class is
0.5. What is the probability of have none, one, two, three students attend the class?
From example, . For random sample size , let . Probability mass function express as:
Since sample size is 3, the sum of the probability of none, 1, 2, 3 students come equals 1.
41
Properties – Binomial distribution
• has
mean and variance
• , probability of success, also denote as has mean and standard
deviation
• If trial has more than two possible outcomes, lets say categories, the
counts follow multinomial distribution. The multinomial probability
mass function is:
42
Poisson distribution
The
Poisson distribution expresses the probability of a number of events that
occur randomly over a fixed interval of time or space, when outcomes in
disjoint periods or regions are independent.
Examples: the number of emails you get in an hour; the number of red car
pass by in a hour; the number of earthquake in a year in some region.
Where
43
Cont
Property:
44
Limitation
Overdispersion: count observations often exhibit variability exceeding that
predicted by the binomial or Poisson.
When vary from different conditions, the counts event display more
variation.
The negative binomial is a related distribution for count data that has a
second parameter and permits the variance to exceed the mean.
45
Generalized Linear Models
Generalized linear models (GLMs): extend ordinary regression models to
encompass non-normal response distributions and modeling functions of
the mean.
• Random component
– Consists of a response variable Y with independent observations
from a distribution in the natural exponential family.
• Systematic component
– Specify explanatory variables used in a linear predictor function.
• Link function
– A function of that equals to a linear function of explanatory
variables.
46
Introduction to Logistic Regression (LR)
47
Coefficient trend
Multiplicative effect
The odds multiply by for every 1-
unit increase in .
In other words, is an odd ratio, the
odds at divided by the odds at .
48
Example – logistic model
From an epidemiological survey to investigate snoring as a risk factor
for heart disease. (0, 2, 4, 5) is used to score the level of snoring,
treating the last two levels closer. Build logistic model to demonstrate
relationship.
Heart Disease
Snoring/ score Yes No
Never 0 24 1355
Occasionally 2 35 603
Nearly every night 4 21 192
Every night 5 30 224
49
Cont
Software
reposts the logistic regression ML fit
50
Logistic model interpretation
Interpret:
• The positive reflects the increased incidence of heart disease
at higher snoring levels.
• The estimated probability of heart disease is about 0.02 for
non-snorers (calculated using ); it increases to 0.04 for
occasional snorers, to 0.09 for those who snore nearly very
night, and to 0.13 for those who always snore.
51
Example: Multiple logistic regression
The following data are from a study on the effects of AZT in slowing
the development of AIDS symptoms. In the study, 338 veterans
whose immune systems were beginning to falter after infection with
HIV were randomly assigned either to receive AZT immediately or to
wait until their T cells showed severe immune weakness.
Symptoms
Race AZT Use Yes No
White Yes 14 93
No 32 81
Black Yes 11 52
No 12 43
52
Cont
Logistic
regression for binary response:
53
Cont
54
Multiple logistic model Interpretation
Parameter
Interpretation
x z logit
1 1
1 0
0 1
0 0
55
Cont
56
Loglinear model
Poisson GLMs are used for model count or rate data for a single
nonnegative integer-value response variable.
Poisson loglinear GLM assumes a Poisson distribution for and uses the
log link.
57
Example – Loglinear model
this example, the response outcome for each of 173 female crabs is her
In
number of satellites. Explanatory variables are the female crab's color, spine
condition, weight, and carapace width. Table below shows a small set of the
data.
Create a model to predict number of satellites using carapace width. Let the
expected number of satellites at width . The ML fit of Poisson loglinear model
is
is the multiplicative effect on for 1 unit increase in . For example, when , , also
equals to multiplicative effect 1.18 times 2.81 which calculated from
58
References
Agresti, A. (2018). An introduction to categorical data analysis. Wiley.
Knuiman, M.W. & Speed, T.P. (1988) Incorporating Prior Information into the Analysis of
Contingency Tables. Biometrics, 44 (4), 1061–1071.
Peng Zeng (2012) Categorical Data Analysis - More discussions on Logistic Regression
http://webhome.auburn.edu/~carpedm/courses/stat7040/Review/06-logistic-more.pdf
Fenton, N., Neil, M. and Constantinou, A. (2015) Simpson’s Paradox and the implications for
medical trials
https://www.eecs.qmul.ac.uk/~norman/papers/simpson.pdf
59