Applied Statistics II-2 and III

BCBB Workshop

Applied Statistics II-2 & III

Jingwen Gu ([email protected])
Clinical Statistician
Bioinformatics and Computational Biosciences Branch (BCBB) /OCICB /NIAID 

1. Contingency table
• Sensitivity, Specificity, Type I/II error, Positive/Negative Predictive Value
• Joint, Marginal, Conditional probability distribution

2. Strength of association
• Odds ratio
• Relative Risk

3. Test of independence
• Cases with nominal large sample size and small sample size
• Cases with stratified and paired data
• Cases with ordinal data

4. Generalized linear model

• Logistic regression model
• Loglinear model

Categorical variable is one for which the measurement scale
consists of a set of categories.

Categorical data is the statistical data type consisting of

categorical variables or of data has been converted into the

In categorical data analysis,

• Response or Dependent variable Y : two ore more categories
• Explanatory or Independent variables X : discrete or continuous
or both
Y = vote in election (Democrat, Republican, Independent)
X’s - income, education, gender, race

Categorical data type

Nominal and ordinal are categorical data.

• Nominal: unorder categories
– Examples: Gender, race, hair color
– Measures: counts, frequency, mode

• Ordinal: order categories

– Examples: Highest education degree, levels of satisfaction
– Measures: counts, frequency, mode, median

Contingency table

Cases: A data frame where each row represents one case. (e.g.
patient-level data)
Count Contingency table
Smoking Lung Cancer Count
Lung Cancer
Yes Case 688 Smoking Case Control
Yes Control 650 Yes 688 650
No Case 21 No 21 59

No Control 59

Table with cells that represent the IJ possible outcomes, when the cells
contain frequency counts of outcomes for a sample, called contingency

Sensitivity and Specificity

Sensitivity of a test is the ability to identify correctly those who have the
disease and it is the proportion of patients with disease in whom the test
is positive.

Specificity of a test is the ability to identify correctly those who do not

have the disease and it is the proportion of patients without disease in
whom the test is negative.

Type I error is the rejection of a true null hypothesis, also known as false

Type II error is fail to reject a false null hypothesis, also known as false

Example – Screening test

  Screening
Test Result
Diseased Not Diseased Total

Positive 10 400 410

Negative 1 4500 4501
Total 11 4900 4911

The sensitivity of a screening test ).

The specificity of a screening test: ).

A good screening exam has both high sensitivity and specificity.


  predictive value (PPV) of a test is the probability of an
individual with a positive test has the disease. .

Negative predictive value (NPV) of a test is the probability of an

individual with negative test does not have the disease.

Diseased Not Diseased Total

Test Positive 100 900 1000
Test Negative 50 5000 5050
Total 150 5900 6050

Partial table

In a three-way contingency table cross-classifies X, Y, and Z, we

control for Z by studying the XY relationship at fixed levels of Z.

• Partial table splits the original three- Z X Y

way table according to levels of Z.
Gender Smoke Case Control
• The associations in partial tables are Male Yes
called conditional associations. It No
refers to the association between X Female Yes
and Y conditional on fixing Z at some No
level. Total Yes
• The two-way contingency table
obtained by combining the partial
tables is called the XY marginal

Probability distribution

  Joint distribution
– Let denote the probability that (X, Y) occurs in the cell in row i and column j.
is the joint distribution of X and Y.

Marginal distribution
– The marginal distribution that or
– Sum of the marginal distribution is 1.

Conditional distribution
– Given that a subject is classified in row i of X, we use to denote the
probability of classification in column j of Y, j = 1 , . . . , J. Then, .

Example - Probability distribution
A new drug is being tested on a group of 800 people (400 men and 400
women) with a particular disease. We wish to establish whether there is a
link between taking the drug and recovery from the disease.

Drug trial results:

Drug taken
Recovered Yes No
Yes 200 160
No 200 240
Recovery rate 50% 40%

We can conclude that the drug has positive effect. But if the result break
down into gender…

Gender Male Female
Drug taken Yes No Yes No
Yes 180 70 20 210
No 120 30 80 90
Recovery rate 60% 70% 20% 30%

Both for male and female, the recovery rates are better without drug. Gender
influences drug taken because men are much more likely in this study to have been
given the drug than women.

The result that a marginal association can have a different direction from each
conditional association is called Simpson's paradox.

Avoid? Yes if we are certain that we know every possible variable that can impact
the outcome variable. If we are not certain – and in general we simply cannot be –
then Simpson’s paradox is theoretically unavoidable.

Reference: Simpson’s Paradox and the implications for medical trials


Confounding variable is a variable that influences both

the dependent and independent variable causing a
spurious association.

To reduce effects of confounding variable:

• In experimental studies: randomly assigning subjects to
different levels.
• In observational studies: control confounding variable that
can influence relationship.
• Statistical control: collect data and include the potential
confounders as variable in your model.

Strength of association
To measure the strength of association, use other methods like:
• Odds ratio
• Relative risk

Above methods also measures of risk, they can be useful in safety

and efficacy studies.

Measures are effective when confounding variables are controlled.

Odds ratio

   is the probability of an outcome divided by the probability of not

having that outcome. If  is the probability of the outcome, the odds equals.

The odds of outcome when exposure presents is:

Similarly, the odds of outcome when exposure absent is:

Odds ratio is a ratio of the odds of two groups:

The asymptotic standard error of the :

Confidence intervals

  general form of confidence interval is:

The confidence level of a confidence interval is the probability that

the true parameter is between this interval.

• Usually use 95% confidence interval, . In this lecture, critical value is

usually . For a 95% confidence interval, = 1.96.
• Sometimes a confidence interval for can be obtained indirectly, we
first calculate a confidence interval for , and then a confidence
interval for is obtained as .

Example – Odds ratio
In a study, patients admitted with lung cancer in the preceding year were queried
about their smoking behavior.
For each of the 709 patients admitted, they recorded the smoking behavior of a
noncancer patient at the same hospital of the same gender and within the same 5-
year grouping on age (smoker was defined as a person who had smoked at least one
cigarette a day for at least a year).

Evaluate the strength of association by calculating odds ratio and 95% confidence
interval. Is the odds ratio significantly different from 1?


 example,
The odds of patient to have lung cancer are three times that if smoke compare to if
did not smoke.

The asymptotic standard error of the log odds is:


95% CI for is

95% CI for is

The 95% confidence interval is Since it does not include one, odds ratio is
significantly different from 1. The odds of the smoking group to have lung cancer is
between 1.79 and 4.95 times compared to the non smoking group.

Relative risk

  risk is the ratio of the probability of an outcome in an
exposed group to the probability of an outcome in an unexposed

Relative risk equals to:

An estimated standard error for log :

Example – Relative risk

In smoking and lung cancer example, the proportions having lung cancer were for smoker and
were for non-smoker.

The sample relative risk is

Participants who smoke are 1.96 times more likely to develop lung cancer as compared to non-

An estimated standard error for log is

95% confidence interval of is

95% confidence interval of is .

Similarity between OR and RR

  relationship between odds ratio and relative risk:

when and small, odds ratio is approximately equals to relative risk.

OR and RR are useful for prospective study designs.

Dealing with small probability, RR is better in interpretation.

Test of Independence
Lung Cancer

Smoker Case Control Total

Yes Large688sample
650 size
1338 Small sample size
No 21 59 80

Total 709 709 1418

CVD Non-CVD Total

Ordinal data
Obese 10 90 100
Not Obese 35 465 500
Total 45 555 600

Stratified data Paired data

CVD Non-CVD Total

Obese 36 164 200

Not Obese 25 175 200
Total 61 339 400

Test of Independence
Pearson’s chi-square test and likelihood ratio test are used for testing
independence by evaluating the closeness between observed and expected
– Assumption: large samples and independence of individual observation.

– Two variables are independent. They are not independent.

– is the expected frequencies, is the observed.

Pearson chi-square statistic: where

Likelihood ratio statistic:

Degree of freedom:

and follow . The larger the values of and are, the more evidence exists
against independence. If p-value less than significance level, then reject null

Example – Chi-square test

 the example of case-control study of lung cancer and smoking:

Is there a significant association between smoking and lung cancer?

Smoking and lung cancer are independent. not independent

Assume significance level=0.05.

Observed table Expected table

Lung Cancer Lung Cancer
Smoker Case Control Total Smoker Case Control
Yes 688 650 1338 Yes 669 669
No 21 59 80 No 40 40
Total 709 709 1418 Total 709 709

Cont – Pearson Chi-square test

   Pearson statistic equals to


Degree of freedom is

P-value is , less than , reject null hypothesis. (calculate using in R)

We have 95% confidence to reject the null hypothesis that smoking

and lung cancer are independent.

Cont – Likelihood ratio test

  likelihood ratio statistic equals to,

Degree of freedom is

P-value is , less than , reject null hypothesis.

We have 95% confidence to reject the null hypothesis that

smoking and lung cancer are independent.


  have the same limiting chi-squared distribution
(asymptotically equivalent).

converges in probability to zero.

The order of the row or column vector does not change for the
result of a chi-square or likelihood ratio test of independence.

Small sample size
Fisher’s exact test can be used for test of independence when is small.

– Assumption: Independence of individual observation and fixed totals. (the

row and column totals are fixed, or “conditioned.”) When row or column totals are
unconditioned, makes this test less powerful.

The probability mass function is:

The exact possibility assigned to each of the possible outcomes:

Calculate p-value as the total probability of observing data as extreme

and more extreme cases. Reject null hypothesis if p-value less than
significant level.

Example – Fisher’s test for small size data
Example: Lady Tasting Tea
R. A. Fisher described the following experiment from his
days working at Rothamsted Experimental Station. His
colleague, Dr. Muriel Bristol declares that by tasting a
cup of tea made with milk she can discriminate whether
the milk or the tea infusion was first added to the cup.

Experiment design: consist of eight cups of tea, four

pouring milk first and four pouring tea first; serve in a
random order. She knew there were four cups of each
type and had to predict which four had the milk added

Cont – Fisher’s test for small size data
Distinguishing the order of pouring better than with pure guessing corresponds to ,
reflecting a positive association between order of pouring and the prediction.
All Even All Correct
0 4 1 3 2 2 3 1 4 0
4 0 3 1 2 2 1 3 0 4
The probability is The more extreme in the direction of has correct.
The P-value is p
Cannot reject at significant level 0.05, which means that the result does not establish
an association between actual order of pouring and her predictions.

Stratified data
Cochran-Mantel-Haenszel test for the analysis of stratified or matched
categorical data.
– Often used in observational studies where random assignment of subjects to different
treatments cannot be controlled, but confounding covariates can be measured.

– : There is no association between the two inner variables.

According to stratification, create contingency tables. Assume there

Y1 Y2 Total
The test statistic can be calculated by:
Follow distribution asymptotically with one degree of freedom under .

Example – CHM test for stratified data

  examine the association between obesity and cardiovascular diseases (CVD),

data is stratified into two categories with ageand age50:

CVD Non-CVD Total CVD Non-CVD Total

Obese 10 90 100 Obese 36 164 200
Not Obese 35 465 500 Not Obese 25 175 200
Total 45 555 600 Total 61 339 400

  is no association between obesity and CVD.

p-value=0.08, fail to reject null hypothesis that there is no association between

obesity and CVD.

Paired data

  test is
two samples
is used
used for
for comparing
samples that that are statistically
are statistically dependent.
Commonly occur in studies with repeated measurement of
subjects. occur in studies with repeated measurement of subjects.

McNemar statistic
McNemar statistic with
with continuity

(� − � − 1)
� =
� +�
For large samples, has a chi-squared distribution with , p-value less
For large samples, � has a chi-squared distribution with �� = 1,
than significance
p-value less thanlevel, reject the
significance nullreject
level, hypothesis of hypothesis
the null independence.

Example – McNemar’s test for matched pair

In the 2010 General Social Survey, subjects were asked who they voted for
democrat or republican in the 2004 and 2008 Presidential elections. Was
there a shift in this direction?

The McNemar statistic is: with , p-value , extremely strong evidence of a
shift in the Democrat direction.

Ordinal data

   chi-square tests ignore some information when used to test

independence between ordinal classifications. Taking the ordering
into account are usually more powerful.

In ordinal data analysis, we can assign scores to the levels for

ordinal variables by using:
• Average of category interval
• Midrank

Then Linear trend test statistic can be used:

where is Pearson correlation between two variables.

For large samples, it is approximately chi-squared with df = 1.

Example – Linear trend test for ordinal data

   subjects are cross-classified according to the three

factors: hypertension (hyp; 2 levels), obesity (obe; 3 levels)
and alcohol (alc; 4 levels).
• Alc: the classification of alcohol intake of drinks per day
(0, 1-2, 3-5, 6+)
• Obe: the classification of obesity (low, average, high)
• Hyp: the classification of hypertension (yes, no)
Objective: whether correlation between two ordinal variables.
• Use linear trend test to test for independence between
two variables.
• Check p-value of statistic with degree of freedom 1.

Source: Knuiman, M.W. & Speed, T.P. (1988) Incorporating

Prior Information into the Analysis of Contingency
Tables. Biometrics, 44 (4), 1061–1071.


Assign scores to the level of ordinal variables
• Average of the category interval
– Obesity: low, median, high assign to 1, 2, 3
– High BP: no, yes assign to 0, 1
– Alcohol: 0, 1-2, 3-5, 6+ assign to 0, 1.5, 4, 7
(Left graphic shown the recoding data)

• Midrank
Rank the observations and applies midrank as scores.
– Obesity: 83 for low (average among 1-165), 246 for median (from 166-
326), 409 for high (from 327-491)

Create contingency table

(rows for obesity and columns for alcohol)

Pearson correlation between obesity and alcohol is , ,

Compare with significant level and make conclusion.

We have 95% confidence to conclude that there is association
between obesity and alcohol.

Compare result with other statistic

Obesity vs Alcohol 0.325 0.317 0.022
High BP vs Alcohol 0.026 0.022 0.003
Obesity vs High BP 0.003 0.003 0.001

Sensitivity to choice of scores

• Scores that are linear transforms of each other, such as (1, 2,
3, 4) and (0, 2, 4, 6), have the same absolute correlation and
hence the same .
• Results may depend on the scores when the data are highly

Summary of Statistical Testing Applied
to Specific Cases
Large and independent data  Pearson Chi-square test/
Likelihood ratio test

Small and conditioned data Fisher’s exact test

Stratified data  Cochran-Mantel-Haenszel

Paired data  McNemar’s test / CMH test

Ordinal data  Linear trend test

** There are much more than listed statistical testing can be applied to categorical data, compare and select one most fit
to your cases! Check assumption before applying it.

Bernoulli and Binomial distribution
Bernoulli trial: two possible outcomes for one trial (success, failure).
Assume as a success, the probability of a success is . Assume as a failure, the probability of
a failure .

Bernoulli probability mass function (PMF):

Binomial distribution with parameters  and  is the discrete probability distribution of the number

of successes in a sequence of n independent experiments.

• Binomial: n Bernoulli trials – two possible outcome for each trial.

• Y follows binomial distribution, .

Binomial PMF:

Represents the probability of having y successes in n independent trials.

Example – Binomial distribution

   are three students registered a class. The decision of whether attend class or
not are independent for each student. Assume the probability of attending class is
0.5. What is the probability of have none, one, two, three students attend the class?

From example, . For random sample size , let . Probability mass function express as:

The probability of no student come attend class is:

The probability of having 1, 2, 3 students attend is:

Since sample size is 3, the sum of the probability of none, 1, 2, 3 students come equals 1.

Properties – Binomial distribution

• has
  mean and variance
• , probability of success, also denote as has mean and standard

• If trial has more than two possible outcomes, lets say categories, the
counts follow multinomial distribution. The multinomial probability
mass function is:

Poisson distribution
  Poisson distribution expresses the probability of a number of events that
occur randomly over a fixed interval of time or space, when outcomes in
disjoint periods or regions are independent.

Examples: the number of emails you get in an hour; the number of red car
pass by in a hour; the number of earthquake in a year in some region.

Poisson probabilities depends on a single parameter, mean .

Probability mass function:


mean number of occurrences in the given interval or space

Euler’s constant 2.71828



Source: Poisson distribution



Overdispersion: count observations often exhibit variability exceeding that
predicted by the binomial or Poisson.

When vary from different conditions, the counts event display more

The negative binomial is a related distribution for count data that has a
second parameter and permits the variance to exceed the mean.

where and are parameters.

Or other methods like generalized linear model…

Generalized Linear Models

Generalized linear models (GLMs): extend ordinary regression models to
encompass non-normal response distributions and modeling functions of
the mean.
• Random component
– Consists of a response variable Y with independent observations
from a distribution in the natural exponential family.
• Systematic component
– Specify explanatory variables used in a linear predictor function.
• Link function
– A function of that equals to a linear function of explanatory

Introduction to Logistic Regression (LR)

   model is to describe data and to explain the relationship between one

dependent binary variable and one or more categorical or continuous independent

Suppose we have a binary response variable, denote as and a single explanatory

variable the distribution of is .

Logistic regression model:

As increases, increases when and decreases when .

The log odds has the linear relationship:

Coefficient trend

A  fixed change in often has less impact when is near 0

or 1 then when is near 0.5.

 Multiplicative effect
The odds multiply by for every 1-
unit increase in .
In other words, is an odd ratio, the
odds at divided by the odds at .

Example – logistic model
From an epidemiological survey to investigate snoring as a risk factor
for heart disease. (0, 2, 4, 5) is used to score the level of snoring,
treating the last two levels closer. Build logistic model to demonstrate

Heart Disease
Snoring/ score Yes No
Never 0 24 1355
Occasionally 2 35 603
Nearly every night 4 21 192
Every night 5 30 224


  reposts the logistic regression ML fit

Logistic model interpretation


• The positive reflects the increased incidence of heart disease
at higher snoring levels.
• The estimated probability of heart disease is about 0.02 for
non-snorers (calculated using ); it increases to 0.04 for
occasional snorers, to 0.09 for those who snore nearly very
night, and to 0.13 for those who always snore.

Example: Multiple logistic regression
The following data are from a study on the effects of AZT in slowing
the development of AIDS symptoms. In the study, 338 veterans
whose immune systems were beginning to falter after infection with
HIV were randomly assigned either to receive AZT immediately or to
wait until their T cells showed severe immune weakness.

Race AZT Use Yes No
White Yes 14 93
No 32 81
Black Yes 11 52
No 12 43


  regression for binary response:

Where x represents AZT treatment, z represents race, predicting the

probability of AIDS symptoms developed.

In this method, we assume there is no interaction between race (z)

and AZT treatment (x), the effect of one factor is the same at each
level of the other factor.

is the multiplicative effect on the odds of a 1-unit increase in , when

we can keep fixed the levels of other .


1 for white race, 0 for black race

1 for immediate AZT use, otherwise 0

Multiple logistic model Interpretation

x z logit
1 1
1 0
0 1
0 0

• is the log odds of developing AIDS symptoms for black race

subjects without immediate AZT uses.
• is the increment to the log odds for those with immediate AZT
• is the increment to the log odds for white race subjects.


At  a fixed level of Z, the effect on the logit of changing

categories of X is:

The estimated odds ratio between immediate AZT use and

development of AIDS symptoms equals .

For each race, the estimated odds of symptoms are half as

high for those who took AZT immediately.

The Wald confidence interval for this effect is .

Loglinear model

Poisson GLMs are used for model count or rate data for a single
nonnegative integer-value response variable.

Poisson loglinear GLM assumes a Poisson distribution for and uses the
log link.

The Poisson loglinear model with explanatory variable is

The mean satisfied the exponential relationship

A 1-unit increase in has a multiplicative impact of : The mean at equals

the mean at times .

Example – Loglinear model

  this example, the response outcome for each of 173 female crabs is her
number of satellites. Explanatory variables are the female crab's color, spine
condition, weight, and carapace width. Table below shows a small set of the

Create a model to predict number of satellites using carapace width. Let the
expected number of satellites at width . The ML fit of Poisson loglinear model

is the multiplicative effect on for 1 unit increase in . For example, when , , also
equals to multiplicative effect 1.18 times 2.81 which calculated from

Agresti, A. (2018). An introduction to categorical data analysis. Wiley.

Sullivan, L. M. (2011). Essentials of biostatistics in public health. Jones & Bartlett Publishers.

Knuiman, M.W. & Speed, T.P. (1988) Incorporating Prior Information into the Analysis of
Contingency Tables. Biometrics, 44 (4), 1061–1071.

Peng Zeng (2012) Categorical Data Analysis - More discussions on Logistic Regression

Fenton, N., Neil, M. and Constantinou, A. (2015) Simpson’s Paradox and the implications for
medical trials

Wikipedia Cochran-Mantel-Haenszel test–Mantel–Haenszel_statistics


