Applied Statistics II-2 and III

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 59

BCBB Workshop

Applied Statistics II-2 & III

Jingwen Gu ([email protected])
Clinical Statistician
Bioinformatics and Computational Biosciences Branch (BCBB) /OCICB /NIAID 

Contact us: [email protected] Material Download


Outline
1. Contingency table
• Sensitivity, Specificity, Type I/II error, Positive/Negative Predictive Value
• Joint, Marginal, Conditional probability distribution

2. Strength of association
• Odds ratio
• Relative Risk

3. Test of independence
• Cases with nominal large sample size and small sample size
• Cases with stratified and paired data
• Cases with ordinal data

4. Generalized linear model


• Logistic regression model
• Loglinear model

2
Introduction
Categorical variable is one for which the measurement scale
consists of a set of categories.

Categorical data is the statistical data type consisting of


categorical variables or of data has been converted into the
form.

In categorical data analysis,


• Response or Dependent variable Y : two ore more categories
• Explanatory or Independent variables X : discrete or continuous
or both
Example
Y = vote in election (Democrat, Republican, Independent)
X’s - income, education, gender, race

3
Categorical data type

Nominal and ordinal are categorical data.


• Nominal: unorder categories
– Examples: Gender, race, hair color
– Measures: counts, frequency, mode

• Ordinal: order categories


– Examples: Highest education degree, levels of satisfaction
– Measures: counts, frequency, mode, median

4
Contingency table

Cases: A data frame where each row represents one case. (e.g.
patient-level data)
Count Contingency table
Smoking Lung Cancer Count
Lung Cancer
Yes Case 688 Smoking Case Control
Yes Control 650 Yes 688 650
No Case 21 No 21 59

No Control 59

Table with cells that represent the IJ possible outcomes, when the cells
contain frequency counts of outcomes for a sample, called contingency
table.

5
Sensitivity and Specificity

 
Sensitivity of a test is the ability to identify correctly those who have the
disease and it is the proportion of patients with disease in whom the test
is positive.

Specificity of a test is the ability to identify correctly those who do not


have the disease and it is the proportion of patients without disease in
whom the test is negative.

Type I error is the rejection of a true null hypothesis, also known as false
positive.

Type II error is fail to reject a false null hypothesis, also known as false
negative.

6
Example – Screening test

  Screening
Test Result
Diseased Not Diseased Total

Positive 10 400 410


Negative 1 4500 4501
Total 11 4900 4911

The sensitivity of a screening test ).

The specificity of a screening test: ).

A good screening exam has both high sensitivity and specificity.

7
PPV and NPV

Positive
  predictive value (PPV) of a test is the probability of an
individual with a positive test has the disease. .

Negative predictive value (NPV) of a test is the probability of an


individual with negative test does not have the disease.

Diseased Not Diseased Total


Test Positive 100 900 1000
Test Negative 50 5000 5050
Total 150 5900 6050

8
Partial table

In a three-way contingency table cross-classifies X, Y, and Z, we


control for Z by studying the XY relationship at fixed levels of Z.

• Partial table splits the original three- Z X Y


way table according to levels of Z.
Gender Smoke Case Control
• The associations in partial tables are Male Yes
called conditional associations. It No
refers to the association between X Female Yes
and Y conditional on fixing Z at some No
level. Total Yes
No
• The two-way contingency table
obtained by combining the partial
tables is called the XY marginal
table.

9
Probability distribution

  Joint distribution
– Let denote the probability that (X, Y) occurs in the cell in row i and column j.
is the joint distribution of X and Y.

Marginal distribution
– The marginal distribution that or
– Sum of the marginal distribution is 1.

Conditional distribution
– Given that a subject is classified in row i of X, we use to denote the
probability of classification in column j of Y, j = 1 , . . . , J. Then, .

10
Example - Probability distribution
A new drug is being tested on a group of 800 people (400 men and 400
women) with a particular disease. We wish to establish whether there is a
link between taking the drug and recovery from the disease.

Drug trial results:


Drug taken
Recovered Yes No
Yes 200 160
No 200 240
Recovery rate 50% 40%

We can conclude that the drug has positive effect. But if the result break
down into gender…

11
Cont
Gender Male Female
Drug taken Yes No Yes No
Recovered
Yes 180 70 20 210
No 120 30 80 90
Recovery rate 60% 70% 20% 30%

Both for male and female, the recovery rates are better without drug. Gender
influences drug taken because men are much more likely in this study to have been
given the drug than women.

The result that a marginal association can have a different direction from each
conditional association is called Simpson's paradox.

Avoid? Yes if we are certain that we know every possible variable that can impact
the outcome variable. If we are not certain – and in general we simply cannot be –
then Simpson’s paradox is theoretically unavoidable.

Reference: Simpson’s Paradox and the implications for medical trials

12
Confounding

Confounding variable is a variable that influences both


the dependent and independent variable causing a
spurious association.

To reduce effects of confounding variable:


• In experimental studies: randomly assigning subjects to
different levels.
• In observational studies: control confounding variable that
can influence relationship.
• Statistical control: collect data and include the potential
confounders as variable in your model.

13
Strength of association
To measure the strength of association, use other methods like:
• Odds ratio
• Relative risk

Above methods also measures of risk, they can be useful in safety


and efficacy studies.

Measures are effective when confounding variables are controlled.

14
Odds ratio

   is the probability of an outcome divided by the probability of not


Odds
having that outcome. If  is the probability of the outcome, the odds equals.

The odds of outcome when exposure presents is:

Similarly, the odds of outcome when exposure absent is:

Odds ratio is a ratio of the odds of two groups:


The asymptotic standard error of the :

15
Confidence intervals

The
  general form of confidence interval is:

The confidence level of a confidence interval is the probability that


the true parameter is between this interval.

• Usually use 95% confidence interval, . In this lecture, critical value is


usually . For a 95% confidence interval, = 1.96.
• Sometimes a confidence interval for can be obtained indirectly, we
first calculate a confidence interval for , and then a confidence
interval for is obtained as .

16
Example – Odds ratio
In a study, patients admitted with lung cancer in the preceding year were queried
about their smoking behavior.
For each of the 709 patients admitted, they recorded the smoking behavior of a
noncancer patient at the same hospital of the same gender and within the same 5-
year grouping on age (smoker was defined as a person who had smoked at least one
cigarette a day for at least a year).

Evaluate the strength of association by calculating odds ratio and 95% confidence
interval. Is the odds ratio significantly different from 1?

17
Cont

 example,
From
 
The odds of patient to have lung cancer are three times that if smoke compare to if
did not smoke.

The asymptotic standard error of the log odds is:

)=

95% CI for is

95% CI for is

The 95% confidence interval is Since it does not include one, odds ratio is
significantly different from 1. The odds of the smoking group to have lung cancer is
between 1.79 and 4.95 times compared to the non smoking group.

18
Relative risk

Relative
  risk is the ratio of the probability of an outcome in an
exposed group to the probability of an outcome in an unexposed
group.

Relative risk equals to:

An estimated standard error for log :

19
Example – Relative risk

 
In smoking and lung cancer example, the proportions having lung cancer were for smoker and
were for non-smoker.

The sample relative risk is


Participants who smoke are 1.96 times more likely to develop lung cancer as compared to non-
smokers.

An estimated standard error for log is

95% confidence interval of is

95% confidence interval of is .

20
Similarity between OR and RR

The
  relationship between odds ratio and relative risk:

when and small, odds ratio is approximately equals to relative risk.

OR and RR are useful for prospective study designs.

Dealing with small probability, RR is better in interpretation.

21
Test of Independence
Lung Cancer

Smoker Case Control Total

Yes Large688sample
650 size
1338 Small sample size
No 21 59 80

Total 709 709 1418

CVD Non-CVD Total


Ordinal data
Obese 10 90 100
Not Obese 35 465 500
Total 45 555 600

Stratified data Paired data

CVD Non-CVD Total

Obese 36 164 200


Not Obese 25 175 200
Total 61 339 400

22
Test of Independence
 
Pearson’s chi-square test and likelihood ratio test are used for testing
independence by evaluating the closeness between observed and expected
frequencies.
– Assumption: large samples and independence of individual observation.

– Two variables are independent. They are not independent.

– is the expected frequencies, is the observed.

Pearson chi-square statistic: where

Likelihood ratio statistic:

Degree of freedom:

and follow . The larger the values of and are, the more evidence exists
against independence. If p-value less than significance level, then reject null
hypothesis.

23
Example – Chi-square test

 the example of case-control study of lung cancer and smoking:


In
 
Is there a significant association between smoking and lung cancer?

Smoking and lung cancer are independent. not independent


Assume significance level=0.05.

Observed table Expected table


Lung Cancer Lung Cancer
Smoker Case Control Total Smoker Case Control
Yes 688 650 1338 Yes 669 669
No 21 59 80 No 40 40
Total 709 709 1418 Total 709 709

24
Cont – Pearson Chi-square test

   Pearson statistic equals to


The

Degree of freedom is

P-value is , less than , reject null hypothesis. (calculate using in R)

We have 95% confidence to reject the null hypothesis that smoking


and lung cancer are independent.

25
Cont – Likelihood ratio test

The
  likelihood ratio statistic equals to,

Degree of freedom is

P-value is , less than , reject null hypothesis.

We have 95% confidence to reject the null hypothesis that


smoking and lung cancer are independent.

26
Properties

and
  have the same limiting chi-squared distribution
(asymptotically equivalent).

converges in probability to zero.

The order of the row or column vector does not change for the
result of a chi-square or likelihood ratio test of independence.

27
Small sample size
 
Fisher’s exact test can be used for test of independence when is small.

– Assumption: Independence of individual observation and fixed totals. (the


row and column totals are fixed, or “conditioned.”) When row or column totals are
unconditioned, makes this test less powerful.

The probability mass function is:

The exact possibility assigned to each of the possible outcomes:

Calculate p-value as the total probability of observing data as extreme


and more extreme cases. Reject null hypothesis if p-value less than
significant level.

28
Example – Fisher’s test for small size data
Example: Lady Tasting Tea
R. A. Fisher described the following experiment from his
days working at Rothamsted Experimental Station. His
colleague, Dr. Muriel Bristol declares that by tasting a
cup of tea made with milk she can discriminate whether
the milk or the tea infusion was first added to the cup.

Experiment design: consist of eight cups of tea, four


pouring milk first and four pouring tea first; serve in a
random order. She knew there were four cups of each
type and had to predict which four had the milk added
first.

29
Cont – Fisher’s test for small size data
 
Distinguishing the order of pouring better than with pure guessing corresponds to ,
reflecting a positive association between order of pouring and the prediction.
against
All Even All Correct
Incorrect
0 4 1 3 2 2 3 1 4 0
4 0 3 1 2 2 1 3 0 4
The probability is The more extreme in the direction of has correct.
The P-value is p
Cannot reject at significant level 0.05, which means that the result does not establish
an association between actual order of pouring and her predictions.

30
Stratified data
 
Cochran-Mantel-Haenszel test for the analysis of stratified or matched
categorical data.
– Often used in observational studies where random assignment of subjects to different
treatments cannot be controlled, but confounding covariates can be measured.

– : There is no association between the two inner variables.

According to stratification, create contingency tables. Assume there


aretables.
Y1 Y2 Total
The test statistic can be calculated by:
X1
X2
Total
Follow distribution asymptotically with one degree of freedom under .

31
Example – CHM test for stratified data

  examine the association between obesity and cardiovascular diseases (CVD),


To
data is stratified into two categories with ageand age50:

CVD Non-CVD Total CVD Non-CVD Total


Obese 10 90 100 Obese 36 164 200
Not Obese 35 465 500 Not Obese 25 175 200
Total 45 555 600 Total 61 339 400

There
  is no association between obesity and CVD.

p-value=0.08, fail to reject null hypothesis that there is no association between


obesity and CVD.

32
Paire
Pairedd data

McNemar’s
McNemar’s
  test is
test
two samples
is used
used for
for comparing
comparingcategorical
categoricalresponses
responsesfor
fortwo
samples that that are statistically
are statistically dependent.
dependent.
Commonly occur in studies with repeated measurement of
Commonly
subjects. occur in studies with repeated measurement of subjects.

McNemar statistic
McNemar statistic with
with continuity
continuitycorrection:
correction:

(� − � − 1)
� =
� +�
For large samples, has a chi-squared distribution with , p-value less
For large samples, � has a chi-squared distribution with �� = 1,
than significance
p-value less thanlevel, reject the
significance nullreject
level, hypothesis of hypothesis
the null independence.
of
independence.

33
32
Example – McNemar’s test for matched pair
data

In the 2010 General Social Survey, subjects were asked who they voted for
democrat or republican in the 2004 and 2008 Presidential elections. Was
there a shift in this direction?

 
The McNemar statistic is: with , p-value , extremely strong evidence of a
shift in the Democrat direction.

34
Ordinal data

   chi-square tests ignore some information when used to test


The
independence between ordinal classifications. Taking the ordering
into account are usually more powerful.

In ordinal data analysis, we can assign scores to the levels for


ordinal variables by using:
• Average of category interval
• Midrank

Then Linear trend test statistic can be used:

where is Pearson correlation between two variables.

For large samples, it is approximately chi-squared with df = 1.

35
Example – Linear trend test for ordinal data

   subjects are cross-classified according to the three


491
factors: hypertension (hyp; 2 levels), obesity (obe; 3 levels)
and alcohol (alc; 4 levels).
• Alc: the classification of alcohol intake of drinks per day
(0, 1-2, 3-5, 6+)
• Obe: the classification of obesity (low, average, high)
• Hyp: the classification of hypertension (yes, no)
Objective: whether correlation between two ordinal variables.
versus
• Use linear trend test to test for independence between
two variables.
• Check p-value of statistic with degree of freedom 1.

Source: Knuiman, M.W. & Speed, T.P. (1988) Incorporating


Prior Information into the Analysis of Contingency
Tables. Biometrics, 44 (4), 1061–1071.

36
Cont

 
Assign scores to the level of ordinal variables
• Average of the category interval
– Obesity: low, median, high assign to 1, 2, 3
– High BP: no, yes assign to 0, 1
– Alcohol: 0, 1-2, 3-5, 6+ assign to 0, 1.5, 4, 7
(Left graphic shown the recoding data)

• Midrank
Rank the observations and applies midrank as scores.
– Obesity: 83 for low (average among 1-165), 246 for median (from 166-
326), 409 for high (from 327-491)

Create contingency table

(rows for obesity and columns for alcohol)

Pearson correlation between obesity and alcohol is , ,

Compare with significant level and make conclusion.


We have 95% confidence to conclude that there is association
between obesity and alcohol.

37
Compare result with other statistic

 
Obesity vs Alcohol 0.325 0.317 0.022
High BP vs Alcohol 0.026 0.022 0.003
Obesity vs High BP 0.003 0.003 0.001

Sensitivity to choice of scores


• Scores that are linear transforms of each other, such as (1, 2,
3, 4) and (0, 2, 4, 6), have the same absolute correlation and
hence the same .
• Results may depend on the scores when the data are highly
unbalanced.

38
Summary of Statistical Testing Applied
to Specific Cases
Large and independent data  Pearson Chi-square test/
Likelihood ratio test

Small and conditioned data Fisher’s exact test

Stratified data  Cochran-Mantel-Haenszel

Paired data  McNemar’s test / CMH test

Ordinal data  Linear trend test

** There are much more than listed statistical testing can be applied to categorical data, compare and select one most fit
to your cases! Check assumption before applying it.

39
Bernoulli and Binomial distribution
 
Bernoulli trial: two possible outcomes for one trial (success, failure).
Assume as a success, the probability of a success is . Assume as a failure, the probability of
a failure .

Bernoulli probability mass function (PMF):

Binomial distribution with parameters  and  is the discrete probability distribution of the number


of successes in a sequence of n independent experiments.

• Binomial: n Bernoulli trials – two possible outcome for each trial.

• Y follows binomial distribution, .

Binomial PMF:

Represents the probability of having y successes in n independent trials.

40
Example – Binomial distribution

   are three students registered a class. The decision of whether attend class or
There
not are independent for each student. Assume the probability of attending class is
0.5. What is the probability of have none, one, two, three students attend the class?

From example, . For random sample size , let . Probability mass function express as:

The probability of no student come attend class is:

The probability of having 1, 2, 3 students attend is:

Since sample size is 3, the sum of the probability of none, 1, 2, 3 students come equals 1.

41
Properties – Binomial distribution

• has
  mean and variance
• , probability of success, also denote as has mean and standard
deviation

• If trial has more than two possible outcomes, lets say categories, the
counts follow multinomial distribution. The multinomial probability
mass function is:

42
Poisson distribution
The
  Poisson distribution expresses the probability of a number of events that
occur randomly over a fixed interval of time or space, when outcomes in
disjoint periods or regions are independent.

Examples: the number of emails you get in an hour; the number of red car
pass by in a hour; the number of earthquake in a year in some region.

Poisson probabilities depends on a single parameter, mean .


Probability mass function:

Where

mean number of occurrences in the given interval or space


Euler’s constant 2.71828

43
Cont

 

Source: Poisson distribution

Property:

44
Limitation

 
Overdispersion: count observations often exhibit variability exceeding that
predicted by the binomial or Poisson.

When vary from different conditions, the counts event display more
variation.

The negative binomial is a related distribution for count data that has a
second parameter and permits the variance to exceed the mean.

where and are parameters.

Or other methods like generalized linear model…

45
Generalized Linear Models

 
Generalized linear models (GLMs): extend ordinary regression models to
encompass non-normal response distributions and modeling functions of
the mean.
• Random component
– Consists of a response variable Y with independent observations
from a distribution in the natural exponential family.
• Systematic component
– Specify explanatory variables used in a linear predictor function.
• Link function
– A function of that equals to a linear function of explanatory
variables.

46
Introduction to Logistic Regression (LR)

   model is to describe data and to explain the relationship between one


Logistic
dependent binary variable and one or more categorical or continuous independent
variables.

Suppose we have a binary response variable, denote as and a single explanatory


variable the distribution of is .

Logistic regression model:

As increases, increases when and decreases when .

The log odds has the linear relationship:

47
Coefficient trend

A  fixed change in often has less impact when is near 0


or 1 then when is near 0.5.

 Multiplicative effect
The odds multiply by for every 1-
unit increase in .
In other words, is an odd ratio, the
odds at divided by the odds at .

48
Example – logistic model
From an epidemiological survey to investigate snoring as a risk factor
for heart disease. (0, 2, 4, 5) is used to score the level of snoring,
treating the last two levels closer. Build logistic model to demonstrate
relationship.

Heart Disease
Snoring/ score Yes No
Never 0 24 1355
Occasionally 2 35 603
Nearly every night 4 21 192
Every night 5 30 224

49
Cont

Software
  reposts the logistic regression ML fit

50
Logistic model interpretation

 

Interpret:
• The positive reflects the increased incidence of heart disease
at higher snoring levels.
• The estimated probability of heart disease is about 0.02 for
non-snorers (calculated using ); it increases to 0.04 for
occasional snorers, to 0.09 for those who snore nearly very
night, and to 0.13 for those who always snore.

51
Example: Multiple logistic regression
The following data are from a study on the effects of AZT in slowing
the development of AIDS symptoms. In the study, 338 veterans
whose immune systems were beginning to falter after infection with
HIV were randomly assigned either to receive AZT immediately or to
wait until their T cells showed severe immune weakness.

Symptoms
Race AZT Use Yes No
White Yes 14 93
No 32 81
Black Yes 11 52
No 12 43

52
Cont

Logistic
  regression for binary response:

Where x represents AZT treatment, z represents race, predicting the


probability of AIDS symptoms developed.

In this method, we assume there is no interaction between race (z)


and AZT treatment (x), the effect of one factor is the same at each
level of the other factor.

is the multiplicative effect on the odds of a 1-unit increase in , when


we can keep fixed the levels of other .

53
Cont

1 for white race, 0 for black race


1 for immediate AZT use, otherwise 0

54
Multiple logistic model Interpretation

Parameter
  Interpretation
x z logit
1 1
1 0
0 1
0 0

• is the log odds of developing AIDS symptoms for black race


subjects without immediate AZT uses.
• is the increment to the log odds for those with immediate AZT
use.
• is the increment to the log odds for white race subjects.

55
Cont

At  a fixed level of Z, the effect on the logit of changing


categories of X is:

The estimated odds ratio between immediate AZT use and


development of AIDS symptoms equals .

For each race, the estimated odds of symptoms are half as


high for those who took AZT immediately.

The Wald confidence interval for this effect is .

56
Loglinear model

 
Poisson GLMs are used for model count or rate data for a single
nonnegative integer-value response variable.

Poisson loglinear GLM assumes a Poisson distribution for and uses the
log link.

The Poisson loglinear model with explanatory variable is

The mean satisfied the exponential relationship

A 1-unit increase in has a multiplicative impact of : The mean at equals


the mean at times .

57
Example – Loglinear model

  this example, the response outcome for each of 173 female crabs is her
In
number of satellites. Explanatory variables are the female crab's color, spine
condition, weight, and carapace width. Table below shows a small set of the
data.

Create a model to predict number of satellites using carapace width. Let the
expected number of satellites at width . The ML fit of Poisson loglinear model
is

is the multiplicative effect on for 1 unit increase in . For example, when , , also
equals to multiplicative effect 1.18 times 2.81 which calculated from

58
References
Agresti, A. (2018). An introduction to categorical data analysis. Wiley.

Sullivan, L. M. (2011). Essentials of biostatistics in public health. Jones & Bartlett Publishers.

Knuiman, M.W. & Speed, T.P. (1988) Incorporating Prior Information into the Analysis of
Contingency Tables. Biometrics, 44 (4), 1061–1071.

Peng Zeng (2012) Categorical Data Analysis - More discussions on Logistic Regression
http://webhome.auburn.edu/~carpedm/courses/stat7040/Review/06-logistic-more.pdf

Fenton, N., Neil, M. and Constantinou, A. (2015) Simpson’s Paradox and the implications for
medical trials
https://www.eecs.qmul.ac.uk/~norman/papers/simpson.pdf

Wikipedia Cochran-Mantel-Haenszel test


https://en.wikipedia.org/wiki/Cochran–Mantel–Haenszel_statistics

59

You might also like