Research Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 59

Introduction to Statistical Analysis

Changyu Shen
Richard A. and Susan F. Smith Center for Outcomes Research in Cardiology
Beth Israel Deaconess Medical Center
Harvard Medical School
Objectives
• Descriptive versus inferential procedures
• Parametric versus non‐parametric inferential
procedures
• Central limit theorem (CLT) and pivot quantity
• Inferential procedures
– Categories
– Univariate analysis
– Bivariate analysis
– Multivariate analysis
Descriptive versus Inferential
Procedures
• A descriptive statistic usually is the “sample
version” of the corresponding population
parameter
– Mean age in a sample is the sample version of the
mean age in the population
– The purpose is to get an approximate sense of the
population
• An inference procedure formally addresses the
uncertainty when using statistics to infer
parameters
– Interval estimation
– Hypothesis testing
Descriptive Statistics
• Continuous variables
– Measures of the center of the distribution
• Mean
• Median
– Measures of the dispersion of the distribution
• Standard deviation
• Interquartile range
• Range
– Measures of symmetry
• Skewness
– Measures of “fatness” of the tail
• Kurtosis
Descriptive Statistics

• Binary variables
– Proportion (proportion can also be viewed as
mean if event is coded as 1 and non‐event is
coded as 0)

• Categorical variables
– Proportions
Before data collection
Determine the parameter of interest
Parameter

Experiment Determine the study design

Procedure Determine statistical method(s)


After data collection

Statistic(s) Compute statistic(s)

Interpretation Draw conclusion(s)


Formulating a Research Question
• Parameter
– Translate the research question into estimation/testing of
population parameter(s)
e.g. For the evaluation of a medical intervention, how to
measure efficacy?
– If there are multiple choices, which to pick?
• Interpretability
• Variability
• Practicality
• Sometimes we needs to transfer to another quantity
– It is easier to draw statistical inference of
– Transform back
Parametric Inferential Methods

• Assumptions about the underlying distributions


of the variables (collectively called a model)
– Shape of distributions, e.g. normal distribution,
Poisson distribution
– Relationship between the mean of a variable and the
values of other variables, e.g. linear, quadratic
– Correlation structure among variables
• Advantages
– Convenience in defining the parameters of interest;
e.g. slope in a linear regression
– More efficient (better precision)
Parametric Inferential Methods

• Disadvantages
– Model misspecification renders parameter not
meaningful
– Bias
Nonparametric Inferential Methods

• Fewer assumptions
• Many of them are based on ranks
– Signed rank test (NP version of paired T)
– Wilcoxon rank‐sum/Mann‐Whitney U test (NP version
of two‐sample T)
– Kruskal‐Wallis test (NP version of ANOVA)
– Friedman test (NP version of repeated ANOVA)
• Advantage
– Robust
• Disadvantage
– Less efficient
Central Limit Theorem (CLT)

• CLT states that the sampling distribution of


the sample mean approaches to a normal
distribution as the sample size gets large
• It is the foundational theorem of most of the
widely adopted statistical inference
procedures that rely on large sample sizes
Pivot Quantity
• parameter of interest
• : a statistic that serves as a point estimator
of (usually the sample version of )
• : the estimated standard error of

• Unique feature: the sampling distribution of
the pivot does not depend on .
Sample size=21

Scenario 1 Sample 1 T1=1.32

Sample 2 T2=0.27

Population
(mean SBP=120mmHg) Sample 3 T3=-0.89

Sample m Tm=-0.11

T=(sample mean of SBP‐120)/estimated standard error


Sample size=21

Scenario 2 Sample 1 T1=0.87

Sample 2 T2=1.84

Population
(mean SBP=115mmHg) Sample 3 T3=-1.14

Sample m Tm=-0.33

T=(sample mean of SBP‐115)/estimated standard error


Scenarios 1 and 2 have the same
sampling distribution for T
Pivot Quantity
• Often the common distribution of
can be approximated by standard normal
distribution based on CLT, then
– We can make statement like
Pr 0.95 → Pr
0.95 (confidence interval)
– We can compute the null distribution for the test statistic
(hypothesis testing)
Three Foundational Elements of
Statistical Inference

Sampling
CLT Pivot
distribution
(Define how uncertainty Computation of Bridge statistics
is measured) the sampling and parameters
distribution
Procedures: Category “Normal” (“N”)

• Sampling distribution of the pivot is standard


normal or approximately standard normal
– Estimation: and ( for 95%
CI)
– Hypothesis testing:

, two‐sided: ; one‐sided:
(or ). for
Procedures: Category “T”

• Sampling distribution of the pivot is T distribution


– Estimation: and
– Hypothesis testing:

, two‐sided: ; one‐sided: (or


).
– depends on confidence level/type I error and
sample size
– When sample size is large, T distribution is the same
as standard normal distribution
Procedure: Category “Other” (“O”)
• Alternative pivots and sampling distributions of the
pivots
– Chi‐square distribution for the pivot used to infer
population variance
– F distribution for the pivot used to infer ratio of the
variances of two populations
–…
• Non‐pivot based approaches
– When the sampling distribution of a statistic only depends
on one parameter (which is of interest), exact method is
possible
– Bootstrap
Univariate Analysis for Continuous
Variables
• Population mean
: sample mean; : estimated SEM
a) If the variable is normally distributed: T test and
T interval (category T)
b) If the sample size is large: Z test and Z interval
(category N)
c) Other (category O)
Univariate Analysis for Continuous
Variables
• Population median (for skewed distribution)
– Signed test
median=10 versus median>10


: proportion of units with value >10=0.5

: proportion of units with value >10>0.5
Essentially a test of proportion
Univariate Analysis for Binary Variables

• Proportion: a proportion can be viewed as a


mean by coding “event” as 1 and “non‐event”
as 0
a) If the sample size is large: Z test and Z interval
(category N)
b) Exact method based on binomial distribution for
small sample sizes (category O)
Univariate Analysis for Time to Event
Variables
• Time to event variables
– Time to death
– Time to MI
• Feature: potentially right censoring
e.g. Follow‐up is not available after time t so we
know subject is event free at time t but do not know
exactly when event occurs after t
• Need special technique for analysis
Univariate Analysis for Time to Event
Variables
• Survival probability beyond time
– : Kaplan‐Meier estimator (category N)
Key assumption: those who are censored at a
given time point and those who are not
censored have the same survival probability
Univariate Analysis for Time to Event
Variables
Bivariate Analysis: Two Continuous
Variables
• Pearson correlation coefficient (PCC)
Bivariate Analysis: Two Continuous
Variables
• Pearson correlation coefficient (PCC)
– Takes value in [‐1,1]
– Measures linear relationships
– The two variables do NOT necessarily need to be
normal
– PCC=0 does not necessarily mean the two
variables are independent
– : sample correlation coefficient
– Inference is category N
Bivariate Analysis: Two Continuous
Variables
• Simple linear regression
– Model:

• is the error that is independence of X


• Mean of is 0
• Sometimes is assumed to be normal
– Examine how changes of X affect the mean of Y
– Slope ( ): the amount of change in the mean value of
Y for every one unit increase in X
– Point estimate: least square method
– Inference is category N or T
Bivariate Analysis: Two Binary
Variables
|
• Odds ratio:
|
|
– Example:
|
– Always greater than 0
– Measure of the strength of association when
event rate is relatively low (e.g. death rate is low)
– : sample odds ratio
– Inference is category N or category O (exact
method for small sample size)
Bivariate Analysis: Two Binary
Variables
|
• Relative risk:
|
|
– Example:
|
– Always greater than 0
– Measure of the strength of association when
event rate is relatively high (e.g. disease free rate
is high)
– : sample relative risk
– Inference is category N
Bivariate Analysis: Two Binary
Variables
• Risk difference:

– Example: ‐
– Takes value in [‐1,1]
– : sample risk difference
– Inference is category N or O (equivalent)
Bivariate Analysis: One Continuous and
One Binary Variables
• Treat the continuous variable as outcome (the
binary variable defines the group)
– Paired data
• Mean of the difference: paired T test/T CI (category T) or Z
test/Z CI (category N)
• Median of the difference: Wilcoxon signed‐rank test
(category N)
– Non‐paired data
• Mean of the difference: T test/T CI (category T) or Z test/Z CI
(category N)
• Median of the difference: Wilcoxon rank‐sum test (category
N)
Bivariate Analysis: One Continuous and
One Binary Variables
• Treat the binary variable as outcome, the
analysis becomes logistic regression
(discussed later)
Bivariate Analysis: One of the Two is
Time to Event Variable
• Usually the time to event variable is the
outcome
• When the other variable is binary
– Log rank test to test if the two survival curves are
the same at every time point (category N and O).
• Cox proportional hazard regression model
(discussed later)
Multivariate Analysis: Regression
Model
• A statistical model is a set of assumptions about
the underlying stochastic process that generated
the data
• Regression model describes the relationship
between dependent random variables and
independent variables (observed or unobserved)
• Regression model is a general framework that
covers many statistical methods (e.g. T test,
ANOVA)
Objectives of Regression Models

• Understand association between a set of


independent variables and the dependent
variables
• Causal inference
• Predictions
Continuous Outcome: Linear
Regression
• Model:

• is the error that is independence of all X
• Mean of is 0
• Sometimes is assumed to be normal
• measures the amount of change in the mean value
of Y for every one unit increase in when other X’s
are fixed
• Point estimate: least square method
• Inference is category N or T
• Once ’s are estimated, the model can be used for
prediction
Binary Outcome: Logistic Regression

• Model:
ln 1 ⋯
• measures the amount of change in the odd of
Y=1 (at the logarithm scale) for every one unit
increase in when other X’s are fixed; when
is binary, is the odds ratio
• Point estimate: maximum likelihood estimation
• Inference is category N or O
• Once ’s are estimated, the model can be used
for prediction
Time to Event Outcome: Cox Model

• Hazard
#
#
• Model
)
• Interpretation
– is the “baseline hazard” corresponding to the
group of subjects with all X equal to 0
– The essence of the model is that the X’s affect the
hazard (at the logarithm scale) in a linear manner
Time to Event Outcome: Cox Model
• Cox model is also called the proportional hazard model
Consider the model with one binary covariate:
exp )
The hazard ratio of the group with 1 to the group
with 0 is exp ), which does not depend on t.
• measures the amount of change in the hazard (at the
logarithm scale) for every one unit increase in when
other X’s are fixed; when is binary, exp is the hazard
ratio
• Point estimate: maximum partial likelihood estimation
• Inference is category N
• Once ’s are estimated, the model can be used for
prediction
Summary

• Descriptive versus inferential procedures


– Descriptive: “rough” idea of the population
– Inferential: formally address uncertainty
• Parametric versus non‐parametric inference
procedures
– Parametric: more efficient; less robust
– Nonparametric: more robust; less efficient
Summary

• Three elements of statistical inference


– Sampling distribution
– Central limit theorem
– Pivot
• Analysis
– Parameter(s)
– Assumptions
– Procedures
– Interpretations

You might also like