Research Analysis
Research Analysis
Research Analysis
Changyu Shen
Richard A. and Susan F. Smith Center for Outcomes Research in Cardiology
Beth Israel Deaconess Medical Center
Harvard Medical School
Objectives
• Descriptive versus inferential procedures
• Parametric versus non‐parametric inferential
procedures
• Central limit theorem (CLT) and pivot quantity
• Inferential procedures
– Categories
– Univariate analysis
– Bivariate analysis
– Multivariate analysis
Descriptive versus Inferential
Procedures
• A descriptive statistic usually is the “sample
version” of the corresponding population
parameter
– Mean age in a sample is the sample version of the
mean age in the population
– The purpose is to get an approximate sense of the
population
• An inference procedure formally addresses the
uncertainty when using statistics to infer
parameters
– Interval estimation
– Hypothesis testing
Descriptive Statistics
• Continuous variables
– Measures of the center of the distribution
• Mean
• Median
– Measures of the dispersion of the distribution
• Standard deviation
• Interquartile range
• Range
– Measures of symmetry
• Skewness
– Measures of “fatness” of the tail
• Kurtosis
Descriptive Statistics
• Binary variables
– Proportion (proportion can also be viewed as
mean if event is coded as 1 and non‐event is
coded as 0)
–
• Categorical variables
– Proportions
Before data collection
Determine the parameter of interest
Parameter
• Disadvantages
– Model misspecification renders parameter not
meaningful
– Bias
Nonparametric Inferential Methods
• Fewer assumptions
• Many of them are based on ranks
– Signed rank test (NP version of paired T)
– Wilcoxon rank‐sum/Mann‐Whitney U test (NP version
of two‐sample T)
– Kruskal‐Wallis test (NP version of ANOVA)
– Friedman test (NP version of repeated ANOVA)
• Advantage
– Robust
• Disadvantage
– Less efficient
Central Limit Theorem (CLT)
Sample 2 T2=0.27
Population
(mean SBP=120mmHg) Sample 3 T3=-0.89
Sample m Tm=-0.11
Sample 2 T2=1.84
Population
(mean SBP=115mmHg) Sample 3 T3=-1.14
Sample m Tm=-0.33
Sampling
CLT Pivot
distribution
(Define how uncertainty Computation of Bridge statistics
is measured) the sampling and parameters
distribution
Procedures: Category “Normal” (“N”)
, two‐sided: ; one‐sided:
(or ). for
Procedures: Category “T”
∗
: proportion of units with value >10=0.5
∗
: proportion of units with value >10>0.5
Essentially a test of proportion
Univariate Analysis for Binary Variables
– Example: ‐
– Takes value in [‐1,1]
– : sample risk difference
– Inference is category N or O (equivalent)
Bivariate Analysis: One Continuous and
One Binary Variables
• Treat the continuous variable as outcome (the
binary variable defines the group)
– Paired data
• Mean of the difference: paired T test/T CI (category T) or Z
test/Z CI (category N)
• Median of the difference: Wilcoxon signed‐rank test
(category N)
– Non‐paired data
• Mean of the difference: T test/T CI (category T) or Z test/Z CI
(category N)
• Median of the difference: Wilcoxon rank‐sum test (category
N)
Bivariate Analysis: One Continuous and
One Binary Variables
• Treat the binary variable as outcome, the
analysis becomes logistic regression
(discussed later)
Bivariate Analysis: One of the Two is
Time to Event Variable
• Usually the time to event variable is the
outcome
• When the other variable is binary
– Log rank test to test if the two survival curves are
the same at every time point (category N and O).
• Cox proportional hazard regression model
(discussed later)
Multivariate Analysis: Regression
Model
• A statistical model is a set of assumptions about
the underlying stochastic process that generated
the data
• Regression model describes the relationship
between dependent random variables and
independent variables (observed or unobserved)
• Regression model is a general framework that
covers many statistical methods (e.g. T test,
ANOVA)
Objectives of Regression Models
• Model:
ln 1 ⋯
• measures the amount of change in the odd of
Y=1 (at the logarithm scale) for every one unit
increase in when other X’s are fixed; when
is binary, is the odds ratio
• Point estimate: maximum likelihood estimation
• Inference is category N or O
• Once ’s are estimated, the model can be used
for prediction
Time to Event Outcome: Cox Model
• Hazard
#
#
• Model
)
• Interpretation
– is the “baseline hazard” corresponding to the
group of subjects with all X equal to 0
– The essence of the model is that the X’s affect the
hazard (at the logarithm scale) in a linear manner
Time to Event Outcome: Cox Model
• Cox model is also called the proportional hazard model
Consider the model with one binary covariate:
exp )
The hazard ratio of the group with 1 to the group
with 0 is exp ), which does not depend on t.
• measures the amount of change in the hazard (at the
logarithm scale) for every one unit increase in when
other X’s are fixed; when is binary, exp is the hazard
ratio
• Point estimate: maximum partial likelihood estimation
• Inference is category N
• Once ’s are estimated, the model can be used for
prediction
Summary