Metode Firth (PMLE)

Logistic regression with rare events:
problems and solutions
Georg Heinze
Medical University of Vienna
Supported by the Austrian Science Fund FWF (I2276-N33)
Barcelona, 27 September 2017
[email protected] @Georg__Heinze http://prema.mf.uni-lj.si http://cemsiis.meduniwien.ac.at/en/kb
Georg Heinze – Logistic regression with rare events

CeMSIIS-Section for Clinical Biometrics 1
Rare events: examples
Medicine:
• Side effects of treatment 1/1000s to fairly common
• Hospital-acquired infections 9.8/1000 pd
• Epidemiologic studies of rare diseases 1/1000 to 1/200,000
Engineering:
• Rare failures of systems 0.1-1/year
Economy:
• E-commerce click rates 1-2/1000 impressions
Political science:
• Wars, election surprises, vetos 1/dozens to 1/1000s
…
Georg Heinze
27.09.2017 2
Problems with rare events
• ‚Big‘ studies needed to observe enough events
• Difficult to attribute events to risk factors
• Low absolute number of events
• Low event rate
Georg Heinze
27.09.2017 3
Our interest
• Models
• for prediction of binary outcomes
• should be interpretable,
i.e., betas should have a meaning
 explanatory models
Georg Heinze
27.09.2017 4
Logistic regression
Pr 1 1 exp
• Leads to odds ratio interpretation of exp :

Pr 1 1 /Pr 0| 1
exp
Pr 1 /Pr 0|
• Likelihood: L | ∏ 1
• Its nth root: Probability of correct prediction
• How well can we estimate if events ( 1) are rare?
Georg Heinze
27.09.2017 5
Rare event problems…
Not much gain!
Logistic regression with 5 variables:

• estimates are unstable (large MSE) because of few events
• removing some ‚non-events‘ does not affect precision
Georg Heinze
27.09.2017 6
Penalized likelihood regression
∗
log log
Imposes priors on model coefficients, e.g.

• ∑ (ridge: normal prior)
• ∑| | (LASSO: double exponential)
• log det (Firth-type: Jeffreys prior)
in order to
• avoid extreme estimates and stabilize variance (ridge)
• perform variable selection (LASSO)
• correct small-sample bias in (Firth-type)
Georg Heinze
27.09.2017 7
Firth‘s penalization for logistic regression
In exponential family models with canonical parametrization the Firth-type

penalized likelihood is given by
∗ /
det ,
where is the Fisher information matrix and is the likelihood.
Firth-type penalization
• removes the first-order bias of the ML-estimates of ,
• is bias-preventive rather than corrective,
• is available in Software packages such as SAS, R, Stata…

In exponential family models with canonical parametrization the Firth-type

penalized likelihood is given by
Jeffreys
∗ /
det , invariant prior
where is the Fisher information matrix and is the likelihood.
Firth-type penalization
• removes the first-order bias of the ML-estimates of ,
• is bias-preventive rather than corrective,
• is available in Software packages such as SAS, R, Stata…

In logistic regression, the penalized likelihood is given by

∗
det / , with
diag expit X 1 expit X

diag 1 .
• Firth-type estimates always exist.

is maximised at , i.e. at 0, thus
• predictions are usually pulled towards ,

• coefficients towards zero.
• Separation of outcome classes by covariate values (Figs. from Mansournia et al 2017)
• Firth‘s bias reduction method was proposed as solution to the problem of separation
in logistic regression (Heinze and Schemper, 2002)
• Penalized likelihood has a unique mode
• It prevents infinite coefficients to occur

Bias reduction also leads to reduction in MSE:
• Rainey, 2017: Simulation study of LogReg for political science

‚Firth‘s methods dominates ML in bias and MSE‘
However, the predictions get biased…
• Elgmati et al, 2015
… and anti-shrinkage could occasionally arise:
• Greenland and Mansournia, 2015

Firth‘s Logistic regression
For logistic regression with one binary regressor*,

Firth’s bias correction amounts to adding 1/2 to each cell:
original augmented
A B A B
Firth-type
Y=0 44 4 0 44.5 4.5
penalization
Y=1 1 1 1 1.5 1.5
event rate 0.04 event rate ~0.058

ORBvsA 11 ORBvsA 9.89
av. pred. prob. 0.054
* Generally: for saturated models

Example of Greenland 2010
original augmented
A B A B
Y=0 315 5 320 Y=0 315.5 5.5 321
Y=1 31 1 32 Y=1 31.5 1.5 33
346 6 352 346.5 6.5 354
event rate 0.091 event rate 0.093
ORBvsA 2.03 ORBvsA 2.73
Greenland, AmStat 2010

Greenland example: likelihood, prior, posterior

Bayesian non-collapsibility:
anti-shrinkage from penalization
• Prior and likelihood modes do not ‚collapse‘:

posterior mode exceeds both
• The ‚shrunken‘ estimate
is larger than ML estimate
• How can that happen???

An even more extreme example
from Greenland 2010
• 2x2 table X=0 X=1

Y=0 25 5 30
Y=1 5 1 6
30 6 36
• Here we immediately see that the odds ratio = 1 ( 0
• But the estimate from augmented data: odds ratio = 1.26

(try it out!) Greenland, AmStat 2010

CeMSIIS-Section for Clinical Biometrics
17
Simulating the example of Greenland
• We should distinguish BNC in a single data set from a systematic increase

in bias of a method (in simulations)
X=0 X=1
Y=0 315 5 320
Y=1 31 1 32
346 6 352
• Simulation of the example:
• Fixed groups x=0 and x=1, P(Y=1|X) as observed in example
• True log OR=0.709

• True value: log OR = 0.709
Parameter ML Jeffreys-
Firth
Bias * +18%
RMSE * 0.86
Bayesian non- 63.7%
collapsibility
* Separation causes to be undefined ( ∞) in 31.7% of the cases

• To overcome Bayesian non-collapsibility,

Greenland and Mansournia (2015)
proposed not to impose a prior on the intercept
• They suggest a log-F(1,1) prior for all other regression coefficients
• The method can be used with conventional frequentist software

because it uses a data-augmentation prior
Greenland and Mansournia, StatMed 2015

logF(1,1) prior (Greenland and Mansournia, 2015)
∗
Penalizing by log-F(1,1) prior gives ⋅ ∏ .
This amounts to the following modification of the data set:
x1 x2 y x1 x2 y
1 ∗ ∗ ∗ 1 ∗ ∗ ∗
1 ∗ ∗ ∗ 1 ∗ ∗ ∗
1 ∗ ∗ ∗ 1 ∗ ∗ ∗
1 ∗ ∗ ∗ 1 ∗ ∗ ∗ each assigned a weight of 1
1 ∗ ∗ ∗ 1 ∗ ∗ ∗
1 ∗ ∗ ∗ 1 ∗ ∗ ∗
1 ∗ ∗ ∗ 1 ∗ ∗ ∗
0 1 0 0
0 1 0 1
0 0 1 0 each assigned a weight of ½
0 0 1 1
• No shrinkage for the intercept, no rescaling of the variables

• Re-running the simulation with the log-F(1,1) method yields:
Parameter ML Jeffreys- logF(1,1)

Firth
Bias * +18%
RMSE * 0.86
Bayesian non- 63.7% 0%
collapsibility
* Separation causes be undefined ( ∞) in 31.7% of the cases

• Re-running the simulation with the log-F(1,1) method yields:
Parameter ML Jeffreys- logF(1,1)

Firth
Bias * +18% -52%
RMSE * 0.86 1.05
Bayesian non- 63.7% 0%
collapsibility
* Separation causes be undefined ( ∞) in 31.7% of the cases

Other, more subtle occurrences
of Bayesian non-collapsibility
• Ridge regression: normal prior around 0

• usually implies bias towards zero,
• But:
• With correlated predictors with different effect sizes,
for some predictors the bias can be away from zero

Simulation of bivariable log reg models
• , ~Bin 0.5 with correlation 0.8, 50

• 1.5, 0.1, ridge parameter optimized by cross-validation
Parameter ML Ridge (CV ) Log- Jeffreys-
F(1,1) Firth
Bias +40% (+9%*) -26% -2.5% +1.2%
RMSE 3.04 (1.02*) 1.01 0.73 0.79
Bias -451% (+16%*) +48% +77% +16%
RMSE 2.95 (0.81*) 0.73 0.68 0.76
Bayesian non- 25% 28% 23%
collapsibility
*excluding 2.7% separated samples
Anti-shrinkage from penalization?
Bayesian non-collapsibility/anti-shrinkage
• can be avoided in univariable models,

but no general rule to avoid it in multivariable models
• Likelihood penalization can often decrease RMSE

(even with occasional anti-shrinkage)
• Likelihood penalization guaranteed shrinkage

Reason for anti-shrinkage
• We look at the association of X and Y
• We could treat the source of data as a ‚ghost factor‘ G
• G=0 for original table
• G=1 for pseudo data
• We ignore that the conditional association of X and Y is confounded by G

Example of Greenland 2010 revisited
original augmented
A B A B
Y=0 315 5 320 Y=0 315.5 5.5 321
Y=1 31 1 32 Y=1 31.5 1.5 33
346 6 352 347 7 352
To overcome both the overestimation and anti-shrinkage problems:
• We propose to adjust for the confounding by including the ‚ghost factor‘ G

in a logistic regression model

FLAC: Firth‘s Logistic regression with Added Covariate
Split the augmented data into the original and pseudo data:
original pseudo
augmented G=0 G=1 Ghost factor
A B A B A B
0 315.5 5.5 0 315 5 0 0.5 0.5
1 31.5 1.5 1 31 1 1 0.5 0.5
ORBvsA 2.03
Define Firth type Logistic regression with Additional

Covariate as an analysis including the ghost factor as
added covariate:
ORBvsA 1.84

Beyond 2x2 tables:

Firth-type penalization can be obtained by solving modified score equations:
1
0; 0, … ,
2
/
where the ’s are the diagonal elements of the hat matrix
They are equivalent to:
1
2
1 0
2 2
• A closer inspection yields:
1 0
2 2
The original data

Original data, Data with reversed outcome,
weighted by /2 weighted by /2
Pseudo data

• A closer inspection yields:
1 0
2 2
The original data

Original data, data with reversed outcome,
weighted by /2 weighted by /2
Pseudo data
Ghost factor: G=0 G=1

(‚Added covariate‘)
FLAC estimates can be obtained by the following steps:

1) Define an indicator variable discriminating between original and
pseudo data.
2) Apply ML on the augmented data including the indicator.
unbiased pred. probabilities

FLIC
Firth type Logistic regression with Intercept Correction:

1. Fit a Firth logistic regression model
2. Modify the intercept in Firth-type estimates such that the average pred. prob.
becomes equal to the observed proportion of events.
unbiased pred. probabilities

effect estimates are the same as in Firth type logistic regression

Simulation study: the set-up
We investigated the performance of FLIC and FLAC,

simulating 1000 data sets for 45 scenarios with:
• 500, 1000 or 1400 observations,
• event rates of 1%, 2%, 5% or 10%
• 10 covariables (6 cat., 4 cont.),

see Binder et al., 2011
• none, moderate and strong effects
of positive and mixed signs
Main evaluation criteria:
bias and RMSE of predictions and effect estimates

Other methods for accurate prediction
In our simulation study, we compared FLIC and FLAC to the following methods:
• weakened Firth-type penalization (Elgmati 2015),

∗
with det , 0.1, WF
• ridge regression, RR
• penalization by log-F(1,1) priors, LF
• penalization by Cauchy priors with scale parameter=2.5. CP

Cauchy priors (CP)
Cauchy priors (scale=2.5) have heavier tails than log-F(1,1)-priors:
We follow Gelman 2008:

• all variables are centered,
• binary variables are coded to have a range of 1,
• all other variables are scaled to have standard deviation 0.5,
• the intercept is penalized by Cauchy(0,10).
This is implemented in the function bayesglm in the R-package arm.

Simulation results
• Bias of : clear winner is Firth method

FLAC, logF, CP: slight bias towards 0
• RMSE of :
equal effect sizes: ridge the winner
unequal effect sizes: very good performance of FLAC and CP
closely followed by logF(1,1)
• Calibration: often FLAC the winner; considerable instability of ridge
• Bias and RMSE of : see following slides

Predictions: bias RMSE








Comparison
FLAC Bayesian methods (CP, logF)

• No tuning parameter • CP: in-built standardization,
no tuning parameter
• Transformation-invariant
• logF(m,m): choose m by ’95% prior region’ for
• Often best MSE, calibration
parameter of interest
Ridge m=1 for wide prior, m=2 less vague
• Standardization is standard • (in principle, m could be tuned as in ridge)
• Tuning parameter • logF: easily implemented
– no confidence intervals
• CP and logF are not transformation-invariant
• Not transformation-invariant
• Performance decreases
if effects are very different

Confidence intervals
It is important to note that:
• With penalized (=shrinkage) methods one cannot achieve nominal coverage over
all possible parameter values
• But one can achieve nominal coverage averaging over the implicit prior
• Prior – penalty correspondence can be a-priori established

if there is no tuning parameter
• Important to use profile penalized likelihood method
• Wald method ( 1.96 depends on unbiasedness of estimate

Gustafson&Greenland, StatScience 2009
Conclusion
• We recommend FLAC for:
• Good performance
• Invariance to transformations or coding
• Cannot be ‘outsmarted’ by creative coding

References
• Heinze G, Schemper M. A solution to the problem of separation in logistic regression. Statistics in
Medicine 2002
• Mansournia M, Geroldinger A, Greenland S, Heinze G. Separation in logistic regression – causes,
consequences and control. American Journal of Epidemiology, 2017.
• Puhr R, Heinze G, Nold M, Lusa L, Geroldinger A. Firth‘s logistic regression with rare events – accurate
effect estimates and predictions? Statistics in Medicine 2017.
Please cf. the reference lists therein for all other citations of this presentation.
Further references:
• Gustafson P, Greenland S. Interval estimation for messy observational data. Statistical Science 2009,
24:328-342.
• Rainey C. Estimating logit models with small samples. www.carlislerainey.com/papers/small.pdf (27
March 2017)


Metode Firth (PMLE)

Uploaded by

Copyright:

Available Formats

Metode Firth (PMLE)

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Metode Firth (PMLE)

Uploaded by

Copyright:

Available Formats

Logistic regression with rare events:

problems and solutions

Supported by the Austrian Science Fund FWF (I2276-N33)

Barcelona, 27 September 2017

[email protected] @Georg__Heinze http://prema.mf.uni-lj.si http://cemsiis.meduniwien.ac.at/en/kb

Georg Heinze – Logistic regression with rare events

• ‚Big‘ studies needed to observe enough events

• Difficult to attribute events to risk factors

• Low absolute number of events

• Low event rate

• Leads to odds ratio interpretation of exp :

• How well can we estimate if events ( 1) are rare?

Not much gain!

Logistic regression with 5 variables:

Imposes priors on model coefficients, e.g.

In exponential family models with canonical parametrization the Firth-type

where is the Fisher information matrix and is the likelihood.

• removes the first-order bias of the ML-estimates of ,

• is bias-preventive rather than corrective,

• is available in Software packages such as SAS, R, Stata…

Georg Heinze – Logistic regression with rare events

In exponential family models with canonical parametrization the Firth-type

• removes the first-order bias of the ML-estimates of ,

• is bias-preventive rather than corrective,

• is available in Software packages such as SAS, R, Stata…

Georg Heinze – Logistic regression with rare events

In logistic regression, the penalized likelihood is given by

diag expit X 1 expit X

• Firth-type estimates always exist.

• predictions are usually pulled towards ,

• Separation of outcome classes by covariate values (Figs. from Mansournia et al 2017)

• Penalized likelihood has a unique mode

• It prevents infinite coefficients to occur

Bias reduction also leads to reduction in MSE:

• Rainey, 2017: Simulation study of LogReg for political science

However, the predictions get biased…

• Elgmati et al, 2015

… and anti-shrinkage could occasionally arise:

• Greenland and Mansournia, 2015

Georg Heinze – Logistic regression with rare events

For logistic regression with one binary regressor*,

event rate 0.04 event rate ~0.058

* Generally: for saturated models

Georg Heinze – Logistic regression with rare events

event rate 0.091 event rate 0.093

ORBvsA 2.03 ORBvsA 2.73

Greenland, AmStat 2010

Georg Heinze – Logistic regression with rare events

Georg Heinze – Logistic regression with rare events

• Prior and likelihood modes do not ‚collapse‘:

• How can that happen???

Georg Heinze – Logistic regression with rare events

• 2x2 table X=0 X=1

• Here we immediately see that the odds ratio = 1 ( 0

• But the estimate from augmented data: odds ratio = 1.26

Georg Heinze – Logistic regression with rare events

• We should distinguish BNC in a single data set from a systematic increase

• Simulation of the example:

• Fixed groups x=0 and x=1, P(Y=1|X) as observed in example