Discrimination and Calibration by Terry Therneau
Discrimination and Calibration by Terry Therneau
Discrimination and Calibration by Terry Therneau
Terry Therneau
August 5, 2022
1
1 Introduction
This vignette is in progress and not yet complete.
2 Discrimination
2.1 Pseudo R2 measures
There have been many attempts to dene an overall goodness of t criteria for the Cox model
which would be parallel to the widely used R2 of linear models. A direct analog is hampered
by the issues of censoring and scale. Censoring is the major technical impediment: how do we
dene error for an observation known only to be > ti ? A potentially larger issue is that of scale:
if we have two (t, t̂) pairs of (3 m, 6 m) and (9 yr, 10 yr), which one represents the greater error?
On an absolute scale the second is the larger dierence, but in a clinical study the rst may be
more important. Simon xxx has a nice discussion of the issue.
Rather than a direct extension of R2 , subject to the issues raised above, the most common
2
approach has been based on pseudo-R measures. These re-write the linear model statistic in
another way, and then evaluate the alternate formula.
(yi − ŷi )2
P
R2 = 1 − P
(yi − y)2
2/n
LL(intercept)
=1− (1)
LL(full)
var(ŷ)
= (2)
var(ŷ) + σ 2
Equation (1) is the Cox and Snell formula, for a Cox model replace the linear model log-
likelhood (LL) with the partial likelihood of the null and tted models. This gives the measure
?
proposed by Nagelkerke [ ] which was part of the standard printout of coxph for many years. It
has, however, been recognized as overly sensitive to censoring.
?
The measure of Kent and O'Quigley [ ] is based on (2). Replace ŷ with the Cox model linear
2 2
predictor η = X β̂ and sigma by π /6. The latter is based on an extreme value distribution,
and the equivalence of the Cox model to a transformation model.
Royston and Sauerbrei replace the risk scores η with a normal-scores transform
ri − 3/8
si = Φ−1
n + 1/4
ri = rank(ηi )
then re-t the Cox model using s as the single covariate. Since var(s) = 1 by design, the variance
of the normalized risk score will be captured by the coecient β 2 from the re-t; while the Cox
model estimate of the variance of
p β is used as an estimate of variance. They then further dene
p
a measure D=β 8/π . The rationale is that E(s|s > 0) = (2/π), so D represents the hazard
ratio between a random draw from the bottom half of the s distribution to a random draw from
the top half. This value is then `comparable' to the hazard ratio for a binomial covariate such
2
as treatment. (We prefer to use the 25th and 75th quantiles of the risk score, untransformed,
for this purpose, i.e. the middle of the top half versus the middle of the bottom half , rather
than mean(bottom) vs. mean(top).)
A issue with the Royston and Sauerbrei approach is that although the risk scores from a tted
Cox model will sometimes be approximately symmetric, there is no reason to assume that this
should be so. In medical data the risks are often right skewed: it is easier to have an extremely
high risk of death than an extremely low one (there are no immortals). For the well known
PBC data set, for instance, whose risk score has been validated in several independent studies,
the median-centered risk scores range from -2 to 5.4. Remember that even in the classic linear
model, β̂ and the residuals are assumed to Gaussian, but no such assumption is needed for y or
ŷ ; in fact such a distribution is uncommon. See Health, Normality and the Ghost of Gauss [ ] ?
for a good discussion of this topic.
?
Göen and Heller [ ] create a pseudo-concordance that also uses only the risk scores. It is
based on the fact that, if proportional hazards holds, then the time to event ti and tj for two
subjects satises
1
P (ti > tj ) = (3)
1 + exp(ηj − ηi )
They then propose the estimate
2 X 1
CGH = (4)
n(n − 1) η >η 1 + exp(ηj − ηi )
i j
2
which can be translated from the (0,1) concordance scale to a (-1, 1) range via RGH = 2C −1,
if desired. If there is no relationship between X and t then β = 0 and η = 0, leading to a
concordance of 1/2. A claimed advantage of the GH measure over the usual concordance is that
it is not aected by censoring. A primary disadvantage is that it is based on the assumption that
the model is completely correct; the assumption that proportional hazards holds for all time,
even well beyond any observed data, is particularly unlikely.
All of the above measures are computed by the royston command.
3
2.2 Concordance
The most commonly used measure of discrimination is the concordance. [merge with concordance
vignette]
3 Calibration
4 Computing IPC weights
4.1 Ties
Modication of the redistribute to the right (RTTR) algorithm for counting data is straightfor-
ward. First, only actual censorings cause a redistribution. For example, assume we have a data
4
set with time-dependent covariates, and a subject with three (time1, time2, status) intervals of
(0, 5, 0), (5,18, 0) and (18, 25, 0). Only the last of these, time 25, is an actual censoring time. For
counting process data the rttright function will insist on an id statement to correctly group
rows for a subject, and makes use of survcheck to insure that there are no gaps or overlaps.
(The survfit routine imposes the same restriction).
A general form of the RTTR algorithm that allows for multiple states has the following
features
1. Censoring weights ci (t) for each observation are explicitly a function of time, and sum to
1.
3. At any time, for any given state, the non-zero weights for all observations in that state are
proportional to prior case weights wi .
The estimated probability in state j is estimated as a sum of weights of those currently in state
j. X
pj (t) = ci (t)
si =j
For an absorbing state, i.e., one with no departures, the rebalancing step is not technically
necessary since it does not change the sum. As a practical matter in the code, there is actually
no need to rebalance until there is a departure of an observation to another state. This ts well
with a design decision of the survival package to not force declaration of the aborbing states by
the user, but to simply notice them as those states where no one departs.
For the case of a simple alive-dead or competing risks model, the weights generated by the
RTTR approach also have a simple representation as 1/G(t) for some censoring distribution
G. The keys that makes it work for these two cases are that everyone starts in the same state
5
and that subjects are only censored from that state. There is in essence only one censoring
distribution to keep track of. There is not natural way (that we have seen) to reect RTTR
weights that involve rebalancing as the inverse of a censoring distribution G.