Categorical Data Analysis
Categorical Data Analysis
Categorical Data Analysis
Spring 2006
In this session we cover the basics of modeling time-to-event (TTE) data using the R
software package. R is the free clone of S and can be downloaded from http://www.rproject.org/. The following topics will be addressed:
We consider a subset of the data from a study designed to assess the effect of a new
treatment on the time to infection of burn patients. The analyses presented are for
illustrative purposes only. The covariates included in the analyses we will consider
include (i) treatment (new=1; standard=0), (ii) female (1=female, 0=male), (iii) white
(1=white, 0=other), (iv) surface area burned, (v) burntype (1=chemical, 2=scald,
3=electric, 4=flame).
We need to load the survival library and the MASS library in R. Do this by running the
lines
library(survival)
library(MASS)
1.
Data are often stored in text files or in Excel files. If the data are in an Excel file then I
usually save the data in a .txt file in order to import it into R. We can import Excel files
into R but this can require some fussing (this is easy to do in Splus though). To import
the burn data I used the command
burndata <- read.table(file="C:\\Stat
222\\BurnData\\subsetoftheburndata.txt",header=T,sep="\t")
attach(burndata)
options(contrasts=c("contr.treatment","contr.poly"))
Of course, you need to change the path in order to import your data. The header=T
option tells R that the variable names are stored in the first row of my data set. If you
dont have variable names in the first row, set header=F and you can use the col.names
function to provide variable names.
2.
AFT models
Alternatively we could have created indicator variables for the burntype variable.
chemical
scald <electric
flame <-
<- burntype==1
burntype==2
<- burntype==3
burntype==4
We can compare the Weibull, log logistic, and log normal fits using AIC, where
AIC=-2loglik(MLE) +2p where p is the number of parameters in the model. The
ANOVA function will provide 2loglik(MLE) for each model (this can also be obtained
from the summaries). The log normal model yields the smallest AIC so well proceed
with that model.
anova(WeibullReg,LogNormalReg,LogLogisticReg)
-2loglik = 474.9971 Weibull
-2loglik = 465.2968 Log normal
-2loglik = 469.0688 Log logistic
There are no built in functions that will provide inferences for median TTEs or plot
survivor functions for the AFT model. To get inferences for median TTEs and for the
survivor function at a point, we need the covariance matrix Cov( , ). Unfortunately R
gives the covariance matrix for Cov( , * log ). To get this use,
LogNormalReg$var
To get the covariance matrix we want, Cov( , ), simply multiply all the elements in the
last row and the last column by .
covhatstar <- LogNormalReg$var
covhat <- covhatstar
covhat[,9] <- covhatstar[,9]* LogNormalReg$scale
covhat[9,] <- covhatstar[9,]* LogNormalReg$scale
Median TTEs:
Consider two burn patients with the following covariate combinations
x1=(1, treatment=1, female=0, white=1, surface area burned=20, burntype=4)
x2=(1, treatment=0, female=0, white=1, surface area burned=20, burntype=4).
The median TTE for an individual with covariate vector x under the log normal
regression model is exp(x).
x1 <- c(1,1,0,1,20,0,0,1)
x2 <- c(1,0,0,1,20,0,0,1)
medx1 <- exp(x1%*%LogNormalReg$coeff)
c(exp(x1%*%LogNormalReg$coeff-1.96*sqrt(matrix(c(x1,0),1,9)%*%covhat%*
%matrix(c(x1,0),9,1))), medx1,
exp(x1%*%LogNormalReg$coeff+1.96*sqrt(matrix(c(x1,0),1,9)%*%covhat%*
%matrix(c(x1,0),9,1))))
medx2 <- exp(x2%*%LogNormalReg$coeff)
c(exp(x2%*%LogNormalReg$coeff-1.96*sqrt(matrix(c(x2,0),1,9)%*%covhat%*
%matrix(c(x2,0),9,1))), medx2,
exp(x2%*%LogNormalReg$coeff+1.96*sqrt(matrix(c(x2,0),1,9)%*%covhat%*
%matrix(c(x2,0),9,1))))
For subject 1:
L
Median
31.29894
56.31643 101.33061
For subject 2:
L
Median
0.0
0.2
0.4
Surv Prob
0.6
0.8
1.0
t <- seq(0,50,0.1)
survx1 <- 1-pnorm((log(t)-x1%*%LogNormalReg$coeff)/LogNormalReg$scale)
survx2 <- 1-pnorm((log(t)-x2%*%LogNormalReg$coeff)/LogNormalReg$scale)
plot(t,survx1,type='l',xlab="Time",ylab="Surv Prob",ylim=c(0,1))
lines(t,survx2,lty=2)
10
20
30
40
50
Time
Fig 1: Survivor functions based on the log normal model for white males that had 20%
surface area burned by flame and that received the new treatment (solid line) or the
standard treatment (dashed line).
3.
The R function used to fit the PH model is coxph() and to fit the burn data we use
> PHmodel <- coxph(Surv(time, status) ~ newtrt + female + white +
saburned + factor(burntype), data = burndata)
> summary(PHmodel)
Call:
coxph(formula = Surv(time, status) ~ newtrt + female + white + saburned
+ factor(burntype), data = burndata)
n= 154
coef exp(coef) se(coef)
z
p
newtrt -0.6166
0.540 0.30333 -2.033 0.042
female -0.5341
0.586 0.39862 -1.340 0.180
white 2.2316
9.314 1.02956 2.167 0.030
saburned 0.0045
1.005 0.00727 0.619 0.540
factor(burntype)2 1.5213
4.578 1.09518 1.389 0.160
factor(burntype)3 2.0679
7.908 1.08898 1.899 0.058
factor(burntype)4 0.9651
2.625 1.02064 0.946 0.340
newtrt
female
white
saburned
factor(burntype)2
factor(burntype)3
factor(burntype)4
Rsquare= 0.147
(max possible=
Likelihood ratio test= 24.4 on
Wald test
= 19.8 on
Score (logrank) test = 23.3 on
0.942 )
7 df,
p=0.000952
7 df,
p=0.00594
7 df,
p=0.0015
We extract the covariance matrix for the MLE of as in the AFT model. Use
PHmodel$var
Only here R gives the covariance matrix we want. Relative risks for individuals with
covariate combinations x and x* are obtained as previously described in the AFT setting.
The relative risk is exp{(x-x*)} so we get a 95% CI for (x-x*) and then exponentiate
the endpoints.
We can use the built in function stepAIC in the MASS library for automatic model
selection. First we must run the following code
library(MASS)
The procedure removed the saburned variable first and then stopped. The final result was
<none>
- female
- factor(burntype)
+ saburned
+ newtrt:female
+ newtrt:factor(burntype)
- white
1
3
1
1
3
1
426.5009
426.7209
427.8148
428.1300
428.2391
429.7343
434.8294
0.4
Surv Prob
0.6
0.8
1.0
0.0
0.2
New treatment
Routine Treatment
10
20
30
40
50
Time to Infection
Fig 2: Survivor functions based on the Cox model for white males that were burned by
flame and that received the new treatment (solid line) or the standard treatment (dashed
line).
We can try to get inferences for the median TTE based on the Cox model for these two
individuals the survfit function. In my experience the 0.95Ucl is usually missing
and often the median is as well. This, of course, is data dependent.
> PHmodelx1
Call: survfit.coxph(object = PHmodel1, newdata = list(newtrt = 1, female
= 0,
white = 1, scald = 0, electric = 0, flame = 1))
n
154
events
48
> PHmodelx2
Call: survfit.coxph(object = PHmodel1, newdata = list(newtrt = 0, female
= 0,
white = 1, scald = 0, electric = 0, flame = 1))
n
154
events
48
A graphical summary
0.8
0.6
0.2
0.0
20
30
40
50
10
20
30
40
Time to Infection
Time to Infection
50
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
Surv Prob
0.6
0.8
1.0
10
1.0
Surv Prob
0.4
Surv Prob
0.6
0.4
0.0
0.2
Surv Prob
0.8
1.0
Lognormal Model
1.0
Cox Model
10
20
30
Time to Infection
40
50
10
20
30
Time to Infection
40
50