UC Santa Barbara
UC Santa Barbara Previously Published Works
Title
Nonparametric estimation for middle-censored data
Permalink
https://escholarship.org/uc/item/3rn2q532
Journal
JOURNAL OF NONPARAMETRIC STATISTICS, 15(2)
ISSN
1048-5252
Authors
Jammalamadaka, SR
Mangalam, V
Publication Date
2003-04-01
DOI
10.1080/1048525031000089266
Peer reviewed
eScholarship.org
Powered by the California Digital Library
University of California
Nonparametric Statistics, 2003, Vol. 15(2), pp. 253–265
NONPARAMETRIC ESTIMATION FOR
MIDDLE-CENSORED DATA
S. RAO JAMMALAMADAKAa,* and VASUDEVAN MANGALAMb
a
Department of Statistics and Applied Probability, University of California, Santa Barbara,
CA 93106, USA; bDepartment of Mathematics, Universiti Brunei Darussalam, Brunei
(Received 2 November 1999; In final form 22 August 2000)
This paper provides the self-consistent estimator (SCE) and the nonparametric maximum likelihood estimator
(NPMLE) for ‘‘middle-censored’’ data, in which a data value becomes unobservable if it falls within a random
interval. We provide an algorithm to find the SCE and show that the NPMLE satisfies the self-consistency
equation. We find a sufficient condition for the SCE to be concentrated on the uncensored observations. In
addition, we find sufficient conditions for the consistency of the SCE and prove that consistency holds for the
special case when one of the ends is a constant. Some simulation results and an illustrative example, using
Danish melanoma data set, are provided.
Keywords: Survival function; Middle-censoring; Self-consistency; Nonparametric maximum likelihood estimation
AMS 1991 Subject Classifications: Primary: 62G05, 62G30; Secondary: 62G99
1
INTRODUCTION
Estimation of the unknown distribution of a random variable is of fundamental importance in
statistics. In areas such as reliability, biometry, general medical follow-up studies and clinical
trials, the distribution function of the underlying lifetime or more specifically. the survival
function is of paramount interest.
In these situations, the random variable of interest is the lifetime and the observations refer
to times of occurrence of an event such as death due to a certain cause under study, or times
for equipment failure. When complete data are available, the Empirical Distribution Function
(EDF) is used and it has many desirable properties. However, in many practical situations, it
is quite common to have incomplete data, making the standard empirical distribution
function (EDF) unavailable. Often, such incomplete observation of the data results from a
random censoring mechanism. When observations are censored to the right, the product
limit estimator due to Kaplan and Meier (1958) is used in place of the EDF and similar
estimators exist for the left-censored case. Gehan (1965) and Turnbull (1974) and others considered doubly-censored data (where both left and right censoring occur simultaneously) and
estimators for the distribution function have been developed. Groeneboom and Wellner
* Corresponding author.
ISSN 1048-5252 print; ISSN 1029-0311 online # 2003 Taylor & Francis Ltd
DOI: 10.1080=1048525031000089266
254
S. R. JAMMALAMADAKA AND V. MANGALAM
(1992) and Geskus and Groeneboom (1996) studied the case of ‘‘interval-censored’’ data
where one can only observe a censoring event and whether the time of the event of interest,
say death, occurred before or after the occurrence of the censoring event. Nonparametric
Maximum Likelihood Estimators (NPMLE) for the distribution of interest have been studied
by various authors for all these cases. A Self Consistent Estimator (SCE) is usually obtained
by solving a set of equations called the self-consistency equations (see Efron, 1967; Tarpey
and Flury, 1996), and under some conditions this coincides with the NPMLE. Tsai and
Crowley (1985) have shown that many of these cases can be unified by applying a generalized maximum likelihood principle. They also point out that solving the self-consistency
equation is essentially equivalent to applying the EM algorithm for the corresponding
missing data problem. See Dempster and Laird (1977) and McLachlan and Krishnan
(1997) on the EM algorithm.
In this paper we consider an important variation and generalization of censoring where a
data point becomes unobservable if it fell inside a random interval. When that happens we
observe a censorship indicator and the interval of censorship. We will refer to this as
‘‘middle-censoring’’. Left censoring, right censoring and double censoring are special
cases of this ‘‘middle censorship’’ by suitable choice of this censoring interval, which can
be infinite. Middle-censoring where a random middle part is missing appears at first glance,
as complementary to the idea of double-censoring where the middle is what is actually
observed. However, if one considers these two schemes carefully along with the resulting
data sets (see next section), they turn out to be quite distinct ideas.
Before discussing the estimator, we consider some situations where this type of censoring
may arise. In general, in any lifetime study, if the subject is temporarily withdrawn from the
study we will have an interval of censorship. It can be equipment failure that could occur
during a period where observation is not possible or is not being made. In biomedical studies,
the patient under observation may be absent from study for a short period during which time
the event of interest may occur. As an example of double censoring, Turnbull (1974) refers to
a study of African infant precocity by Leiderman et al. (1973), where establishing the for
infant development for a community in Kenya was the purpose. A sample of 65 children
are considered and each child was tested monthly to see if (s)he had learned to accomplish
certain tasks. The time from birth to the learning time was the variable of interest. In their
analysis, double-censoring occurred due to late arrivals (the child had already learned the
skills before entering the study) and losses (the child had not acquired the skill by the end
of time study). We envisage a scenario where there are no late entries or losses as such,
but during a fixed time interval (this fixed interval is indeed, a random interval relative to
the individual’s lifetime) the observation was not possible, such as the temporary closure
of the clinic due to an outbreak of say, war. If some children, of varying ages, developed
the skill during this time, we do not observe the exact age of these children at the time of
skill development, rather only the information that the event of interest occurred during a
certain time interval. These ideas can, of course, be extended to more general random sets
of censorship such as union of intervals or even more complicated sets but we have not
explored this in detail.
In Section 2, we derive the self-consistency equation for the middle censored case and
show that the NPMLE indeed satisfies the self-consistency equation. A simple example
which shows how one computes the NPMLE is also given. In Section 3 we explore conditions under which the self-consistent estimator (SCE) is consistent and prove the consistency
in an important special case. Section 4 illustrates the SCE for middle-censored case for a
simulated data set as well as for a real data set on Melanoma survival, from Andersen
et al. (1993). A computer program which allows the computation of the SCE is available
by writing to the authors.
ESTIMATION FOR MIDDLE-CENSORED DATA
2
255
SELF-CONSISTENCY AND THE NPMLE
Let Xi , i ¼ 1, . . . , n, be a sequence of independent identically distributed (i.i.d.) random variables with unknown distribution F0 . Let Yi ¼ (Li , Ri ) be a sequence of i.i.d. random vectors,
independent of Xt ’s, with unknown bivariate distribution G such that P(Li < Ri ) ¼ 1. While
X denotes the variable of interest, Y represents the censoring mechanism. Using the notation
di ¼ I [Xi 2
= (Li , Ri )],
we observe
Zi ¼ Xi when di ¼ 1 i:e:, if Xi 2
= (Li , Ri )
¼ (Li , Ri ) when di ¼ 0 i:e:, if Xi 2 (Li , Ri )
That is, we either observe the original value Xi , if there is no censoring or the interval of
censoring (Li , Ri ) when there is censoring.
In many censoring situations, if we were to try to estimate the distribution function via the
EM algorithm the resulting equation takes the form
F^ (t) ¼ EF^ [En (t)jZ]
as described by Tsai and Crowley (1985), where En is the empirical distribution function.
This equation was first introduced and referred to as self-consistency equation by Efron
(1967). In the middle censored case the SCE Fn , satisfies the equation
n
1X
Fn (t) Fn (Li )
di I (Xi t) þ di I (Ri t) þ di I [t 2 (Li ,Ri )]
(1)
Fn (t) ¼
n i¼1
Fn (Ri ) Fn (Li )
where d i ¼ 1 di . (For the rest of the paper we will follow the convention that x , for any
variable or function x, indicates 1 x). As in the case of doubly-censored data, there is no
explicit closed form solution to the equation and has to be computed by the iterative formula
F^ (mþ1) (t) ¼ EF^ (m) [En (t)jZ]:
The convergence of the algorithm is assured by Theorem 2.1 of Tsai and Crowley (1985)
provided that the initial estimator gives positive mass to all observed points. See Remark 2.1
below regarding the choice of the initial estimator. For a general discussion on selfconsistency and its relation to EM algorithm, see Tarpey and Flury (1996).
Let F denote the set of all distribution functions on the line. For F 2 F the likelihood of
the sample is given by
L(F) ¼
n
Y
i¼1
[F(Xi ) F(Xi )]di [F(Ri ) F(Li )]1di :
Denoting by DF(x) ¼ F(x) F(x), f(F) (1=n) log L(F) is given by
n
1X
[di log(DF(Xi )) þ d i log [F(Ri ) F(Li )]]
n i¼1
ð
¼ {I [x 2
= (l, r)] log DF(x) þ I [x 2 (l, r)] log[F(r) F(l)]} dPn (x, l, r)
f(F) ¼
256
S. R. JAMMALAMADAKA AND V. MANGALAM
where Pn is the empirical measure of {(Xi , Li , Ri ): 1 i n}. The maximizer of f is clearly
the NPMLE. In the next theorem, we show that the NPMLE for middle censored data
satisfies the self-consistency equation. But before that we need the following lemma.
LEMMA 1
Define
F(t ^ x)
F(x) if F(t) > 0
F(t)
¼0
otherwise
At (x) ¼
(2)
where t ^ x ¼ min(t, x). Then
K(x) ¼ F(x) þ hAt (x)
defines a class of distribution functions for h sufficiently close to zero.
Proof
Note that we need to show this only for F(t) > 0 since At 0 when F(t) ¼ 0.
K(x) ¼ (1 h)F(x) þ h
F(t ^ x)
F(t)
is a convex combination of two cdf’s and hence is a cdf for 0 h < 1. For negative h, write
K ¼ F hAt with h > 0 so that
K(x) ¼ (1 þ h)F(x) h
F(t ^ x)
:
F(t)
Clearly K(1) ¼ 0 and K(1) ¼ 1. It is also right-continuous so it remains to show that
it is monotone. We check this separately on (1, t] and [t, 1). For x in (1, t],
F(x)
K(x) ¼ (1 þ h)F(x) h
F(t)
(1 F(t))
¼ F(x) 1 h
:
F(t)
This is clearly bounded by 1 and is non-negative if h (F(t)=1 F(t)) and in this case, K is
monotone non-decreasing. Similarly on [t, 1),
K(x) ¼ (1 þ h)F(x) h
¼ F(x) h(1 F(x)):
Again this is bounded by 1 and is non-negative so long as h (F(x)=1 F(x)) which is
assured if h (F(t)=1 F(t)) since t x. Monotonicity of K is clear.
j
THEOREM 1
The NPMLE satisfies the equation
F(t) ¼
n
1X
F(t) F(Li )
di I (Xi t) þ d i I (Ri t) þ d i I [t 2 (Li , Ri )]
:
n i¼1
F(Ri ) F(Li )
(3)
ESTIMATION FOR MIDDLE-CENSORED DATA
257
Proof If F maximizes f, then the directional derivative of f towards At should be zero at F
i.e., satisfies the equation
f(F þ hAt ) f(F)
0 ¼ lim
h!0
h
ð
log[DF(x) þ hDAt (x)] log DF(x)
¼ I [x 2
= (l, r)] lim
h!0
h
log[F(r) F(l) þ h(At (r) At (l))] log [F(r) F(l)]
þI [x 2 (l, r)] lim
h!0
h
dPn (x, l, r)
as the integral involved is really a finite sum and hence interchanging of limit and integration
is valid. When F(t) > 0, the first of the two limits inside the integral is
DAt (x) lim
e!0
log[DF(x) þ e] log DF(x) DAt (x) I (x t)
¼
¼
1:
e
DF(x)
F(t)
The second limit is similarly equal to
At (r) At (l) F(t) ^ F(r) F(t ^ I )
¼
1
F(r) F(l)
F(t)[F(r) F(l)]
where F(t) ^ F(r) stands for F(t) if t < r and F(r) otherwise. Thus we get
1¼
ð
I (x t)
F(t) ^ F(r) F(t ^ I )
þ I [x 2 (l, r)]
I [x 2
= (l, r)]
dPn (x, l, r)
F(t)
F(t)[F(r) F(l)]
or
F(t) ¼
ð
I [x 2
= (l, r)]I (x t) þ I (l < x < r t) þ I [x, t 2 (l, r)]:
F(t) F(l)
dPn (x, l, r):
F(r) F(l)
RHS of the above is same as RHS of (3).
j
It is a question of considerable interest to ask if NPMLE will have all its mass on the
uncensored observations. The answer is yes, provided all censored intervals contain at
least one uncensored observation. When there is a censoring interval empty of uncensored
observations, clearly some mass has to be attached to that interval or the likelihood would
be zero. That the weights are concentrated on the uncensored observations when all censoring
intervals are non-empty is a consequence of the following proposition.
PROPOSITION 1 If each observed censored interval (Li , Ri ) contains at least one uncensored
observation Xj , j 6¼ i, then any distribution function that satisfies (3) attaches all its mass on
the uncensored observations.
258
S. R. JAMMALAMADAKA AND V. MANGALAM
Proof Let F be a distribution satisfying (3) and let x1 , x2 , . . . , xm be the uncensored
observations. P
For any x let DF(x) ¼ F(x) F(x) be the weight F associates to x. We need
to show that m
j¼1 DF(xj ) ¼ 1. From (3) it follows that
n
DF(xj )
1 1X
(1 di )I [xj 2 (Li , Ri )]
DF(xj ) ¼ þ
:
n n i¼1
F(Ri ) F(Li )
Summing (4) over all uncensored observations, we get
P
n (1 d )
m
X
m 1X
i
j I [xj 2 (Li , Ri )]DF(xj )
DF(xj ) ¼ þ
n n i¼1
F(Ri ) F(Li )
j¼1
(4)
(5)
For each censoring interval (Li , Ri ), let ai be the slack between the mass associated to the
interval and the sum of weights of all uncensored observations in the interval. Then
ai ¼ F(Ri ) F(Li )
m
X
j¼1
I [xj 2 (Li , Ri )]DF(xj )
(6)
and ai ’s are all non-negative. From (5) and (6), it follows that
X
1X
ai
1
ai ¼ 1
n
F(Ri ) F(Li )
or
X
ai ¼
1X
ai
n
F(Ri ) F(Li )
(7)
where the sum is over all censored observations. As every interval contains at least one
uncensored observation, it follows from (4) that F(Ri ) F(Li ) (1=n) and hence (7)
implies that
ai
(8)
ai ¼
n(F(Ri ) F(Li ))
for each i. Now if there exists i such that ai > 0, (F(Ri ) F(Li ))(1=n) þ ai > (1=n)
contradicting (8).
j
We have now proved that the NPMLE will have all its mass on the uncensored observations except when it so happens that a censored interval contains no uncensored observation.
If this happens, we are in a situation similar to that of right censored data where the largest
observation is censored. While in the right censored case the extra mass is usually left
unassigned, for middle-censored data there is a natural way of handling this. When a
censored interval contains no uncensored points, we let the mass that corresponds to that
interval be assigned to its midpoint. Thus our initial estimator may give equal mass to all
uncensored observations and to the midpoints of those finite censored intervals that contain
no uncensored observations. If an infinite censoring interval happens to be empty of uncensored observations, one can then assign the mass to any arbitrary point inside this interval for
the estimator to have a maximum.
Consider the following example where n ¼ 5 and z1 ¼ 2, z2 ¼ 4, z3 ¼ 6, z4 ¼ (1, 5) and
z5 ¼ (3, 7). Let p1 , p2 , p3 be the masses to be assigned to z1 , z2 , z3 respectively. The likelihood function is given by
p1 p2 p3 ( p1 þ p2 )( p2 þ p3 )
ESTIMATION FOR MIDDLE-CENSORED DATA
259
and, as pi ’s add up to 1 and the roles of p1 and p3 are interchangeable, we can simplify the
2
problem to that of maximizing (x2 )(1
x ffiffiand
pffiffiffi
p
ffiffiffi 2x)(1 x) with p1 ¼ p3 ¼ p
ffi p2 ¼ 1 2x. The
solution, then, is given by x ¼ (5 5)=10 so that p1 ¼ p3 ¼ (5 5)=10 and p2 ¼ 1= 5
is the solution to the NPMLE. In this example the iterations of the self-consistency equation
rapidly converged to the NPMLE.
The SCE, being a result of convergence of the EM algorithm, provides a local maximum of
the likelihood equation [see, for example, Mykland and Ren, 1996] and may not coincide
with the NPMLE. Examples of cases when an SCE is not the NPMLE can be constructed
by considering situations where two empty censoring intervals overlap. For instance, if we
have 1, 2, (3, 6), (4, 7) as the data, we could assign 0.25 mass to 1, 2, 4.5 and 5.5 to get
an SCE. The NPMLE will assign 0.25 each on 1 and 2, but assign 0.5 on some point, say
5, on the overlap area (4, 6). Both estimators are self-consistent, but the latter has higher likelihood. This happens whenever there are empty, overlapping intervals. In the next section we
shall show the strong consistency of self-consistent estimators for certain cases. This will
demonstrate that SCE and NPMLE are, at least for these special cases, asymptotically
equivalent and hence will be approximately the same for large samples.
3
CONSISTENCY OF SELF-CONSISTENT ESTIMATORS
Define P and Q, sub-distribution functions on R and R2 respectively, by
P(t) ¼ P(X t, d ¼ 1)
Q(l, r) ¼ P(L l, R r, d ¼ 0)
and their empirical versions Pn and Qn by
Pn (t) ¼
Qn (l, r) ¼
n
1X
I (Xi t, di ¼ 1)
n i¼1
n
1X
I (Li l, Ri r, di ¼ 0):
n i¼1
Then by Glivenko-Cantelli Lemma, it follows that Pn and Qn converge almost surely to P and
Q respectively and the convergence in each case is uniform on the respective domain. Also,
(1) can be written in terms of Pn and Qn as follows:
ð
Fn (t) ^ Fn (r) Fn (t ^ l)
Fn (t) ¼ Pn (t) þ
dQn (l, r)
(9)
Fn (r) Fn (l)
By Helly’s Theorem, 9 a subsequence nk and a non-decreasing function F bounded by 0 and
1 such that on a set of probability 1, Fnk (t) ! F(t) for each t.
PROPOSITION 2 If {jn } is a sequence of functions on R2 which converge uniformly to
bounded continuous function j, then
ð
ð
jn (l, r) dQn (l, r) ! j(l, r) dQ(l, r):
260
Proof
S. R. JAMMALAMADAKA AND V. MANGALAM
Note that,
ð
ð
j (l, r) dQn (l, r) j (l, r) dQ(l, r)
n
n
ð
ð
ð
(jn (l, r) j(l, r)) dQn (l, r) þ j(l, r) dQn (l, r) j(l, r) dQ(l, r)
ð
ð
ð
kjn jk dQn þ j(l, r) dQn (l, r) j(l, r) dQ(l, r)
where kk represents
Ð the supremum norm. Now the first term on the RHS of the inequality
goes to zero since dQn ¼ 1 while the second term goes to zero because the sequence of
empirical measures Qn converge to Q weakly and j is a bounded continuous function. j
LEMMA 2
Any subsequential limit F of Fn will satisfy the equation
ð
F(t) ¼ P(t) þ
F(t) ^ F(r) F(t ^ l)
dQ(l, r)
F(r) F(l)
(10)
Proof For a fixed t, taking limits in (9) through the subsequence nk as k ! 1 and using
Proposition 2 with
jn (l, r) ¼
Fn (t) ^ Fn (r) Fn (t ^ l)
,
Fn (r) Fn (l)
j
the result follows.
When P and Q are written in terms of F0 and G, (10) is equivalent to
ð
F(t) F(l)
(F0 (r) F0 (l)) þ F0 (l) F0 (t) dG(l, r):
F(t) F0 (t) ¼
l<t<r F(r) F(l)
(11)
From (11), it follows that F(1) ¼ 1 and F(1) ¼ 0. Note that if F ¼ F0 , (11) is automatically satisfied. If we were able to show that (11) has a unique solution, then it follows that F0 is
the only limit point of {Fn }. Then we will have that on a set of probability 1, Fn (x) ! F0 (x) for
each x and by continuity of F0 uniformity of the almost sure convergence follows.
A necessary condition for consistency is what we call ‘‘identifiability’’. Let A(t) ¼
P(L < t < R). The condition is that A be not identically 1 on any interval [a, b], a b for
which F0 (b) > F0 (a). Observe that if A 1 on any interval where F0 has a positive
mass, then censoring occurs with probability 1 on such an interval. As a consequence,
there will be no observations on this interval and that prevents us from distinguishing any
two distributions which are identical outside [a, b] but differing only on [a, b]. This condition
will be referred to as the ‘‘identifiability condition’’ and is a requirement for consistent
estimation of F0 .
LEMMA 3
Let h ¼ (F F0 ) and
g(t) ¼
ð
R2
c(l, r)I (l < t < r) dG(l, r)
ESTIMATION FOR MIDDLE-CENSORED DATA
261
where c(l, r) ¼ (h(r) h(l)=F(r) F(l)). Then
A dh ¼ gdF
Proof
(12)
From (11) we get
h(t) ¼
¼
ð
ðl<t<r
l<t<r
(F(t) F(l))
h(r) h(l)
þ h(l) h(t) dG(l, r)
F(r) F(l)
[(F(t) F(l))c(l, r) þ h(l)]dG(l, r) þ h(t)A(t):
So,
h(t)A (t) ¼
ð
l<t<r
[(F(t) F(l))c(l, r) þ h(l)]dG(l, r)
¼ F(t)g(t) þ C(t)
(13)
where
C(t) ¼
ð
l<t<r
[h(l) F(l)c(l, r)]dG(l, r):
‘‘Differentiating’’ both sides of (13) w.r.t. t, we get
h(t)dA(t) A (t)dh(t) ¼ g(t)dF(t) þ F(t)dg(t) þ dC(t):
To show that (12) holds, clearly it is sufficient to show that
F(t)dg(t) þ dC(t) h(t)dA(t) ¼ 0
(14)
Ð
IfÐ B(t) ¼ l<t<r H(l, r) dG(l, r) for
Ð t some function H, then it can be shown that dB(t) ¼
1
( t H(t, r) dFRjL (rjt))dFL (t) ( 1 H(l, t) dFLjR (ljt))dFR (t) where FRjL ( jt) is the conditional distribution function of R given L ¼ t. Hence, applying this to g and C,
ð 1
ð t
LHS of (14) ¼ F(t)
c(t, r) dFRjL (rjt) dFL (t) F(t)
c(l, t) dFLjR (ljt) dFR (t)
1
ð 1t
(h(t) F(t)c(t, r))dFRjL (rjt) dFL (t)
þ
t
ð t
(h(l) F(l)c(l, t))dFLjR (ljt) dFR (t) h(t) dA(t)
1
ðt
ð1
¼ dFR (t)
h(t) dFLjR (ljt) þ dFL (t)
h(t) dFRjL (rjt) h(t) dA(t)
1
t
¼ h(t)(dFL (t) dFR (t)) h(t) dA(t)
because
Ðt
1
¼0
dFLjR (ljt) ¼
Ð1
t
dFRjL (rjt) ¼ 1 and A(t) ¼ FL (t) FR (t):
j
Thus, if the only function h satisfying (12) is the zero function, then we would have proved
the strong consistency of the SCE. We have not yet been able to show that this (12) has a
unique solution in the general case, but we give below a proof for the special case when
one of the end points of the censoring interval is degenerate. Although the result is stated
for the case when L is degenerate (for instance, censoring if it occurs, starts on a certain birthday of the individual), the proof works equally well when R is degenerate.
262
S. R. JAMMALAMADAKA AND V. MANGALAM
THEOREM 2 Assume that F0 and FR are continuous and L l0 . Assume the identifiability
condition is satisfied. Then the only function F that satisfies ð11Þ is F0 and hence the SCE is
uniformly strongly consistent.
Proof In this special case (13) becomes
h(t)A (t) ¼ I (t > l0 )
ð1
t
[(F(t) F(l0 ))c(l0 , r) þ h(l0 )]dFR (r)
(15)
As A (t) ¼ 1 P[t 2 (l0 , R)] ¼ 1 I (t > l0 )F R (t), A (t) ¼ 1 for all t l0 and A (t) ¼ FR (t) for
all t > l0 ; hence from (15), h(t) ¼ 0 for all t l0 . In particular, h(l0 ) ¼ 0. Thus (15) becomes
ð1
F(t) F(l0 )
h(r) dFR (r):
h(t)A (t) ¼ I (t > l0 )
t F(r) F(l0 )
Similarly, (12) holds with
g(t) ¼ I (t > l0 )
ð1
t
h(r)
dFR (r):
F(r) F(l0 )
(16)
Note that from the assumptions of the theorem it follows that F, h and g are continuous on
(l0 , 1). We aim to show that h 0 on (l0 , 1). Assuming 9t0 > l0 such that h(t0 ) > 0, we
will arrive at a contradiction. The proof is similar if h(t0 ) < 0.
j
As h(l0 ) ¼ h(1) ¼ 0, 9t1 < t2 suchÐthat t1 l0 , t2 1, h(t1 ) ¼ h(t2 ) ¼ 0 and h(t) > 0 on
1
(t1 , t2 ). From (16), on (l0 , 1), g(t) ¼ t c(r) dFR (r) where c(r) ¼ (h(r)=F(r) F(l0 )).
CLAIM 1
g(t1 ) 0.
Suppose not. Then 9t such that g > 0 on (t1 , t ). We shall now show that A (t) > 0 on
(t1 , t ). If 9t 2 (t1 , t ) such that A (t) ¼ 0, then A (t) ¼ 0 for all t 2 (l0 ,t) so that by the
identifiability condition, dF0 (t) ¼ 0 for all t 2 (l0 , t). From (12) dF(t) ¼ 0 on (t1 , t); so
dh(t) ¼ dF(t) dF0 (t) ¼ 0 on (t1 , t) which implies
Ð t h 0 there, contrary to our assumption.
From (12), dh(t) ¼ (g(t)=1 A(t))dF(t), so t1 (g(t)=1 A(t))dF(t) ¼ h(t ) h(t1 ) ¼
h(t ). Now, LHS 0 contradicting the fact that h(t ) > 0. This proves Claim 1.
CLAIM 2
g(t2 ) 0.
Suppose not. Then 9t 2 (t0 , t2 ) such that g < 0 on (t , t2 ).Ð Similar to the previous situat
tion, we have A (t) > 0 on (t , t2 ). As earlier, h(t2 ) h(t ) ¼ t2 ( g(t)=1 A(t))dF(t) 0,
implying h(t ) 0 which is contradiction. Thus Claim 2 is proved.
On (t1 , t2 ), dg(t) ¼ c(t)dFR (t) ¼ ( h(t)dFR (t)=F(t) F(l0 )) 0, so g is decreasing
there. Thus from Claim 1 and Claim 2 it follows that g 0 on (t1 , t2 ). (Note that the
goes through even if t2 ¼ 1). From (12), A dh 0 on (t1 , t2 ). As g(t) ¼
Ðargument
1
t c(r) dFR (r), FR is a constant (t1 , t2 ) and hence A is a constant on (t1 , t2 ).
If c > 0, h 0 on (t1 , t2 ) which is a contradiction. If c ¼ 0, A 0 on (t1 , t2 ) and hence
by the identifiability condition F0 is constant on (t1 , t2 ). As h(t1 ) ¼ h(t2 ) ¼ 0,
F(t1 ) ¼ F(t2 ), so F is a constant on (t1 , t2 ). So h is a constant on (t1 , t2 ), which means
h 0 on (t1 , t2 ). This is a contradiction.
4
ILLUSTRATIVE EXAMPLES
A simulation study was performed to measure the performance of self-consistent estimators
where an exponential (mean ¼ 10) random variable was middle-censored by random intervals
ESTIMATION FOR MIDDLE-CENSORED DATA
263
FIGURE 1 The EDF and SCE for the simulated exponential data.
with left end points being exponential (mean ¼ 5) and interval widths being an independent
exponential (mean ¼ 5). A sample of size 100 was taken and this resulted in 22 of them being
censored. Figure 1 shows the SCE along with the original exponential distribution function
F0 . The maximum distance kFn F0 k is 0.0827 which is very good compared to the
Kolmogorov–Smirnov distance, namely the maximum distance of the EDF. En of the uncensored data from the true distribution, which is kEn F0 k ¼ 0:0715. The authors also tried
out various other survival distributions such as gamma and Weibull that were censored by
intervals whose left ends were distributed as exponential, gamma, Weibull or uniform and
interval width was a positive random variable or a constant. In all these cases, the resulting
estimators for middle censoring were in very close agreement with the EDF of the original
uncensored data. It is clear that the amount of censoring in any of these cases, is approximately P(L X R).
Finally we applied our techniques to an actual data set on melanoma survival collected at
Odense University Hospital, Denmark [see Andersen et al., 1993]. The sample contains 205
data points, ranging from 10 to 5565. The data were censored by a random interval whose left
end was an exponential random variable with mean 2000 and width was exponential with
mean 1000. Over 23% of data were censored resulting in 157 uncensored observations.
The SCE Fn is given in Figure 2 while the EDF En of the survival data is in Figure 3.
They are shown super-imposed in Figure 4, to see how close they are. Indeed, the
maximum distance kFn En k between them is 0.0604 while the maximum relative distance
k((Fn En )=En )k turns out to be still a small 0.153.
FIGURE 2 SCE for the censored melanoma survival data.
264
S. R. JAMMALAMADAKA AND V. MANGALAM
FIGURE 3
EDF for the uncensored melanoma survival data.
FIGURE 4 EDF and SCE superimposed.
Acknowledgements
We would like to thank an anonymous referee whose persistence led to a much more readable
paper.
References
Andersen, P. K., Borgan, O., Gill, R. D. and Keiding, N. (1993). Statistical Models Based on Counting Processes.
Springer-Verlag, New York.
Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM
algorithm. J. Roy. Statist. Soc. Ser., B39, 1–38 (with discussion).
Efron, B. (1967). The two-sample problem with censored data. Proc. Fifth Berkeley Symp. Math. Stat. Probab.,
Vol. 4. University of California Press, Berkeley, pp. 831–853.
Gehan, E. A. (1965). A generalized two-sample Wilcoxon test for doubly censored data. Biometrika, 52, 650–653.
Geskus, R. B. and Groeneboom, P. (1996). Asymptotically optimal estimation of smooth functionals for interval
censoring. J. Statist. Neerlandica, 50, 69–88.
Groeneboom, P. and Wellner, J. A. (1992). Information Bounds and Non-parametric Maximum Likelihood
Estimation. DMV Seminar, Vol. 19. Birkhäuser Verlag, Basel.
Gu, M. G. and Zhang, C.-H. (1993). Asymptotic properties of self-consistent estimators based on doubly censored
data. Ann. Statist., 21, 611–624.
Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incomplete observations. J. Amer. Statist. Assoc.,
53, 457–481.
Leiderman, P. H., Babu, D., Kagia, J., Kraemer, H. C. and Leiderman, G. F. (1973). African infant precocity and
some social influences during the first year. Nature, 242, 247–249.
McLachlan, G. J. and Krishnan, T. (1997). The EM Algorithm and Extensions. John Wiley and Sons, Inc., New York.
ESTIMATION FOR MIDDLE-CENSORED DATA
265
Mykland, P. A. and Ren, J. (1996). Algorithms for computing self-consistent and maximum likelihood estimators
with doubly censored data. Ann. Statist., 24, 1740–1764.
Tarpey, T. and Flury, B. (1996). Self-consistency: A fundamental concept in statistics. Statistical Science, 11,
229–243.
Tsai, W. Y. and Crowley, J. (1985). A large sample study of generalized maximum likelihood estimators from
incomplete data via self-consistency. Ann. Statist., 13, 1317–1334.
Turnbull, B. W. (1974). Nonparametric estimation of a survivorship function with doubly censored data. J. Amer.
Statist. Assoc., 69, 169–173.