Censoring (Statistics)
Censoring (Statistics)
Censoring (Statistics)
In statistics, censoring is a condition in which the value of a measurement or observation is only partially
known.
For example, suppose a study is conducted to measure the impact of a drug on mortality rate. In such a
study, it may be known that an individual's age at death is at least 75 years (but may be more). Such a
situation could occur if the individual withdrew from the study at age 75, or if the individual is currently
alive at the age of 75.
Censoring also occurs when a value occurs outside the range of a measuring instrument. For example, a
bathroom scale might only measure up to 140 kg. If a 160-kg individual is weighed using the scale, the
observer would only know that the individual's weight is at least 140 kg.
The problem of censored data, in which the observed value of some variable is partially known, is related
to the problem of missing data, where the observed value of some variable is unknown.
Censoring should not be confused with the related idea truncation. With censoring, observations result
either in knowing the exact value that applies, or in knowing that the value lies within an interval. With
truncation, observations never result in values outside a given range: values in the population outside the
range are never seen or never recorded if they are seen. Note that in statistics, truncation is not the same as
rounding.
Types
Left censoring – a data point is below a certain value but it is unknown by how much.
Interval censoring – a data point is somewhere on an interval between two values.
Right censoring – a data point is above a certain value but it is unknown by how much.
Type I censoring occurs if an experiment has a set number of subjects or items and stops the
experiment at a predetermined time, at which point any subjects remaining are right-
censored.
Type II censoring occurs if an experiment has a set number of subjects or items and stops
the experiment when a predetermined number are observed to have failed; the remaining
subjects are then right-censored.
Random (or non-informative) censoring is when each subject has a censoring time that is
statistically independent of their failure time. The observed value is the minimum of the
censoring and failure times; subjects whose failure time is greater than their censoring time
are right-censored.
Interval censoring can occur when observing a value requires follow-ups or inspections. Left and right
censoring are special cases of interval censoring, with the beginning of the interval at zero or the end at
infinity, respectively.
Estimation methods for using left-censored data vary, and not all methods of estimation may be applicable
to, or the most reliable, for all data sets.[1]
A common misconception with time interval data is to class as left censored intervals where the start time is
unknown. In these cases we have a lower bound on the time interval, thus the data is right censored
(despite the fact that the missing start point is to the left of the known interval when viewed as a timeline!).
Analysis
Special techniques may be used to handle censored data. Tests with specific failure times are coded as
actual failures; censored data are coded for the type of censoring and the known interval or limit. Special
software programs (often reliability oriented) can conduct a maximum likelihood estimation for summary
statistics, confidence intervals, etc.
Epidemiology
One of the earliest attempts to analyse a statistical problem involving censored data was Daniel Bernoulli's
1766 analysis of smallpox morbidity and mortality data to demonstrate the efficacy of vaccination.[2] An
early paper to use the Kaplan–Meier estimator for estimating censored costs was Quesenberry et al.
(1989),[3] however this approach was found to be invalid by Lin et al.[4] unless all patients accumulated
costs with a common deterministic rate function over time, they proposed an alternative estimation
technique known as the Lin estimator.[5]
An analysis of the data from replicate tests includes both the times-to-failure for the items that failed and the
time-of-test-termination for those that did not fail.
Censored regression
An earlier model for censored regression, the tobit model, was proposed by James Tobin in 1958.[6]
Likelihood
The likelihood is the probability or probability density of what was observed, viewed as a function of
parameters in an assumed model. To incorporate censored data points in the likelihood the censored data
points are represented by the probability of the censored data points as a function of the model parameters
given a model, i.e. a function of CDF(s) instead of the density or probability mass.
left censoring:
right censoring:
Example
Suppose we are interested in survival times, , but we don't observe for all . Instead, we
observe
If the censoring times are all known constants, then the likelihood is
and = the probability that is greater than , called the survival function.
This can be simplified by defining the hazard function, the instantaneous force of mortality, as
so
Then
For the exponential distribution, this becomes even simpler, because the hazard rate, , is constant, and
. Then:
where .
From this we easily compute , the maximum likelihood estimate (MLE) of , as follows:
.
Then
This differs from the standard MLE for the exponential distribution in that the any censored observations
are considered only in the numerator.
See also
Data analysis Sampling bias
Detection limit Saturation arithmetic
Imputation (statistics) Survival analysis
Inverse probability weighting Winsorising
References
1. Helsel, D. (2010). "Much Ado About Next to Nothing: Incorporating Nondetects in Science"
(https://doi.org/10.1093%2Fannhyg%2Fmep092). Annals of Occupational Hygiene. 54 (3):
257–262. doi:10.1093/annhyg/mep092 (https://doi.org/10.1093%2Fannhyg%2Fmep092).
PMID 20032004 (https://pubmed.ncbi.nlm.nih.gov/20032004).
2. Bernoulli, D. (1766). "Essai d'une nouvelle analyse de la mortalité causée par la petite
vérole". Mem. Math. Phy. Acad. Roy. Sci. Paris, reprinted in Bradley (1971) 21 and Blower
(2004)
3. Quesenberry, C. P., Jr.; et al. (1989). "A survival analysis of hospitalization among patients
with acquired immunodeficiency syndrome" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC
1349769). American Journal of Public Health. 79 (12): 1643–1647.
doi:10.2105/AJPH.79.12.1643 (https://doi.org/10.2105%2FAJPH.79.12.1643).
PMC 1349769 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1349769). PMID 2817192 (htt
ps://pubmed.ncbi.nlm.nih.gov/2817192).
4. Lin, D. Y.; et al. (1997). "Estimating medical costs from incomplete follow-up data".
Biometrics. 53 (2): 419–434. doi:10.2307/2533947 (https://doi.org/10.2307%2F2533947).
JSTOR 2533947 (https://www.jstor.org/stable/2533947). PMID 9192444 (https://pubmed.ncb
i.nlm.nih.gov/9192444).
5. Wijeysundera, H. C.; et al. (2012). "Techniques for estimating health care costs with
censored data: an overview for the health services researcher" (https://www.ncbi.nlm.nih.go
v/pmc/articles/PMC3377439). ClinicoEconomics and Outcomes Research. 4: 145–155.
doi:10.2147/CEOR.S31552 (https://doi.org/10.2147%2FCEOR.S31552). PMC 3377439 (htt
ps://www.ncbi.nlm.nih.gov/pmc/articles/PMC3377439). PMID 22719214 (https://pubmed.ncb
i.nlm.nih.gov/22719214).
6. Tobin, James (1958). "Estimation of relationships for limited dependent variables" (http://cow
les.yale.edu/sites/default/files/files/pub/d00/d0003-r.pdf) (PDF). Econometrica. 26 (1): 24–36.
doi:10.2307/1907382 (https://doi.org/10.2307%2F1907382). JSTOR 1907382 (https://www.j
stor.org/stable/1907382).
7. Lu Tian, Likelihood Construction, Inference for Parametric Survival Distributions (https://web.
stanford.edu/~lutian/coursepdf/unit2.pdf) (PDF), Wikidata Q98961801.
Further reading
Blower, S. (2004), D, Bernoulli's ""An attempt at a new analysis of the mortality caused by
smallpox and of the advantages of inoculation to prevent it" (https://web.archive.org/web/201
70808033709/https://www.semel.ucla.edu/sites/all/files/biomedicalmodeling/pdf/Bernoulli%
26Blower.pdf) (PDF). Archived from the original (http://www.semel.ucla.edu/sites/all/files/bio
medicalmodeling/pdf/Bernoulli&Blower.pdf) (PDF) on 2017-08-08. Retrieved
2019-06-25. (146 KiB)", Reviews of Medical Virology, 14: 275–288
Bradley, L. (1971). Smallpox Inoculation: An Eighteenth Century Mathematical Controversy.
Nottingham. ISBN 0-902031-23-6.
Mann, N. R.; et al. (1975). Methods for Statistical Analysis of Reliability and Life Data (https://
archive.org/details/methodsforstatis00mann). New York: Wiley. ISBN 047156737X.
Bagdonavicius, V., Kruopis, J., Nikulin, M.S. (2011),"Non-parametric Tests for Censored
Data", London, ISTE/WILEY,ISBN 9781848212893.
External links
"Engineering Statistics Handbook", NIST/SEMATEK, [1] (http://www.itl.nist.gov/div898/hand
book/)