Censoring

Data censoring occurs when the value of the response variable is only partially known. For example, right censoring occurs when we are modeling time to an event but the study ends prior to that event (e.g., death) occurring. In this case, we only know that the event happened after a certain date, but we do not know exactly when. Right censoring is very common in biomedical studies and typically is dealt with using survival analysis.

In Ecology, interval censoring is also common. This refers to the situation where we know that an event happened between two specific time points but we do not know exactly when it happened. An example of this is when we are measuring trees in permanent forest plots. We might measure these trees in 2020 and 2025 and, as a result, we do not have the exact death date of any tree that might have died during this period. Below we will discuss an example of censoring associated with detection limits.

The model

Say we have some measurements that come from a device that has a detection limit (dl), assumed to be known, below which the results are not reliable. As a result, this device returns a reading equal to dl for any measurements below dl. This is a form of truncation or censoring.

To analyze this data, we have the following options:

  1. Analyze the data as it is. The problem with this option is that we know that this detection limit problem is occurring but we are ignoring it, potentially biasing our results.

  2. Through away data that are equal to dl. The problem with this option is that you might be throwing substantial amount of information out (these data tells us that the outcome was small but not how small) and your results are likely to be biased.

  3. We can try to modify our model to accommodate for this type of truncation/censoring. More specifically, we can model these data by assuming that:

\[If \; z_i>dl \; then \; y_i=z_i\] \[If \; z_i<dl \; then \; y_i=dl\] \[z_i \sim N(\beta_0+\beta_1 x_i,\sigma^2)\]

The important thing to note here is that, if we know what all the \(z_i\)’s are, then the rest of the model is easy (i.e., it is the simple regression that we have dealt with in past lectures and assignments). But we only know \(z_i\) if \(y_i>dl\). When \(y_i=dl\), then we have some partial information on \(z_i\) (i.e., we only know that it is less than dl). Thus, we have to estimate \(z_i\) if \(y_i=dl\).

Simulating data

Here is how we generate some fake data for this problem:

rm(list=ls(all=TRUE))
set.seed(1)

n=100
b0=1.5
b1=-0.1
x=seq(from=0,to=10,length.out=n)
sig2=0.2
mean1=b0+b1*x
z=rnorm(n,mean=mean1,sd=sqrt(sig2))

#this is what things look like prior to the detection limit problem
plot(x,z,col='grey',main='True data and relationship')
lines(x,mean1,col='red')

#below the detection limit, y=dl
dl=1
y=z
cond=z<dl
y[cond]=dl

#this is the actual data
plot(x,z,col='grey',main='Observed data and estimated relationships')
lines(x,mean1,col='red',lty=3)
points(x,y)
abline(h=dl,col='grey')

dat=data.frame(x=x,y=y)

#regression results with all the data
res=lm(y~x,data=dat) 
lines(x,res$coef[1]+res$coef[2]*x,col='blue')

#regression results after throwing away the "bad data"
res=lm(y~x,data=dat[dat$y>dl,]) 
lines(x,res$coef[1]+res$coef[2]*x,col='green')

setwd('U:/uf/courses/bayesian course/rasc/2016/detection limit')
write.csv(dat,'fake data.csv',row.names=F)

In this figure, the observed points are shown in black while the true values (i.e., the values prior to the censoring process) are shown in grey. As you can see, it seems there is some relationship between y and x but the data are truncated by the detection limit (given by the grey line). The true relationship between x and y is depicted with a red line. The estimated relationship using all the data is shown with the blue line. While this is bad, it seems that things are even worse if we just use the data above the detection limit, depicted with the green line! What should we do?



Comments?

Send me an email at