The model
Say we have some measurements that come from a device that has a detection limit (dl), assumed to be known, below which the results are not reliable. This is a form of truncation or censoring.
To analyze this data, one option would be to through away data that are equal to dl but you might be throwing substantial amount of information out (these data tells us that the outcome was small but not how small) and your results would be biased. Alternatively, we can try to modify our model to accommodate for this type of truncation/censoring. We can model these data by assuming that:
\[If \; z_i>dl \; then \; y_i=z_i\] \[If \; z_i<dl \; then \; y_i=dl\] \[z_i \sim N(\beta_0+\beta_1 x_i,\sigma^2)\]
The important thing to note here is that, if we know what all the \(z_i\)’s are, then the rest of the model is easy (i.e., it is the simple regression that we have dealt with in past lectures and assignments). But we only know \(z_i\) if \(y_i>dl\). When \(y_i=dl\), then we have some partial information on \(z_i\) (i.e., we only know that it is less than dl). Thus, we have to estimate \(z_i\) if \(y_i=dl\).
Simulating data
Here is how we generate some fake data for this problem:
rm(list=ls(all=TRUE))
set.seed(1)
n=100
b0=1.5
b1=-0.1
x=seq(from=0,to=10,length.out=n)
sig2=0.2
z=rnorm(n,mean=b0+b1*x,sd=sqrt(sig2))
#this is what things look like prior to the detection limit problem
plot(x,z,col='grey')
#below the detection limit, y=dl
dl=1
y=z
cond=z<dl
y[cond]=dl
#this is the actual data
points(x,y)
abline(h=dl,col='grey')
dat=data.frame(x=x,y=y)
#true relationship that we are trying to estimate
lines(x,b0+b1*x,col='red')
#regression results with all the data
res=lm(y~x,data=dat)
lines(x,res$coef[1]+res$coef[2]*x,col='blue',lty=3)
#regression results after throwing away the "bad data"
res=lm(y~x,data=dat[dat$y>dl,])
lines(x,res$coef[1]+res$coef[2]*x,col='green',lty=1)
setwd('U:/uf/courses/bayesian course/2016/detection limit')
write.csv(dat,'fake data.csv',row.names=F)
In this figure, the observed points are shown in black while the true values (i.e., the values prior to the censoring process) are shown in grey. As you can see, it seems there is some relationship between y and x but the data are truncated by the detection limit (given by the grey line). The true relationship between x and y is depicted with a red line. The estimated relationship using all the data is shown with the dashed blue line. While this is bad, it seems that things are even worse if we just use the data above the detection limit, depicted with the green line! What should we do?
Comments?
Send me an email at