The model

Say we have some measurements that come from a device that has a detection limit (dl), assumed to be known, below which the results are not reliable. This is a form of truncation or censoring.

To analyze this data, one option would be to through away data that are equal to dl but you might be throwing substantial amount of information out (these data tells us that the outcome was small but not how small) and your results would be biased. Alternatively, we can try to modify our model to accommodate for this type of truncation/censoring. We can model these data by assuming that:

\[If \; z_i>dl \; then \; y_i=z_i\] \[If \; z_i<dl \; then \; y_i=dl\] \[z_i \sim N(\beta_0+\beta_1 x_i,\sigma^2)\]

The important thing to note here is that, if we know what all the \(z_i\)’s are, then the rest of the model is easy (i.e., it is the simple regression that we have dealt with in past lectures and assignments). But we only know \(z_i\) if \(y_i>dl\). When \(y_i=dl\), then we have some partial information on \(z_i\) (i.e., we only know that it is less than dl). Thus, we have to estimate \(z_i\) if \(y_i=dl\).

Simulating data

Here is how we generate some fake data for this problem:

In this figure, the observed points are shown in black while the true values (i.e., the values prior to the censoring process) are shown in grey. As you can see, it seems there is some relationship between y and x but the data are truncated by the detection limit (given by the grey line). The true relationship between x and y is depicted with a red line. The estimated relationship using all the data is shown with the dashed blue line. While this is bad, it seems that things are even worse if we just use the data above the detection limit, depicted with the green line! What should we do?



Comments?

Send me an email at