Introduction to models with latent continuous responses
There are several Bayesian models that rely on the assumption that there is an underlying continuous variable that is, for one reason or another, latent (i.e., not observed/observable). This assumption is often adopted:
- because it matches well to the way we conceptualize how the world works; and/or
- because of computational reasons (i.e., the algorithm to fit the model is more efficient if we assume the existence of these latent continuous variables).
Examples of these models include:
models for censored data (also known as tobit regression models): for example, imagine that data on the concentration of certain chemicals in the water are collected with a measurement instrument that has a particular detection limit. In this case, whenever the actual concentration is below the detection limit, it is not directly observed. This is an example for which the concentration is our partially latent variable (i.e., it is observable above the detection limit but not below it).
probit regression models: these models are similar to logistic regression models in that they are used to model binary data. The main difference is that, instead of assuming the logistic link function \(\frac{exp(\beta_0 + \beta_1 x_i )}{1+exp(\beta_0 + \beta_1 x_i )}\), we assume a probit link function \(\Phi(\beta_0 + \beta_1 x_i )\). In this last expression, \(\Phi\) is the Cumulative Density Function (CDF) of the standard normal distribution. To implement this model, we assume that there is a continuous latent variable \(z_i\) such that if \(z_i>0\) we observe \(y_i=1\) whereas if \(z_i<0\) then \(y_i=0\).
models for ordered categorical variables: examples of ordered categorical variables include likert scale type of data (e.g., strongly disagree, disagree, indifferent, agree, strongly agree) or responses that somehow are categorized and for which there is a natural ordering (e.g., letter grades). In this model, we assume that there is a latent continuous scale that somehow maps to these ordered categories. This is implemented by assuming a continuous latent variable \(z_i\) and that we get to observe \(y_i=j\) if \(z_i\) is between thresholds \(t_{j-1}\) and \(t_j\).
In all of these models, we have an observed response variable \(y_i\) and a latent continuous response \(z_i\) and we build a regression for these \(z_i's\). Our goal is to estimate the regression parameters \(\beta_0,\beta_1,...\) but we will have to figure out the \(z_i's\) along the way to be able to do this. Here is a figure depicting our overall modeling approach:
The regression that we build for these \(z_i's\) is given by:
\[z_i \sim N(\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i}+...,\sigma^2) \]
There are two important things to notice in this model:
In some of the models we will see, \(\sigma^2\) will not be estimated. This is often the case because, in these models, we can only estimate \(\frac{\beta}{\sigma}\) rather than the regression slope \(\beta\) itself and, as a result, the convention has been to set \(\sigma\) to 1.
notice that this is identical to a standard Gaussian regression model if we assume that we know all the \(z_i's\). This is an important observation if we are creating our customized Gibbs sampler in R. Because the full conditional distributions (FCDs) assume that we know the values of all the other parameters and latent variables in the model, the FCDs for the regression parameters \(\sigma^2,\beta_0,\beta_1,...\) are the same as the FCDs we have already derived for a standard Gaussian regression. The only tricky part here is figuring out how to sample the latent variables \(z_i's\). Once we know how to do that, then sampling the other regression parameters is straight-forward.
Comments?
Send me an email at