Introduction

One way to think about the models for ordinal data is that they are simply extensions of the regression models we typically use for binary data (logistic or probit regressions). However, instead of having just two categories (0 and 1), we can now have more than two categories.

From binary logistic regression to ordinal logistic regression

Following these ideas, it is natural to think about modeling ordinal data using a series of logistic regressions. For example, we can use separate logistic regressions in the following way:

  1. We create a new binary variable \(w_i\) such that, if the observed outcome belongs to the first category (\(y_i=1\)), then \(w_i=1\). If the observed outcome does not belong to the first category (\(y_i > 1\)), then \(w_i=0\). In this case, we assume that

\[p(w_i=1|x_i)=\frac{exp(\beta_{01}+\beta_1 x_i)}{1+exp(\beta_{01}+\beta_1 x_i)}\]

This is equivalent to assuming that:

\[log(\frac{p(y_i \leq 1|x_i)}{p(y_i>1|x_i)})=\beta_{01}+\beta_1 x_i\]

  1. We create a new binary variable \(q_i\) such that, if the observed outcome belongs to the first or second categories (\(y_i \leq 2\)), then \(q_i=1\). If the observed outcome does not belong to these categories (\(y_i > 2\)), then \(q_i=0\). In this case, we assume that;

\[p(q_i=1|x_i)=\frac{exp(\beta_{02}+\beta_1 x_i)}{1+exp(\beta_{02}+\beta_1 x_i)}\]

This is equivalent to assuming that:

\[log(\frac{p(y_i \leq 2|x_i)}{p(y_i>2|x_i)})=\beta_{02}+\beta_1 x_i\]

  1. We create a new binary variable \(r_i\) such that, if the observed outcome belongs to the first, second, or third categories (\(y_i \leq 3\)), then \(r_i=1\). If the observed outcome does not belong to these categories (\(y_i > 3\)), then \(r_i=0\). In this case, we assume that;

\[p(r_i=1|x_i)=\frac{exp(\beta_{03}+\beta_1 x_i)}{1+exp(\beta_{03}+\beta_1 x_i)}\]

This is equivalent to assuming that:

\[log(\frac{p(y_i \leq 3|x_i)}{p(y_i>3|x_i)})=\beta_{03}+\beta_1 x_i\]

  1. so on and so forth.

Notice that, despite these regressions having different intercepts, we assume the same slope parameter in each regression. This is called the proportional odds assumption. Because of this assumption of common slope parameters, when actually trying to fit this model, we would not use a series of logistic regressions.

From binary probit regression to ordinal probit regression

Another approach is to start with a binary probit regression and then extend it to ordinal data. Recall that, in the binary probit regression, we assumed that a latent continuous variable \(z_i\) exists but that we only get to observed \(y_i=1\) (if \(z_i>0\)) or \(y_i=0\) (if \(z_i<0\)). Before you ask, notice that \(z_i\) cannot be equal to zero because we assume that:

\[z_i \sim N(\beta_0+\beta_1 x_i,1)\]

As a result, the probability that \(z_i\) is exactly zero is zero. In this model, there is a single cutoff/threshold value and it is natural to assume that this is known and equal to zero. Because of this assumption, we can estimate the intercept \(\beta_0\).

For the ordinal probit regression, we still assume the same underlying regression structure for \(z_i\) but now we assume that there are multiple cutoffs/threshold values \(t_1,...,t_{K-1}\), where \(K\) is the number of categories. As a result, we have that:

\(y=1\) if \(z_i<t_1\)

\(y=2\) if \(t_1<z_i<t_2\)

\(y=3\) if \(t_2<z_i<t_3\)

\(y=k-1\) if \(t_{k-2}<z_i<t_{k-1}\)

\(y=k\) if \(t_{k-1}<z_i\)

In this context, it is not entirely clear which threshold should be set to zero. Unfortunately, if we do not set one of the thresholds to zero, we cannot estimate the intercept term \(\beta_0\). For this reason, we end up assuming that \(\beta_0=0\) and therefore we rely on the following model for \(z_i\):

\[z_i \sim N(\beta_1 x_i,1)\]

Another way to think about this model is to eliminate the latent variable \(z_i\) by integrating it out. This enables us to write the likelihood in a much more compact way. Before jumping into this derivation, notice that

\[p(t_{j-1} < z_i<t_j)=p(z_i < t_j) - p(z_i < t_{j-1})\]

This is true because

  1. \(p(t_{j-1} < z_i<t_j)\) is given by the grey area below:

  1. \(p(z_i < t_j)\) is given by the blue area below:

  1. \(p(z_i < t_{j-1})\) is given by the red area below:

As a result, if we substract the blue area from the red area, we obtain the grey area.

Deriving the likelihood for the ordinal probit likelihood

Let \(e_i \sim N(0,1)\). Similar to the derivation of the binary probit likelihood, we have that:

\[p(y_i=1|x_i)=p(z_i<t_1)=p(\beta_1 x_i + e_i<t_1)=p(e_i<t_1-\beta_1 x_i)=\Phi(t_1-\beta_1 x_i)\]

\[p(y_i=2|x_i)=p(t_1 < z_i<t_2)=p(z_i < t_2) - p(z_i < t_1) = p(e_i < t_2-\beta_1 x_i) - p(e_i < t_1-\beta_1 x_i) = \Phi(t_2-\beta_1 x_i) - \Phi(t_1-\beta_1 x_i)\]

\[p(y_i=3|x_i)=p(t_2 < z_i<t_3)=p(z_i < t_3) - p(z_i < t_2) = p(e_i < t_3-\beta_1 x_i) - p(e_i < t_2 -\beta_1 x_i) =\Phi(t_3-\beta_1 x_i) - \Phi(t_2-\beta_1 x_i)\]

We can keep on writing these equations, one for each possible category. Alternatively, we can simply write the likelihood for a single observation as:

\[p(y_i) = \prod_{k=1}^K [\Phi(t_k - \beta_1 x_i ) - \Phi(t_{k-1} - \beta_1 x_i )]^{I(y_i=k)}\]

where we assume that \(t_0=-\infty\) and \(t_K=\infty\).

Why am I showing all these equations?

I like to explicitly represent the latent continuous variable \(z_i\) because:

  1. it is easier to understand how the model was conceptualized, instead of having to try to make sense of the equations shown above;

  2. modeling results are easier to interpret if we think about this latent continuous variable.

Comments?

Send me an email at

References