Shrinkage in mixed-models

To better illustrate what is meant by shrinkage in mixed-effects models, we will start with a simpler model that does not contain any covariate. This model is given by:

\[\propto [\prod_{k=1}^K \prod_{i=1}^{n_k} N(y_{ik}|\beta_{0k},\sigma^2)][\prod_{k=1}^K N(\beta_{0k}|\gamma,\tau^2)]N(\gamma|0,10)Gamma(\frac{1}{\tau^2}|a_{\tau^2},b_{\tau^2})Gamma(\frac{1}{\sigma^2}|a_{\sigma^2},b_{\sigma^2})\]

The FCD for \(\beta_{0k}\) is given by:

\[p(\beta_{0k}|...)\propto [\prod_{i=1}^{n_k} N(y_{ik}|\beta_{0k},\sigma^2)]N(\beta_{0k}|\gamma,\tau^2)\] \[p(\beta_{0k}|...)= N([\frac{n_k}{\sigma^2}+\frac{1}{\tau^2}]^{-1}[\frac{\sum_i y_{ik}}{\sigma^2}+\frac{\gamma}{\tau^2}],[\frac{n_k}{\sigma^2}+\frac{1}{\tau^2}]^{-1})\]

Although we do not provide the step-by-step derivation of this result, it nevertheless helps to illustrate what shrinkage is. In particular, notice that:

\[E[\beta_{0k}|...]=[\frac{n_k}{\sigma^2}+\frac{1}{\tau^2}]^{-1}[\frac{\sum_i y_{ik}}{\sigma^2}+\frac{\gamma}{\tau^2}]=\frac{\frac{n_k}{\sigma^2}\bar{y_k}+\frac{1}{\tau^2}\gamma}{\frac{n_k}{\sigma^2}+\frac{1}{\tau^2}}\]

In other words, this is a weighted average, where the weights are equal to \(\frac{n_k}{\sigma^2}\) and \(\frac{1}{\tau^2}\). It is instructive to look at two possible cases.

Case 1: we have a lot of data from the k-th county (\(n_k\) is large).

In this case, \(E[\beta_{0k}|...]\) will be much closer to \(\bar{y_k}\) (the mean of the observations for county k) than to \(\gamma\) (i.e., there is little shrinkage of the estimate \(\beta_{0k}\) towards \(\gamma\)). In an extreme situation where \(n_k\) is very large, we have that

\[E[\beta_{0k}|...] \approx \frac{\frac{n_k}{\sigma^2}\bar{y_k}}{\frac{n_k}{\sigma^2}}=\bar{y_k}\] In other words, when we have a lot of information, we trust a lot our county specific mean and thus it makes sense that \(E[\beta_{0k}|...] \approx \bar{y_k}\).

Case 2: we have little data for county k (\(n_k\) is small).

In this case, \(E[\beta_{0k}|...]\) will be closer to \(\gamma\) (i.e., the estimate \(\beta_{0k}\) will be shrunk towards the global intercept \(\gamma\)). In the extreme scenario where \(n_k=0\), then

\[E[\beta_{0k}|...]=\frac{\frac{1}{\tau^2}\gamma}{\frac{1}{\tau^2}}=\gamma\]

In other words, when we don’t have much information for a particular county k, it seems reasonable to rely more heavily on the overall mean and thus it makes sense that \(E[\beta_{0k}|...]\approx \gamma\).

In general we expect \(\beta_{0k}\) to be between \(\gamma\) and \(\bar{y_k}\) and the proximity to each one of these end points will depend on the number of observations in this county.



Comments?

Send me an email at

References