Why do we need the MH algorithm?

The workhorse algorithm for Bayesian statistics is the Gibbs sampler, which consists of sampling each parameter (or set of parameters) from their full conditional distributions (FCD) until all parameters have been updated and doing this repeatedly until the whole thing has converged. The Metropolis-Hastings algorithm is useful, as part of a Gibbs sampler, because sometimes it is not straight-forward to sample from a given FCD because this FCD is not a standard distribution (e.g., normal, beta, or gamma distribution).

For example, recall the problem of estimating how the number of species S is related to area A. In this problem, we are interested in estimating the parameter c and z from the species-area curve given by:

\[S=cA^z\]

One common solution to this problem is to take logs on both sides and then use a Gaussian regression. If we did that, we would have the following model:

\[log(S_i) \sim N(log(c)+z\times log(A_i),\sigma^2)\]

We can re-write this as:

\[log(S_i) \sim N(\beta_0+\beta_1 \times log(A_i),\sigma^2)\] where \(\beta_0=log(c)\), \(\beta_1=z\), and our covariate is \(log(A_i)\).

Instead of taking the log on both sides and then using a Gaussian regression, perhaps we want to model our data on its original scale. In particular, we might want to appropriately acknowledge that the number of species is a discrete variable. In this case, we can use a Poisson distribution to model the number of species:

\[S_i \sim Poisson(exp(\beta_0+\beta_1 \times log(A_i)))\]

It is hard, if not impossible, to come up with suitable priors for \(\beta_0\) and \(\beta_1\) so that the FCDs are available in closed form. In other words, we might not be able to find conjugate priors. It is for these situations that the Metropolis-Hastings algorithm is useful. Assuming independent priors for \(\beta_0\) and \(\beta_1\), each iteration of our Gibbs sampler will consist of:

  1. Sample from \(p(\beta_0|...)\propto [\prod_i Poisson(S_i|exp(\beta_0+\beta_1 \times log(A_i)))]p(\beta_0)\)
  2. Sample from \(p(\beta_1|...)\propto [\prod_i Poisson(S_i|exp(\beta_0+\beta_1 \times log(A_i)))]p(\beta_1)\)

where samples from each of these FCD’s will be generated by the MH algorithm applied to these expressions.

If you think about it, this is almost magical. How can we draw from a distribution that we only know up to a proportionality constant? This distribution might not be any standard or known distribution!

Why should I learn about the MH algorithm if I only use JAGS?

  1. Up to this point, we have only talked about conjugate likelihood-prior pairs. If you recall, this requires us to make very specific choices so that the math works. How can JAGS (and other software) run a Gibbs sampler for arbitrary likelihood and prior combinations? How is that possible? There are several algorithms that enable sampling from a distribution that is only known up to a normalizing constant. The MH algorithm is one of the simplest but it is still commonly used to this date. I believe that it is useful to know how this algorithm works because it can help us gain a deeper understanding of how JAGS (and other software) ultimately does its magic.

  2. Say that we were interested in estimating the parameters of a more complex dynamic model (e.g., a hydrological model, a disease dynamics model, or a matrix population model). You can try to program this dynamic model within JAGS but it is probably going to be too slow and too complicated. In this situation, it becomes handy to know how to program your own Gibbs sampler and use the MH algorithm to generate samples from the FCDs.

  3. Some software (e.g., NIMBLE) allow you to pick which sampler you want to use and which parameters you want to sample as a block. Knowing how the MH algorithm works, including understanding its advantages and disadvantages, can help you make these decisions.

  4. Many software (e.g., NIMBLE and JAGS) rely on algorithms that need to be tuned/adapted to perform efficiently. For example, the JAGS manual states that:

“When a model is initialized, it may be in adaptive mode, meaning that the Samplers used by the model may modify their behaviour for increased efficiency. Since this adaptation may depend on the entire sample history, the sequence generated by an adapting sampler is no longer a Markov chain, and is not guaranteed to converge to the target distribution. Therefore, adaptive mode must be turned off at some point during burn-in, and a sufficient number of iterations must take place after the adaptive phase to ensure successful burn-in”

Even if the software relies on a different algorithm, understanding the MH algorithm can provide you some insight regarding what is meant by adaptation.



Back to main menu

Comments?

Send me an email at