Introduction to Bayesian statistics

Introduction

Much has already been said about differences and commonalities regarding the Frequentist and Bayesian paradigms to statistics. While these different approaches may provide similar inference for simple models, their inference often diverges as models become more complex. Here I will try to highlight some of the key differences.

Definition of probability

One would naively think that statisticians would agree in relation to the definition of such a fundamental quantity as the concept of probability. However, there are important differences in how probability is defined in Frequentist and Bayesian statistics.

In Frequentist statistics, the probability of a particular event is often defined as the long-term frequency of that event (this is the origin of the name “Frequentist” statistics). This seems to be a useful definition for phenomena that occur repeatedly and are very well defined (e.g., the probability of rolling a 6 on a fair dice) but is not necessarily a good definition for other types of events (e.g., the probability of nuclear war or of Obama being elected president).

Critically, this definition of probability might seem to be very objective when in fact there is some subjectivity to it. For any event of interest (beyond coin flipping and die rolling), the analyst still has to define exactly how the long-term frequency is going to be calculated. For instance, the probability that is going to rain tomorrow can be calculated based on the frequency of rain on days with similar conditions from the historical record. However, different definitions of what are “similar conditions” might lead to different results!

Another way of viewing probability is as a measure of our belief regarding the likelihood of a particular event. This is known as subjective probability (or degrees of belief) and is the basis of Bayesian statistics. One way to think about this is to think about a bet.

Say that we are interested in somebody’s subjective probability of event E (e.g., Brazil wins the World Cup). As suggested by (Winkler 1967), we can ask this person to choose between 2 bets:

Bet 1:

Win $A if E occurs.
Lose $B if E doesn’t occur.

Bet 2:

Win $B if E doesn’t occur.
Lose $A if E occurs.

If we keep asking this person to choose from similar bets in which we vary the amount of money that is earned or lost (i.e., A and B), we can calculate their subjective probability that event E will happen. Notice that there is no underlying idea of long-term relative frequency and it is likely that other folks will have very different probabilities than this person.

Probabilistic statements given the data that we have collected

95% confidence intervals vs 95% credible intervals

Scientists are often interested in making probabilistic statements given the data that they have collected. For instance, we might want to calculate an interval (say from L [lower bound] to U [upper bound]) that, given the data, has a 95% probability of encompassing the true parameter. We can write this as:

\[p(L<\theta<U|D)=0.95\]

where $\theta$ is the parameter you are interested in and $D$ are the data.

It turns out that this is the correct interpretation of a 95% credible interval generated from a Bayesian model BUT it is not the correct interpretation of a 95% confidence interval that arises from Frequentist methods.

The correct way of interpreting a 95% confidence interval generated from a Frequentist method is the following: there is a 95% chance that intervals created in this way will contain the true parameter under repeated sampling. In other words, the 95% in Frequentist methods is the long-term success rate of the algorithm under repeated sampling. Unfortunately, nothing more specific can be said about the particular numbers A and B that were calculated based on the single dataset you collected.

To calculate this quantity, we will clearly need a prior because we can re-express this quantity using Bayes theorem as:

\[p(L<\theta<U|D)=\frac{p(D|L<\theta<U) \color{Red}{p(L<\theta<U)}}{p(D)}\] In this expression, $p(L<\theta<U)$ is our prior belief that the parameter $\theta$ is between L and U.

P-values

Scientists are often interested in making other types of probabilistic statements (given the collected data) as well. For instance, we might want to calculate the probability of the null hypothesis given the data. We can write this quantity as:

\[p(H_0|D)\]

Unfortunately, people often interpret p-values as being this quantity. As a result, if p-values are small, then we can reject the null hypothesis and accept the alternative hypothesis. Again, although this is a very natural way to interpret p-values, this is incorrect. The correct interpretation of a p-value is the long-term relative frequency of observing a statistic that is equal or more extreme than the one observed under the null hypothesis if we were to collect a large number of data sets. What a mouth full!

To be able to calculate $p(H_0|D)$, we again will need a prior. To show this, notice that we can re-write this expression in the following way using Bayes theorem:

\[p(H_0|D)=\frac{p(D|H_0) \color{Red} {p(H_0)}}{p(D)}\] In this expression, $p(H_0)$ is our prior probability regarding the null hypothesis.

Priors

Bayesians are concerned in determining what you know about a problem prior to collecting data and then updating these beliefs based on the data that were gathered. This is formally done through Bayes theorem (thus the name of Bayesian statistics). Although Bayes theorem is not controversial for either camp since it simply follows from accepted laws of probability, its use in statistical modeling is controversial (or at least used to be). The controversy arises from the fact that the analyst is forced to specify a prior (e.g., what you know before collecting the data). Depending on the prior that one picks, conclusions will be different, introducing subjectivity into a discipline typically seen as “objective”.

Some have argued that it is natural to include prior information since science is almost never done in a complete vacuum of knowledge. After all, scientists rely on past work anyway to formulate their questions and create their models. For example, even in the absence of priors, considerable prior information is present in Frequentist’s models (e.g., in the selection of covariates that are used for model building as well as model specifications).

Likelihood

The likelihood function, denoted by $p(D|\theta)$, plays a central role in both the Bayesian and frequentist paradigms. However, as noted in (Bishop 2009), “the manner in which it is used is fundamentally different in the two approaches. In a frequentist setting, $\theta$ is considered to be a fixed parameter, whose value is determined by an estimator, and error bars on this estimate are obtained by considering the distribution of possible data sets $D$. By contrast, from the Bayesian viewpoint, there is only a single data set D (namely the one that is actually observed), and the uncertainty in the parameters is expressed through a probability distribution over $\theta$”.

Summary

In short, scientists often want to make probabilistic statements given the data that they have collected but do not like to use priors. As a result, we unfortunately often sacrifice measuring what we want for the sake of not having to rely on priors and having to rely on very awkward statements when using Frequentist methods.

Importantly, many applied scientists have turned to Bayesian statistics for more pragmatic reasons; there are many models that are easier to fit and obtain uncertainty estimates in a Bayesian framework than in a Frequentist framework. For example, I have relied heavily on Bayesian models for my work due to their flexibility in tackling all sorts of problems, allowing me to customize the model to the problem at hand rather than the other way around. The need for customization arises because data have multiple pathologies (e.g., temporal and/or spatial correlation, zero-inflation, over-dispersion, etc.). Furthermore, some interesting biological hypotheses are better represented by customizing the model (e.g., by allowing for non-linearity, an explicit representation of hidden biological processes) rather than transforming the data so that a standard statistical model can be used (as we are often taught in introductory statistics courses).

Finally, many of the modern tools that scientists are currently using (e.g., Generalized Additive models [GAM], LASSO and ridge regression) are really Bayesian models in disguise as these methods rely on penalization terms that can be interpreted as priors (Wood 2020; Hooten and Hobbs 2015).

The down side of Bayesian statistics is that it requires some level of understanding of probability and distributional theory and a lot of computation since almost everything nowadays is done through simulations. In this course, I will try to separate the theory behind Bayesian models from computation as much as possible because learning both at the same time is quite a steep learning curve.

For more background information regarding the pros and cons of Bayesian statistics, I recommend (Ellison 1996), (Ellison 2004), and pages 1-13/31-45 in (McCarthy 2007). Many of the ideas in this page come from (Chechile 2020).

Comments?

Send me an email at

References

Bishop, C. M. 2009. Pattern Recognition and Machine Learning. Springer.

Chechile, R. A. 2020. Bayesian Statistics for Experimental Scientists. The MIT Press.

Ellison, A. M. 1996. “An Introduction to Bayesian Inference for Ecological Research and Environmental Decision-Making.” Ecological Applications 6: 1036–46.

———. 2004. “Bayesian Inference in Ecology.” Ecology Letters 7: 509–20.

Hooten, M. B., and N. T. Hobbs. 2015. “A Guide to Bayesian Model Selection for Ecologists.” Ecological Monographs 85: 3–28.

McCarthy, M. A. 2007. Bayesian Methods for Ecology. Cambridge University Press.

Winkler, R. L. 1967. “The Quantification of Judgment: Some Methodological Suggestions.” Journal of the American Statistical Association 62: 1105–20.

Wood, S. N. 2020. “Inference and Computation with Generalized Additive Models and Their Extensions.” TEST 29: 307–39.