Binomial-beta conjugate pair

Learning objectives

The goal of this example is to:

introduce the concept of conjugate pairs
illustrate the concept of conjugate pairs using a binomial-beta pair
distinguish uninformative priors from informative priors

1. Conjugate pairs

Recall that we want to make statements about the parameters after we have learned all we could from our data. In other words, we are after the posterior distribution:

\[p(\theta|D)\propto p(D|\theta)p(\theta) \] The idea of conjugate pairs is that we want to pick a prior \(p(\theta)\) such that, when it is combined with the likelihood \(p(D|\theta)\), we get a posterior \(p(\theta|D)\) that has the same distribution as our prior, albeit with different parameters.

Examples of conjugate pairs are:

Before the advent of fast computers and MCMC algorithms, all that we could do with Bayesian statistics depended on these conjugate pairs because, in these special cases, the posterior distribution (i.e., the main object of interest for Bayesian statistics) was available in closed form. Historically, in the absence of conjugacy, we were not able to analyze the posterior distribution and we had to stop. This was the major bottleneck for Bayesian statistics for a long time because it implied that Bayesian statistics could only be used for relatively simple and uninteresting models.

Nowadays, even if things are not conjugate, we can still fit all sorts of models within a Bayesian framework, some of which are too complex to fit within other frameworks. So why do we spend time talking about conjugacy? Conditionally conjugate pairs are used a lot as part of more complex models and are key for computationally efficient algorithms. Furthermore, I strongly believe that working with the equations that describe how the likelihood and the prior come together for conjugate pairs to form the posterior distribution provides a level of understanding and intuition that is hard to provide in a different manner.

2. A basketball example

Let’s put this into concrete terms. Say that I am the owner of a basketball team and I am recruiting players for my team. A guy called Daniel arrives and wants to play in my team. To determine if he is a good basketball player, I ask him to do 10 free throws. Let \(y_i\) denote the outcome of throw \(i\), equal to 1 if he scores and 0 otherwise. Because this is a binary outcome, I assume that:

\[y_i \sim Bernoulli(\pi_i)\]

If I further assume that these throws are independent and that all the probabilities \(\pi_i\) are the same from one throw to the next, I can write the likelihood as:

\[p({y_1,...,y_{10}}|\pi)=\prod_{i=1}^{10} Bernoulli(y_i|\pi)\]

This is equivalent to assuming that the overall number of points \(x=\sum_{i=1}^{10} y_i\) comes from a binomial distribution:

\[X \sim Binomial(n=10,\pi) \] Alternatively, we can write this as:

\[p(X|\pi,n=10)=\frac{10!}{x!(10-x)!}\pi^x(1-\pi)^{10-x}\]

Recall that, to get our posterior distribution of \(\pi\) (the parameter that summarizes how good the player is), we need a prior. Thus, we need to come up with a prior that summarizes how good we think this guy is prior to seeing him shoot.

2.1. Specifying the prior

We can specify our prior beliefs in two ways:

we can discretize \(\pi\) and assign probabilities to individual values of \(\pi\); or
we can assign a pdf to \(\pi\) so that all possible values of \(\pi\) have a prior density.

2.1.1. Assign probabilities to discrete values of \(\pi\)

One way of doing this is to think about discrete values for \(\pi\) and assign probabilities to these values. For instance, based on Daniel’s credentials as a basketball player, say that were were willing to assume that:

\[p(\pi=0.7)=0.2\] \[p(\pi=0.5)=0.75\] \[p(\pi=0.1)=0.05\]

With this prior, we are assuming that Daniel has some probability of being very good (\(\pi=0.7\)), being medium (\(\pi=0.5\)) and a relatively low probability of being horrible (\(\pi=0.1\)). Notice that this prior states that these are the only 3 values that \(\pi\) can take (i.e., other values for \(\pi\) have zero probability), which is admittedly a relatively artificial example because we know that \(\pi\) can be any real number between 0 and 1.

But… how can we specify our prior beliefs for all the infinitely many values between 0 and 1 that \(\pi\) can take?

2.1.2. Use a pdf to represent prior beliefs on continuous values of \(\pi\)

We can rely on a pdf to specify our prior beliefs such that all the infinitely many values between 0 and 1 that \(\pi\) can take have some weight. A natural choice for a prior for \(\pi\) is to choose a beta distribution (why is that?). A beta distribution has two parameters (a and b) but how do we decide on which a and b we should use?

There are multiple ways of trying to determing what a and b ought to be. Here is one way of doing this if we have the following two pieces of information:

What is the average number of times this player would score out of 1,000 shots?
What is the value Z for which we believe that it is extremely unlikely that this player will score less than Z out of 1,000 shots?

Here I assume that “extremely unlikely” in (2) means the probability of this happening is equal to 0.01.

Say that the answer to question (1) was 500 (grey line in figure below) while the answer to question (2) was 100 (red line in figure below). I will translate this information into the following expressions:

\[E[\pi]=\mu=\frac{500}{1000}\] \[p(\pi<\frac{100}{1000})=0.01\]

Given these numbers, I can say that our collective belief regarding the skills of this player is given by the following beta distribution:

\[p(\pi)=Beta(\pi|2.86,2.86)\]

I come up with this result by trying different sets of a and b values and picking the one that satisfies the equations above:

#calculate what the beta parameters a and b ought to be to match our prior beliefs

#prior beliefs are expressed as lo (he can't miss more than lo) and prior.mean (he should score on average prior.mean)
lo=100/1000
prior.mean=500/1000

#generate combinations of parameters a and b that have the corresponding prior mean
a=seq(from=0.001,to=10,length.out=1000) #this range was chosen arbitrarily
b=a*(1-prior.mean)/prior.mean

#out of these combinations of a and b, choose the one that results in p(theta<lo)=0.01 (i.e., very low probability of being worse than lo)
prob=pbeta(lo,shape1=a,shape2=b)  #prob is equal to p(theta<lo) for different values of a and b

#let's see which prob is closest in absolute value to 0.01
calc=abs(prob-0.01) 
ind=which(calc==min(calc))

a1=a[ind]; a1 #this is the selected a parameter

## [1] 2.863577

b1=b[ind]; b1 #this is the selected b parameter

## [1] 2.863577

seq1=seq(from=0,to=1,length.out=1000)
plot(seq1,dbeta(seq1,shape1=a1,shape2=b1),type='l',ylab='Density',xlab=expression(pi))
abline(v=lo,col='red')
abline(v=prior.mean,col='grey')

In this code, I use the following relationships:

\[E(\pi)=\mu=\frac{a}{a+b}\]

If we solve for b, we get: \[b=\frac{a(1-\mu)}{\mu}\]

2.2. Posterior distribution

Now our player Daniel is going to shoot. He scores 1 out of 10. How have our prior beliefs changed given that we have seen some data on Daniel’s performance? Below I will show the posterior under the prior for discrete values of \(\pi\) as well as the posterior under the Beta prior for \(\pi\).

2.2.1. Discrete values for \(\pi\)

We calculate the posterior distribution under the discrete prior for \(\pi\) using the table below:

x=1
n=10
pi=c(0.7,0.5,0.1)

#calculate prior
prior=c(0.2,0.75,0.05)

#calculate likelihood
likel=dbinom(1,size=n,pi)

#calculate posterior
likel_prior=likel*prior
soma=sum(likel_prior)
post=likel_prior/soma

#bring everything together into a single data frame
fim=data.frame(param.values=pi,
               prior=prior,
               likelihood=likel,
               likel.times.prior=likel_prior,
               posterior=post)
round(fim,3)

##   param.values prior likelihood likel.times.prior posterior
## 1          0.7  0.20      0.000             0.000     0.001
## 2          0.5  0.75      0.010             0.007     0.274
## 3          0.1  0.05      0.387             0.019     0.725

Why do we calculate the posterior distribution in this way? Recall that:

\[p(\pi|X)=\frac{p(X|\pi)p(\pi)}{p(X)}=\frac{p(X|\pi)p(\pi)}{\sum_j p(X|\pi=j)p(\pi=j)}\]

Notice how our prior is different from our posterior. Before seeing the data, we put a lot of weight on \(\pi=0.5\), some weight on \(\pi=0.7\), and very little weight on \(\pi=0.1\). However, after seeing the data, we are more inclined to believe that Daniel is not that good of a basketball player because \(\pi=0.1\) has a lot more weight than the other \(\pi\) values.

Here is a depiction of the prior and the posterior distributions. Notice that, because the prior is discrete, the posterior will also be discrete:

#plot prior
fim1=fim[order(fim$param.values),]
plot(1:3,fim1$prior,type='h',xaxt='n',ylim=c(0,1),xlab='',ylab='Probability',xlim=c(0.5,3.5))
axis(side=1,at=1:3,fim1$param.values)

#plot posterior
for (i in 1:3){
  lines(rep(i+0.1,2),c(0,fim1$posterior[i]),col='blue')
}

#legend
legend(2,1,col=c('black','blue'),lty=1,c('Prior','Posterior'))

2.2.2. Continuous values for \(\pi\)

Now, instead of using this discrete prior for \(\pi\), let’s use the beta distribution prior which allows for \(\pi\) to be any continuous number between 0 and 1. Recall that the beta distribution is given by:

\[p(\pi)\propto \frac{1}{B(\alpha,\beta)}\pi^{\alpha-1} (1-\pi)^{\beta-1}\] What we want is to combine the likelihood with this prior to find out the posterior distribution of \(\pi\). Here we are only interested in \(\pi\) and everything that does not involve \(\pi\) can be regarded as a constant. Therefore

\[p(\pi|X)\propto p(X|\pi)p(\pi)\] \[\propto Binom(X|\pi,n)Beta(\pi|\alpha,\beta)\]

\[\propto K_1 \pi^x (1-\pi)^{n-x}K_2 \pi^{a-1} (1-\pi)^{\beta-1}\]

where \(K_1=\frac{n!}{x!(n-x)!}\) and \(K_2=\frac{1}{B(\alpha,\beta)}\). We can ignore these guys because of our proportionality sign, yielding

\[\propto\pi^x (1-\pi)^{n-x} \pi^{a-1} (1-\pi)^{\beta-1}\] \[\propto\pi^{(x+\alpha)-1}(1-\pi)^{(n-x+\beta)-1}\] If we compare this last expression with that from a beta distribution, you will (hopefully) notice that they are very similar. Indeed, we can make some replacements:

\[p(\pi|X)\propto \pi^{\alpha_1-1}(1-\pi)^{\beta_1-1}\]

where \(a_1=(x+a)\) and \(\beta_1=(n-x+\beta)\). It turns out that our posterior distribution is also a beta distribution but with different parameters from our prior.

More specifically, the posterior distribution of \(\pi\) for Daniel is given by:

\[p(\pi|X)=Beta(x+a,n-x+\beta)=Beta(1+2.86,10-1+2.86)\]

In this example, the binomial likelihood and the beta distribution for the probability \(\pi\) form a conjugate pair because the prior is of the same distribution as the posterior but with different parameters. In general, we can deduce the posterior based solely on \(p(\pi|X)\) up to a proportionality constant, without having to go through the mathematical formalism shown in the “Miscellanea” section.

Here is a depiction of the prior and the posterior distributions. Notice that, because the prior is continuous, the posterior will also be continuous:

seq1=seq(from=0,to=1,length.out=1000)

#plot prior
plot(seq1,dbeta(seq1,2.86,2.86),type='l',ylim=c(0,4),xlab='',ylab='Density')

#plot posterior
lines(seq1,dbeta(seq1,1+2.86,10-1+2.86),col='blue')

#legend
legend(0.5,4,col=c('black','blue'),lty=1,c('Prior','Posterior'))

Final remarks

It can be challenging to determine the prior distribution based on expert feedback because humans are not particularly good in providing coherent statements. For example, as described in (Robert 2007), “a study in the New England Journal of Medicine showed that 44% of the questioned individuals were ready to undertake a treatment against lung cancer when told that the survival probability was 68%. However, only 18% were still willing to undertake it when told that the probability of death was 32%”. Notice that the survival probability is the same in both scenarios but the get different results because the statement is framed differently.

A nice review of some of the methods and common problems associated with eliciting informative prior distributions can be found in (Falconer et al. 2022).

Back to main menu

Comments?

Send me an email at

References

Falconer, J. R., E. Frank, D. L. L. Polaschek, and C. Joshi. 2022. “Methods for Eliciting Informative Prior Distributions: A Critical Review.” Decision Analysis 19: 189–204.

Robert, C. P. 2007. The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation. New Jersey: Springer Texts in Statistics.