Informative prior
In the quest for a non-informative prior for \(\pi\) in our earlier basketball example, people often choose a uniform distribution. However, does this prior really capture our beliefs? For instance, if we choose \(p(\pi)\) to be the uniform distribution, this implies that we think that
\[p(0.9<\pi<1)=0.1\]
Do we really think that Daniel has a 10% probability of being an extremely good basketball player (i.e., an NBA player waiting to be discovered)? This is important to think about because it illustrates that we are actually making many implicit assumptions (which we may or may not be aware of) even when using standard “vague” priors.
If we have an uniform prior \(p(\pi)\) (which turns out to be \(Beta(1,1)\)), then the posterior distribution is a beta distribution with \(\alpha_1=x+1\) and \(\beta_1=n-x+1\). Recall that in our example, Daniel scored 1 out of 10 (i.e., x=1 and n=10). Let’s compare these two posterior distributions:
- under our “informative” prior \(p(\pi)=Beta(2.86,2.86)\), the posterior becomes \(Beta(1+2.86,10-1+2.86)\) (black line in the figure below);
- under the uniform prior \(p(\pi)=Beta(1,1)\), the posterior distribution becomes \(Beta(1+1,10-1+1)\) (red line in the figure below).
seq1=seq(from=0,to=1,length.out=1000)
#draw the "informative" prior (i.e., beta(2.86,2.86))
a=b=2.86
plot(seq1,dbeta(seq1,a,b),type='l',ylab='Density',
xlab='Outcomes',main='Beta distribution',col='grey',
ylim=c(0,5),xlim=c(0,1))
#draw posterior distribution using our informative prior
a1=1+a
b1=10-1+b
lines(seq1,dbeta(seq1,shape1=a1,shape2=b1))
#draw posterior distribution using uniform prior
a=1+1
b=10-1+1
lines(seq1,dbeta(seq1,a,b),col='red',lty=1)
legend(0.3,5,c('Posterior (uniform prior)','Posterior (informative prior)','Prior beta(2.86,2.86)'),lty=1,col=c('red','black','grey'))
These comparisons reveal that, because we believed our basketball player to be better than his scores indicate, the resulting posterior is more shifted to the right than the posterior that results from using a relatively uninformative flat prior.
Some general results for the Binomial-Beta conjugate pair
More generally, we can write: \[p(\pi|x)\propto Binom(x|n,\pi)Beta(\pi|a,b)\] \[\propto \pi^x(1-\pi)^{n-x} \pi^{a-1} (1-\pi)^{b-1}\] \[\propto \pi^{(x+a)-1}(1-\pi)^{(n-x+b)-1}\] \[p(\pi|x)=Beta(x+a,n-x+b)\] An important observation is that the posterior mean is given by: \[E[\pi|x]=\frac{x+a}{(x+a)+(n-x+b)}\] This observation is important for several reasons:
when we used an uniform prior, this implied \(a=b=1\). However, we can make this prior even less informative by choosing smaller values for a and b (e.g., \(a=b=0.001\)). In the limit, if a and b are zero, then \(E(\pi|x)=x/n\), which is the MLE for \(\pi\).
if a and b are large relative to x and (n-x) then we have a relatively strong prior. This comment is important because it reveals that the same prior can be highly informative in one situation (i.e., with few data) but be inconsequential in another (i.e., with lots of data). For instance, if x and n-x are large relative to a and b, then again \(E(\pi|x)\approx x/n\).
notice that if we have no data at all, then \(E(\pi|x)=a/(a+b)\), which is the prior mean. Indeed, the posterior distribution is then Beta(a,b). This makes sense. If we don’t have any data, then we just get the prior back from our analysis (i.e., our beliefs do not change given that we have not seen any data).
A note on multiple binomial trials
Say Daniel was upset that he scored only one out of 10 and he came to do two more additional sets of throws. Say we had the following results:
- 1st set of throws: X=1,n=10
- 2nd set of throws: X=2,n=10
- 3rd set of throws: X=3,n=5
There are two ways of analyzing these data:
Sequential analysis: In this case, we have shown that the posterior distribution after the first set of throws is \(Beta(1+2.86,10-1+2.86)\). We can then use this posterior distribution as the prior for the next set of throws, yielding a new posterior distribution. Then, this new posterior becomes the prior for the third (and last) set of throws.
All at once analysis: In this case, we can assume that our dataset is X=1+2+3 and n=10+10+5 and that our prior is Beta(2.86,2.86).
Do we get different answers if we do this analysis one way or another?
Some things to think about
Would you select a 90% free throw shooter (based on 100 attempts) or a person that made 1 out of 1 free throw for your basketball team? Why?
Would it make a difference if you knew that the 90% free throw shooter was a professional and the other person was a random person selected off of the street? Why?
Comments?
Send me an email at