Why are we studying Monte Carlo integration?

  1. in Bayesian statistics the main object of interest is the posterior distribution, which should reflect what we now know about the parameters after looking at the data.
  2. Based on this posterior distribution, we report our results using several summary measures of this distribution, like the posterior mean and 95% credible intervals. Most of these summary measures are the result of integrals.
  3. There are multiple ways of calculating these integrals. For instance, we can try to find an analytical or a numerical solution. Within the numerical solution option, we can use a deterministic algorithm (quadrature) or an stochastic algorithm (Monte Carlo integration) to get this solution.

Bayesian statisticians rely on Monte Carlo integration because:

  1. We typically do not have a closed form expression for the posterior distribution (i.e., cannot be derived mathematically), precluding analytical solutions.
  2. It is hard (if not impossible) to solve high-dimensional integrals using deterministic methods.

Why is integration important for statistics?

Although rarely acknowledged, the type of results/summaries that we want from a Bayesian statistical model can often be viewed as the result of an integral. For instance, after running their regression models involving several covariates (e.g., rainfall, NDVI, and temperature), people often want to create the following table:

Estimate Standard error 95% CI
Intercept NA NA NA
Rainfall NA NA NA
NDVI NA NA NA
Temperature NA NA NA

In relation to parameter estimates for \(\beta_{intercept},\beta_{rainfall},\beta_{NDVI},\beta_{temperature}\) (i.e., \(1^{st}\) column in the table), Bayesian statisticians would typically report the posterior mean. This is given by:

\[E(\beta_i|Y,X)=\int \beta_i f(\beta_i|Y,X) d\beta_i\]

where \(f(\beta_i|Y,X)\) is the marginal posterior distribution.

Furthermore, we might be interested in calculating the variability of this parameter, rather than only a point estimate (i.e., large variability might indicate that the results are not “significant”). A measure of variability is the standard error (\(2^{nd}\) column in the table), which can be calculated as the square root of the posterior variance:

\[Var(\beta_i|Y,X)=\int (\beta_i-E(\beta_i))^2 f(\beta_i|Y,X) d\beta_i\]

Similarly, a 95% posterior credible interval (\(3^{rd}\) column in the table) consists in the numbers L and U such that: \[p(L < \beta_i <U)=\int_L^U f(\beta_i|Y,X)d\beta_i=0.95\] In summary, integration helps us to summarize various aspects of our modeling results.

How do we do integration with simulation?

Let’s go back to the basketball example. Say we know that the posterior distribution of \(\pi\) (our measure of the skillfulness of our globetrotter player) comes from a beta distribution with parameter a and b. Then we can algebraically calculate:

\[E(\pi)=\int x f_\pi (x)dx = \int x \frac{x^{a-1}(1-x)^{b-1}}{B(a,b)}dx\]

\[Var(\pi)=\int(x-\frac{a}{a+b})^2 f_\pi(x)dx=\int (x-\frac{a}{a+b})^2 \frac{x^{a-1}(1-x)^{b-1}}{B(a,b)}dx\]

Thankfully somebody has already shown that the results of these integrals are \(\frac{a}{a+b}\) and \(\frac{ab}{(a+b)^2 (a+b+1)}\), respectively.

Other quantities are harder to calculate algebraically, such as confidence intervals. Perhaps we even want to calculate the probability that \(\pi\) is above a given NBA threshold \(T_{NBA}\), to see the likelihood that our basketball player will make it into the pros.

\[p(\pi>T_{NBA})=\int_{T_{NBA}}^1 \frac{x^{a-1}(1-x)^{b-1}}{B(a,b)}dx\]

Instead of going into the weeds regarding analytically finding the solutions to these integrals, another way of doing these integrals is through simulation. Say the parameter that we are interested in has a posterior distribution given by a beta distribution (i.e., \(x \sim Beta(1,3)\)). If we want to calculate the probability \(p(0.5<x<0.8)=\int_{0.5}^{0.8} Beta(x|1,3)dx\), then we are essentially calculating the grey area in the figure below:

#draw beta distribution
x=seq(from=0,to=1,length.out=1000)
plot(x,dbeta(x,1,3),type='l',ylab='',xlab=expression(pi))

#draw area of interest
y=seq(from=0.5,to=0.8,length.out=1000)
polygon(c(y,y[length(y):1]),c(dbeta(y,1,3),rep(0,length(y))),col='grey')
abline(v=c(0.5,0.8),lty=3)

Here is the main way we solve the integrals that show up in Bayesian statistics. Say that we can generate samples from this posterior distribution \(x_i \sim f_\pi (x)\) (this is an important assumption). Then the posterior average, variance, and probability that it lies between L and U of our parameter of interest x can be approximated by:

\[E(x)=\int_{-\infty}^{\infty}xf(x)dx\approx \frac{1}{n} \sum_{i=1}^n x_i = \bar{x}\] \[Var(x)=\int_{-\infty}^{\infty} (x-E(x))^2 f(x)dx\approx \frac{1}{n} \sum_{i=1}^n (x_i-\bar{x})^2\] \[p(L<x<U)=0.95=\int_L^U f_\pi(x)dx \approx \frac{1}{n} \sum_{i=1}^n I(L<x_i<U)\] where I is the indicator function, equal to 1 if the condition is satisfied and 0 otherwise. Here, L and U denote Lower and Upper bound, respectively. We might also want to calculate the probability that x is bigger than a particular threshold T:

\[p(x>T)=\int_T^{\infty} f(x) dx \approx \frac{1}{n} \sum_{i=1}^n I(x_i>T)\] This should make intuitive sense. If we want to estimate \(E(x)\), we use the sample mean. Similarly, if we want to estimate \(Var(x)\), we use the sample variance. If we want to estimate the probability that x lies between L and U, we calculate the proportion of times that our sample falls between these numbers.

To make this more concrete, here are some plots of these quantities as a function of sample size (note that sample size here refers to the number of samples from the posterior distribution, not the number of observations in your dataset). We assume that \(x \sim Beta(1,3)\) and that L=0.2, U=0.8, T=0.5. I will only show part of the code here so that you can have fun with the homework.

n=10000
a=1; b=3
x=rbeta(n,a,b)
L=0.2
U=0.8
T1=0.5

res=matrix(NA,n,4)
for (i in 2:n){
  res[i,1]=mean(x[1:i])
  res[i,4]=mean(x[1:i]>T1)
}

par(mfrow=c(2,2),mar=c(4,1,4,1))
nomes=c('Mean','Variance','p(L<x<U)','p(x>U)')
true.vals=c(a/(a+b),a*b/(((a+b)^2)*(a+b+1)),pbeta(U,a,b)-pbeta(L,a,b),1-pbeta(T1,a,b))

for (i in 1:4){
  cond=!is.na(res[,i])
  rango=quantile(res[cond,i],c(0.001,0.999))
  rango1=range(c(rango,true.vals[i]))
  cond=i==1 | i==4
  if (!cond) plot(NA,NA,xlim=c(0,1),ylim=c(0,1))
  if (cond) {
    plot(res[,i],type='l',xlab='Sample size',main=nomes[i],ylim=rango1,ylab='')
    abline(h=true.vals[i],col='blue')
  }
}

More generally, we can approximate the expectation of any function that depends on x (i.e., g(x)) as:

\[E(g(x))=\int g(x) f(x) dx \approx \frac{1}{n} \sum_{i=1}^n g(x_i) \] This is called Monte Carlo integration. This method works because all these things we are calculating are, in a way, averages (i.e., sum a bunch of numbers and divide by total sample size n). I am not going to cover the theoretical reasons of why this works but just state that we can calculate all sorts of integrals if we can generate samples from the posterior distribution of the parameter of interest \(f_\pi (x)\). These results are remarkably useful because sometimes we cannot derive the analytical expression of E[g(x)] but we can approximate it by simulation. The main idea here is that we can use stochastic simulations to calculate things that are otherwise hard to solve analytically. Notice that we are only talking about the Monte Carlo part, not about the Markov Chain.

Another important point refers to the fact that knowing the mathematical expression of the pdf or pmf is equivalent to having many random variables from these distributions. In other words, these are two sides of the same coin. For instance, in the figure below, we approximate the normal distribution \(N(\mu=0,\sigma=2)\) with 10,000 random variables from this distribution. This comparison reveals that the approximation is quite good.

#plot theoretical distribution
seq1=seq(from=-5,to=5,length.out=1000)
plot(seq1,dnorm(seq1,mean=0,sd=2),type='l',col='grey',xlab='x',
     ylab='Density',main='Normal distribution')

#plot our approximation based on samples from this distribution
x=rnorm(10000,mean=0,sd=2)
lines(density(x))
legend(-4.5,0.2,legend=c('true pdf','approx. pdf'),col=c('grey','black'),lty=1)

The mathematical expression is nice because we can clearly see how things work. For instance, when deriving the posterior distribution for the conjugate pairs, knowing the mathematical expression of the posterior was useful in understanding how the prior influences the posterior. However, there are many integrals that are hard or impossible to solve analytically, but which we can approximate through simulations. Simulations are particularly handy because frequently we do not know the complete mathematical expression for the pdf or pmf (e.g., not knowing that we have a beta distribution in the above example) but we can still generate samples from these distributions (e.g., through a Gibbs sampler)!! Bayesian statisticians typically devise algorithm to generate samples from the posterior distribution, which are then used to make statistical inference.



Back to main menu

Comments?

Send me an email at