Does centering the covariate improve our original algorithm?

To answer this question, I will use code that is almost identical to what I had before except that now I am centering the covariate. The results of the model fitted to data with centered covariates are stored in the file poisreg results centering.csv. Let’s see if centering the covariate results in much lower auto-correlation.

The first thing we will do is to determine if and when the algorithm has converged. From the plots below, it seems that our algorithm converges almost immediately. Therefore, I assume that the algorithm has converged after 50 iterations.

#get data
setwd('U:\\uf\\courses\\bayesian course\\rmarkdown\\improving MCMC')
param=read.csv('poisreg results centering.csv',as.is=T)

ngibbs=nrow(param)
nburn=50
par(mfrow=c(2,2))
plot(param$b0,type='l',main='Beta0 all',xlab='',ylab='')
plot(param$b1,type='l',main='Beta1 all',xlab='',ylab='')
plot(param$b0[nburn:ngibbs],type='l',main='Beta0 after convergence',xlab='',ylab='')
plot(param$b1[nburn:ngibbs],type='l',main='Beta1 after convergence',xlab='',ylab='')

b0=param$b0[nburn:ngibbs]
b1=param$b1[nburn:ngibbs]

Importantly, it seems that we now have much lower correlation between \(\beta_0\) and \(\beta_1\).

plot(b0,b1)

Auto-correlation function (ACF) and effective sample size

The ACF reveals that the correlation decays much more rapidly for both \(\beta_0\) and \(\beta_1\) when compared to the results with the uncentered covariate. For example, for both of these parameters, correlation is less than 10% if we retain every 20th sample.

seq1=seq(from=0,to=1,by=0.2)
par(mfrow=c(2,1),mar=c(4,4,4,1))
acf(b0)
abline(h=seq1,col='grey',lty=3)
acf(b1)
abline(h=seq1,col='grey',lty=3)

In terms of effective sample size, we find that having 9501 samples from the posterior distribution for \(\beta_0\) is equivalent to having 960 independent samples. This is much more than what we had before when the covariate had not been centered.

library(coda)
effectiveSize(b0); length(b0)
##     var1 
## 961.7831
## [1] 9951

Generalizing the problem

More generally, observing highly correlated parameters is indicative of a log-likelihood surface with ridges and unfortunately algorithms like the Gibbs sampler are not ideal for these surfaces. To understand why this is the case, it is important to take a step back and re-visit how the Gibbs sampler works and how it explores parameter space.

Recall that, within each iteration of our Gibbs sampler, we sequentially update one parameter at a time while holding the remaining parameters fixed. In our example, you can see that the Gibbs sampler explores parameter space by iteratively making vertical (\(\beta_1\)) and horizontal (\(\beta_0\)) moves, as shown below. This figure also shows on the background the log-likelihood for different values of \(\beta_0\) and \(\beta_1\).

#get estimated parameters without centering
setwd('U:\\uf\\courses\\bayesian course\\rmarkdown\\improving MCMC')
param=read.csv('poisreg results no centering.csv',as.is=T)
ngibbs=nrow(param)
nburn=500
b0=param$b0[nburn:ngibbs]
b1=param$b1[nburn:ngibbs]

#get data
setwd('U:\\uf\\courses\\bayesian course\\group activities\\7 example Poisson_Logistic reg')
dat=read.csv('islands.csv',as.is=T,fileEncoding="latin1")
y=dat$Native.bird.species
x=log(as.numeric(dat$island.area))
## Warning: NAs introduced by coercion
cond=!is.na(x)
y=y[cond]
x=x[cond]

seq.b0=seq(from=round(min(b0),2),to=round(max(b0),2),length.out=100)
seq.b1=seq(from=round(min(b1),2),to=round(max(b1),2),length.out=100)
combo=expand.grid(b0=seq.b0,b1=seq.b1)
combo$llk=NA
for (i in 1:nrow(combo)) {
  media=exp(combo$b0[i]+combo$b1[i]*x)
  combo$llk[i]=sum(dpois(y,media,log=T)) 
}

library('ggplot2')
res=ggplot() + 
  geom_tile(data = combo, alpha = 1,aes(x = b0, y = b1,fill = llk)) +
  scale_fill_gradient2(low = "cyan", mid = "yellow",high='red',limits=quantile(combo$llk,c(0.5,1)),midpoint=quantile(combo$llk,0.75))

n=100
oo=1
coord1=numeric()
for (i in 2:n){
  #sample b0 first while keeping b1 fixed
  b0past=b0[i-1]; b0now=b0[i]
  b1past=b1[i-1]; b1now=b1[i-1]
  tmp.b0=c(b0past,b0now)
  tmp.b1=c(b1past,b1now)
  tmp.oo=rep(oo,2)
  coord1=rbind(coord1,cbind(tmp.b0,tmp.b1,tmp.oo))
  oo=oo+1
  
  #sample b1 second while keeping b0 fixed
  b0past=b0[i];   b0now=b0[i]
  b1past=b1[i-1]; b1now=b1[i]
  tmp.b0=c(b0past,b0now)
  tmp.b1=c(b1past,b1now)
  tmp.oo=rep(oo,2)
  coord1=rbind(coord1,cbind(tmp.b0,tmp.b1,tmp.oo))
  oo=oo+1
}
coord2=data.frame(b0=coord1[,1],b1=coord1[,2],pair1=coord1[,3])
res=res+geom_line(aes(x=coord2$b0, y=coord2$b1,group=coord2$pair1))
res

Notice how the likelihood surface looks like a ridge that runs diagonally in our graph. This is an important observation. An ideal algorithm would propose new values in a diagonal fashion because the new value is more likely to fall in a high probability region, resulting in a greater chance of it being accepted.

Unfortunately, because our algorithm explores parameter space by using horizontal and vertical moves, moving around this surface is tricky because we will often be proposing values that fall into low probability regions and we will have to reject these values. To get a decent acceptance rate, we will need to rely on small steps within our MH algorithm (i.e., small \(\sigma^2_{jump}\)), resulting in very slow exploration of parameter space.

On the other hand, if we plot the log-likelihood surface when using the centered covariate, we find that the diagonal ridge is gone. As a result, the Gibbs sampler is likely to have an easier time exploring parameter space in this surface.

#calculate centered x
x.centered=x-mean(x)

seq.b0=seq(from=3,to=3+0.25,length.out=100)
seq.b1=seq(from=0.22,to=0.26,length.out=100)
combo=expand.grid(b0=seq.b0,b1=seq.b1)
combo$llk=NA
for (i in 1:nrow(combo)) {
  media=exp(combo$b0[i]+combo$b1[i]*x.centered)
  combo$llk[i]=sum(dpois(y,media,log=T)) 
}

library('ggplot2')
res=ggplot() + 
  geom_tile(data = combo, alpha = 1,aes(x = b0, y = b1,fill = llk)) +
  scale_fill_gradient2(low = "cyan", mid = "yellow",high='red',limits=quantile(combo$llk,c(0.5,1)),midpoint=quantile(combo$llk,0.75))
res

Summary and final remarks

We observed a high correlation between \(\beta_0\) and \(\beta_1\). This high correlation was indicative of the existence of ridges in our log-likelihood surface, a situation that is poorly handled by Gibbs samplers. We were able to mitigate this problem by centering our covariate.

Notice, however, that highly correlated parameters arise in many different situations and therefore cannot always be solved through the centering of covariates. For example:

  • regression slope parameters might be highly correlated if the corresponding covariates are highly correlated. In this case, we would have to find a way to decorrelate our covariates or eliminate highly correlated covariates prior to our analysis.

  • occupancy models typically have parameters associated with detection and parameters associated with occupancy. These parameters might be highly correlated if we do not have many re-visits to each site. To deal with this issue, we might have to impose informative priors on some of the parameters and/or simply collect more data (i.e., have more re-visits to each site).



Back to main menu

Comments?

Send me an email at