Unidentifiable parameters

The WinBUGS manual starts by stating that “Beware: MCMC sampling can be dangerous!” and “If there is a problem, WinBUGS might just crash, which is not very good, but it might well carry on and produce answers that are wrong, which is even worse.” Parameter identifiability has been recognized as one of the big challenges in hierarchical statistical modeling (Cressie et al. 2009) partly because it can be hard to diagnose, being a type of problem that fits the second category (i.e., JAGS does not crash and goes on to produce answers that are wrong).

Parameters are said to be unidentifiable if different combinations of them yield the same likelihood value. As a simple illustration, say that we are interested in estimating demographic parameters and we observed 6 bears at time 1 and 8 bears at time 2. Perhaps, 2 bears were born between these time periods or perhaps 4 bears were born but 2 of the 6 bears at time 1 died. Based on these data, we simply can’t tell what really happened and therefore we cannot reliably estimate fecundity and mortality rates. You can read more about this issue in (Gross, Craig, and Hutchison 2002).

This example illustrates that we should not rely on parameter estimates from these models. Furthermore, our algorithms (regardless if using JAGS or our own customized Gibbs sampler) will often have a hard time converging, with high correlation among parameters and poor mixing. The reason that I, and others (Cressie et al. 2009), associate this problem with tools like JAGS/WinBUGS is that sometimes their users may not fully understand the modeling assumptions that they make and may create models with unidentifiable parameters.

An important observation is that non-identifiability can often be overcome with strong priors on at least one of the unidentifiable parameters. This is true for many models, not only the ones described here. For instance, going back to our original bear example, a prior study might have generated considerable information regarding fecundity rates of bears. This would enable us to create an informative prior on fecundity rates, which might consequently make bear mortality rates identifiable. The reason one has to be careful with this approach is that the estimated mortality rates would be highly dependent on the fecundity rate prior. If this prior information came from a different population or from a severe-drought year, it might not be representative for the focus population at this particular time period.

1) Simple example

Say that we have the following model:

\[y_i \sim N(\mu_1+\mu_2,\sigma^2)\]

In this model, \(\mu_1\) and \(\mu_2\) are unidentifiable regardless if we have few or thousands of observations. For instance, if \(\bar{y}=4\), then we get the same likelihood with different combinations of \(\mu_1\) and \(\mu_2\) as long as \(\mu_1+\mu_2=4\), as illustrated with the code below:

rm(list=ls(all=T))
set.seed(1)

#generate some data
mu1=1
mu2=3
n=1000
sig2=3
y=rnorm(n,mean=mu1+mu2,sd=sqrt(sig2))

setwd('U:\\uf\\courses\\bayesian course\\2016\\chapter15 jags problems\\identif mu1 and mu2')
write.csv(y,'data mu1 and mu2.csv',row.names=F)

#likelihood based on the true parameter values
sum(dnorm(y,mean=mu1+mu2,sd=sqrt(sig2),log=T))
## [1] -2003.302
#likelihood based on different combinations of paramter values
ncombo=20
mu1=seq(from=-5,to=10,length.out=ncombo)
mu2=4-mu1
llk.comb=rep(NA,ncombo)
for (i in 1:ncombo){
  llk.comb[i]=sum(dnorm(y,mean=mu1[i]+mu2[i],sd=sqrt(sig2),log=T))
}
llk.comb
##  [1] -2003.302 -2003.302 -2003.302 -2003.302 -2003.302 -2003.302 -2003.302
##  [8] -2003.302 -2003.302 -2003.302 -2003.302 -2003.302 -2003.302 -2003.302
## [15] -2003.302 -2003.302 -2003.302 -2003.302 -2003.302 -2003.302

This problem typically manifests itself in JAGS through a very high correlation in the posterior distribution (and sometimes poor mixing). To show this, I first define the model:

model{
  for (i in 1:nobs){
    y[i] ~ dnorm(mu1+mu2,tau)
  }
  mu1 ~ dnorm(0,1/10)
  mu2 ~ dnorm(0,1/10)
  tau ~ dgamma(1,1)
}

Now let’s fit this model:

rm(list=ls(all=T))
library(jagsUI)
## Loading required package: lattice
## 
## Attaching package: 'jagsUI'
## The following object is masked from 'package:utils':
## 
##     View
set.seed(1)

# MCMC settings 
ni <- 10000 #number of iterations
nt <- 50    #interval to thin 
nb <- ni/2  #number of iterations to discard as burn-in
nc <- 3     #number of chains

#set parameters to monitor
params=c('mu1','mu2','tau')

#get data
setwd('U:\\uf\\courses\\bayesian course\\2016\\chapter15 jags problems\\identif mu1 and mu2')
y=read.csv('data mu1 and mu2.csv',as.is=T)$x

model2=jags(model.file="mu1 and mu2 model.R",
            parameters.to.save=params,data=list(y=y,nobs=length(y)),
            n.chains=nc,n.burnin=nb,n.iter=ni,n.thin=nt,DIC=TRUE)
par(mfrow=c(2,2))
plot(model2$sims.list$mu1,type='l',ylab=expression(mu[1]))
plot(model2$sims.list$mu2,type='l',ylab=expression(mu[2]))
plot(model2$sims.list$tau,type='l',ylab=expression(tau))
plot(model2$sims.list$mu1,model2$sims.list$mu2,xlab=expression(mu[1]),ylab=expression(mu[2]))

The interesting thing about this is that the marginal posterior distribution for \(\mu_1\) and for \(\mu_2\) have large variance (suggesting that we have very little information on these individual parameters) despite the fact that the marginal posterior distribution for \(\mu_1+\mu_2\) is very tight, indicating that we have a lot of information on this!

par(mfrow=c(3,1))
plot(density(model2$sims.list$mu1),type='l',main=expression(mu[1]),xlim=c(-5,10),xlab='')
plot(density(model2$sims.list$mu2),type='l',main=expression(mu[2]),xlim=c(-5,10),xlab='')
plot(density(model2$sims.list$mu1+model2$sims.list$mu2),type='l',main=expression(mu[1]+mu[2]),xlim=c(-5,10),xlab='')
abline(v=4,col='red')

2) Regression example

The example above is very simple and clearly unrealistic. Who would create such a model? Nevertheless, it is not too hard to expand on the example above to a situation that is likely to be more common. Say that we have two covariates \(x_{1i}\) and \(x_{2i}\) that are strongly correlated and that we formulate the following regression model:

\[y_i \sim N(\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i},\sigma^2)\]

In the extreme situation in which these covariates are perfectly linearly related (e.g., \(x_{1i}=\gamma_1 x_{2i}\)), we would have:

\[y_i \sim N(\beta_0 + \beta_1 (\gamma_1 x_{2i}) + \beta_2 x_{2i},\sigma^2)\] \[y_i \sim N(\beta_0 + (\beta_1 \gamma_1 + \beta_2) x_{2i},\sigma^2)\] In other words, different combinations of \(\beta_1\) and \(\beta_2\) would yield the same likelihood! Here is some code that shows this:

rm(list=ls(all=T))
set.seed(1)

#generate some data
b0=1
b1=2
b2=-1
n=1000
sig2=3
x2=rnorm(n)
gamma1=3
x1=gamma1*x2

y=rnorm(n,mean=b0+b1*x1+b2*x2,sd=sqrt(sig2))

#export dataset
dat=data.frame(y=y,x1=x1,x2=x2)
setwd('U:\\uf\\courses\\bayesian course\\2016\\chapter15 jags problems\\identif regression')
write.csv(dat,'data regression.csv',row.names=F)

#likelihood based on the true parameter values
sum(dnorm(y,mean=b0+b1*x1+b2*x2,sd=sqrt(sig2),log=T))
## [1] -2008.617
#likelihood based on different combinations of parameter values
res=b1*gamma1+b2 # we have to generate combinations such that we get this result
ncombo=20
b1=seq(from=-5,to=10,length.out=ncombo)
b2=res-b1*gamma1
llk.comb=rep(NA,ncombo)
for (i in 1:ncombo){
  llk.comb[i]=sum(dnorm(y,mean=b0+b1[i]*x1+b2[i]*x2,sd=sqrt(sig2),log=T))
}
llk.comb
##  [1] -2008.617 -2008.617 -2008.617 -2008.617 -2008.617 -2008.617 -2008.617
##  [8] -2008.617 -2008.617 -2008.617 -2008.617 -2008.617 -2008.617 -2008.617
## [15] -2008.617 -2008.617 -2008.617 -2008.617 -2008.617 -2008.617

Let’s define our JAGS model:

model{
  for (i in 1:nobs){
    y[i] ~ dnorm(beta[1]+beta[2]*x1[i]+beta[3]*x2[i],tau)
  }
  tau ~ dgamma(1,1)
  
  for (j in 1:3){
    beta[j] ~ dnorm(0,1/10)
  }
}

Here we show how this problem leads to high correlation (and sometimes mixing issues) in JAGS:

rm(list=ls(all=T))
library(jagsUI)
set.seed(1)

# MCMC settings 
ni <- 10000 #number of iterations
nt <- 50    #interval to thin 
nb <- ni/2  #number of iterations to discard as burn-in
nc <- 3     #number of chains

#set parameters to monitor
params=c('beta','tau')

#get data
setwd('U:\\uf\\courses\\bayesian course\\2016\\chapter15 jags problems\\identif regression')
dat=read.csv('data regression.csv',as.is=T)

model2=jags(model.file="regression model.R",
            parameters.to.save=params,data=list(y=dat$y,x1=dat$x1,x2=dat$x2,nobs=nrow(dat)),
            n.chains=nc,n.burnin=nb,n.iter=ni,n.thin=nt,DIC=TRUE)
par(mfrow=c(3,2))
plot(model2$sims.list$beta[,1],type='l',ylab=expression(beta[0]))
plot(model2$sims.list$beta[,2],type='l',ylab=expression(beta[1]))
plot(model2$sims.list$beta[,3],type='l',ylab=expression(beta[2]))
plot(model2$sims.list$tau,type='l',ylab=expression(tau))
plot(model2$sims.list$beta[,2],model2$sims.list$beta[,3],xlab=expression(beta[1]),ylab=expression(beta[2]))

Similar to what we showed previously, the marginal posterior distributions for \(\beta_1\) and \(\beta_2\) are very wide despite the fact that the posterior distribution for \(\beta_1 \gamma + \beta_2\) is very tight.

gamma1=3
par(mfrow=c(3,1))
plot(density(model2$sims.list$beta[,2]),type='l',xlab='',xlim=c(-5,10),main=expression(beta[1]))
plot(density(model2$sims.list$beta[,3]),type='l',xlab='',xlim=c(-5,10),main=expression(beta[2]))
plot(density(model2$sims.list$beta[,2]*gamma1+model2$sims.list$beta[,3]),type='l',
     xlab='',xlim=c(-5,10),main=expression(beta[1]*gamma+beta[2]))

To avoid this type of problem, it can be very useful to calculate Variance Inflation Factors (VIF) for the covariates in your regression model. This should immediately reveal if you are likely to run into these issues, regardless if using Bayesian or frequentist models.

3) Mixed models example

Say that we have data on hemoglobin level of individual i in village j \(y_{ij}\) and we want to see how this is related to malaria status \(x_{ij}\). To account for the fact that individuals are clustered within each village and that there might be systematic differences between villages that we are not accounting for, we can use a mixed-effect model that has a village-level random effect \(\beta_2j\), given by:

\[y_{ij} \sim N(\beta_0 +\beta_1 x_{ij} + \beta_{2j},\sigma^2)\] where

\[\beta_{2j} \sim N(\gamma,\tau^2)\]

It turns out that \(\beta_0\) and \(\gamma\) are unidentifiable parameters. To see why, notice that:

\[\beta_{2j}=\gamma+e_j\]

where \(e_j \sim N(0,\tau^2 )\). Therefore, we can re-write our original equation as:

\[y_{ij}\sim N([\beta_0+\gamma]+\beta_1 x_{ij}+e_j,\sigma^2 )\] where \[e_j \sim N(0,\tau^2 )\]

Written this way, it becomes clear that we have 2 intercepts and therefore different combination of these parameters lead to the same likelihood. Let’s try to fit this model and see what happens with our estimates of \(\beta_0\) and \(\gamma\).

We first generate some fake data:

rm(list=ls(all=T))
set.seed(1)

#generate some data
gamma1=-1
nvil=40
tau2=1
b2=rnorm(nvil,mean=gamma1,sd=sqrt(tau2))

nobs.vil=5
ind.vil=rep(1:nvil,each=nobs.vil)
b0=1
b1=2
n=length(ind.vil)
x1=rnorm(n)
sig2=3

y=rnorm(n,mean=b0+b1*x1+b2[ind.vil],sd=sqrt(sig2))

#export dataset
dat=data.frame(y=y,x1=x1,ind.vil=ind.vil)
setwd('U:\\uf\\courses\\bayesian course\\2016\\chapter15 jags problems\\identif mixed model')
write.csv(dat,'data mixed model.csv',row.names=F)

Now we define our model:

model{
  for (j in 1:nvil){
    b2[j] ~ dnorm(gamma,prec2)
  }
  
  for (i in 1:nobs){
    y[i] ~ dnorm(b0+b1*x1[i]+b2[ind.vil[i]],prec1)
  }
  b0 ~ dnorm(0,1/10)
  b1 ~ dnorm(0,1/10)
  gamma ~ dnorm(0,1/10)
  prec2 ~ dgamma(1,1)
  prec1 ~ dgamma(1,1)
}

Finally, we fit this model:

rm(list=ls(all=T))
library(jagsUI)
set.seed(1)

# MCMC settings 
ni <- 10000 #number of iterations
nt <- 50    #interval to thin 
nb <- ni/2  #number of iterations to discard as burn-in
nc <- 3     #number of chains

#set parameters to monitor
params=c('b0','b1','b2','prec1','prec2','gamma')

#run model
setwd('U:\\uf\\courses\\bayesian course\\2016\\chapter15 jags problems\\identif mixed model')
dat=read.csv('data mixed model.csv',as.is=T)
nvil=length(unique(dat$ind.vil))

model2=jags(model.file="mixed model.R",
            parameters.to.save=params,data=list(y=dat$y,x1=dat$x1,nobs=nrow(dat),nvil=nvil,ind.vil=dat$ind.vil),
            n.chains=nc,n.burnin=nb,n.iter=ni,n.thin=nt,DIC=TRUE)
par(mfrow=c(3,2))
plot(model2$sims.list$b0,type='l',ylab=expression(beta[0]))
plot(model2$sims.list$b1,type='l',ylab=expression(beta[1]))
plot(model2$sims.list$gamma,type='l',ylab=expression(gamma))
plot(model2$sims.list$prec1,type='l',ylab=expression(1/(sigma^2)))
plot(model2$sims.list$prec2,type='l',ylab=expression(1/(tau^2)))
plot(model2$sims.list$b0,model2$sims.list$gamma,xlab=expression(beta[0]),ylab=expression(gamma))

Notice that most parameters seems to be mixing well except for \(\beta_0\) and \(\gamma\). Finally, as expected, we see marginal posterior distributions that are wide but a very tight marginal posterior distribution for \(\gamma+\beta_0\):

par(mfrow=c(3,1))
plot(density(model2$sims.list$b0),type='l',xlab='',xlim=c(-5,10),main=expression(beta[0]))
plot(density(model2$sims.list$gamma),type='l',xlab='',xlim=c(-5,10),main=expression(gamma))
plot(density(model2$sims.list$b0+model2$sims.list$gamma),type='l',
     xlab='',xlim=c(-5,10),main=expression(beta[0]+gamma))

To avoid this problem, we will have to specify our model differently. We can adopt one of the following two model specifications:

  1. Model specification 1:

\[y_{ij} \sim N(\beta_0 +\beta_1 x_{ij} + \beta_{2j},\sigma^2)\]

\[\beta_{2j} \sim N(0,\tau^2)\]

  1. Model specification 2:

\[y_{ij} \sim N(\beta_1 x_{ij} + \beta_{2j},\sigma^2)\]

\[\beta_{2j} \sim N(\gamma,\tau^2)\]

Final remarks

There are two final observations that I want to make regarding parameter identifiability:

  • All of our examples have relied on the normal distribution but do not be tricked by this into thinking that identifiability issues do not arise for other distributions as well. We will go through examples in class for models that run into similar problems but that rely on other distributions.

  • The examples I have given are immediately evident when we display a scatter plot of posterior samples of the different parameters. However, some identifiability problems might be harder to diagnose if more than 2 parameters are jointly unidentifiable. My opinion is that the main way to avoid these problems is either to have a deep understanding of the model we are using or to restrict ourselves to well known model formulations.



Comments?

Send me an email at

References

Cressie, N., C. A. Calder, J. S. Clark, J. M. ver Hoef, and C. K. Wikle. 2009. “Accounting for Uncertainty in Ecological Analysis: The Strengths and Limitations of Hierarchical Statistical Modeling.” Ecological Applications 19.
Gross, K., B. A. Craig, and W. D. Hutchison. 2002. “Bayesian Estimation of a Demographic Matrix Model from Stage-Frequency Data.” Ecology 83: 3285–98.