Unidentifiable parameters when covariates sum to one

Covariates that sum to one

Ecologists that study the impact of land-use/land-cover (LULC) will often end up creating a regression of some sort that has covariates that sum to one. For example, we were studying how animal speed is influenced by LULC. To this end, we calculated the proportion of each LULC within a particular buffer area around the animal, ending up with covariates that sum to one (e.g., 0.3 grasslands, 0.5 forests, and 0.2 urban).

Regression model

Let \(y_i\) be the speed of the animal in movement segment i. Furthermore, let \(x_{i1}\), \(x_{i2}\), and \(x_{i3}\) be the proportion of grassland, forests, and urban area, respectively, surrounding segment i. For our regression, we will assume that:

\[E[y_i]=exp(\beta_0+\beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3})\]

Notice that \(x_{i1}+x_{i2}+x_{i3}=1\).

Identifiability problem

To undertand why we have an identifiability problem, notice that we can add an arbitrary parameter \(\gamma\) to each of the slope parameters, yielding:

\[E[y_i]=exp(\beta_0+(\beta_1+\gamma) x_{i1} + (\beta_2+\gamma) x_{i2} + (\beta_3+\gamma) x_{i3})\] \[=exp(\beta_0+\gamma \times (x_{i1}+x_{i2}+x_{i3}) + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3})\] \[=exp(\beta_0+\gamma + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3})\]

What this shows is that if:

the slope parameters \(\beta_1,\beta_2,\beta_3\) are changed by adding \(\gamma\); and
the intercept \(\beta_0\) is changed to \(\beta_0-\gamma\);

then, we still end up with the same \(E[y_i]\).

Illustration

To illustrate this unidentifiability issue, we will simulate some data based on the following generative model:

\[y_i \sim Gamma(b\times exp(\beta_0+\beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3}),b)\]

rm(list=ls(all=T))
set.seed(1)

#generate covariates that sum to one
ncov=3
nobs=1000
tmp=matrix(rgamma(nobs*ncov,1,1),nobs,ncov)
xmat=tmp/apply(tmp,1,sum)
unique(apply(xmat,1,sum))

## [1] 1 1 1

#generate response variable
b=3
b0=1
b1=0
b2=0.5
b3=-0.5
media=exp(b0+b1*xmat[,1]+b2*xmat[,2]+b3*xmat[,3])
y=rgamma(nobs,b*media,b)

We can show that we will run into problems by calculating the log-likelihood based on the true parameters and based on parameters modified by \(\gamma\). The lack of identifiability is demonstrated by both sets of parameters yielding the same log-likelihood.

#calculate log-likelihood for true parameters
sum(dgamma(y,b*media,b,log=T))

## [1] -1330.19

#calculate log-likelihood for modified parameters
gamma1=0.5
b0m=b0-gamma1
b1m=b1+gamma1
b2m=b2+gamma1
b3m=b3+gamma1
mediam=exp(b0m+b1m*xmat[,1]+b2m*xmat[,2]+b3m*xmat[,3])
sum(dgamma(y,b*mediam,b,log=T))

## [1] -1330.19

Let’s now see how this would manifest itself in JAGS. To fit this model in JAGS, I need to specify the prior for each parameter. I will assume the following vague priors:

\[b \sim Unif(0,100) \] \[\beta_p \sim N(0,100) \]

Here is the corresponding model definition within JAGS:

model{
  for (i in 1:nobs){
    media[i] <- exp(b0+b1*xmat[i,1]+b2*xmat[i,2]+b3*xmat[i,3])
    y[i] ~ dgamma(b*media[i],b)
  }
  b0 ~ dnorm(0,1/100)
  b1 ~ dnorm(0,1/100)
  b2 ~ dnorm(0,1/100)
  b3 ~ dnorm(0,1/100)      
  b ~ dunif(0,100)
}

Now let’s fit this model:

library(jagsUI)
set.seed(1)

# MCMC settings 
ni <- 10000 #number of iterations
nt <- 50    #interval to thin 
nb <- ni/2  #number of iterations to discard as burn-in
nc <- 3     #number of chains

#set parameters to monitor
params=c('b','b0','b1','b2','b3')

#get model
setwd('U:\\uf\\courses\\bayesian course\\rmarkdown\\identifiability covariates sum to one')

model2=jags(model.file="jags_identif_sum_one.R",
            parameters.to.save=params,
            data=list(y=y,nobs=length(y),xmat=xmat),
            n.chains=nc,n.burnin=nb,n.iter=ni,n.thin=nt,DIC=TRUE)

Even after running 10,000 iterations, this algorithm has not converged. Actually, you can run this algorithm as much as you want and I would bet that we would still be having issues at the end.

Lack of convergence is also evident by looking at traceplots of each parameter and the very high correlation between the parameters \(\beta_0,\beta_1,\beta_2,\beta_3\).

#plot results
par(mfrow=c(3,2),mar=c(2,2,4,1))
plot(model2$sims.list$b,type='l',main='b')
plot(model2$sims.list$b0,type='l',main='b0')
plot(model2$sims.list$b1,type='l',main='b1')
plot(model2$sims.list$b2,type='l',main='b2')
plot(model2$sims.list$b3,type='l',main='b3')

param=cbind(model2$sims.list$b,
            model2$sims.list$b0,
            model2$sims.list$b1,
            model2$sims.list$b2,
            model2$sims.list$b3)
colnames(param)=c('b','b0','b1','b2','b3')
round(cor(param),2)

##        b    b0    b1    b2    b3
## b   1.00 -0.02  0.02  0.02  0.01
## b0 -0.02  1.00 -1.00 -1.00 -1.00
## b1  0.02 -1.00  1.00  0.99  0.99
## b2  0.02 -1.00  0.99  1.00  0.99
## b3  0.01 -1.00  0.99  0.99  1.00

You can read more about this identifiability problem and potential solutions to this problem in our article (Valle, Mintz, and Brack 2024).

Comments?

Send me an email at

References

Valle, D., J. Mintz, and I. V. Brack. 2024. “Estimation and Interpretation Problems and Solutions When Using Proportion Covariates in Linear Regression Models.” Ecology, e4256.