Introduction
Having a large number of approximately independent samples from the posterior distribution (i.e., large effective sample size) is important in order to generate reliable inference. However, sometimes your algorithm (JAGS or your own customized Gibbs sampler) can be very slow to converge and to generate enough approximately independent posterior samples, resulting in a long wait time.
A common way to deal with this problem is to reduce the dimensionality by reducing the number of parameters / latent variables that are being estimated by integrating them out. Models that integrate out multiple parameters / latent variables are sometimes called marginal models or collapsed models. The only problem with this is that it often leads to a likelihood which does not correspond to a standard distribution. Therefore, if we want to use collapsed/marginal models in JAGS, we need to know how to define our very own likelihood in JAGS. To do this, we will rely on the so-called “ones” trick.
Example with a zero-inflated Poisson model (ZIP)
In the standard zero-inflated Poisson (ZIP) model, we assume that there is a latent variable \(z_i\) for each site i, which is given by:
\[z_i\sim Bern(\pi)\] We further assume that the data we observe at each site \(y_i\) is given by:
\[y_i \sim Poisson(exp(\beta_0 + \beta_1 x_i)\times z_i)\]
In this model, if \(z_i=0\), then \(y_i\) is guaranteed to be 0 because it will be drawn from a Poisson with mean 0. On the other hand, if \(z_i=1\), then \(y_i\) may or may not be different from zero, depending on how small the mean \(exp(\beta_0 + \beta_1 x_i)\) is. This ZIP model can be fit in two ways:
Model 1: Explicitly representing all the latent variables \(z_i\). If we have \(n\) sites, this means we will be estimating \(n\) latent variables + 3 parameters (\(\beta_0,\beta_1,\pi\)).
Model 2: Integrating out the latent variables \(z_i\) (i.e., we would not be explicitly representing these variables). In this case, we would only be estimating 3 parameters (\(\beta_0,\beta_1,\pi\)).
Both models should yield the same result but they have important differences. Model 1 is easier to interpret and will likely be easier to implement. However, often times collapsed models like model 2 are faster to converge and generates a greater number of approximately independent posterior samples per time.
Integrating out latent variables / parameters
How do we integrate out the latent variables \(z_i\)? Using basic properties of probabilities, we have:
\[p(y_i )=p(y_i,z_i=0)+p(y_i,z_i=1)\] \[=p(y_i |z_i=0)p(z_i=0)+p(y_i |z_i=1)p(z_i=1)\]
Using the assumptions of ZIP model describe above, this is equal to:
\[=Poisson(y_i|0)\times(1-\pi)+Poisson(y_i|exp(\beta_0 + \beta_1 x_i))\times \pi\]
In this expression, we have eliminated the latent variables \(z_i\). Assuming independence, the likelihood is simply defined as:
\[=\prod_i [Poisson(y_i|0)\times(1-\pi)+Poisson(y_i|exp(\beta_0 + \beta_1 x_i))\times \pi]\]
However, we end up with an expression for the likelihood that does not resemble any standard distribution. How do we implement this likelihood in JAGS?
“Ones” trick
The “ones” trick is very useful when we want to define a likelihood in JAGS that is not based on a standard distribution. The key idea is that we create a vector of ones (hence the name) and treat this vector as our “response” variable. For example, say \(k_i=1\) for all i. We model this variable as:
\[k_i\sim Bern(\delta_i )\]
where
\[\delta_i=Poisson(y_i|0)\times(1-\pi)+Poisson(y_i|exp(\beta_0 + \beta_1 x_i))\times \pi\]
After defining this in JAGS, all we need to do is define the priors for \(\pi,\beta_0\) and \(\beta_1\).
Why does this work? The likelihood for this model is given by
\[=\prod_i Bern(k_i |\delta_i ) \] \[=\prod_i \delta_i^{k_i} (1-\delta_i )^{1-k_i}\] Because all \(k_i\) are equal to 1, this simplifies to:
\[=\prod_i \delta_i\]
which is exactly the expression for our likelihood. Notice that this trick only works if \(\delta_i\) is between 0 and 1. While for our example this is not a problem, depending on the likelihood that you have, you might have to re-scale \(\delta_i\) (i.e., divide by a large constant C) to ensure that it is always between 0 and 1.
Generating some fake data
We start by generating some fake data. Here is the code for that.
Fitting the model with latent variables (model 1)
We specify model 1 in JAGS with the following code:
#model with z.R
model{
for (i in 1:nobs){
media[i] <- exp(b0+b1*x1[i])
z[i]~dbern(pi)
y[i]~dpois(media[i]*z[i])
}
pi ~ dbeta(1,1)
b0 ~ dnorm(0,1/10)
b1 ~ dnorm(0,1/10)
}
We fit this model with the code below:
## Loading required package: lattice
##
## Attaching package: 'jagsUI'
## The following object is masked from 'package:utils':
##
## View
set.seed(1)
# MCMC settings
ni <- 1000 #number of iterations
nt <- 1 #interval to thin
nb <- ni/2 #number of iterations to discard as burn-in
nc <- 3 #number of chains
#set parameters to monitor
params=c('pi','b0','b1')
#run model
setwd('U:\\uf\\courses\\bayesian course\\2016\\chapter16 ones trick')
dat=read.csv('fake data.csv',as.is=T)
inits<-function(){
list(z=rep(1,nrow(dat)))
}
model2=jags(model.file="model with z.R",inits=inits,
parameters.to.save=params,
data=list(y=dat$y,nobs=nrow(dat),x1=dat$x1),
n.chains=nc,n.burnin=nb,n.iter=ni,n.thin=nt,DIC=TRUE)
##
## Processing function input.......
##
## Done.
##
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 1000
## Unobserved stochastic nodes: 1003
## Total graph size: 7008
##
## Initializing model
##
## Adaptive phase.....
## Adaptive phase complete
##
##
## Burn-in phase, 500 iterations x 3 chains
##
##
## Sampling from joint posterior, 500 iterations x 3 chains
##
##
## Calculating statistics.......
##
## Done.
## JAGS output for model 'model with z.R', generated by jagsUI.
## Estimates based on 3 chains of 1000 iterations,
## adaptation = 100 iterations (sufficient),
## burn-in = 500 iterations and thin rate = 1,
## yielding 1500 total samples from the joint posterior.
## MCMC ran for 0.14 minutes at time 2020-07-15 09:31:50.
##
## mean sd 2.5% 50% 97.5% overlap0 f Rhat n.eff
## pi 0.302 0.031 0.245 0.301 0.362 FALSE 1 1.020 102
## b0 -0.551 0.122 -0.802 -0.548 -0.315 FALSE 1 1.028 78
## b1 0.607 0.088 0.441 0.603 0.791 FALSE 1 1.011 195
## deviance 610.233 33.752 545.124 610.345 675.158 FALSE 1 1.024 90
##
## Successful convergence based on Rhat values (all < 1.1).
## Rhat is the potential scale reduction factor (at convergence, Rhat=1).
## For each parameter, n.eff is a crude measure of effective sample size.
##
## overlap0 checks if 0 falls in the parameter's 95% credible interval.
## f is the proportion of the posterior with the same sign as the mean;
## i.e., our confidence that the parameter is positive or negative.
##
## DIC info: (pD = var(deviance)/2)
## pD = 557.6 and DIC = 1167.806
## DIC is an estimate of expected predictive error (lower is better).
Fitting the model without latent variables (model 2)
Now, we implement the JAGS model without the latent variables (model 2) using the “ones” trick. Notice that \(\delta_i\) can be written as:
\[\delta_i=Poisson(y_i|0)\times(1-\pi)+Poisson(y_i|exp(\beta_0 + \beta_1 x_i))\times \pi\] \[=I(y_i=0)\times(1-\pi)+\frac{exp(\beta_0 + \beta_1 x_i)^{y_i}exp(-exp(\beta_0 + \beta_1 x_i))}{y_i!}\times \pi\]
Some pieces of this expression do not depend on parameters (i.e., \(I(y_i=0)\) and \(y_i!\)). Therefore, I will pre-calculate these quantities (i.e., isy0[i] and yfact[i], respectively) and feed them into JAGS. Here is the code for the JAGS model.
#model without z.R
model{
for (i in 1:nobs){
media[i] <- exp(b0+b1*x1[i])
prob[i] <- isy0[i]*(1-pi)+(media[i]^y[i])*exp(-media[i])*pi/yfact[i] #likelihood
k[i]~dbern(prob[i]) #"ones" trick
}
pi ~ dbeta(1,1)
b0 ~ dnorm(0,1/10)
b1 ~ dnorm(0,1/10)
}
Here is the code to fit this model.
rm(list=ls(all=T))
library(jagsUI)
set.seed(1)
# MCMC settings
ni <- 1000 #number of iterations
nt <- 1 #interval to thin
nb <- ni/2 #number of iterations to discard as burn-in
nc <- 3 #number of chains
#set parameters to monitor
params=c('pi','b0','b1')
#run model
setwd('U:\\uf\\courses\\bayesian course\\2016\\chapter16 ones trick')
dat=read.csv('fake data.csv',as.is=T)
nobs=nrow(dat)
k=rep(1,nobs)
inits<-function(){
list(b0=0,b1=0,pi=0.5)
}
#pre-calculate useful stuff
yfact=factorial(dat$y)
isy0=ifelse(dat$y==0,1,0)
#run model
model2=jags(model.file="model without z.R",
parameters.to.save=params,inits=inits,
data=list(y=dat$y,nobs=nobs,x1=dat$x1,isy0=isy0,
k=k,yfact=yfact),
n.chains=nc,n.burnin=nb,n.iter=ni,n.thin=nt,DIC=TRUE)
##
## Processing function input.......
##
## Done.
##
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 1000
## Unobserved stochastic nodes: 3
## Total graph size: 14011
##
## Initializing model
##
## Adaptive phase.....
## Adaptive phase complete
##
##
## Burn-in phase, 500 iterations x 3 chains
##
##
## Sampling from joint posterior, 500 iterations x 3 chains
##
##
## Calculating statistics.......
##
## Done.
## JAGS output for model 'model without z.R', generated by jagsUI.
## Estimates based on 3 chains of 1000 iterations,
## adaptation = 100 iterations (sufficient),
## burn-in = 500 iterations and thin rate = 1,
## yielding 1500 total samples from the joint posterior.
## MCMC ran for 0.261 minutes at time 2020-07-15 09:31:58.
##
## mean sd 2.5% 50% 97.5% overlap0 f Rhat n.eff
## pi 0.297 0.032 0.237 0.296 0.365 FALSE 1 1.001 1500
## b0 -0.539 0.128 -0.798 -0.534 -0.297 FALSE 1 1.002 1312
## b1 0.607 0.088 0.445 0.604 0.783 FALSE 1 1.004 808
## deviance 1011.842 2.446 1009.061 1011.240 1017.995 FALSE 1 1.006 1258
##
## Successful convergence based on Rhat values (all < 1.1).
## Rhat is the potential scale reduction factor (at convergence, Rhat=1).
## For each parameter, n.eff is a crude measure of effective sample size.
##
## overlap0 checks if 0 falls in the parameter's 95% credible interval.
## f is the proportion of the posterior with the same sign as the mean;
## i.e., our confidence that the parameter is positive or negative.
##
## DIC info: (pD = var(deviance)/2)
## pD = 3 and DIC = 1014.834
## DIC is an estimate of expected predictive error (lower is better).
Comparison of model 1 vs model 2
The first thing to notice is that both models estimated well the true parameter values. This is expected given that both specifications ultimately represent the same statistical model.
Also notice that model 2 yielded a much larger effective sample size than model 1. In other words, it is very likely that we would have to run model 1 for much longer than model 2 to obtain the same effective sample size. As a result, depending on the amount of time it takes to fit each of these models, it might be advantageous to use the specification of model 2 for computational purposes.
Comments?
Send me an email at