102_ones_trick.utf8

Introduction

Having a large number of approximately independent samples from the posterior distribution (i.e., large effective sample size) is important in order to generate reliable inference. However, sometimes your algorithm (JAGS or your own customized Gibbs sampler) can be very slow to converge and to generate enough approximately independent posterior samples, resulting in a long wait time.

A common way to deal with this problem is to reduce the dimensionality by reducing the number of parameters / latent variables that are being estimated by integrating them out. Models that integrate out multiple parameters / latent variables are sometimes called marginal models or collapsed models. The only problem with this is that it often leads to a likelihood which does not correspond to a standard distribution. Therefore, if we want to use collapsed/marginal models in JAGS, we need to know how to define our very own likelihood in JAGS. To do this, we will rely on the so-called “ones” trick.

Example with a zero-inflated Poisson model (ZIP)

In the standard zero-inflated Poisson (ZIP) model, we assume that there is a latent variable \(z_i\) for each site i, which is given by:

\[z_i\sim Bern(\pi)\] We further assume that the data we observe at each site \(y_i\) is given by:

\[y_i \sim Poisson(exp(\beta_0 + \beta_1 x_i)\times z_i)\]

In this model, if \(z_i=0\), then \(y_i\) is guaranteed to be 0 because it will be drawn from a Poisson with mean 0. On the other hand, if \(z_i=1\), then \(y_i\) may or may not be different from zero, depending on how small the mean \(exp(\beta_0 + \beta_1 x_i)\) is. This ZIP model can be fit in two ways:

Model 1: Explicitly representing all the latent variables \(z_i\). If we have \(n\) sites, this means we will be estimating \(n\) latent variables + 3 parameters (\(\beta_0,\beta_1,\pi\)).
Model 2: Integrating out the latent variables \(z_i\) (i.e., we would not be explicitly representing these variables). In this case, we would only be estimating 3 parameters (\(\beta_0,\beta_1,\pi\)).

Both models should yield the same result but they have important differences. Model 1 is easier to interpret and will likely be easier to implement. However, often times collapsed models like model 2 are faster to converge and generates a greater number of approximately independent posterior samples per time.

Integrating out latent variables / parameters

How do we integrate out the latent variables \(z_i\)? Using basic properties of probabilities, we have:

\[p(y_i )=p(y_i,z_i=0)+p(y_i,z_i=1)\] \[=p(y_i |z_i=0)p(z_i=0)+p(y_i |z_i=1)p(z_i=1)\]

Using the assumptions of ZIP model describe above, this is equal to:

\[=Poisson(y_i|0)\times(1-\pi)+Poisson(y_i|exp(\beta_0 + \beta_1 x_i))\times \pi\]

In this expression, we have eliminated the latent variables \(z_i\). Assuming independence, the likelihood is simply defined as:

\[=\prod_i [Poisson(y_i|0)\times(1-\pi)+Poisson(y_i|exp(\beta_0 + \beta_1 x_i))\times \pi]\]

However, we end up with an expression for the likelihood that does not resemble any standard distribution. How do we implement this likelihood in JAGS?

“Ones” trick

The “ones” trick is very useful when we want to define a likelihood in JAGS that is not based on a standard distribution. The key idea is that we create a vector of ones (hence the name) and treat this vector as our “response” variable. For example, say \(k_i=1\) for all i. We model this variable as:

\[k_i\sim Bern(\delta_i )\]

where

\[\delta_i=Poisson(y_i|0)\times(1-\pi)+Poisson(y_i|exp(\beta_0 + \beta_1 x_i))\times \pi\]

After defining this in JAGS, all we need to do is define the priors for \(\pi,\beta_0\) and \(\beta_1\).

Why does this work? The likelihood for this model is given by

\[=\prod_i Bern(k_i |\delta_i ) \] \[=\prod_i \delta_i^{k_i} (1-\delta_i )^{1-k_i}\] Because all \(k_i\) are equal to 1, this simplifies to:

\[=\prod_i \delta_i\]

which is exactly the expression for our likelihood. Notice that this trick only works if \(\delta_i\) is between 0 and 1. While for our example this is not a problem, depending on the likelihood that you have, you might have to re-scale \(\delta_i\) (i.e., divide by a large constant C) to ensure that it is always between 0 and 1.

Generating some fake data

We start by generating some fake data. Here is the code for that.

rm(list=ls(all=T))
set.seed(1)

#key parameters and settings
nobs=1000
pi=0.3
b0=-0.5; b1=0.5
x1=rnorm(nobs)
media=exp(b0+b1*x1)

#generate latent variables and response
z=rbinom(nobs,size=1,prob=pi)
y=rep(0,nobs)
cond=z==1
y[cond]=rpois(sum(cond),media[cond])

#look at the data
plot(x1,y)

fim=data.frame(y=y,x1=x1)

setwd('U:\\uf\\courses\\bayesian course\\2016\\chapter16 ones trick')
write.csv(fim,'fake data.csv',row.names=F)

Fitting the model with latent variables (model 1)

We specify model 1 in JAGS with the following code:

#model with z.R
model{
  for (i in 1:nobs){
    media[i] <- exp(b0+b1*x1[i])
    z[i]~dbern(pi)
    y[i]~dpois(media[i]*z[i])  
  }

  pi ~ dbeta(1,1)
  b0 ~ dnorm(0,1/10)
  b1 ~ dnorm(0,1/10)
}

We fit this model with the code below:

rm(list=ls(all=T))
library(jagsUI)

## Loading required package: lattice

## 
## Attaching package: 'jagsUI'

## The following object is masked from 'package:utils':
## 
##     View

set.seed(1)

# MCMC settings 
ni <- 1000 #number of iterations
nt <- 1    #interval to thin 
nb <- ni/2  #number of iterations to discard as burn-in
nc <- 3     #number of chains

#set parameters to monitor
params=c('pi','b0','b1')

#run model
setwd('U:\\uf\\courses\\bayesian course\\2016\\chapter16 ones trick')
dat=read.csv('fake data.csv',as.is=T)

inits<-function(){
  list(z=rep(1,nrow(dat)))
}

model2=jags(model.file="model with z.R",inits=inits,
            parameters.to.save=params,
            data=list(y=dat$y,nobs=nrow(dat),x1=dat$x1),
            n.chains=nc,n.burnin=nb,n.iter=ni,n.thin=nt,DIC=TRUE)

## 
## Processing function input....... 
## 
## Done. 
##  
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1000
##    Unobserved stochastic nodes: 1003
##    Total graph size: 7008
## 
## Initializing model
## 
## Adaptive phase..... 
## Adaptive phase complete 
##  
## 
##  Burn-in phase, 500 iterations x 3 chains 
##  
## 
## Sampling from joint posterior, 500 iterations x 3 chains 
##  
## 
## Calculating statistics....... 
## 
## Done.

#check to see if results make sense
model2

## JAGS output for model 'model with z.R', generated by jagsUI.
## Estimates based on 3 chains of 1000 iterations,
## adaptation = 100 iterations (sufficient),
## burn-in = 500 iterations and thin rate = 1,
## yielding 1500 total samples from the joint posterior. 
## MCMC ran for 0.14 minutes at time 2020-07-15 09:31:50.
## 
##             mean     sd    2.5%     50%   97.5% overlap0 f  Rhat n.eff
## pi         0.302  0.031   0.245   0.301   0.362    FALSE 1 1.020   102
## b0        -0.551  0.122  -0.802  -0.548  -0.315    FALSE 1 1.028    78
## b1         0.607  0.088   0.441   0.603   0.791    FALSE 1 1.011   195
## deviance 610.233 33.752 545.124 610.345 675.158    FALSE 1 1.024    90
## 
## Successful convergence based on Rhat values (all < 1.1). 
## Rhat is the potential scale reduction factor (at convergence, Rhat=1). 
## For each parameter, n.eff is a crude measure of effective sample size. 
## 
## overlap0 checks if 0 falls in the parameter's 95% credible interval.
## f is the proportion of the posterior with the same sign as the mean;
## i.e., our confidence that the parameter is positive or negative.
## 
## DIC info: (pD = var(deviance)/2) 
## pD = 557.6 and DIC = 1167.806 
## DIC is an estimate of expected predictive error (lower is better).

Fitting the model without latent variables (model 2)

Now, we implement the JAGS model without the latent variables (model 2) using the “ones” trick. Notice that \(\delta_i\) can be written as:

\[\delta_i=Poisson(y_i|0)\times(1-\pi)+Poisson(y_i|exp(\beta_0 + \beta_1 x_i))\times \pi\] \[=I(y_i=0)\times(1-\pi)+\frac{exp(\beta_0 + \beta_1 x_i)^{y_i}exp(-exp(\beta_0 + \beta_1 x_i))}{y_i!}\times \pi\]

Some pieces of this expression do not depend on parameters (i.e., \(I(y_i=0)\) and \(y_i!\)). Therefore, I will pre-calculate these quantities (i.e., isy0[i] and yfact[i], respectively) and feed them into JAGS. Here is the code for the JAGS model.

#model without z.R
model{
  for (i in 1:nobs){
    media[i] <- exp(b0+b1*x1[i])
    prob[i] <- isy0[i]*(1-pi)+(media[i]^y[i])*exp(-media[i])*pi/yfact[i] #likelihood
    k[i]~dbern(prob[i]) #"ones" trick
  }

  pi ~ dbeta(1,1)
  b0 ~ dnorm(0,1/10)
  b1 ~ dnorm(0,1/10)
}

Here is the code to fit this model.

rm(list=ls(all=T))
library(jagsUI)
set.seed(1)

# MCMC settings 
ni <- 1000 #number of iterations
nt <- 1    #interval to thin 
nb <- ni/2  #number of iterations to discard as burn-in
nc <- 3     #number of chains

#set parameters to monitor
params=c('pi','b0','b1')

#run model
setwd('U:\\uf\\courses\\bayesian course\\2016\\chapter16 ones trick')
dat=read.csv('fake data.csv',as.is=T)
nobs=nrow(dat)
k=rep(1,nobs)

inits<-function(){
  list(b0=0,b1=0,pi=0.5)
}

#pre-calculate useful stuff
yfact=factorial(dat$y)
isy0=ifelse(dat$y==0,1,0)

#run model
model2=jags(model.file="model without z.R",
            parameters.to.save=params,inits=inits,
            data=list(y=dat$y,nobs=nobs,x1=dat$x1,isy0=isy0,
                      k=k,yfact=yfact),
            n.chains=nc,n.burnin=nb,n.iter=ni,n.thin=nt,DIC=TRUE)

## 
## Processing function input....... 
## 
## Done. 
##  
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1000
##    Unobserved stochastic nodes: 3
##    Total graph size: 14011
## 
## Initializing model
## 
## Adaptive phase..... 
## Adaptive phase complete 
##  
## 
##  Burn-in phase, 500 iterations x 3 chains 
##  
## 
## Sampling from joint posterior, 500 iterations x 3 chains 
##  
## 
## Calculating statistics....... 
## 
## Done.

#check to see if results make sense
model2

## JAGS output for model 'model without z.R', generated by jagsUI.
## Estimates based on 3 chains of 1000 iterations,
## adaptation = 100 iterations (sufficient),
## burn-in = 500 iterations and thin rate = 1,
## yielding 1500 total samples from the joint posterior. 
## MCMC ran for 0.261 minutes at time 2020-07-15 09:31:58.
## 
##              mean    sd     2.5%      50%    97.5% overlap0 f  Rhat n.eff
## pi          0.297 0.032    0.237    0.296    0.365    FALSE 1 1.001  1500
## b0         -0.539 0.128   -0.798   -0.534   -0.297    FALSE 1 1.002  1312
## b1          0.607 0.088    0.445    0.604    0.783    FALSE 1 1.004   808
## deviance 1011.842 2.446 1009.061 1011.240 1017.995    FALSE 1 1.006  1258
## 
## Successful convergence based on Rhat values (all < 1.1). 
## Rhat is the potential scale reduction factor (at convergence, Rhat=1). 
## For each parameter, n.eff is a crude measure of effective sample size. 
## 
## overlap0 checks if 0 falls in the parameter's 95% credible interval.
## f is the proportion of the posterior with the same sign as the mean;
## i.e., our confidence that the parameter is positive or negative.
## 
## DIC info: (pD = var(deviance)/2) 
## pD = 3 and DIC = 1014.834 
## DIC is an estimate of expected predictive error (lower is better).

Comparison of model 1 vs model 2

The first thing to notice is that both models estimated well the true parameter values. This is expected given that both specifications ultimately represent the same statistical model.

Also notice that model 2 yielded a much larger effective sample size than model 1. In other words, it is very likely that we would have to run model 1 for much longer than model 2 to obtain the same effective sample size. As a result, depending on the amount of time it takes to fit each of these models, it might be advantageous to use the specification of model 2 for computational purposes.

Comments?

Send me an email at