convergence_problems.utf8

Example of convergence problem: mixture model

Here we illustrate convergence problems that may arise due to identifiability issues. To do this, we rely on a class of models that are known as mixture models. In the particular model that we will use, data are assumed to arise from one of two distributions, both of which are normal distributions with variance of 1 but different means. Here is the model:

\[z_i\sim Bernoulli(\pi)\] \[x_i|z_i=0 \sim N(\mu_0,1)\] \[x_i|z_i=1 \sim N(\mu_1,1)\]

In this model, \(z_i\) is a latent (unobserved) variable and we only get to observe \(x_i\), the end product of this process. Here is how we generate these data in R:

rm(list=ls(all=TRUE))
set.seed(3)
library(jagsUI)

## Loading required package: lattice

## 
## Attaching package: 'jagsUI'

## The following object is masked from 'package:utils':
## 
##     View

#generate simulated data
nobs=1000
pi1=0.2
z=rbinom(nobs,size=1,prob=pi1)
mu0=1; mu1=10
x=ifelse(z==1,rnorm(nobs,mean=mu1,sd=1),
         rnorm(nobs,mean=mu0,sd=1))

#here is how these data look like
plot(density(x),type='l')

These data have 2 bumps, which are essentially scaled versions of the two normal distributions that are used to draw the data. Mixture models like this are commonly used when one suspects that there are two or more subpopulations and we want to characterize each subpopulation despite the fact that we do not know which observation came from which subpopulation. For example, we could have height data both for men and women. In the absence of gender information associated with each height observation, we might only see these two bumps.

Estimating parameters using JAGS

The parameters that we want to estimate in this model are:

\(\pi\): the proportion of observations that come from each distribution;
\(\mu_0\): the mean of one of the normal distributions
\(\mu_1\): the mean of the other normal distribution

Our JAGS model is given in the file “mixture_model.R”. The content of this file is given below:

model{
  for (i in 1:N){
    z[i]~dbinom(pi,1)
    mu[i] <- z[i]*mu1+(1-z[i])*mu0
    x[i]~dnorm(mu[i],1)
  }
  mu0 ~ dnorm(0,1/100)
  mu1 ~ dnorm(0,1/100)
  pi ~ dbeta(1,1)
}

To fit this model, we will rely on the following code:

# MCMC settings 
ni <- 1500 #number of iterations
nt <- 10    #interval to thin 
nb <- 1000  #number of iterations to discard as burn-in
nc <- 5     #number of chains

#set parameters to monitor
params=c("mu1","mu0","pi")

#run model
setwd('U:/uf/courses/bayesian course/convergence1')
model2=jags(model.file="mixture_model.R",
            parameters.to.save=params,data=list(x=x,N=length(x)),n.chains=nc,
            n.burnin=nb,n.iter=ni,n.thin=nt,DIC=TRUE)

## 
## Processing function input....... 
## 
## Done. 
##  
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1000
##    Unobserved stochastic nodes: 1003
##    Total graph size: 6008
## 
## Initializing model
## 
## Adaptive phase..... 
## Adaptive phase complete 
##  
## 
##  Burn-in phase, 1000 iterations x 5 chains 
##  
## 
## Sampling from joint posterior, 500 iterations x 5 chains 
##  
## 
## Calculating statistics....... 
## 
## Done.

model2

## JAGS output for model 'mixture_model.R', generated by jagsUI.
## Estimates based on 5 chains of 1500 iterations,
## adaptation = 100 iterations (sufficient),
## burn-in = 1000 iterations and thin rate = 10,
## yielding 250 total samples from the joint posterior. 
## MCMC ran for 0.256 minutes at time 2020-02-05 14:30:55.
## 
##              mean    sd     2.5%      50%    97.5% overlap0 f    Rhat n.eff
## mu1         4.580 4.421    0.921    1.014   10.072    FALSE 1 124.033     5
## mu0         6.377 4.421    0.914    9.899   10.098    FALSE 1 106.063     5
## pi          0.557 0.281    0.196    0.770    0.812    FALSE 1  29.980     5
## deviance 2807.222 2.179 2805.225 2806.537 2813.854    FALSE 1   1.004   250
## 
## **WARNING** Rhat values indicate convergence failure. 
## Rhat is the potential scale reduction factor (at convergence, Rhat=1). 
## For each parameter, n.eff is a crude measure of effective sample size. 
## 
## overlap0 checks if 0 falls in the parameter's 95% credible interval.
## f is the proportion of the posterior with the same sign as the mean;
## i.e., our confidence that the parameter is positive or negative.
## 
## DIC info: (pD = var(deviance)/2) 
## pD = 2.4 and DIC = 2809.626 
## DIC is an estimate of expected predictive error (lower is better).

The “Rhat” values look horrible, suggesting that the model has not converged. We could try to run the algorithm for a larger number of iterations to see if this helps with convergence but I believe this would not solve our problem.

To investigate further what might be happening, I show the traceplot for \(\mu_0\) and \(\mu_1\).

mu1=model2$sims.list$mu1
mu0=model2$sims.list$mu0
plot(mu1,type='l')

plot(mu0,type='l')

The main problem with this model is that \(\mu_1\) and \(\mu_0\) are not identifiable. As a result, in one chain, our algorithm might associate \(\mu_0\) with the left distribution whereas in another chain, JAGS might associate \(\mu_0\) with the right distribution. This is reflected in the traceplots. We ran 5 chains and each chain contributes with \(\frac{1500-1000}{10}=50\) samples. The traceplots combine these 50 samples from each chain and, for this reason, we see these big jumps in parameter values in iterations that are multiple of 50.

Estimating parameters using a reparameterized model

To solve this problem, we would have to reformulate our model so that \(\mu_0\) always corresponds to the left distribution. Here is one way of doing this by reparameterizing \(\mu_1\) as \(\mu_0 + \theta\) where \(\theta\) is restricted to be a positive real number. Here is the JAGS model (in file “mixture_model_reparam.R”) under this new parameterization:

model{
  for (i in 1:N){
    z[i]~dbinom(pi,1)
    mu[i] <- z[i]*mu1+(1-z[i])*mu0
    x[i]~dnorm(mu[i],1)
  }
  mu1 <- mu0+theta
  theta ~ dunif(0,100)
  mu0 ~ dnorm(0,1/100)
  pi ~ dbeta(1,1)
}

Let’s see if this helps with convergence:

# MCMC settings 
ni <- 1500 #number of iterations
nt <- 10    #interval to thin 
nb <- 1000  #number of iterations to discard as burn-in
nc <- 5     #number of chains

#set parameters to monitor
params=c("mu0","theta","pi")

#run model
setwd('U:/uf/courses/bayesian course/convergence1')
model2=jags(model.file="mixture_model_reparam.R",
            parameters.to.save=params,data=list(x=x,N=length(x)),n.chains=nc,
            n.burnin=nb,n.iter=ni,n.thin=nt,DIC=TRUE)

## 
## Processing function input....... 
## 
## Done. 
##  
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1000
##    Unobserved stochastic nodes: 1003
##    Total graph size: 6009
## 
## Initializing model
## 
## Adaptive phase..... 
## Adaptive phase complete 
##  
## 
##  Burn-in phase, 1000 iterations x 5 chains 
##  
## 
## Sampling from joint posterior, 500 iterations x 5 chains 
##  
## 
## Calculating statistics....... 
## 
## Done.

model2

## JAGS output for model 'mixture_model_reparam.R', generated by jagsUI.
## Estimates based on 5 chains of 1500 iterations,
## adaptation = 100 iterations (sufficient),
## burn-in = 1000 iterations and thin rate = 10,
## yielding 250 total samples from the joint posterior. 
## MCMC ran for 0.341 minutes at time 2020-02-05 14:31:10.
## 
##              mean    sd     2.5%      50%    97.5% overlap0 f  Rhat n.eff
## mu0         0.976 0.035    0.908    0.975    1.046    FALSE 1 0.995   250
## theta       9.017 0.072    8.887    9.015    9.170    FALSE 1 0.999   250
## pi          0.214 0.013    0.189    0.213    0.244    FALSE 1 1.002   250
## deviance 2806.947 1.772 2805.214 2806.427 2812.145    FALSE 1 1.002   250
## 
## Successful convergence based on Rhat values (all < 1.1). 
## Rhat is the potential scale reduction factor (at convergence, Rhat=1). 
## For each parameter, n.eff is a crude measure of effective sample size. 
## 
## overlap0 checks if 0 falls in the parameter's 95% credible interval.
## f is the proportion of the posterior with the same sign as the mean;
## i.e., our confidence that the parameter is positive or negative.
## 
## DIC info: (pD = var(deviance)/2) 
## pD = 1.6 and DIC = 2808.533 
## DIC is an estimate of expected predictive error (lower is better).

Based on the “Rhat” values outputted by JAGS, it seems that now our algorithm has converged successfully!

Back to main menu

Comments?

Send me an email at