Sources of uncertainty

Parameter uncertainty across a range of covariate values

Let’s now look at the implications of these different sources of uncertainty over the entire range of covariate values, instead of focusing on x=10. Now the red line shows the mean trend \(\hat{y_i}=\hat{\beta_0}+\hat{\beta_1} \times x_i\) and the orange line shows the 95% credible interval around this mean, after accounting for uncertainty in parameters \(\hat{y_{ij}}=\beta_{0j}+\beta_{1j} \times x_i\).

rm(list = ls())

#get data
setwd('U:\\uf\\courses\\bayesian course\\rmarkdown\\uncertainty\\figs_code_data')
dat=read.csv('river data.csv',as.is=T)

#get samples from the posterior distribution
param=read.csv('posterior distrib.csv',as.is=T) 

#estimate of pollutant concentration without uncertainty
estim=apply(param,2,mean)
rango=range(dat$cropland)
seq1=seq(from=rango[1],to=rango[2],length.out=100)
yhat=estim['b0']+estim['b1']*seq1

#pollutant concentration without uncertainty
plot(dat$cropland,dat$nitrate,ylab='Nitrate concentration',xlab='Crop land')
lines(seq1,yhat,col='red')

#pollutant concentration with parameter uncertainty
yhat1=matrix(NA,100,3)
for (i in 1:100){
  tmp=param[,'b0']+param[,'b1']*seq1[i]
  yhat1[i,]=quantile(tmp,c(0.025,0.5,0.975))
}
lines(seq1,yhat1[,1],col='orange')
lines(seq1,yhat1[,3],col='orange')

Parameter + sampling uncertainty across a range of covariate values

Notice how many of our observations fall out of the orange envelope. This is to be expected because this envelope just captures the uncertainty around the mean trend. To really account for the variability in our dataset, we need the predictive distribution, which takes into account parameter and sampling uncertainty (grey lines).

plot(dat$cropland,dat$nitrate,xlab='Crop land',ylab='Nitrate concentration')

#pollutant concentration without uncertainty
lines(seq1,yhat,col='red',type='l',xlim=c(0,1),ylim=c(0,1.5),xlab='Covariate x')

#pollutant concentration with parameter uncertainty
lines(seq1,yhat1[,1],col='orange')
lines(seq1,yhat1[,3],col='orange')
nsim=nrow(param)

#pollutant concentration with parameter and sampling uncertainty
yhat2=matrix(NA,100,3)
for (i in 1:100){
  tmp=rnorm(nsim,mean=param[,'b0']+param[,'b1']*seq1[i],sd=sqrt(param[,'sig2']))
  yhat2[i,]=quantile(tmp,c(0.025,0.5,0.975))
}
lines(seq1,yhat2[,1],col='grey')
lines(seq1,yhat2[,3],col='grey')

Probability of river being impaired across a range of covariate values

Finally, now we can plot our inference on the probability that the river is impaired for different values of our covariate \(x_i\). The figure below shows striking differences for \(p(\hat{y_i}>0.5)\) when different sources of uncertainty are taken into account.

res=matrix(NA,100,3)

#results with uncertainty
res[,1]=yhat>0.5

for (i in 1:100){
  media=param[,'b0']+param[,'b1']*seq1[i]
  res[i,2]=mean(media>0.5) #results with parameter uncertainty
  tmp=rnorm(nsim,mean=media,sd=sqrt(param[,'sig2']))
  res[i,3]=mean(tmp>0.5) #results with parameter and sampling uncertainty
}

#plot results
par(mfrow=c(1,1),mar=c(4,4,1,1))
plot(NA,NA,xlim=c(0,50),ylim=c(0,1),ylab='Probability of impaired',xlab='Proportion of cropland')
cor1=c('red','orange','grey')
for (i in 1:3){
  lines(seq1,res[,i],col=cor1[i])
}

Notice that predictions without sampling uncertainty would state that, when \(x < 6.8\), then the probability that the river is impaired is negligible. Likewise, when \(x > 6.8\), then it is virtually certain that the river is impaired. On the other hand, conclusions are very different when we take into account sampling uncertainty. For instance, when \(x = 0\), the probability that the river is impaired is 30% whereas when \(x=20\), this probability is 80%.

Problems related to thresholds abound in multiple fields. Can you give me some examples? My examples are:

policy makers often have to decide on who will receive treatment for AIDS and the key parameter here is the number of CD4 cells people have. Normal CD4 counts range between 500 cells/mm3 to 1,000 cells/mm3 and opportunistic infections start to occur when CD4 counts are much lower than this. For instance, the U.S. Department of Health and Human Services recommends treatment to start once counts are < 350. We would be essentially in the same scenario above if we had to predict CD4 counts for each person based on certain covariates.
Animal and plant populations are deemed vulnerable, endangered, and critically endangered partly based on their current and future population sizes (IUCN). Say the threshold here for critically endangered species was having less than 1,000 individuals 10 years from now. Say we develop a model that will predict the probability that the total number of individuals (our response variable y) is below this threshold ten years from now to be able to inform policy making. This is very similar to the situation described above.

Comments?

Send me an email at