Data and model

Here we are going to analyze real data on nitrate concentration (i.e., an important water pollutant) in Mississippi river basins as a function of percent row crops (Goolsby et al (1999) NOAA Coastal Ocean Program Decision Analysis Series # 1). Here are the data “river data.csv” and this is what these data look like:

setwd('U:\\uf\\courses\\bayesian course\\rmarkdown\\uncertainty\\figs_code_data')
dat=read.csv('river data.csv',as.is=T)
head(dat)
##   basin.id cropland nitrate
## 1        1      2.5   0.647
## 2        2      1.3   1.062
## 3        3     14.3   1.432
## 4        4      0.5   0.579
## 5        5     45.6   3.561
## 6        6     46.6   3.938
plot(nitrate~cropland,data=dat)

To understand the relationship between nitrate concentration \(y_i\) as a function of percent row crops \(x_i\) in basin \(i\), we will use the following model:

\[y_i \sim N(\beta_0 + \beta_1 x_i, \sigma^2)\]

No uncertainty

We can display the results of this model in multiple ways, which reveal different features of what we have learned from data. To simplify things, let’s assume that \(x_i=10\). For instance, we can calculate the posterior mean of each slope parameter (let’s denote these means by \(\hat{\beta_0}\) and \(\hat{\beta_1}\)) and use these means to predict pollutant concentration

\[\hat{y_i}=\hat{\beta_0}+\hat{\beta_1}\times 10\]

This approach of course ignores uncertainty in our parameter estimates.

Parameter uncertainty

Alternatively, we can account for parameter uncertainty on the mean by using the posterior distribution of our parameters. We do this by drawing sets of parameters from their joint posterior distribution. Let’s denote the j-th draw by adding the subscript j to our parameters {\(\beta_{0j},\beta_{1j},{\sigma_j}^2\)}. Then, we calculate:

\[\hat{y_{ij}}=\beta_{0j}+\beta_{1j}\times 10\]

and we can plot the distribution of \(\hat{y_{ij}}\) (orange line). It is important to emphasize that the equation above just describes how I will calculate \(\hat{y_{ij}}\). In the past, some students have interpreted this equation incorrectly, assuming that I had introduced new parameters in my statistical model. The posterior distribution is avaiable here: “posterior distrib.csv”.

setwd('U:\\uf\\courses\\bayesian course\\rmarkdown\\uncertainty\\figs_code_data')
param=read.csv('posterior distrib.csv',as.is=T) #this dataset contains samples from the posterior distribution

#estimate of pollutant concentration without uncertainty
estim=apply(param,2,mean) #get the mean of the posterior distribution of each parameter
yhat=estim['b0']+estim['b1']*10 

#estimate of pollutant concentration with parameter uncertainty using joint distribution
yhat1=param[,'b0']+param[,'b1']*10 
plot(density(yhat1),type='l',col='orange',main='Inference for x=10',xlab='yhat')
points(yhat,0,pch=19,col='red') #estimate of pollutant concentration without parameter uncertainty

#estimate of pollutant concentration with parameter uncertainty using marginal distributions, instead of joint distribution
y_bad=param[,'b0']+sample(param[,'b1'],nrow(param))*10 
k=density(y_bad)
lines(k$x,k$y,col='blue')
legend(0.2,2.5,lty=1,col=c('orange','blue'),c('joint','marginal'))

In this figure, the red point depicts the mean \(\hat{y_i}=\hat{\beta_0}+\hat{\beta_1}\times 10\) while the orange lines show the uncertainty around this mean once parameter uncertainty is taken into account (i.e., the distribution of \(\hat{y_{ij}}\)). Notice that we used the joint posterior distribution (orange line). In other words, I calculated:

\[\begin{bmatrix} \hat{y_{i1}} \\[0.3em] ... \\[0.3em] \hat{y_{in}} \\[0.3em] \end{bmatrix}= \begin{bmatrix} \beta_{01} \\[0.3em] ... \\[0.3em] \beta_{0n} \\[0.3em] \end{bmatrix}+ \begin{bmatrix} \beta_{11} \\[0.3em] ... \\[0.3em] \beta_{1n} \\[0.3em] \end{bmatrix} \times 10\]

The ability to get the joint posterior distribution is an important characteristic of Bayesian models. If we had used just the marginal posterior distributions, our results could have been very different. To illustrate this, I can calculate:

\[\begin{bmatrix} \hat{y_{i1}} \\[0.3em] \hat{y_{i2}} \\[0.3em] \hat{y_{i3}} \\[0.3em] ... \\[0.3em] \end{bmatrix}= \begin{bmatrix} \beta_{01} \\[0.3em] \beta_{02} \\[0.3em] \beta_{03} \\[0.3em] ... \\[0.3em] \end{bmatrix}+ \begin{bmatrix} \beta_{14} \\[0.3em] \beta_{19} \\[0.3em] \beta_{11} \\[0.3em] ... \\[0.3em] \end{bmatrix} \times 10\]

Notice how the second subscript denoting the sample number for \(\beta_1\) is all mixed up and does not match that of \(\beta_0\). The key point here is that the results using the marginal distributions (i.e., ignoring the relationship between samples from \(\beta_1\) and samples from \(\beta_0\)) will be not only different but incorrect (blue line).

Parameter + sampling uncertainty

Another source of uncertainty besides parameter uncertaint is sampling uncertainty. Now we are not interested in the mean anymore. Rather, we are interested in actual pollutant values. To do this, we will be creating new datasets by sampling from the predictive distribution . These new datasets are generate by first drawing a set of parameters from the posterior distribution {\(\beta_{0j},\beta_{1j},\sigma_j^2\)} and then drawing a new observation from \(N(\beta_{0j} + \beta_{1j} \times 10, \sigma_j^2)\). In other words, we do this by calculating:

\[\begin{bmatrix} y_{i1}^{new} \\[0.3em] y_{i2}^{new} \\[0.3em] y_{i3}^{new} \\[0.3em] ... \\[0.3em] \end{bmatrix}= \begin{bmatrix} N(\beta_{01}+\beta_{11} \times 10,\sigma_1^2) \\[0.3em] N(\beta_{02}+\beta_{12} \times 10,\sigma_2^2) \\[0.3em] N(\beta_{03}+\beta_{13} \times 10,\sigma_3^2) \\[0.3em] ... \\[0.3em] \end{bmatrix}\]

Here is how we implement this in R:

#estimate of pollutant concentration with parameter uncertainty
plot(density(yhat1),type='l',col='orange',main='Inference for x=10',xlab='yhat',xlim=c(-2,3))

#estimate of pollutant concentration without uncertainty
points(yhat,0,pch=19,col='red') 

#estimate of pollutant concentration with parameter and sampling uncertainty
yhat2=rnorm(nrow(param),mean=param[,'b0']+param[,'b1']*10,sd=sqrt(param[,'sig2'])) 
yhat2a=density(yhat2)
lines(yhat2a$x,yhat2a$y,col='grey')

We now depict in grey lines the overall uncertainty that arises from sampling and parameter uncertainty (i.e., distribution for \(y_{ij}^{new}\)). As you will notice, now the level of uncertainty is much larger.

Notice that what I mean by sampling uncertainty is the uncertainty that arises from the PDF or PMF used as our likelihood. These distributions account for random fluctuations/deviations between our predictions and our actual observations. Some people might interpret sampling uncertainty in a more literal way (i.e., uncertainty that arises from selecting different samples from the population). This is not what I mean by sampling uncertainty.

When we make predictions, which sources of uncertainty should we include?

Notice that if we are reporting only parameter uncertainty, this just describes our level of uncertainty regarding the mean of nitrate concentration for rivers with row crops occupying 10% of the basin. In other words, this just tells us what we expect nitrate concentration for rivers with this covariate value to be on average . However, if we want to make predictions for a new river, we have to report the uncertainty around the mean together with sampling uncertainty.

By the way, is there anything that already calls your attention about our model having some issues? Notice that our model allows for substantial probability of rivers having a negative concentration of nitrate! This is an important shortcoming of our model, suggesting that it could and should be improved to avoid this.

These different levels of uncertainty are particularly important when making predictions. For instance, say this model describes the concentration of a particular pollutant in the river (i.e., our response variable). Say we want to predict this concentration for a new river with covariate value equal to 10. Assume that rivers with concentrations above 0.5 are declared to be impaired and the federal government has to take action to recover these impaired rivers. Should we make predictions using the estimated mean trend or should we account for parameter uncertainty? Alternatively, should we account for sampling uncertainty as well?

We get very different results, depending on the amount of uncertainty that we include in our calculations. For instance, if we:

  • don’t account for any source of uncertainty, then \(p(\hat{y_i}>0.5)=1\)
  • account for parameter uncertainty around the mean, then \(p(\hat{y_i}>0.5)=0.94\)
  • account for sampling and parameter uncertainty, we get \(p(\hat{y_i}>0.5)=0.60\).
#what is the probability of river impairment for x=10?
mean(yhat>0.5) #1
## [1] 1
mean(yhat1>0.5) #0.94
## [1] 0.948
mean(yhat2>0.5) #0.60
## [1] 0.6095556



Comments?

Send me an email at