A common assumption when developing our statistical models has been that of conditional independence (i.e., independence after we take into account the parameters, covariates, and latent variables). As discussed in the “mixed models” notes, this is the assumption that allows us to write the likelihood as a product of many univariate distributions.

Ignoring this lack of independence often leads to over-optimistically narrow confidence intervals for parameter estimates because we assume that we have more information than we actually have. To see why this is the case, let’s think about the variance around the mean. If all the observations are independent and have the same variance, we have that:

\[Var(\bar{x})=Var(\frac{\sum_{i=1}^n x_i}{n})=\frac{1}{n^2}\sum_{i=1}^n Var(x_i) = \frac{nVar(x_i)}{n^2}=\frac{Var(x_i)}{n}\]

On the other hand, if some of the observations are positively correlated (i.e., \(Cov(x_i,x_j )>0\) for some i and j), then we have that:

\[Var(\bar{x})=Var(\frac{\sum_{i=1}^n x_i}{n})=\frac{1}{n^2}[\sum_{i=1}^n Var(x_i) + \sum_{i \neq j}Cov(x_i,x_j)]= \frac{Var(x_i)}{n}+\frac{\sum_{i \neq j}Cov(x_i,x_j)}{n^2}\]

Because \(\frac{\sum_{i \neq j}Cov(x_i,x_j)}{n^2}>0\), the uncertainty around the mean for dependent samples is bigger than the uncertainty around the mean for independent samples. Similar ideas apply for other statistics (e.g., regression parameter estimates). This should be intuitive. Sampling 100 different individuals is likely to be much more informative than sampling 10 individuals 10 times through time, despite the fact that we have 100 observations in both cases. Depending on what we are measuring, the 2nd - 10th measurements in a given individual might yield very little extra information beyond the 1st measurement on this individual.

Lack of independence can often arise because of space (samples closer in space might be more highly correlated) and/or time (samples closer in time might be more highly correlated). Therefore, one way to model spatial or temporal correlation is to explicitly parameterize this correlation as a function of distance or time difference. This is often done using a multivariate normal distribution. For example, to model spatial correlation, a commonly adopted parameterization is given by:

\[ \begin{pmatrix} y_1 \\ y_2 \\ y_3 \\ y_4 \end{pmatrix} \sim N( \begin{pmatrix} \mu_1 \\ \mu_2 \\ \mu_3 \\ \mu_4 \end{pmatrix} , \sigma^2 \begin{pmatrix} 1 & exp(-\rho D_{12}) & exp(-\rho D_{13}) & exp(-\rho D_{14}) \\ exp(-\rho D_{12}) & 1 & exp(-\rho D_{23}) & exp(-\rho D_{24}) \\ exp(-\rho D_{13}) & exp(-\rho D_{23}) & 1 & exp(-\rho D_{34}) \\ exp(-\rho D_{14}) & exp(-\rho D_{24}) & exp(-\rho D_{34}) & 1 \end{pmatrix} ) \]

where \(D_{ij}\) is the distance between observations i and j. Notice that, in this formulation, we are assuming that correlation decays exponentially with distance and that the rate of decay is governed by the parameter \(\rho\), where \(\rho\) is a positive real number. Here is what this function looks like for different values of \(\rho\):

dist=seq(from=0,to=100,length.out=1000)
rho=c(0.01,0.1,0.5,1)
plot(NA,NA,xlim=range(dist),ylim=c(0,1),xlab='Distance',ylab='Correlation')
for (i in 1:length(rho)){
  cor1=exp(-rho[i]*dist)
  lines(dist,cor1,col=i)
}

This formulation for the correlation matrix is often adopted for several reasons. First, because both \(\rho\) and \(D_{ij}\) are non-negative numbers, \(exp(-\rho D_{ij})\) will be between 0 and 1, which is important given that correlations cannot be smaller than -1 or greater than 1. Finally, this formulation ensures that the resulting matrix is symmetric and positive-definite, an important requirement for any covariance matrix.

In relation to temporal correlation, one of the simplest parameterizations is that of an autoregressive model of order 1 (AR-1), given by:

\[ \begin{pmatrix} y_1 \\ y_2 \\ y_3 \\ y_4 \end{pmatrix} \sim N( \begin{pmatrix} \mu_1 \\ \mu_2 \\ \mu_3 \\ \mu_4 \end{pmatrix} , \sigma^2 \begin{pmatrix} 1 & \rho ^ {|t_1 - t_2|} & \rho ^ {|t_1 - t_3|} & \rho ^ {|t_1 - t_4|} \\ \rho ^ {|t_1 - t_2|} & 1 & \rho ^ {|t_2 - t_3|} & \rho ^ {|t_2 - t_4|}) \\ \rho ^ {|t_1 - t_3|} & \rho ^ {|t_2 - t_3|} & 1 & \rho ^ {|t_3 - t_4|} \\ \rho ^ {|t_1 - t_4|} & \rho ^ {|t_2 - t_4|} & \rho ^ {|t_3 - t_4|} & 1 \end{pmatrix} ) \]

where \(t_i\) is a time variable that indicates when observation i was measured. In this model, \(\rho\) is a real number between 0 and 1 and \(|t_i - t_j|\) is the absolute value of this difference. Again, notice that \(\rho^{|t_i - t_j|}\) will always be constrained to be between 0 and 1. For example, if samples are collected at regular intervals (e.g., every year), then we would see the following discrete version of the exponential decay pattern:

time.dif=0:10
rho=c(0.11,0.3,0.6,0.95)
mar1=c(-0.05,0,0.05,0.1)
plot(NA,NA,xlim=range(time.dif),ylim=c(0,1),xlab='Absolute time difference',ylab='Correlation')
for (i in 1:length(rho)){
  cor1=rho[i]^time.dif
  lines(time.dif,cor1,col=i)  
  points(time.dif,cor1,col=i)
}

This is called an AR-1 model because it can be shown that \(y_t\) given \(y_{t-1}\) does not depend on the other variables further in the past (e.g., \(y_{t-2}\),\(y_{t-3}\),…). More specifically:

\[p(y_t|y_{t-1},y_{t-2},y_{t-3},...)=N(\mu_t + \rho (y_{t-1}-\mu_{t-1}),1-\rho^2)\]



Comments?

Send me an email at