Linear models

Intro

Linear models encompasses a large number of the models that we typically see in stats 101 courses, such as t-tests, ANOVAs, ANCOVAs, and linear regression models. While these tests might seem to be very different from one another, the reason they are all considered part of the linear model family is because they can be written as:

\[y_i \sim N(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ..., \sigma^2)\]

This equation should make it crystal clear that t-tests, ANOVAs, ANCOVAs, and linear regression models share the following assumptions:

Observations are independent given model parameters and covariates. This can be seen by noticing that \(y_i\) is drawn independently from the other observations;
Constant variance: this is clear if we notice that \(\sigma^2\) does not depend on any covariate;
The error is normally distributed: due to the properties of the normal distribution, the equation above can also be written as \(y_i=\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + e_i\), where \(e_i \sim N(0,\sigma^2)\);
The model is linear: this is evident because each intercept or slope parameter is multiplied by 1 or a covariate. For example, \(\beta_0 + \beta_1 x_1 + \beta_2 x_1^2\) is a valid linear model even if we are actually describing a quadratic curve, not a straight line. On the other hand, \(\frac{\beta_0}{\beta_1 + \beta_2 x_1}\) is not a linear model.

Bayesian linear model

Because all of these models (i.e., t-tests, ANOVAs, ANCOVAs, and linear regression) can be described by this single equation, I will rely on a single model specification in JAGS. Here is the model (stored in the file “linear_model.R”):

model{
  for (i in 1:nobs){
    mu[i]<-inprod(xmat[i,],betas)
    y[i]~dnorm(mu[i],prec1)
  }
  
  #priors
  sd~dunif(0,1000)
  prec1<-1/(sd*sd)
  for (j in 1:nparam){
    betas[j]~dnorm(0,0.1)
  }
}

Notice that I rely on an inner-product operation using the function “inprod”. This is just a simpler way to write \(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...\) using a tiny bit of linear algebra. You can read more about this on the “Miscellanea” section focused on inner products.

To illustrate how to fit all of these models in a Bayesian framework, I will go over them one at a time. I will first simulate some data, I will then fit the model in a frequentist setting, and then I will fit the model within a Bayesian framework.

2-sample t-test

In a two-sample t-test, we are interested in comparing the mean of two groups. Here is how I would simulate data for this test:

rm(list=ls())
set.seed(1)

#simulation settings
nobs.per.grp=50
mu1=1
mu2=2
sd1=0.8

#simulate data
y1=rnorm(nobs.per.grp,mean=mu1,sd=sd1)
y2=rnorm(nobs.per.grp,mean=mu2,sd=sd1)

#look at data
breaks1=seq(from=min(c(y1,y2)),to=max(c(y1,y2)),length.out=10)
par(mfrow=c(2,1),mar=c(3,3,1,1))
hist(y1,breaks=breaks1,main='')
hist(y2,breaks=breaks1,main='')

Frequentist 2-sample t-test

Now let’s fit a standard 2-sample t-test in R:

#fit 2-sample t-test
t.test(x=y1,y=y2,var.equal=T)

## 
##  Two Sample t-test
## 
## data:  y1 and y2
## t = -7.0169, df = 98, p-value = 2.965e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.3001327 -0.7268724
## sample estimates:
## mean of x mean of y 
##  1.080359  2.093861

Because the p-value is very small, we can reject the null hypothesis that the two groups have the same mean.

2-sample t-test as a frequentist linear model

To show how this fits into the linear model structure, notice that I can create a binary covariate \(x_i\) that is equal to 0 if the corresponding observation comes from group 1 and equal to 1 otherwise. As a result, we have that:

the mean for group 1 is given by \(\beta_0 + \beta_1 \times 0 = \beta_0\); and
the mean for group 2 is given by \(\beta_0 + \beta_1 \times 1 = \beta_0 + \beta_1\).

Below, I will create this covariate and I will fit this linear model using the “lm” function:

#re-format the data
dat=data.frame(y=c(y1,y2),
               x=c(rep(0,nobs.per.grp),rep(1,nobs.per.grp)))
mod=lm(y~x,data=dat)
summary(mod)

## 
## Call:
## lm(formula = y ~ x, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.85212 -0.48641  0.00402  0.46003  1.82743 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.0804     0.1021  10.578  < 2e-16 ***
## x             1.0135     0.1444   7.017 2.97e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7222 on 98 degrees of freedom
## Multiple R-squared:  0.3344, Adjusted R-squared:  0.3276 
## F-statistic: 49.24 on 1 and 98 DF,  p-value: 2.965e-10

Notice that the p-value for the covariate is identical to the one we get from a 2-sample t-test. As a result, because the p-value is very small, we can reject the null hypothesis that the two groups have the same mean.

Finally, notice that our estimates of the mean of each group are very close to the true values.

mean1=c(mod$coefficients[1], #estimate of mu1
        mod$coefficients[1]+mod$coefficients[2]) #estimate of mu2
mean1

## (Intercept) (Intercept) 
##    1.080359    2.093861

2-sample t-test as a Bayesian linear model

Now I will feed the formatted data into our Bayesian linear model:

#create design matrix
xmat=cbind(1,dat$x)
y=dat$y
nobs=nrow(dat)
nparam=ncol(xmat)

#fit Bayesian linear model
library(jagsUI)

# MCMC settings 
ni <- 10000 #number of iterations
nt <- 5    #interval to thin 
nb <- 5000  #number of iterations to discard as burn-in
nc <- 3     #number of chains

#set parameters to monitor
params=c("betas","sd")

#create function that sets the initial values
inits=function(){
  list(betas=rnorm(nparam,0,10), sd=rgamma(1,1,1))
}

#run model
setwd('U:\\uf\\courses\\bayesian course\\group activities\\7A Linear models')
dat1=list(y=y,xmat=xmat,nobs=nobs,nparam=nparam)
model2=jags(model.file="linear_model.R",inits=inits,
            parameters.to.save=params,
            data=dat1,n.chains=nc,
            n.burnin=nb,n.iter=ni,n.thin=nt,DIC=TRUE)

We can retrieve these means and calculate 95% credible intervals for each of them by performing the following calculations:

mu1=model2$sims.list$betas[,1]
mu2=model2$sims.list$betas[,1]+model2$sims.list$betas[,2]
grp1=c(mean(mu1),quantile(mu1,c(0.025,0.975)))
grp2=c(mean(mu2),quantile(mu2,c(0.025,0.975)))
tmp=rbind(grp1,grp2)
colnames(tmp)=c('post.mean','CI2.5','CI97.5')
rownames(tmp)=c('group 1','group 2')
round(tmp,2)

##         post.mean CI2.5 CI97.5
## group 1      1.08  0.87   1.28
## group 2      2.09  1.89   2.30

Because the 95% credible intervals for the means of each group do not overlap, we can conclude that these groups have different means.

Back to main menu

Comments?

Send me an email at