Intro
Linear models encompasses a large number of the models that we typically see in stats 101 courses, such as t-tests, ANOVAs, ANCOVAs, and linear regression models. While these tests might seem to be very different from one another, the reason they are all considered part of the linear model family is because they can be written as:
\[y_i \sim N(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ..., \sigma^2)\]
This equation should make it crystal clear that t-tests, ANOVAs, ANCOVAs, and linear regression models share the following assumptions:
- Observations are independent given model parameters and covariates. This can be seen by noticing that \(y_i\) is drawn independently from the other observations;
- Constant variance: this is clear if we notice that \(\sigma^2\) does not depend on any covariate;
- The error is normally distributed: due to the properties of the normal distribution, the equation above can also be written as \(y_i=\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + e_i\), where \(e_i \sim N(0,\sigma^2)\);
- The model is linear: this is evident because each intercept or slope parameter is multiplied by 1 or a covariate. For example, \(\beta_0 + \beta_1 x_1 + \beta_2 x_1^2\) is a valid linear model even if we are actually describing a quadratic curve, not a straight line. On the other hand, \(\frac{\beta_0}{\beta_1 + \beta_2 x_1}\) is not a linear model.
Bayesian linear model
Because all of these models (i.e., t-tests, ANOVAs, ANCOVAs, and linear regression) can be described by this single equation, I will rely on a single model specification in JAGS. Here is the model (stored in the file “linear_model.R”):
model{
for (i in 1:nobs){
mu[i]<-inprod(xmat[i,],betas)
y[i]~dnorm(mu[i],prec1)
}
#priors
sd~dunif(0,1000)
prec1<-1/(sd*sd)
for (j in 1:nparam){
betas[j]~dnorm(0,0.1)
}
}
Notice that I rely on an inner-product operation using the function “inprod”. This is just a simpler way to write \(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...\) using a tiny bit of linear algebra. You can read more about this on the “Miscellanea” section focused on inner products.
To illustrate how to fit all of these models in a Bayesian framework, I will go over them one at a time. I will first simulate some data, I will then fit the model in a frequentist setting, and then I will fit the model within a Bayesian framework.
2-sample t-test
In a two-sample t-test, we are interested in comparing the mean of two groups. Here is how I would simulate data for this test:
rm(list=ls())
set.seed(1)
#simulation settings
=50
nobs.per.grp=1
mu1=2
mu2=0.8
sd1
#simulate data
=rnorm(nobs.per.grp,mean=mu1,sd=sd1)
y1=rnorm(nobs.per.grp,mean=mu2,sd=sd1)
y2
#look at data
=seq(from=min(c(y1,y2)),to=max(c(y1,y2)),length.out=10)
breaks1par(mfrow=c(2,1),mar=c(3,3,1,1))
hist(y1,breaks=breaks1,main='')
hist(y2,breaks=breaks1,main='')
Frequentist 2-sample t-test
Now let’s fit a standard 2-sample t-test in R:
#fit 2-sample t-test
t.test(x=y1,y=y2,var.equal=T)
##
## Two Sample t-test
##
## data: y1 and y2
## t = -7.0169, df = 98, p-value = 2.965e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.3001327 -0.7268724
## sample estimates:
## mean of x mean of y
## 1.080359 2.093861
Because the p-value is very small, we can reject the null hypothesis that the two groups have the same mean.
2-sample t-test as a frequentist linear model
To show how this fits into the linear model structure, notice that I can create a binary covariate \(x_i\) that is equal to 0 if the corresponding observation comes from group 1 and equal to 1 otherwise. As a result, we have that:
the mean for group 1 is given by \(\beta_0 + \beta_1 \times 0 = \beta_0\); and
the mean for group 2 is given by \(\beta_0 + \beta_1 \times 1 = \beta_0 + \beta_1\).
Below, I will create this covariate and I will fit this linear model using the “lm” function:
#re-format the data
=data.frame(y=c(y1,y2),
datx=c(rep(0,nobs.per.grp),rep(1,nobs.per.grp)))
=lm(y~x,data=dat)
modsummary(mod)
##
## Call:
## lm(formula = y ~ x, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.85212 -0.48641 0.00402 0.46003 1.82743
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0804 0.1021 10.578 < 2e-16 ***
## x 1.0135 0.1444 7.017 2.97e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7222 on 98 degrees of freedom
## Multiple R-squared: 0.3344, Adjusted R-squared: 0.3276
## F-statistic: 49.24 on 1 and 98 DF, p-value: 2.965e-10
Notice that the p-value for the covariate is identical to the one we get from a 2-sample t-test. As a result, because the p-value is very small, we can reject the null hypothesis that the two groups have the same mean.
Finally, notice that our estimates of the mean of each group are very close to the true values.
=c(mod$coefficients[1], #estimate of mu1
mean1$coefficients[1]+mod$coefficients[2]) #estimate of mu2
mod mean1
## (Intercept) (Intercept)
## 1.080359 2.093861
2-sample t-test as a Bayesian linear model
Now I will feed the formatted data into our Bayesian linear model:
#create design matrix
=cbind(1,dat$x)
xmat=dat$y
y=nrow(dat)
nobs=ncol(xmat)
nparam
#fit Bayesian linear model
library(jagsUI)
# MCMC settings
<- 10000 #number of iterations
ni <- 5 #interval to thin
nt <- 5000 #number of iterations to discard as burn-in
nb <- 3 #number of chains
nc
#set parameters to monitor
=c("betas","sd")
params
#create function that sets the initial values
=function(){
initslist(betas=rnorm(nparam,0,10), sd=rgamma(1,1,1))
}
#run model
setwd('U:\\uf\\courses\\bayesian course\\group activities\\7A Linear models')
=list(y=y,xmat=xmat,nobs=nobs,nparam=nparam)
dat1=jags(model.file="linear_model.R",inits=inits,
model2parameters.to.save=params,
data=dat1,n.chains=nc,
n.burnin=nb,n.iter=ni,n.thin=nt,DIC=TRUE)
We can retrieve these means and calculate 95% credible intervals for each of them by performing the following calculations:
=model2$sims.list$betas[,1]
mu1=model2$sims.list$betas[,1]+model2$sims.list$betas[,2]
mu2=c(mean(mu1),quantile(mu1,c(0.025,0.975)))
grp1=c(mean(mu2),quantile(mu2,c(0.025,0.975)))
grp2=rbind(grp1,grp2)
tmpcolnames(tmp)=c('post.mean','CI2.5','CI97.5')
rownames(tmp)=c('group 1','group 2')
round(tmp,2)
## post.mean CI2.5 CI97.5
## group 1 1.08 0.87 1.28
## group 2 2.09 1.89 2.30
Because the 95% credible intervals for the means of each group do not overlap, we can conclude that these groups have different means.
Comments?
Send me an email at