Mixed/multi-level models
A central assumption in many of the statistical models we have dealt with so far has been of conditional independence when writing down the likelihood. For instance, in a regression setting, we assume that, given the covariates \(x_{i1},.,x_{ip}\) for \(i=1,...,n\), our response variable \(y_1,.,y_n\) are independent. Assuming a normal distribution, this assumption allows us to write the likelihood as the product of the individual normal distributions:
\[p(y_1,...,y_n|X)=\prod_{i=1}^n N(y_i|\beta_0 +\beta_1 x_{i1}+...+\beta_p x_{ip},\sigma^2)\] However, this assumption might not be valid. For instance, we might have taken multiple measurements on individuals. In this case we would expect the response variables \(y_i\) and \(y_j\) to be more alike if both were measured on the same individual versus in different individuals. A typical example refers to measuring blood pressure every hour during a 6-hr period for 10 individuals (these are called repeated measures), in which case measurements are clustered within persons. Another example refers to spatial clustering. For instance, it might be the case that measurements on patients from the same hospital are more alike than on patients in different hospitals. We can even have multiple levels of nesting. For instance, patient within the same hospital are more alike, and hospitals within counties are more alike.
There are several reasons for this correlation. For instance, when talking about blood pressure measurements, it might be the case that gender, age, weight, height, body mass index, and genotype (to name a few) influence our response variable, making measurements within an individual more alike than between individuals. One might then argue that we can control for these differences between individuals with a regression (i.e., just add these variables as covariates). However, this list of individual level characteristics goes on and on and it is virtually impossible to be sure that we accounted for all possible individual level differences. This is what makes random-effects so powerful; we can acknowledge the correlation in our data without having to specify why (i.e., due to which missing covariates) this correlation exists .
Here we will focus on the radon example provided in (Gelman and Hill 2007). “Radon is a carcinogen - a naturally occurring radioactive gas whose decay products are also radioactive - known to cause lung cancer in high concentrations and estimated to cause several thousand lung cancer deaths per year in the United States. The distribution of radon levels in US homes varies greatly, with some houses having dangerously high concentrations. In order to identify the areas with high radon exposures, the EPA coordinated radon measurements in a random sample of more than 80,000 houses throughout the country” (Gelman and Hill 2007). According to the CDC, radon exposure is the second leading cause of lung cancer behind smoking. In this example, radon concentration was measured in multiple houses, either in the basement (\(x_{ik1}=0\)) or at the first floor level (\(x_{ik1}=1\)). Furthermore, we also have uranium in the soil for each county (\(x_{k2}\)). Uranium is important because it is the first element in a long series of decay that eventually produces radon. In this model, the log-radon concentration for the i-th house in the k-th county is denoted by \(y_{ik}\). Say we have \(n_k\) houses in the k-th county (k=1,.,K). Here are a couple of options (and their drawbacks) to analyze these data
Single regression
To analyze these data, one option would be to perform a single regression, where we assume that observations are independent. This model is given by:
\[\propto [\prod_{k=1}^K \prod_{i=1}^{n_k}N(y_{ik}|\beta_0 + \beta_1 x_{ik1} + \beta_2 x_{k2},\sigma^2)]\] \[\times N(\beta_0|0,10)N(\beta_1|0,10)N(\beta_2|0,10)Gamma(\frac{1}{\sigma^2}|a,b)\]
The problem with this is that we might be failing to account for other important differences between counties, resulting in measurements within counties being correlated. In this case, our independence assumption used to construct the likelihood would not be valid. Can you think about covariates at the county level that we might be missing here?
Separate regression for each county
An alternative solution is to perform a separate regression for each county. In this case, we would be adopting a much milder assumption of independence of observations within each county instead of independence within and across counties as before. We might still have issues here but at least the assumption seems to be better justified. In this case, our model is given by:
\[\propto [\prod_{i=1}^{n_k}N(y_{ik}|\beta_{0k} + \beta_{1k} x_{ik1} + \beta_{2k} x_{k2},\sigma_k^2)]\] \[\times N(\beta_{0k}|0,10)N(\beta_{1k}|0,10)N(\beta_{2k}|0,10)Gamma(\frac{1}{\sigma_k^2}|a,b)\]
Notice that we will have one set of parameters \(\beta_{0k},\beta_{1k},\beta_{2k},\sigma_k^2\) for each county k. We have three problems with this approach:
- it will be hard to estimate these parameters for counties with few observations
- it is not clear how to summarize the information from all these regressions, and
- there is no way of using the county level covariate (i.e., uranium in the soil) if we have one regression for each county.
Mixed-effects model
Finally, a third alternative is to implement a mixed-effect model. We could try different types of mixed-effect models for this example. We will start with one of the simplest, called a varying intercept regression model, given by:
\[\propto \prod_{k=1}^K \prod_{i=1}^{n_k} N(y_{ik}|\beta_{0k} + \beta_1 x_{ik1} + \beta_2 x_{k2} , \sigma^2)\] Notice that the slope parameters \(\beta_1\) and \(\beta_2\) and the variance \(\sigma^2\) do not change from county to county (although this can be allowed) but the intercept \(\beta_{0k}\) does. We then adopt the following priors:
\[\beta_{0k} \sim N(\gamma,\tau^2)\] \[\beta_1,\beta_2 \sim N(0,10) \] \[\frac{1}{\sigma^2} \sim Gamma(a,b)\] Notice that we still have to specify priors for \(\gamma\) and \(\tau^2\). These are called hyper-priors. Let’s assume that:
\[\gamma \sim N(0,10)\] \[\frac{1}{\tau^2} \sim Gamma(a,b)\]
It turns out that this model does not have many of the issues of our earlier models. For instance, we are allowing for the fact that different counties might have different baseline radon concentration by having county-specific intercepts. However, this does not preclude us from determining the impact that a county level covariate (uranium concentration in the soil) has on radon concentration. Furthermore, we can fit this model without having to discard counties with few observations.
Notice that we could also have modeled the slope for \(x_{ik1}\) (instead of just the intercepts) as random effects, allowing this slope to vary from county to county. If we did this, we would be allowing for the difference in radon measurements taken at the basement vs. 1st floor to vary from county to county. In some situations, modeling intercepts and slopes as random effects can be very important to avoid generating over-confident inference on the slope parameter (Schielzeth and Forstmeier 2009).
Why are these models called hierarchical? These models are called Hierarchical Bayesian models because we have the following 3-level hierarchy:
In other words, the data provides information on \(\beta_{01},.,\beta_{0K},\beta_1,\beta_2,\sigma^2\) through the likelihood. We obtain information to estimate \(\gamma,\tau^2\) from \(\beta_{01},.,\beta_{0K}\). Finally, the elements that are in grey are set by us and thus are not estimated. Notice that some models might have many more levels than just 3. An additional observation is that this graph is not quite right in the sense that only data from county 1 influence the intercept \(\beta_{01}\) and so on and so forth for the other counties. However, to avoid drawing too many arrows, I will leave it like this.
A different description of the same model
Notice that another way of writing this model would be:
\[y_{ik} \sim N(\beta_{0k}+\beta_1 x_{ik1},\sigma^2) \] \[\beta_{0k} \sim N(\gamma +\beta_2 x_{k2},\tau^2)\]
We have described the model differently but it is still the same model. The idea here is that we have regressions in two different spatial scales. The first regression is at the house level and we allow for a ‘county’ effect, represented by \(\beta_{0k}\). The second regression is at the level of the county and we model ‘county’ effects as a function of the uranium in the soil. This is appropriate because this covariate is also at the county level.
To see why these models are equivalent, we can write:
\[\beta_{0k} \sim N(\gamma+\beta_2 x_{k2},\tau^2 )\] as
\[\beta_{0k}=\gamma+\beta_2 x_{k2}+e_k\] where \(e_k\) is the error term for the k-th intercept and we assume that \(e_k \sim N(0,\tau^2)\). Then, we plug this back in the original equation to get:
\[\propto \prod_{k=1}^K \prod_{i=1}^{n_k} N(y_{ik}|\beta_{0k}+\beta_1 x_{ik1},\sigma^2)\] \[\propto \prod_{k=1}^K \prod_{i=1}^{n_k} N(y_{ik}|(\gamma+\beta_2 x_{k2}+e_k)+\beta_1 x_{ik1},\sigma^2) \] \[\propto \prod_{k=1}^K \prod_{i=1}^{n_k} N(y_{ik}|\gamma+e_k + \beta_1 x_{ik1} + \beta_2 x_{k2},\sigma^2) \] where \(\gamma+e_k = \beta_{0k}^*\). I will not delve deeper into this second formulation but this is the basis for multi-level regression models, in this case one at the house level and the other at the county level. One nice example of these types of model is (Qian et al. 2010).
Comments?
Send me an email at
References