The problem

Often times we want to depict the main relationship between our response and our predictor(s) and the standard model for this is a regular Gaussian regression. One problem with the regular Gaussian regression is that it is strongly influenced by outliers and, as a consequence, this model might be unable to adequately capture the relationship we are trying to represent. For instance, the figure below shows that there is a strong relationship between the response and the predictor variables (grey line), despite the presence of a couple of outlier observations. While the Gaussian regression works well in the absence of outliers (blue line), this statistical model has a hard time in the presence of outliers, failing to find an association between these variables (red line).

rm(list=ls(all=T))

set.seed(1)

#true parameter values
b0=1
b1=0.5
N=60
sig2=30

#generate data
x=runif(N,min=0,max=100)
y=rnorm(N,mean=b0+b1*x,sd=sqrt(sig2))

#regression without outliers
k=lm(y~x)
x1=c(0,100)
pred.no.out=k$coef[1]+k$coef[2]*x1
true=b0+b1*x1

#regression with outliers
x=c(x,1,22,10,35,40)
y=c(y,200,135,150,120,190)
k=lm(y~x)
pred.out=k$coef[1]+k$coef[2]*x1

#plot regression results 
plot(x,y)
lines(x1,true,lwd=3,col='grey')
lines(x1,pred.no.out,col='blue',lty=2,lwd=2)
lines(x1,pred.out,col='red',lty=2,lwd=2)
legend(60,190,lty=c(1,2,2),col=c('grey','blue','red'),c('true','no outliers','with outliers'))

#export fake data
data=data.frame(y=y,x=x)
write.csv(data,'fake data.csv',row.names=F)

One option is to eliminate these obvious outliers from your dataset. However, unless you can find a clear cut explanation of why these observations are outliers, most scientists would raise objections to this removal because this seems rather arbitrary and subjective. Worse, this might be seen as a way to increase the support for your pet hypothesis.

The model

Can we devise a statistical model that automatically identifies and decreases the influence of outlier observations? The answer is yes and this model is called robust regression. Here is the structure of the model:

\[\tau_i \sim Gamma(\frac{\nu}{2},\frac{\nu}{2})\] \[y_i \sim N(\beta_0+\beta_1 x_i, \frac{\sigma^2}{\tau_i})\]

The key feature of this model is that each observation is allowed to have its own variance parameter, given by \(\frac{\sigma^2}{\tau_i}\).

Why would this work?

Recall that the likelihood is the multiplication of several densities:

\[\propto N(y_1|\beta_0 + \beta_1 x_1,\frac{\sigma^2}{\tau_1})\times N(y_2 | \beta_0 + \beta_1 x_2,\frac{\sigma^2}{\tau_2})\times...\times N(y_n | \beta_0 + \beta_1 x_n,\frac{\sigma^2}{\tau_n})\]

If the second observation is an outlier, then we would like to somehow remove it from this likelihood expression. In other words, we would like that somehow:

\[N(y_2|\beta_0 +\beta_1 x_2 ,\frac{\sigma^2}{\tau_2})= \frac{1}{\sqrt{2\pi \frac{\sigma^2}{\tau_2}}} exp(-\frac{1}{2\frac{\sigma^2}{\tau_2}}(y_2 - (\beta_0+\beta_1 x_2))^2)\approx 1 \]

We can achieve this by setting \(\tau_2\) to be very small (i.e., very close to zero) because as \(\tau_2 \rightarrow 0\) then

\[\frac{1}{\sqrt{2\pi \frac{\sigma^2}{\tau_2}}}\rightarrow 0\] and

\[exp(-\frac{1}{2\frac{\sigma^2}{\tau_2}}(y_2 - (\beta_0+\beta_1 x_2))^2) \rightarrow exp(0)=1\]

In other words, the likelihood function would not be influenced by \(y_2\) and therefore this particular observation will have little to no impact on our regression parameters \(\beta_0\) and \(\beta_1\).

Why do we use this particular prior for \(\tau_i\)?

The prior \(Gamma(\tau_i|\frac{\nu}{2},\frac{\nu}{2})\) has some interesting features:

  • This prior has mean of \(\frac{\nu}{2}/\frac{\nu}{2}=1\). As a result, we expect the different \(\tau_i\) to be centered around 1. If all of these parameters were indeed equal to 1, we would be back to our original Gaussian regression model, whose likelihood is given by \(N(y_i|\beta_0+\beta_1 x_i,\sigma^2)\).

  • notice that we are estimating the prior hyper-parameter \(\nu\). This parameter governs the spread of \(\tau_i\) because \(Var[\tau_i]=\frac{\nu}{2}/[\frac{\nu}{2}]^2 =1/\frac{\nu}{2}=\frac{2}{\nu}\). Thus, bigger values of \(\nu\) suggest that most \(\tau_i\) will be close to 1 while smaller values of \(\nu\) suggest greater spread. It turns out that, if we marginalize out these \(\tau_i\) parameters, we end up with a t-distribution for the likelihood. The t-distribution has fatter tails than a normal distribution (the smaller the \(\nu\), the fatter the tail), which is particularly useful to avoid the strong effects of outliers on the mean of the distribution.

Does this model work?

If there were no outliers, both Gaussian and robust regressions would give approximately the same answer:

However, with the outliers, the robust regression would still be fine while the Gaussian regression would be a mess!

Some students have asked me why not simply estimate the regression parameters using least-squares. It turns out that using this approach is identical to assuming a Gaussian regression and fitting it using a MLE approach. As a result, using least-squares to estimate parameters will not avoid the problem of the parameters being overly sensitive to outliers. You can read more about these types of models in the section “3.6 Robust curve fitting” within (Denison et al. 2002).



Comments?

Send me an email at

References

Denison, D. G. T., C. C. Holmes, B. K. Mallick, and A. F. M. Smith. 2002. Bayesian Methods for Nonlinear Classification and Regression. New Jersey: Wiley.