What is MLE?

The likelihood function is the basis for Maximum Likelihood Estimation (MLE). The basic idea behind MLE is that we want to pick the set of parameters that makes the observation of the data the most likely. For example, we wake up in the morning and see that the lawn of our house is wet but the lawn of our neighbor is not. This is our data D. We might have two hypotheses:

  1. \(H_1\): it rained at night
  2. \(H_2\): we irrigated our lawn and our neighbor did not

Which hypothesis would you pick? I believe that \(p(D|H_2)\) is higher than \(p(D|H_1)\) and therefore MLE mandates that I choose hypothesis 2. While this seems to be a reasonable strategy, notice that if we had prior information that \(H_2\) is impossible (e.g., we don’t have an irrigation system), then choosing \(H_2\) might not seem that reasonable anymore. This again illustrates the difference between making decisions based on:

  1. \(p(D|H_2)\): this is the MLE approach; or
  2. \(p(H_2|D)\): this is the Bayesian approach, which takes into account the prior information that \(p(H_2)=0\).

Often times, hypotheses can be described by parameter values. For instance, H can be different \(\pi\) values in a Bernoulli distribution. Then, we choose the value of \(\pi\) that maximizes the probability that we observe our data. For instance, say we are interested in the proportion of graduate students that believes that Edward Snowden (the guy that leaked top secret NSA documents regarding US surveillance on phone and internet communications) did the right thing. Say we observe 90 people to have raised their hands, out of 100. Then, we can try different \(\pi\) values (each one is a different hypothesis) to see which one would maximize the probability of this particular outcome. If we assume that all individuals are independent and that everybody has the same probability \(\pi\), then we can use the binomial distribution as our likelihood and we get the figure below by trying different \(\pi\) values.

pi1=seq(from=0.01,to=0.99,length.out=1000)
res=dbinom(90,size=100,pi1)

plot(pi1,res,type='l',ylab='Likelihood',xlab=expression(pi))
abline(v=0.9,col='red')

In this figure, it is easy to see the value \(\pi\) that maximizes the likelihood. Another option to find \(\hat \pi_{MLE}\) is through calculus by calculating \(\frac{dL}{d\pi}\) and setting the resulting function to zero. Fortunately, others have already done this and shown that \(\hat \pi_{MLE}=\frac{x}{n}\).

Notice that our assumptions should not be taken lightly. For instance, perhaps students in certain departments tend to have similar points of view (e.g., more conservative) and thus should not be viewed as independent. Furthermore, there might be all sorts of reasons why people do not have the same probability \(\pi\) (e.g., age, gender, political beliefs, etc.).

Example with 2 parameters

The likelihood function is not restricted to a single parameter. Say we collected data on tree diameters in the Peruvian Amazon forest. Then, if each diameter measurement \(d_i\) (i=1,.,n) is independent from the others, we can write:

\[L(H)\propto p(D|H)=p(d_1,.,d_n|H)=\prod_i p(d_i|H)\]

where D is the collection of all diameters and H is our hypothesis. Say we specify this a little further by assuming that:

\[p(D|H)=\prod_i N(d_i|\mu,\sigma ^2)\]

Now each combination of values for the parameters \(\mu,\sigma ^2\) can be regarded as a different hypothesis and we want to find out what is the combination of these two parameters that maximizes this function. Here is what the log-likelihood function looks like:

d=c(10,8,13,18,7,8,5)

#calculate the log-likelihood for different combinations of parameter values
mu.seq=seq(from=6,to=15,length.out=100)
var.seq=seq(from=1,to=50,length.out=1000)

res=matrix(NA,100,1000)
for (i in 1:100){
  for (j in 1:1000){
    res[i,j]=sum(dnorm(d,mean=mu.seq[i],sd=sqrt(var.seq[j]),log=T))
  }
}

#plot the results
image(res,x=mu.seq,y=var.seq,zlim=c(-21,max(res)),xlab=expression(mu),ylab=expression(sigma^2))
contour(res,x=mu.seq,y=var.seq,add=T,levels=seq(from=-21,to=max(res),length.out=10))
abline(h=var(d),lty=3)
abline(v=mean(d),lty=3) 

Although not noticeable in the above figure, it turns out that the MLE estimate for \(\mu\) is the sample mean but the MLE estimate for \(\sigma^2\) is not the sample variance (it is actually equal to \(\sum_i (x_i - \bar x)^2/n\)).

Notice that in this code, instead of explicitly writing

\[N(d_i|\mu , \sigma ^2 )=\frac{1}{\sqrt{2\pi \sigma ^2}} exp(-\frac{1}{2 \sigma^2} (d_i-\mu)^2)\]

we rely on the function dnorm() to perform these calculations.

Why do we calculate the “sum of the individual log-probabilities” instead of the “product of the individual probabilities”?

We can write these expressions as:

  1. the sum of the individual log-probabilities (i.e., log-likelihood) \[\sum_{i=1}^{n} log[N(d_i|\mu , \sigma ^2)]\]

  2. the product of the individual probabilities (i.e., likelihood) \[\prod_{i=1}^n N(d_i|\mu, \sigma ^2 )\]

The equation in (a) is the log-likelihood whereas the equation in (b) is the likelihood. To see this, recall that

\[log(a\times b \times c) = log(a)+log(b)+log(c)\]

Therefore:

\[log[\prod_{i=1}^n N(d_i|\mu, \sigma ^2 )]=\sum_{i=1}^{n} log[N(d_i|\mu , \sigma ^2)]\]

We prefer to work with the log-likelihood because of computer rounding errors when dealing with the multiplication of many potentially small numbers. As you can see in the equations above, in the log-world, products become sums and rounding errors is much less of an issue.

Furthermore, it turns out that finding the pair \(\mu, \sigma ^2\) that maximizes the likelihood is the same as finding the pair \(\mu, \sigma ^2\) that maximizes the log-likelihood. In other words, we work with a slightly different problem that nevertheless gives us the same answer.

Obs. 1: If we were to find \(\hat{\mu}_{MLE},\hat{\sigma}^2_{MLE}\) through calculus, then we would have to calculate \(dL/d\mu\) and \(dL/d\sigma ^2\) (or \(dlog[L]/d\mu\) and \(dlog[L]/d\sigma ^2\)) and set these guys to zero.

Obs. 2: note that the assumption of normal distribution for the likelihood is an assumption which may or may not be appropriate. Given that diameters cannot be negative and often times diameter distributions in natural forests have an inverse j shape, other distributions (e.g., gamma distribution) might have been better suited for modeling these data.

More than 2 parameters

Most of the standard statistical models that we learn in our introductory statistics courses have an underlying likelihood. For instance, what is the likelihood behind a regular regression model?

\[L(\beta_0,...,\beta_p)\propto p(y_1,...,y_n|\beta_0,...,\beta_p,\sigma^2,x_{11},...x_{np})\] \[=\prod_{i=1}^n N(y_i|\beta_0 +\beta_1 x_{i1} +...+\beta_p x_{ip},\sigma^2)\]

where \(y_1,...,y_n\) are the data, \(\beta_0,...,\beta_p,\sigma^2\) are the parameters to be estimated, and \(x_{11},...,x_{np}\) are covariates.

Trying to visualize this function is complicated because we have more than 2 parameters. However, the overall idea remains the same. The only difference is that our strategy of using a grid of values to identify the MLE for these parameters does not scale well as the number of parameters increases. For this reason, people resort to optimization algorithms to identify the set of parameters that maximize the likelihood.

Here is how we would use an optimization algorithm to identify the MLE for the parameters based on our 2-parameter example above:

#the data
d=c(10,8,13,18,7,8,5)

#create a function that returns the log-likelihood
neglogl=function(param){
  mu1=param[1]; sd1=param[2];
  -sum(dnorm(d,mean=mu1,sd=sd1,log=T))
}

#find MLE
results=optim(rep(1,2),neglogl)
results$par
## [1] 9.856836 4.049160
#how does MLE compare with sample mean and standard deviation?
mean(d); sd(d)
## [1] 9.857143
## [1] 4.375255

An example with more than 2 parameters

How would we go about fitting a Gaussian linear regression model with only a single covariate using MLE?

The likelihood for this model is given by:

\[p(y_1,...,y_n|x_1,...,x_n,\beta_0,\beta_1,\sigma^2) \propto \prod_{i=1}^n N(y_i|\beta_0 +\beta_1 x_{i} ,\sigma^2)\] where \(y_1,...,y_n\) are the response variables, \(\beta_0,\beta_1,\sigma^2\) are the parameters to be estimated, and \(x_{1},...,x_{n}\) are the covariates.

Let’s first simulate some data.

set.seed(1)

#parameters
b0=1
b1=1
sig2=10

#covariates
n=100
x=runif(n,min=-10,max=10)

#response variable
y=rnorm(n,mean=b0+b1*x,sd=sqrt(sig2))

dat=data.frame(x=x,y=y)
plot(y~x,data=dat)

Now we will create the function to be maximized.

#create a function that returns the log-likelihood
neglogl=function(param){
  b0=param[1]; b1=param[2]; sig2=exp(param[3]);
  -sum(dnorm(dat$y,mean=b0+b1*dat$x,sd=sqrt(sig2),log=T))
}

Finally, here is what we find when we give this function to our optimization algorithm.

#find MLE
results=optim(c(0,0,1),neglogl)
results$par
## [1] 0.9280354 1.0493853 2.1610502
exp(results$par[3])
## [1] 8.680249
#compare results to those from lm
lm(y~x,data=dat)
## 
## Call:
## lm(formula = y ~ x, data = dat)
## 
## Coefficients:
## (Intercept)            x  
##      0.9268       1.0494

If we compare these results to the original parameter values, you can see that we got pretty close to the parameter values we used to generate the data. Similarly, the estimated parameters are pretty close to what we would get using the ‘lm’ function.



Back to main menu

Comments?

Send me an email at