Likelihood

Definition

Whenever specifying a statistical model, regardless if it is going to be fitted within a Bayesian or frequentist framework, typically the very first thing that needs to be defined is the likelihood function. The likelihood function is a function of the vector of parameters \(\theta\) and is equal to the probability of the data \(y_1,...,y_n\) given the vector of parameters \(\theta\):

\[L(\theta) = p(y_1,...,y_n|\theta)\]

Notice that this is a joint distribution. However, as a simplifying assumption, often times people assume that observations are conditionally independent given model parameters, which implies that we can write the likelihood as:

\[L(\theta) = \prod_{i=1}^{n} p(y_i|\theta)\]

Visualizing the likelihood

As shown above, the likelihood for a dataset is calculated by multiplying several quantities. For example, say that we have the following observations: \(y_1 = 1.5,y_2=2.5,y_3=2.8,y_4=3,y_5=6\). Assuming a normal distribution with mean 2 and variance 2, the likelihood of each observation consists of the height of the corresponding vertical bar and the likelihood for this dataset corresponds to the multiplication of these heights.

#draw normal distribution
seq1=seq(from=-1,to=7,length.out=100)
llk=dnorm(seq1,mean=2,sd=sqrt(2))
plot(seq1,llk,type='l',xlab='y',ylab='Density')

#plot likelihood for each observation
y=c(1.5,2.5,2.8,3,6)
llk1=dnorm(y,mean=2,sd=sqrt(2))
for (i in 1:length(seq1)){
  points(y[i],0,pch=19)
  points(y[i],llk1[i],pch=19)
  lines(rep(y[i],2),c(0,llk1[i]),col='red')
}

#likelihood
prod(llk1)

## [1] 1.91624e-05

We can find the maximum likelihood value for the mean of this normal distribution by shifting it left and right and determining which mean value maximizes the product of these heights. For example, if the mean was equal to 3.16, we would obtain the following picture and the likelihood for the entire dataset would be larger.

#draw normal distribution
seq1=seq(from=-1,to=7,length.out=100)
llk=dnorm(seq1,mean=3.16,sd=sqrt(2))
plot(seq1,llk,type='l',xlab='y',ylab='Density')

#plot likelihood for each observation
y=c(1.5,2.5,2.8,3,6)
llk1=dnorm(y,mean=3.16,sd=sqrt(2))
for (i in 1:length(seq1)){
  points(y[i],0,pch=19)
  points(y[i],llk1[i],pch=19)
  lines(rep(y[i],2),c(0,llk1[i]),col='red')
}

prod(llk1)

## [1] 0.0001030228

Choosing a probability distribution for the likelihood

People typically use probability distributions to describe each \(p(y_i|\theta)\). The choice of which probability distribution to use typically depends on the characteristics of the data. For example:

if \(y_i\) is any real number, then the Normal distribution is commonly used \[L(\theta) \propto \prod_{i=1}^{n} N(y_i|\theta)\] where \(\theta=[\mu,\sigma^2]\). Notice that I use the \(\propto\) sign because we are dealing with PDF’s (instead of PMF’s).
if \(y_i\) is a positive real number, then the Gamma distribution can be used \[L(\theta) \propto \prod_{i=1}^{n} Gamma(y_i|\theta)\] where \(\theta=[a_1,a_2]\).
if \(y_i\) is a real number between 0 and 1, then the beta distribution is commonly used \[L(\theta) \propto \prod_{i=1}^{n} Beta(y_i|\theta)\] where \(\theta=[a_1,a_2]\).
if \(y_i\) is a non-negative integer number with an upper bound \(n\), then the binomial distribution is commonly used \[L(\theta) = \prod_{i=1}^{n} Binomial(y_i|\theta)\] where \(\theta=[n,\pi]\).
if \(y_i\) is a non-negative integer number without an upper bound, then the Poisson distribution is commonly used \[L(\theta) = \prod_{i=1}^{n} Poisson(y_i|\theta)\]
if \(y_i\) is a binary variable, then the Bernoulli distribution is commonly used \[L(\theta) = \prod_{i=1}^{n} Bernoulli(y_i|\theta)\]

These are just some examples. Clearly to be able to choose the most appropriate probability distribution for your data, it is important to have a good grasp on the most common Probability Density Functions (PDFs) and Probability Mass Functions (PMFs) and their characteristics.

Computing and drawing the likelihood

Because the likelihood is a function of the parameter (or parameters), all of our figures show how this function varies for different values of the parameter (or parameters).

Poisson example:

Here is a depiction of what the likelihood function looks like for a dataset comprised of three observations (1, 2 and 3) assuming a Poisson distribution:

dat=c(1,2,3)
lambda=seq(from=0.01,to=10,length.out=1000)
llk=rep(NA,length(lambda))
for (i in 1:length(lambda)) llk[i]=prod(dpois(dat,lambda[i]))

plot(lambda,llk,type='l',ylab='Likelihood',xlab=expression(lambda))

In the code above, instead of us having to explicitly calculate the probabilities using the mathematical expression of the Poisson distribution, given by \(\frac{\lambda^{D_i} e^{-\lambda}}{D_i!}\), this is calculated automatically by the function dpois().

Binomial example:

Here is a depiction of what the likelihood function looks like for a dataset comprised of three observations (1, 2 and 3) assuming a Binomial distribution with n=5:

dat=c(1,2,3)
pi1=seq(from=0.01,to=0.99,length.out=1000)
llk=rep(NA,length(pi1))
for (i in 1:length(pi1)) llk[i]=prod(dbinom(dat,size=5,pi1[i]))

plot(pi1,llk,type='l',ylab='Likelihood',xlab=expression(pi))

Again, we do not explicitly write the the mathematical expression of the Binomial distribution \(\frac{n!}{D_i!(n-D_i)!}\pi^{D_i} (1-\pi)^{n-D_i}\). Instead, we simply use the function dbinom() to evaluate this function for different values of \(\pi\).

Bernoulli example:

Here is a depiction of what the likelihood function looks like for a dataset comprised of three observations (1, 0 and 0) assuming a Bernoulli distribution:

dat=c(1,0,0)
pi1=seq(from=0.01,to=0.99,length.out=1000)
llk=rep(NA,length(pi1))
for (i in 1:length(pi1)) llk[i]=prod(dbinom(dat,size=1,pi1[i]))

plot(pi1,llk,type='l',ylab='Likelihood',xlab=expression(pi))

Normal example:

Here is a depiction of what the likelihood function looks like for a dataset comprised of three observations (-3, 0.5 and 1.3) assuming a Normal distribution:

dat=c(-3,0.5,1.3)
mu=seq(from=-5,to=5,length.out=100)
sig=seq(from=0.01,to=7.5,length.out=100)
combo=expand.grid(mu=mu,sig=sig)
combo$llk=NA
for (i in 1:nrow(combo)) combo$llk[i]=prod(dnorm(dat,mean=combo$mu[i],sd=combo$sig[i]))

library(ggplot2)
v=ggplot(combo,aes(x=mu,y=sig,z=llk))
v+geom_raster(aes(fill=llk)) + geom_contour(colour='white')

Notice that, because we have 2 parameters in the Normal distribution (i.e., \(\mu\) and \(\sigma^2\)), we had to create a surface instead of a line graph.

Back to main menu

Comments?

Send me an email at