Basic PMFs and PDFs
Although introductory statistics courses often skip talking about probability mass functions (pmf’s) and probability density functions (pdf’s), these are the basic building blocks of statistics, regardless if frequentist or Bayesian. Statisticians use these functions to describe uncertainty, random events and phenomena. In this course, we will extensively use these functions to specify the likelihood, prior and posterior distributions.
If we had a large number of random variables from a given distribution and we created a histogram, the shape of this histogram would be described by the mathematical equation of the corresponding PDF or PMF. This mathematical equation will have parameters that, despite not being directly observable, govern the shape of the distribution. Statistics is all about estimating these unobservable parameters.
What is the different between pmf and pdf? Pmf’s are used for discrete outcomes/values whereas pdf’s are used for continuous outcomes/values. Here is a pmf of a Poisson distribution with mean equal to 5. The Poisson distribution has a single parameter \(\lambda\), which corresponds to its mean but also its variance, and is given by \(p(x)=\frac{\lambda^x exp(-\lambda)}{x!}\). In this formula, x is the random variable/outcome. Because it is a pmf, the outcomes it models are discrete:
#plot a histogram of random variables that follow a Poisson distribution with lambda=5
n=10000
x=rpois(n,lambda=5)
#summarize these data
tmp=table(x)
tmp=tmp/sum(tmp) #calculate relative frequency
plot(tmp,type='h')
#plot the the theoretical Poisson distribution assuming lambda=5
k=0:20
prob=dpois(k,lambda=5)
for (i in 1:length(k)){
lines(rep(k[i],2)+0.1,c(0,prob[i]),col='red',lwd=2)
}
## outcome probability
## 1 0 0.007
## 2 1 0.034
## 3 2 0.084
## 4 3 0.140
## 5 4 0.175
## 6 5 0.175
## 7 6 0.146
## 8 7 0.104
## 9 8 0.065
## 10 9 0.036
## 11 10 0.018
## 12 11 0.008
## 13 12 0.003
## 14 13 0.001
## 15 14 0.000
## 16 15 0.000
## 17 16 0.000
## 18 17 0.000
## 19 18 0.000
## 20 19 0.000
## 21 20 0.000
Notice that the sum of the probabilities of all possible outcomes has to be equal to 1. In the case of the Poisson distribution, we have that \(\sum_{x=0}^{\infty} p(x)=1\) (notice upper limit of infinity). Other pmf’s have a more restricted range of outcomes. For instance, outcomes under a Binomial distribution with parameters p and n can be any integer between 0 and n. Additional examples of pmf’s include Bernoulli (for binary outcomes), multinomial (for categorical outcomes), and negative-binomial (for overdispersed count data).
Although few measurements are really continuous (i.e., there is always a maximum precision for the measurement instrument), pdf’s are useful as an approximation to many types of measurements. In this case, the probability of any particular outcome (say \(x=0.5\)) is zero and therefore all probability calculations are based on an interval (e.g., \(p(0.4<x<0.6)=0.1\)). For example, a beta distribution is often chosen as a prior for probability parameters (i.e., \(\pi\)) because it allows \(\pi\) to be any real number between 0 and 1. This distribution is given by \(\frac{x^{(\alpha-1)} (1-x)^{(\beta-1)}}{B(\alpha,\beta)}\) , where x is the random variable/outcome and \(\alpha\),\(\beta\) are parameters. Here is an example of a beta distribution that may or may not be a reasonable description of our prior knowledge of \(\pi\):
#plot a histogram of random variables that follow a beta distribution with parameters a=3 and b=3
n=10000
x=rbeta(n,shape1=3,shape2=3)
hist(x,probability=T)
#plot the theoretical beta distribution assuming that the parameters are a=3 and b=3
seq1=seq(from=0,to=1,length.out=1000)
lines(seq1,dbeta(seq1,shape1=3,shape2=3),col='red')
Notice that I use a continuous curve, very different from the vertical lines I used for the Poisson distribution. This is to emphasize that all values between 0 and 1 are possible (i.e., \(\pi\) is continuous). Here, the equivalent condition that “the sum of the probabilities of all possible outcomes has to be equal to 1” is that the area under the pdf has to integrate to 1 (i.e., \(\int_0^1 p(\pi)d\pi=1\)). Additional examples of pdf’s include the normal, gamma, and Dirichlet distributions.
Wait… why do we have values on the y-axis that are greater than 1? Isn’t probability constrained to be between 0 and 1?
As hinted above, to choose the appropriate pdf’s or pmf’s to use when specifying our likelihood and priors, it is very important to know the characteristics of these functions. Important questions to ask yourself when choosing a distribution for the likelihood:
- do I want to model continuous or discrete data?
- Can these data be negative?
- Is there an upper bound to my data?
When choosing a distribution for the priors, there are a number of factors that come into play, including:
- is the parameter being modeled continuous or discrete?
- is the parameter bounded?
- can this distribution adequately represent my prior beliefs?
- does this distribution “play nicely” with the likelihood? This is important for computational purposes.
If you would like to brush up on these different distributions, I highly recommend Chapters 2-4 in (Wackerly, Mendenhall, and Scheaffer 2008) and Chapter 4 in (Bolker 2008).
Extending basic PDFs and PMFs
These basic PDF’s and PMF’s are typically modified for modeling purposes. For instance, while the normal distribution \(N(y|\mu,\sigma^2)\) is not particularly interesting, we can modify this distribution to capture a linear relationship between x and y through a simple linear regression. In this regression model, instead of assuming a fixed mean parameter \(\mu\), we assume that the mean is a linear function of the covariate x: \(N(y|\beta_0+\beta_1x,\sigma^2)\). We could further extend this model to allow for the variance to change with x as well: \(N(y|\beta_0+\beta_1x,exp(\alpha_0 + \alpha_1 x))\). Notice that I used exp() for the variance to ensure that it is always positive, regardless of what the values for \(\alpha_0\) and \(\alpha_1\) turn out to be.
We do not have to restrict ourselves to the normal distribution. For instance, say that our data consist of proportions and that we do not have access to the original numbers that were used to calculate these proportions. One potential distribution for this could be the beta distribution, given that proportions are constrained to be real numbers between 0 and 1 (notice that the beta distribution will not work if the proportion turns out to be exactly equal to 0 or 1). We can create a beta regression by assuming that \(Beta(y|\frac{b\mu}{1-\mu},b)\). We adopt this formulation because we know that if \(y \sim Beta(a,b)\), then:
\[E[y]=\frac{a}{a+b}=\frac{\frac{b\mu}{1-\mu}}{\frac{b\mu}{1-\mu}+b}=\frac{b\mu}{b\mu+b(1-\mu)}=\mu\]
Furthermore, we know that \(\mu\) has to be between 0 and 1. Therefore, to determine how the mean \(\mu\) is associated with covariate x, we can use the logistic function \(\mu=\frac{exp(\beta_0 + \beta_1 x)}{1+exp(\beta_0 + \beta_1 x)}\) to specify our beta regression model.
References
Bolker, B. M. 2008. Ecological Models and Data in R. Princeton University Press.
Wackerly, D., W. Mendenhall, and R. L. Scheaffer. 2008. Mathematical Statistics with Applications. Cengage Learning.
Comments?
Send me an email at