How do your data arise?

What I mean by this is how do we conceptualize the stochastic processes that generate what we were able to observe? Once you formally describe these stochastic processes using distributions and parameters (i.e., the data generating mechanism), you have essentially created a generative model.

Thinking about the generative model is an important task that we tend to forget when we rely solely on canned software. However, this question pervades all statistics, both Bayesian and Frequentist. Generative models are particularly important in Bayesian statistics for three reasons:

  1. The statistical model we will build to estimate parameters should reflect the way we think data were generated (i.e., the generative model)

  2. We test our algorithms by creating fake data and seeing if we can uncover the parameters that we used

  3. We can use the generative model to create a predictive distribution, which will be critical for us to assess model fit (more on this later)

Obs.: I use the term “generative model” in a very general way just to get people thinking about how their data came into existence. However, this term is also used in a much more specific way, particularly as a counterpoint to “discriminative models” (e.g., see “Generative model” in Wikipedia).

Malaria prevalence example

We are interested in estimating the malaria prevalence and we use some health facility data. The data we will need will be generated in the following classroom activity. You roll a dice and you are infected with malaria if you get a 1 or 2. If you are infected, roll a dice again and you will have fever if you get any number between 1 and 5. If you are not infected, you might still have fever (perhaps because you have another disease) if you get a 1 when rolling your dice. Finally, everybody that has fever looks for help in the health facility (others do not, even if they are infected, since they think they are fine) and gets tested. If you are infected and symptomatic, the chance that you will be detected to have malaria is 3/6 (you get detected if you roll a dice a get a number between 1 and 3). If you are not infected but still symptomatic, the chance that you will be detected to have malaria is 0. Here is a diagram summarizing this process:

Fig. 1. Diagram illustrating the process of generating malaria data from health facilities. I, S, and D stand for infected, symptomatic, and detected, respectively. In parenthesis we have the dice outcomes.
Fig. 1. Diagram illustrating the process of generating malaria data from health facilities. I, S, and D stand for infected, symptomatic, and detected, respectively. In parenthesis we have the dice outcomes.

What is the generative model for this? A formal generative model would be:

\[I \sim Bernoulli(1/3)\] \[S=1|I=1 \sim Bernoulli(5/6)\] \[D=1|S=1,I=1 \sim Bernoulli (1/2)\] \[S=1|I=0 \sim Bernoulli(1/6)\] \[D=0|S=1,I=0 \sim Bernoulli(1)\]

In this activity, we end up with three pieces of information:

  1. overall number of people in the classroom
  2. number of people that went to health facilities and were tested
  3. number of people that went to health facilities, were tested, and got a positive test result

Unfortunately, it is not clear how to use these data to determine malaria prevalence (the proportion of the overall population that is infected with malaria) in the absence of a clear understanding of this process. For example, a straight-forward calculation in which we divide the number of positive results by the total number of students in the classroom will lead to substantial underestimation of infection prevalence, given that:

  1. there are many individuals that, despite being infected, do not show up at the health facility because they do not feel sick; and
  2. many of those that do show up and are infected might be missed by an imperfect diagnostic method.

We can simulate this process in R:

#set the probabilities that govern this problem
prev=1/3
ps1i1=5/6
pd1s1i1=3/6
ps1i0=1/6

#generate these data based on the probabilities above
n=10000
i=rbinom(n,size=1,prob=prev)
s1i1=rbinom(sum(i==1),size=1,prob=ps1i1)
d1s1i1=rbinom(sum(s1i1==1),size=1,prob=pd1s1i1)

s1i0=rbinom(sum(i==0),size=1,prob=ps1i0)

#here is our naive estimate
sum(d1s1i1)/n #p(D=1)
## [1] 0.1372

Once we have thought about the process through which our data were generated (generative model), we then need to devise a statistical model to estimate the parameters that we think govern this process. This is called inverse modeling because, as an analyst, we just have the end result of a stochastic process (e.g., in the example above, we just have the number of people detected to be positive, the number that went to the health facility, and the overall population size) and we want to recover the parameters that govern this process. In these case, although we are interested mainly in the infection prevalence \(p(I)\), we will have to estimate \(p(S=1|I=1)\), \(p(D=1|S=1,I=1)\) and \(p(S=1|I=0)\) to be able to say something about infection prevalence.

Brazil nut example

Say we are talking about Brazil nut production. Say that the probability that a tree is an adult is \(\pi=0.5\). Thus, you can flip a coin and if it lands on its head, then you are an adult tree. If you are not an adult tree, you produce zero fruits. If you are an adult, then the amount of Brazil nuts you are going to produce is given by \(Poisson(\lambda=2)\). Thus, if you are an adult tree, randomly pick a number from the table of numbers that I have circulated in class. Finally, write down the number of fruits produced.

Notice that both processes (adult status and Brazil nut production), we are talking about random events (thus we use probability distributions) because we do not have full acknowledge regarding all the different factors that influence these phenomena.

How can we describe these processes in a formal generative model?

\[A_i \sim Bernoulli(\pi)\] \[F_i|A_i=1 \sim Poisson(\lambda)\]

This conceptual model (assuming \(\pi=0.5\), \(\lambda=2\)) gives rise to the following Brazil nut production data (collected over multiple trees):

In this figure you can see the results of our assumptions regarding how data were generated. In particular, we can see a spike at zero. Why do we have that? This spike arises from trees that either:

  1. were not adults; or
  2. were adults but did not produce any Brazil nut

This is called a zero inflated Poisson distribution. Zero-inflation is a phenomenon that happens commonly in surveys of natural resources (Martin et al. 2005).

Of course, the tricky part as an analyst is that we do not know \(\pi\) (the probability of being an adult) nor \(\lambda\) (the mean Brazil nut production given that the tree is an adult). Furthermore, we do not know what part of the graph below comes from one process versus the other (i.e., we would only see the total height of the bars but not their colors). For this reason, we are going to use statistical models to estimate these parameters given the data depicted in our figure. Again, we are doing inverse modeling since we want to uncover the parameters that gave rise to our data. In other words, we typically just have the final outcome (i.e., the data) and we want to say something about (i.e., draw inference on) the processes that generated it. Although this example is very simple, it is easy to imagine how to make it more interesting, for instance by thinking about how environmental factors influence these parameters.

If we want to estimate \(\pi\) and \(\lambda\), the first step is to devise an algorithm to do so (perhaps using JAGS or your own Gibbs sampler) and test it on some fake data where we know the true \(\pi\) and \(\lambda\). The concept of testing the algorithm on fake data is important in general, although rarely done. Here we emphasize this because we will be developing our customized algorithms and we want to be sure that it is doing something sensible.

The generative models do not have to be this complicated. Most people are fine assuming much simpler generative models. For instance, when we use a linear Gaussian regression model, what is the generative model?

Here it is: \[y_i \sim N(\beta_0+\beta_1 x_{i1}+...+\beta_p x_{ip},\sigma^2 )\] In this case, we want to uncover the parameters \(\beta_0,.,\beta_p,\sigma^2\) that generated our data. However, it is easy to imagine scenarios where perhaps this generative model would be too simple and we would like to introduce more biological realism into our modeling.



Back to main menu

Comments?

Send me an email at

References

Martin, TG, B Wintle, J Rhodes, P Kuhnert, S Field, S Low-Choy, A Tyre, and H Pssingham. 2005. “Zero Tolerance Ecology: Improving Ecological Inference by Modelling the Source of Zero Observations.” Ecology Letters 8. https://doi.org/10.1111/j.1461-0248.2005.00826.x.