Mark-recapture
In this example, we are interested in estimating population size of a wildlife species using mark-recapture data. We collect these data by going into the field multiple times, during which we capture and mark individuals. As a result, we might retrieve individuals that had already been marked before and/or get completely new individuals in these surveys. The key assumptions here are that:
We need to be able to mark individuals. These data historically arouse from capturing animals and uniquely marking them in the first encounter and then trying to recapture them in subsequent surveys. This explains why we use the term “capture-recapture” but it is important to note that individuals might not need to be physically captured (Kery and Schaub 2012). For example, we might be able to identify individuals seen before from photographs, based on their unique scars or fur patterns, from camera traps.
Marking individuals do not impact their fate (e.g., marking does not harm them and/or reduce their survival)
Marked individuals mix randomly with the population at large. Furthermore, marked individuals are neither easier (trap happy) nor harder (trap shy) to capture a second time compared to the other individuals in the population.
Marks do not come off from the marked individuals.
Different surveys need to be close enough in time so that we can assume that the same set of individuals is available throughout these surveys (i.e., individuals have not moved in or out of the study region, individuals have not died, and new individuals have not been born).
It is important to note that mark-recapture methods are also used in other contexts as well. For example, in epidemiology, this methodology is used to estimate the number of people with particular conditions (e.g., people infected with HIV or addicted to illegal drugs).
Intuition for how it works
To keep things simple, let’s assume that we only conduct two surveys. During our first data collection period (i.e., period 1), we capture and mark \(n_{capture_1}=n_{mark_1}\) individuals following a standardized protocol. These individuals are subsequently released. At a second data collection period (i.e., period 2), using the same standardized protocol, we end up capturing \(n_{capture_2}\), out of which \(n_{mark_2}\) individuals had the marks from period 1.
The basic assumption of mark-recapture models is that the proportion of marked individuals recaptured in the second sample \(\frac{n_{mark_2}}{n_{capture_2}}\) is approximately equal to the proportion of marked individuals in the population as a whole \(\frac{n_{mark_1}}{N}\). This is useful because, if we want to estimate N, we can just solve for N given that we know all the other numbers:
\[\hat{N}= \frac{n_{mark_1}\times {n_{capture_2}}}{n_{mark_2}}\]
This is called the Lincoln-Peterson index. How do we convert this into a statistical model? One way is to assume the following distributions:
\[n_{mark_1} \sim Binomial(N,\delta)\] \[n_{mark_2} \sim Binomial(n_{capture_2},δ)\] where \(\delta\) is the detection probability. The parameters that we are trying to estimate here are \(N\) and \(\delta\).
Notice that, if we just had \(n_{mark_1}\), we would be doomed because we would not be able to separately estimate \(N\) and \(\delta\). To see this, notice that different combinations of values of these parameters yield the same mean, given by \(E[n_{mark_1}]=N\times \delta\).
On the other hand, having \(n_{mark_2}\) and \(n_{capture_2}\) enables us to get an estimate of \(\delta\). This estimate arises because \(E[n_{mark_2} ]=n_{capture_2}\times \delta\) and therefore \(\hat {\delta} = \frac{n_{mark_2}}{n_{capture_2}}\). Furthermore, if we have an estimate \(\hat {\delta}\), then we can estimate N as \(\hat {N} = \frac{n_{mark_1}}{\hat {\delta}}\).
This example illustrates how to get point estimates \(\hat \delta\) and \(\hat N\). However, we typically also want to estimate how much uncertainty there is in our population size estimate \(\hat N\). Uncertainty in \(\hat N\) arises from sampling uncertainty associated with the binomial distribution \(n_{mark_1} \sim Binom(N,\delta)\) and uncertainty when estimating \(\delta\).
Fitting this model within a Bayesian framework will enable us to correctly estimate and propagate the uncertainty in these parameters.
Generative model
We will make the model slightly more complicated by assuming that we have multiple surveys, not only two as in the example above. Furthermore, instead of just representing the aggregate number of individuals marked \(n_{mark_t}\) and captured \(n_{capture_t}\) in each survey \(t\), we will be representing the status of each individual.
More specifically, let \(C_{it}\) be a binary variable indicating if animal i (i=1,…,N) was captured at survey t (t=1,…,T). All observed individuals are marked with a unique identifier immediately after being captured. This individual is subsequently released. The unique identifier allows us to determine if an individual that has been just captured is a new individual or an individual that has been previously captured. Assuming detection probability of \(\delta\), our model is given by:
\[C_{it} \sim Bernoulli(\delta)\]
Obviously this is a very simple model in the sense that we assume that capture probability \(\delta\) does not depend on the individual animal characteristics (e.g., capture probability could depend on animal size or color) and does not change with time or with capture history (e.g., individuals do not become trap happy or trap shy). What I mean by capture history is a series of 0’s and 1’s which describe when a given individual was detected. For example, the capture history [0,0,0,1,0,1] indicates that this individual was first detected, captured, and marked in survey 4, being subsequently recaptured in survey 6.
Simulating data
Here is how we generate some simulated data:
#generate some fake data
rm(list=ls(all=T))
set.seed(1)
#true parameter values
delta.true=delta=0.3
N=200 #true population size
#number of surveys
T1=4
#observation status of each individual
C1=matrix(0,N,T1)
for (i in 1:T1){
C1[,i]=rbinom(N,size=1,p=delta)
}
#each colum corresponds to a particular sampling occasion
colnames(C1)=paste0('so',1:T1)
#each row corresponds to a particular individual
rownames(C1)=paste0('id',1:N)
#here is what the data look like
head(C1)## so1 so2 so3 so4
## id1 0 0 0 1
## id2 0 0 0 1
## id3 0 0 1 0
## id4 1 0 1 1
## id5 0 0 1 1
## id6 1 0 1 1
#In the real dataset, we never observe individuals with 0 captures
#Therefore, we need to remove these individuals from our dataset
cond=apply(C1,1,sum)!=0
data1=C1[cond,]
nrow(data1) ## [1] 153
setwd('U:\\uf\\courses\\bayesian course\\group activities\\5 example pop size')
#setwd('/Volumes/drvalle/uf/courses/bayesian course/group activities/5 example pop size')
write.csv(data1,'simulated data popsize.csv',row.names=F)Notice that the data that we will be working with only has 153 lines, despite the fact that the true population size is much larger (i.e., \(N=200\)). This is because some individuals never get observed and, as a result, these individuals are dropped from our original dataset.
Many of the ideas in this document were influenced by an exceptionally well-written document describing the mark-recapture method, authored by Bruce Heyer (DeAnza College). This document is available at his website.
Comments?
Send me an email at