Intro

Clustering methods, like the mixture model that we saw in class, are particularly useful when data are multivariate because these type of data are hard to visualize and it can be hard to extract information from them using our standard tools (e.g., linear models). Examples of multivariate datasets that are sometimes analyzed using clustering algorithms include:

  • hyper-spectral data (i.e., a matrix containing reflectance, where the rows are the samples/locations and the columns are the spectral bands)

  • genetic data (i.e., a matrix containing presence/absence data, where rows are individual samples/people and columns are genes)

  • text data (i.e., a matrix containing the number of times each word [columns] was used in each document [rows])

  • social media data (i.e., a square matrix containing presence/absence of connections/relationships between users)

In this example, we will focus on a common data set in Ecology. This dataset consists of a matrix with abundance data, where each row is a different plot in the field and each column is a different species. Here is how a subset of these data looks like:

Plot Spp 1 Spp 2 Spp 3 Spp 4
1 0 3 0 1
2 2 1 6 10
3 2 0 4 1

The data are actually much larger, having hundreds to thousands of plots/rows and tens to hundreds of species/columns. In this context, clustering is often used to group plots that have similar species composition, yielding so-called bioregions or regions of common profile. This way, instead of having to look at hundreds of rows, we can reduce the dimensionality of these data by identifying and examining just a handful of these bioregions. Then, we can look at how the proportion of each cluster changes through space and time as well as a function of environmental gradients (e.g., altitude, wetness, etc.).

In this particular example, we are interested in trying to understand how biodiversity is associated with altitude. We will be relying on a simplified dataset (only 100 sites and 3 species) to fit a 2-cluster mixture model. You can get the data here “mixture model data.csv”.

Here are the steps that we will need to be able to accomplish this task:

  1. Come up with a 2-cluster mixture model

Here are some guiding questions that can be helpful:

  • Which distribution should we choose to model these data?
  • How can we adapt the mixture model we saw last week to this activity?
  1. Now that we have defined our model, the next step will be to simulate some data

  2. We will then codify our model in JAGS and run it on the simulated data to make sure that we can indeed estimate all the parameters. If this doesn’t work, we will have to go back to the drawing board and change our model.

  3. If everything seems to be working well, we can finally apply our model to the original data. Once this model has been fit, I would like you to graphically explore how these 2 clusters/bioregions are associated with altitude.



Comments?

Send me an email at

References