Wednesday, January 24, 2018

Bioregionalisation part 2: clustering

Already I think I should change the way I was going to do this. It seems more straightforward to keep the two approaches in separate posts. So for today: bioregionalisation using clustering methods.

A small example

As the term clustering suggests the approach is very simple. Let's start by considering a landscape of five cells A-E with five species occurring in them as follows:


Another way of expressing this information is as a matrix where the cells are rows and the species are columns, and presence of a species is indicated with "1" while absence is indicated with "0":


We now simply calculate a distance matrix. There are several possible dissimilarity metrics we can use for this. For this post I will use the S2 dissimilarity, which is defined as
S2 dissimilarity = 1 - ( number of shared species / ( number of shared species + minimum( species unique to first cell , species unique to second cell ) ) )
The resulting S2 dissimilarity matrix for our small dataset is consequently as follows:


Now we use a hierarchical clustering algorithm to produce a dendrogram. I have used R's hclust, and the result is:


We can now recognise clusters as bioregions, and we are done. The main remaining problem with hierarchical clustering is that there is no objective answer for the number of bioregions we should recognise. We could still accept anywhere between one and five, but at least we know that there should not be a region of e.g. only the cells C and E to the exclusion of D.

(This is of course the same problem as in phylogenetic systematics, where we would now know that CE to the exclusion of D is not an acceptable taxon, but it remains a subjective decision whether to recognise CD and E as separate genera or whether to have one genus CDE, for example.)

In our present, case it seems sensible to accept less than five regions but more than one, otherwise we would not have needed the analysis, so let's go with the two clusters AB and CDE:


These regions now show a fairly high level of endemism, as four of the five species are endemic to one region; only the blue species occurs across both.

Some R code

Although the proper software for this kind of work is Biodiverse, this post would get too long if I tried to do everything in one go. What is more, a simple analysis can just as well be run in R, which is what I have done in this case. First build a matrix of cells and the species in them, e.g.
occurs <- as.matrix(rbind(c(1,1,0,0,0), c(1,1,1,0,0), c(0,0,1,1,0), c(0,0,1,1,0), c(0,0,0,1,1)))
rownames(occurs) <- c("A", "B", "C", "D", "E")
colnames(occurs) <- c("red","brown","blue","orange","lilac")
The following loops will then produce a matrix of S2 dissimilarity scores.
mydm <- matrix(0, 5, 5)    # create empty matrix; could make it more flexible for future analyses by handing over square root of length(occurs) for the dimensions
rownames(mydm) <- c("A","B","C","D","E")    # same here, could use row names from occurs
colnames(mydm) <- c("A","B","C","D","E")     # and same here
for (i in 1:5)
{
  for (j in i:5)
  {
    if (i==j)
    {
      mydm[i,j] <- 0
    }
    else
    {
      shareds <- sum(occurs[i,] & occurs[j,])
      uniques_i <- sum(xor(occurs[i,], occurs[i,] & occurs[j,]))
      uniques_j <- sum(xor(occurs[j,], occurs[i,] & occurs[j,]))
      mydm[i,j] <- 1- (shareds / (shareds + min( uniques_i, uniques_j)))
      mydm[j,i] <- mydm[i,j]
    }
  }
}
Now finally do a cluster analysis and plot the resulting dendrogram:
mycl <- hclust(as.dist(mydm), method = "mcquitty")     # WPGMA
plot(mycl)
Done. For large numbers of cells we would want a decent visualisation, ideally as a map, and that is where Biodiverse works better. How to do the analysis in that software will be covered in the next post.

No comments:

Post a Comment