PhyloBotanist: Patrocladistics 1: How does it work? And a contrived example

Tuesday, February 9, 2016

Patrocladistics 1: How does it work? And a contrived example

As the approach is often mentioned in pro-paraphyly publications as an objective method of delimiting paraphyletic taxa, I thought I should look into patrocladistics again and examine it in a blog post or three. In the following I will approach patrocladistics from three different angles:

1. What is patrocladistics and how does it work? This is very straightforward.

2. How does the patrocladistic approach perform when ancestors are added?

It is often easy enough to explain how something works in the abstract, but it is perhaps more enlightening to throw different problems at a method and see under what conditions it is more or less useful or may be mislead. For example, explaining how BEAST does its phylogenetic inferences does not necessarily by itself tell us how it will perform when faced with, say, 25% missing data. I often criticise the pro-paraphyly movement for what I see as their reliance on the fortuitous absence of intermediate fossils to separate out paraphyletic groups. Conversely, members of that movement have a tendency to criticise cladists for supposedly ignoring ancestors. So in the case of patrocladistics, I wanted to see what happens if the method is provided not only with extant taxa but also with ancestors.

3. What is the rationale behind patrocladistics?

In other words, if somebody who is agnostic about the whole phylogenetic versus 'evolutionary' systematics issue were to ask why they should do a patrocladistic analysis, or what the biological or philosophical justification for such an analysis is, what would the answer be?

This post will cover the first point.

What is patrocladistics?

Patrocladistics is an approach suggested by Stuessy & König in a paper published in the plant systematics journal TAXON in 2008. It is sometimes cited by proponents of the formal recognition of paraphyletic taxa as a way to delimit such taxa in an objective, formalised way.

This is done especially as a rebuttal of the cladist argument that it is impossible to objectively delimit paraphyletic groups given the gradual nature of evolution: justifying the recognition of a new subgroup is one thing, but how do you justify that some ancestor in the past was in one insect order but its immediate descendant in another if you could hardly have distinguished the two species? Or more importantly, why in this case and not in all the others where there were the same relatedness and the same degree of difference? Patrocladistics is presented as a way out of this dilemma.

How does patrocladistics work?

A patrocladistic analysis takes a phylogenetic tree, that is a tree of evolutionary relationships between species or other tree terminals, and (re-)clusters those terminals by their phylogenetic distance on the tree.

So first, take a phylogenetic tree with branch lengths proportional to character changes, i.e. a phylogram. From this construct a distance matrix of each terminal against each terminal, using the number of tree nodes separating each pair of terminals as the distance between them. These distances are called cladistic distances.

Now construct a second distance matrix of each terminal against each terminal, using the number of character changes along the tree branches (branch lengths) between any two terminals as the distance between them. These distances are in this context called patristic distances.

Construct the final distance matrix by adding up cladistic and patristic distances. So two sister species sitting on a branch of length one and a branch of length three would have a distance of five on this final matrix, four for the patristic distance and one for the node separating them. The summed distances are called patrocladistic distances.

The matrix of patrocladistic distances is used for a clustering analysis. The paper in which the approach was published is somewhat vague about what clustering method should be used. It mentions UPGMA and single-linkage, expressing a personal preference for the latter because "it more quickly connects groups and also more distinctly reveals dendrogram structure".

The concern with computation speed is somewhat strange given that any available clustering algorithm would have taken only a fraction of a second even for medium-size datasets on a year 2008 desktop computer. In addition, I did not understand what is meant with "more distinctly reveals dendrogram structure", so I consulted that repository of knowledge, Wikipedia, and found the following explanations (accessed 7 Feb 2016):

It is based on grouping clusters in bottom-up fashion (agglomerative clustering)...

This means the clusters will be rooted automatically. I can only assume that bottom-up methods like single-linkage and UPGMA were proposed quite consciously to address the problem of how to objectively root the resulting clusters. Strangely, however, the paper does not appear to explicitly discuss the issue at all; searching the paper for "root" didn't bring anything up. A potential user may thus decide to try out a different clustering method and only later notice a very interesting problem. (That being said, 'evolutionary' systematics being a minority position very few people appear to be using patrocladistics in the first place.)

...at each step combining two clusters that contain the closest pair of elements not yet belonging to the same cluster as each other.

A drawback of this method is that it tends to produce long thin clusters in which nearby elements of the same cluster have small distances, but elements at opposite ends of a cluster may be much farther from each other than to elements of other clusters. This may lead to difficulties in defining classes that could usefully subdivide the data.

I found this rather interesting given the aforementioned problem that the 'evolutionary' approach to classification would have to place in two different phyla or classes an ancestor-descendant pair of species that are so similar to each other that if presented with them in isolation one would be hard pressed to even justify their placement in different subgenera. It seems the clustering approach for patrocladistics was wisely chosen to produce such solutions.

But note how whoever wrote the above section of the Wikipedia article characterises this behaviour as undesirable, and that although the topic isn't even biological classification! The context is clustering in the abstract, not systems that should reflect the reality of evolutionary processes, and even there the people dealing with the performance of clustering algorithms consider such situations problematic.

Anyway, using a clustering algorithm we get clusters of terminals that may not necessarily have formed a clade in the original phylogenetic tree, generally because they will now lack nested members that are very divergent in whatever set of characters underlies the tree. And now these new clusters are used as an argument to recognise them as paraphyletic taxa in 'evolutionary' classifications.

A contrived example

The example case I am using is a contrived one so that nobody can claim any emotional attachment to any particular classification that they learned as a student.

We have five ingroup species: primitiva is the sister to the rest of the ingroup, which consists of two pairs of sister species. One pair, communis and vulgaris, has changed very little relative to their common ancestor with primitiva and thus sits on short branches of length one. The other pair, aberrans and anomalica, is the end product of some rapid evolutionary changes, and they are together at the end of a long branch of length five. The ingroup is separated by another branch of length five from the outgroup, two species imaginatively called outgroupica and outgroupopsis. This is the phylogram:

Phylogenetic systematists (cladists) classify by relatedness and would thus have to place aberrans and anomalica into whatever group primitiva, communis and vulgaris are in, because communis and vulgaris are actually more closely related to aberrans and anomalica than they are to primitiva. Of course it might make sense to recognise the divergence of aberrans and anomalica by giving that subclade a name, but it has to be a subgroup, it cannot be a new group at the same level as that containing the other three ingroup species.

'Evolutionary' systematists do not consistently classify by relatedness but would in this case most likely be impressed by the long branch between aberrans / anomalica and the other species. They would say, "but they look so different!", and thus prefer to place primitiva, communis and vulgaris in one group and aberrans and anomalica in another group at the same level. The point of patrocladistics is to produce a clustering solution that will support such a classification: we want a cluster of aberrans and anomalica outside of the cluster of primitiva, communis and vulgaris.

First, calculate the cladistic distances by counting the nodes (T-crossings) between any two species:

Next, calculate the patristic distances by adding up branch lengths between any two species. Here the red numbers above the branches are helpful:

I hope I got that all right. There is also probably a function in some R package for pairwise phylogenetic distance, but with as few taxa as in my case I didn't bother to search.

Add up the two to produce patrocladistic distances:

Now that we have the distance matrix, we fire up R, load the library(stats) and import the matrix. For me the following worked: Make a tab separated text file containing a complete matrix including the all-zero diagonal and the other half (the above is only one half, for clarity), with taxon names both in the first row and the first column. Import it as a data frame with df <- read.csv("filename", row.names=1, sep="\t", header=TRUE). Cast it into a distance matrix using dm <- as.dist(df).

Clustering can then be done using the hclust function, one of whose methods is single-linkage: cl <- hclust(dm, method = "single", members = NULL). Draw the resulting dendrogram with plot(cl) and you get this:

Voilà, we have the desired result. Not only are aberrans and anomalica outside of the rest of the ingroup, they even ended up on the far side of the outgroup.

Next post: What happens when we include the intermediate species that have existed along the branches? Can a method developed by a school of classification that always criticises cladists for "ignoring ancestors" deal with ancestors? The results were not quite what I had expected.

6 comments:

UnknownFebruary 10, 2016 at 12:26 AM
I don't understand why Stuessy prefers single-linkage. Everyone using patrocladistics uses average-linkage instead (i.e. UPGMA). Willner (2014) justifies this as follows: "We used average-linkage as a cluster algorithm because it also reflects the internal heterogeneity of a group and not only the size of the gap between groups as in the case of single-linkage." It seems reasonable to me.

Your example is good for understanding the way the algorithm works. However, you should be aware it is not a statiscally satisfaying one, because there are only 5 ingroup species and the long branch has also a length of 5. So arguably, the two putative adaptive zones are connected by a bridge as wide as the zones themselves, i.e. there is only one adaptive zone. I guess this is what will reveal adding the ancestors, isn't? Single-linkage leads to a completely unresolved patrocladogram while average-linkage should lead to the same result as the cladogram.

In your second post, you should try with a tree where one could reasonably think there are indeed two adaptive zones, for example by increasing the number of species in both the basal paraphyletic group and the crown autophyletic one.
ReplyDelete
Replies