Monday, February 10, 2014

More 'Framework': Can we trust molecular phylogenetics?

Continuing with extended comments on the book I am reading, Richard Zander's Framework for a Post-Phylogenetic Systematics.

In my previous post, I examined the claim that parsimony analyses are mislead by 'budding' speciation. As indicated then, this does not appear to be plausible unless we deliberately withhold information from the analysis, in which case the analysis itself would not be at fault.

However, it is not only parsimony analyses of morphological data that Zander mistrusts and considers to be too 'mechanistic'. He also claims that molecular phylogenetics cannot be trusted to infer the correct relationships between species. Although he uses different terms, partly of his own invention, his main argument appears to be that the stochastic nature of lineage sorting will often mislead phylogenetic analyses.

To explain, it is a well known fact that species can have a considerable number of different alleles for the same gene region, especially if they have a large population size. Even if a species starts with only one allele, over time new alleles arise from the existing ones through mutation. This means that if we sample enough specimens from a widely distributed species we may find a whole bunch of alleles in it that have their own phylogenetic relationships to each other (a gene tree).

Also over time, some of the alleles get lost. Not necessarily because they are selected against but perhaps just because there is limited space for the constantly increasing number of alleles within the species, and thus some of them have to disappear randomly even if they are not at any selective disadvantage. This is then called genetic drift.

There are now two generally recognized problems with inferring species relationships from genetic data. The first is called incomplete lineage sorting. In this case, a species has inherited a lot of ancestral allele diversity, and some of the alleles within it are more closely related to the alleles in a different species than to its other alleles. In other words, the alleles in one species are non-monophyletic relative to the alleles in another species. However, through the process of genetic drift described above, the alleles in one species will ultimately become monophyletic as most of them are lost.

The funny thing is, Richard Zander has the exact opposite perspective here than most phylogeneticists. Nearly everybody considers incomplete lineage sorting as a problem that has to be solved by using species tree methods. Zander does not consider it to be a problem but instead as desirable, indeed to be a prerequisite for understanding relationships. Only if the alleles in one species are non-monophyletic, he argues, can we assume that it is really closely related to (or, in his words, the "ancestor of") those species whose alleles are nested in it.

His distrust of molecular data in the absence of incomplete lineage sorting comes from the second, closely related problem. Assume that a lineage branches off daughter lineages while it carries ancestral allele diversity, and ultimately we only sample one of the alleles, or alternatively all but one die out (Zander invents the term "implicit paraphyly" for this). Depending on which of the alleles end up in the daughter lineages and which alleles we were ultimately able to sample, we may end up with a gene tree that is completely unrepresentative of the true species phylogeny. Behold:


This is my own representation of Zander's example from plate 6.1 on page 56 in the Framework. An ancestral species contained one allele, the green one. It produced a different, new allele depicted in blue. Now the lineage diversified into four species, and all three side lineages got alleles descended from the blue one. Then, the blue allele dies out in species A and we can sample only the green one. This means our gene phylogeny will say (A,(B,(C,D)) although the species phylogeny is (B,(C,(D,A))).

It is important to understand that in principle this is a real problem. Yes, this can indeed happen and surely something a bit like it does sometimes. But that does not mean that we cannot trust molecular phylogenies at all.

First, as I have already explained elsewhere, this is only a problem if all side lineages survive, and that is clearly an unrealistic assumption. More than 99% of every species that ever existed are extinct, and thus most surviving groups of organisms are connected by relatively long deep branches on the tree of life along which lineage sorting had time to happen. To the degree that there is a problem it is mostly a problem of very shallow phylogenetics, of trying to figure out the phylogeny of small groups of closely related species, not of trying to figure out whether a medium sized genus is monophyletic.

Second, most phylogeneticists know that these problems exist. There is a massive amount of literature on the aforementioned species tree methods which have been developed to deal with precisely this situation. For starters, read this recent review by Luay Nakhleh.

Third, most phylogeneticists also know that we can deal with the problem by sampling more specimens per species (to get more different alleles) and more independent genes whose histories can be compared. Many of Zander's arguments about the supposed problems with molecular data assume that only one gene and one specimen per species are used for analysis, and that is simply not what I, for example, do at the lowest taxonomic levels.

Fourth, let us look more closely at the above example. Again, yes, this could in principle happen, but just how likely is this rather extreme case? We can consider the gene trees inside a species tree to be a sampling problem: Every time a side lineage branches off, it samples one or a few of the alleles in the ancestral lineage. Now how likely is it that all three side lineages will grab the blue allele? All else being equal, it should be relatively unlikely.

However, IF they all randomly grabbed the blue allele, then we might start to suspect that they did so because it was much more frequent in the ancestral lineage than the green allele. But then we immediately run into the next question: How likely is it that the frequent blue allele will go extinct but the rare green one survives? The whole scenario suddenly looks like a rather improbable worst case, not like a situation that we have to expect very often.

So in summary, the author of the Framework really does not trust molecular data, and he is free to use other types of data. But his argumentation is based on rather unrealistic assumptions, and to the degree that there are problems phylogeneticists have solved them years ago.

No comments:

Post a Comment