Thursday, March 23, 2017

Species delimitation using the coalescent model

For two weeks or so now a new paper has been making the rounds, and we discussed it in our journal club today:
Sukumaran J, Knowles LL, 2017. Multispecies coalescent delimits structure, not species. PNAS 7: 1607-1612.
The context is species delimitation: given a bunch of individuals, how many species are there, and which individuals belong to which species? There are a number of ways to address these questions, and they partly depend on the available data and technology and partly on the species concept the researcher is using.

Very traditionally, of course, a taxonomist would look at the morphology of the specimens and more or less intuitively try to form clusters of similar specimens separated from each other by gaps in morphological variation. In other words, a qualitative application of the Genotypic Cluster Species Concept. More dubious approaches would involve ideal "types" (in a Platonic sense), "central identities", or rules of thumb on the lines of "one difference means subspecies, two differences means species", none of which seem to have much basis in what we know about genetics or evolutionary biology.

More formally, one can take the same theoretical approach but conduct an explicit, quantitative analysis. Score the morphological data and produce a pair-wise distance matrix for example with the Gower metric, then do a Principal Coordinates Analysis to visualise potential clusters and gaps between them, or do hierarchical or non-hierarchical clustering. The same can be done with non-morphological data, such as environmental data from the collecting localities, in that case to show that putative species have different ecological niches.

A clustering approach can also, of course, be used for genetic data. In that case one would use some kind of genotyping approach, for example microsatellites, AFLP or genome-wide SNPs, and do hierarchical clustering or use a software such as STRUCTURE. Although using a population genetics model, the results produced by the latter are at a practical level comparable to the non-hierarchical clustering in that we get an optimal number of clusters and information on what sample belongs to what cluster; we then need to make the additional interpretative step of assuming that the clusters are the species. (Meaning we have solved the grouping problem but need additional arguments to solve the ranking problem.)

But today more and more people have multi-locus sequence data at their disposal. They are used for phylogenetics under the coalescent model and using species tree approaches, so it was probably unavoidable that the coalescent model would also be applied to species delimitation. The idea behind the relevant software tools such as the currently very popular BPP (disclosure: I have never used it) is that the information from multiple loci can be used to figure out how many species there are among the samples, under the assumption that samples belonging to the same species should have a history of reticulation but samples belonging to different species should have a history of (permanent) lineage divergence.

That sounds logical, but the aforementioned paper seems to hit this idea under the waterline: as the title suggests, the authors conclude that species delimitation under the coalescent resolves population structure, not species limits.

Frankly, although the method has been extremely popular lately, there has also been a lot of scepticism in the community. After all, its application has produced rather one-sided results, nearly always splitting species into several smaller species. I have heard a talk that amounted to a scathing criticism of the approach, arguing that genetic isolation of a small population for less than 200 years would be enough to make it show up as a separate "species" under the coalescent, surely a ridiculous outcome.

Consequently, the present paper fits my thinking on the issue; I, personally, would rather use clustering approaches to search for gaps in variation. But that being said, the way the authors addressed the issue still seems a bit odd to me and leaves me wondering how far their particular argument will carry.

The thing is, the study does not involve any empirical data, it is entirely based on simulations. The authors used a model under which at first only populations split and then some of them may turn into separate species after varying lag times; although there does not appear to be an explicit process in the model I guess the assumption is that it needs a bit of time to accumulate enough differences that a population cannot reunite with its sister population even if they get back into contact with each other. They then simulated species lineages under that model, and then gene trees in those species lineages, and then sequence matrices for those gene trees. And then they analysed the sequence matrices with the coalescent-based species delimitation approach trying to get the original species back.

Surprise, surprise, the coalescent species delimitation approach recovered the population splits, not the species splits. But what has this really shown? As far as I can tell, it has shown that an approach using a model counting all population splits immediately as species splits will not produce the results expected under a model not counting all population splits immediately as species splits.

Maybe I am missing something, but that is exactly what I would have expected before complex simulations on supercomputers had been conducted. If I simulate bicycle rides under a model that assumes I cycle to work at 20 km/h and then try to fit the results back to a model that assumes I cycle to work at 100 km/h I will also likely find that there is poor fit, right? But that does not tell me anything about how fast I really cycle to work, or in other words, anything about which of the two models is a better fit to reality.

Consequently I have to admit that arguments on the lines of "this real-life population that is clearly not a separate species but has merely been isolated for 200 years comes out as a new species under the coalescent approach" seem to be more impressive.

No comments:

Post a Comment