Tuesday, June 28, 2016

More adventures in science spammer land

Thus writeth the Open Journal of Plant Science, whose mass eMail is totally not spam, oh no sir:
Please Note: This is not a spam message and has been sent to you because of your eminence in the field.
Alright then.
Greetings for the day!
If you don't want to be taken for a spammer, perhaps try not to begin the eMail like every other science spammer.
Peertechz was launched this Journal to support the Open Access in the way of publishing manuscripts, new technics and methods in science.
I was decided any papers not to submit to journals whose editors not write coherent sentence.
Open Journal of Plant Science published articles are freely available online to the readers for life time.
They don't specify, but presumably they mean the life time of their website, right?
The journal encourages the authors to publish their manuscripts in a large Open Access network: Peertechz and its looking for the manuscripts from selective scientists like you who have enormously contributed to the scientific community.

It would really be grateful to you if you can send us energetic and enthusiastic submission to successfully release the upcoming issue. Send us any type of article to increase the visibility of the Open Journal of Plant Science.
Sadly I don't think that I have ever had a manuscript that I would have called 'energetic', be it as an author or as a reviewer. How do they differ from non-energetic ones?
If you are interested, please respond us on or before 48 hours and send your paper by July 15th, 2016.
I have seen that with quite a few journal spammers before. They have a very close deadline and then a more distant deadline in the very same sentence. So which is it? Why would an author have to respond in two days? I strongly suspect that they would also accept manuscripts on 15 July from people who didn't contact them tomorrow. In fact if we are talking a regularly appearing journal here then it should accept manuscripts on 16 July, so why claim that there are deadlines at all?
We are looking forward to have valuable submission from you soon.
Darauf kannst Du lange warten.

---

I also recently saw an example of what comes out at the other end of the process. One of my publication alerts notified me that I had been cited, and the paper was curious enough that I followed the link. It is
Gayathiri et al. 2016. A review: potential pharmacological uses of natural products from Laminaceae. International Journal of Pharma Research & Review 5: 21-34.
Yes, "Laminaceae". Seven authors have written an alleged review article on the mint family Lamiaceae and have consistently misspelled its name throughout the entire manuscript. Ah no, I lie; they spelled it "laiminace" in the keywords, so scratch consistently.

This alone is just... indescribable. Seriously, how can somebody write a review about something and not know how it is spelled? How can seven authors and presumably at least one editor miss that, even in that journal?

Also, my name is misspelled in the reference list, the species I am cited about is misspelled, and as for the text itself, let's just quote a single sentence from the abstract.
Although, medicinal plants continue provide a new drug leads in drug discovery, and numerous challenges are encountered in procurement and selection of plant materials, screening method and the scale up of active compound, hence this brief review work presents a study of the importance of natural products, especially those derived from higher plants and aims to the highlight the pharmacological significant of Laminaceae family in terms of drug development.
Yay for science!

---

On Jeffrey Beall's blog, which has already covered criteria for (1) what it calls predatory journals and (2) fake impact factors, a discussion has been started about criteria to similarly classify conferences as "predatory". Suggestions by contributor James McCrostie are, in short:
  1. Any kind of deceit, such as hiding that the conference is for-profit or claiming that the organisers are based in a different country than they really are;
  2. No or inadequate peer review, including review only by the conference organisers; later fast acceptance of submitted talks is also mentioned;
  3. High conference fees;
  4. Overly broad scope;
  5. Connections to companies known for running "predatory" journals;
  6. "Regularly accepting papers by undergraduates";
  7. Advertising the conference like a holiday;
  8. Spamming, especially to random people outside of the relevant field of science;
  9. People are allowed to give multiple presentations at the same conference.
(Numbers mine, the original list is more extensive and not numbered.)

I have received quite a few spam eMails advertising in broken but hyperbolic English for obviously crappy conferences, often from fields I have no relation to whatsoever, such as fertiliser research or medicine. This is clearly a problem, although as with the journals I would argue that no competent scientist should ever fall for them. People should know what the relevant conferences in their field are and be able to delete these spam eMails at a glance. I assume they mostly prey on the desperate.

Still, I can see how a list of dodgy journals, for example, is useful to people outside of a field or totally outside of science, to help gauge if somebody's CV is inflated. I am just not so sure that it is quite as easy to develop useful criteria for recognising dodgy conferences as it is for journals.

Points 1, 4, 5, 7, and 8 are clearly valid. Deceit, spamming and being run by a crappy journal company are obvious alarm bells, and overly broad scope makes a conference practically useless for networking and learning. But the others? Not so much.

Maybe it is different in other fields, but I do not think that I have ever participated in a professional conference that peer reviewed the abstract submissions except for whether they fit into the scope of the meeting. In other words, the organisers of a systematic botany conference will look over the abstract, and if it were on theology or law or perhaps obviously pseudoscience they would kick it out, but that's that. The closest I have ever seen is that the committee would demote some talk submissions that were deemed less important to posters, and even that only at the very most prestigious meetings.

High fees? Well that is in the eye of the beholder I guess, but some legitimate conferences can be terribly pricey.

"Regularly accepting papers by undergraduates"? What? Are there seriously fields of research where the quality of work doesn't matter but only one's status and age? Luckily I am not working in one of them.

Similarly, what is the problem with giving two talks at the same meeting? At most conferences in my field there are two or three people who do that. Nobody has ever had an issue with that as long as they don't do it every time, and as long as nobody gives five or something like that.

Perhaps practices are just too different between the various fields of science and scholarship to find easy agreement on this.

Sunday, June 26, 2016

Implementation of substitution models in phylogenetic software

Concluding the little series of posts on nucleotide substitution models, below is a summary of my current understanding of how to set several of the models discussed in the previous post in PAUP and MrBayes. But first a few comments.

For PAUP there are three possible options for the basefreq parameter. My current understanding is that equal sets equal base frequencies as in the JC model (duh), empirical fixes them to the frequencies observed in the data set, and estimate has them estimated across the tree. My understanding is further that while estimate is more rigorous, empirical is often 'close enough' and saves computation time. The point is, in any case below where it says one of the latter two options I believe one could also use the other without changing the model as such. I hope that is accurate.

What I do not quite understand is how one would set models like Tamura-Nei in PAUP. At least one of the sources I consulted when researching for this post (see below) suggests that one can set in PAUP models for example with variable transition rates but equal transversion rates, but the PAUP 4.0b10 manual states that the only options for the number of substitution rate categories are 1 (all equal), 2 (presumably to distinguish Tv and Ts), and 6 (all different). Would one not need nst = 3 or nst = 5 to set the relevant models? Perhaps the trick is to set nst = 6 but fix the substitution matrix? But that would mean one cannot estimate it during the analysis.

For the MrBayes commands note that the applyto parameter needs to be specified with whatever part(s) of the partition the particular model should apply to, as in applyto = (2) for the second part. The MrBayes commands I found my sources appear to be rather short; I assume that default settings do the rest. Note also that the old MrBayes versions that I am familiar with have now been superseded by RevBayes, but I have no experience with the latter program's idiosyncratic scripting language.

In addition to PAUP and MrBayes, I mention whether a model can be selected in three other programs I am familiar with, RAxML, BEAST and PhyML. I have no personal experience with most other Likelihood phylogenetics packages. I tried to check what is available in MEGA, but it wouldn't install for me, and I am not sure if the list of models in the online manual shows only those that are actually available for Likelihood search or whether it includes some that are only available for Distance analysis. The relevant list is linked from a page on Likelihood, but its own text implies it is about Distance. Either way, MEGA appears to have lots of options but I didn't indicate them below.

General Time-Reversible (GTR)

PAUP: lset nst = 6 basefreq = empirical rmatrix = estimate;

MrBayes: lset applyto=() nst=6;

Available in RAxML, BEAST 2.1.3 and PhyML 3.0.

Tamura - Nei (TN93 / TrN)

Available in BEAST 2.1.3 and PhyML 3.0.

Symmetrical (SYM)

PAUP: lset nst = 6 basefreq = equal rmatrix = estimate;

MrBayes: lset applyto=() nst=6; prset applyto=() statefreqpr=fixed(equal);

Hasegawa-Kishino-Yano (HKY / HKY85)

PAUP: lset nst=2 basefreq=estimate variant=hky tratio=estimate;

MrBayes: lset applyto=() nst=2;

Available in RAxML, BEAST 2.1.3 and PhyML 3.0.

Felsenstein 84 (F84)

PAUP: lset nst=2 basefreq=estimate variant=F84 tratio=estimate;

Available in PhyML 3.0.

Felsenstein 81 (F81 / TN84)

PAUP: lset nst=1 basefreq=empirical;

MrBayes: lset applyto=() nst=1;

Available in PhyML 3.0.

Kimura 2-parameters (K80 / K2P)

PAUP: lset nst=2 basefreq=equal tratio=estimate;

MrBayes: lset applyto=() nst=2; prset applyto=() statefreqpr=fixed(equal);

Available in RAxML and PhyML 3.0.

Jukes Cantor (JC / JC69)

PAUP: lset nst=1 basefreq=equal;

MrBayes: lset applyto=() nst=1; prset applyto=() statefreqpr=fixed(equal);

Available in RAxML, BEAST 2.1.3 and PhyML 3.0.

Invariant sites (+ I)

For PAUP, add the following to the lset command: pinvar = estimate

For MrBayes, add the following to the lset command: rates = propinv

Gamma (+ G)

For PAUP, add the following to the lset command: pinvar = 0 rates = gamma ncat = 5 shape = estimate (or another number of categories of your choice for ncat).

For MrBayes, add the following to the lset command: rates = gamma

Invariant sites and Gamma (+ I + G)

For PAUP, add the following to the lset command: pinvar = estimate rates = gamma ncat = 5 shape = estimate (or another number of categories of your choice for ncat).

For MrBayes, add the following to the lset command: rates = invgamma

References

This post is partly based on the following very helpful sources, all accessed c. 20 June 2016.

Faircloth B. Substitution models in mrbayes. https://gist.github.com/brantfaircloth/895282

Lewis PO. PAUP* Lab. http://evolution.gs.washington.edu/sisg/2014/2014_SISG_12_7.pdf

Lewis PO. BIOL848 Phylogenetic Methods Lab 5. http://phylo.bio.ku.edu/slides/lab5ML/lab5ML.html

And a PDF posted on molecularevolution.org.

Friday, June 24, 2016

What number of schemes to test to get what substitution models in jModelTest

Another thing that occurred to me that might be useful to write about in the context of substitution models concerns jModelTest.

The background is here that before setting a model in a phylogenetic analysis, one would usually conduct model-testing. There are some programs that do their own testing, but others don't, and consequently people have written software that examines your dataset and suggests the best model. The classic program was simply called ModelTest. It was developed by David Posada, and it was primarily used through and for PAUP. It actually wrote out the PAUP commands of the best-fitting model, so that one could simply copy and paste them over into the Nexus file, and off we go.

Then people wanted to use the results of the model test for MrBayes. Problem was, MrBayes didn't do all the models that PAUP did, and it was annoying to see a model suggested that one couldn't implement. So Johan Nylander produced a modified version called MrModeltest, and it conveniently wrote out the MrBayes and PAUP commands for the best-fitting model.

These programs have now been superseded by jModelTest. On the plus side, this tool is highly portable and extremely user-friendly thanks to its GUI. Also, the user does not have to have PAUP, but instead jModelTest simply comes packaged with PhyML and hands the data over to that software. Also, it is parallelised. On the other hand, in contrast to its predecessors it does not appear to write out the PAUP or MrBayes code. But well, that is what my next post is going to deal with.

For the moment I want to address a different problem: When starting the Likelihood calculations for various models, the user can select whether 3, 5, 7, 11 or 203 (!) substitution schemes, and thus double the number of models, are going to be tested. (And with or without invariant sites and gamma, if desired.) But the thing is, it is not clear from just looking at the interface which models for example the seven schemes are going to cover. If I select seven, will all the models I could implement in BEAST be included? Or will I just waste a lot of computing time on testing models that BEAST can't do anyway?

So I just ran all the numbers across a very small dataset, and this is what models are being tested in each case:

3 substitution schemes: JC, F81, K80, HKY, SYM, GTR

Three schemes is the highest you need to test if deciding on a model for RAxML or MrBayes. Note that as of updating this post (30 Nov 2016) RAxML can only do JC, K80, HKY, and GTR, while MrBayes can do all six.

5 substitution schemes: JC, F81, K80, HKY, TrNef, TrN, TPM1, TPM1uf, SYM, GTR

I am writing them out as named in jModelTest, but note that TPM1 is synonymous with K81 or K3P, and TrN is called TN93 is some software. An "uf" clearly means a variant with unequal base frequencies, and as mentioned in the previous post "ef" is similarly a variant with equal base frequencies.

Five schemes is the highest you need to do if you are model-testing for a BEAST run, because all its models are covered. It also seems to me as if with the exception of F84 all models available in PhyML are covered, and that one doesn't appear to be tested under higher numbers of substitution schemes either. So the same consideration applies, unless jModelTest uses a different name for it (?).

7 substitution schemes: JC, F81, K80, HKY, TrNef, TrN, TIM1ef, TIM1, TVMef, TVM, TPM1, TPM1uf, SYM, GTR

11 substitution schemes: JC, F81, K80, HKY, TrNef, TrN, TPM1, TPM1uf, TPM2, TPM2uf, TPM3, TPM3uf, TIM1ef, TIM1, TIM2ef, TIM2, TIM3ef, TIM3, TVMef, TVM, TPM1, TPM1uf, SYM, GTR

At this stage it becomes clear that there are a lot of strange variants of the Three-Parameter and Transitional Models that I overlooked in my previous post. I don't think they are used very often though...

203 substitution schemes: This adds a gazillion models named with a scheme of six letters or six letters plus F. Not sure how many people ever use them, so I will just stop here. I have a strong tendency to find GTR or HKY suggested for my data anyway...

(Updated 30 Nov 2016 to include information for RAxML and MrBayes.)

Thursday, June 23, 2016

Overview over the substitution models

Continuing from the previous post, here is a run-down of the commonly used models, their abbreviations, alternative names and abbreviations, and parameters. They are sorted from most parameter-rich to simplest.

General Time-Reversible (GTR)
Base frequencies variable.
All six substitution rates variable.
Number of free parameters: 8

Transversional (TVM)
Base frequencies variable.
All four transversion rates variable, transitions equal.
Number of free parameters: 7

Transitional (TIM)
Base frequencies variable.
Two different transversion rates, transitions equal.
Number of free parameters: 6

Symmetrical (SYM)
Base frequencies equal.
All six substitution rates variable.
Number of free parameters: 5

Kimura 3-parameters (K81, K3P, TPM1)
Base frequencies variable.
Two different transversion rates, transition rates equal.
Number of free parameters: 5

Tamura-Nei (TN93, TrN)
Base frequencies variable.
Transversion rates equal, transition rates variable.
Number of free parameters: 5

Hasegawa-Kishino-Yano (HKY)
Base frequencies variable.
Transversion rates equal, transition rates equal.
Number of free parameters: 4
Differs from F84 in how the transversion rate and transition rate parameters are derived.

Felsenstein 84 (F84)
Base frequencies variable.
Transversion rates equal, transition rates equal.
Number of free parameters: 4
Differs from HKY in how the transversion rate and transition rate parameters are derived.

Felsenstein 81 (F81, TN84)
Base frequencies variable.
All substitution rates equal.
Number of free parameters: 3

Kimura 2-parameters (K80, K2P)
Base frequencies equal.
Transversion rates equal, transition rates equal.
Number of free parameters: 1

Jukes-Cantor (JC, JC69)
Base frequencies equal.
All substitution rates equal.
Number of free parameters: 0

And below an overview of the same information in graphical form. The numbers in brackets are the numbers of free parameters that the relevant axis is contributing.


Apparently at least some people fill the empty fields on the left in by using the name of the relevant model to the right and adding "ef" to it. So TVMef would, for example, be a model with all transversion rates variable, transition rates equal, and equal base frequencies. But there is probably a reason why those missing models don't have any well known names; they seem to be rarely used, if ever.

Tuesday, June 21, 2016

Nucleotide substitution models

I recently had reason to consider alternative nucleotide substitution models and got a bit frustrated at the scarcity of clear, easily understood summaries of (1) their differences, (2) their availability in different phylogenetic programs, and (3) how to implement them in the various programs.

This is not to say that the information isn't out there. It is out there, and other people may be very satisfied with the form in which it can be found, e.g. in Wikipedia articles full of mathematical matrices. But because I had a hard time finding exactly what I would have wanted I am going to write a few posts over the next few days summarising it in a way that seems intuitive to me. And perhaps this will also be useful to others.

Some throat clearing will be necessary.

Models

First, what are models? I am certainly not a modeller but was originally trained in parsimony analysis. Consequently I am not the ideal person to introduce this, and I may make a few slips further down the line. But the principle is rather simple: models are mathematical descriptions or abstractions that allow us to describe a process and ideally to make predictions, at least stochastically.

Models may differ vastly in their complexity, in the number of their parameters. Imagine we want a model that allows us to describe how fast I am cycling under different circumstances, assuming that I never change gears. A simple model would perhaps have only one parameter, how much effort I invest. We can now put a value in that describes the energy I am using to pedal and get out a speed value.

We may compare this model against observations of me cycling, and we will find that while there is some fit it is far from ideal. Most likely adding another parameter, the force of headwind I have to work against, will improve the fit of the model to the data. Perhaps adding a third parameter, the incline on which I am cycling at any given moment, will further improve the model, and so on.

Overparameterisation

It is often the case that a model becomes better the more parameters are added. There are, however, most likely diminishing returns, as adding the tenth factor will improve the model less than adding the second. Worse, there is a concept called 'overparameterisation', meaning that a model can be too complicated for its own good. I found a very nice explanation of what that means here.

In case the link is dead when you read this, their example is predicting the seventh point on a graph from the first six points. A very simple model would be a straight line (the usual y = a * x + b), more complex, quadratic models would be monotonous curves, and even more complex, higher polynomial ones would be wavy lines.

If the six points are not exactly on a straight line (perhaps just because there was some noise in the data), it is trivial to produce a very complex model that fits them better than the simple line, because it could get much closer to each of them. But they would work much less well at predicting a seventh point outside of the range of the first six if the underlying pattern was really just a straight line, because they'd tend to shoot off to very high or low values at either end of the range to which they were fitted.

This also related to the reason why I am currently looking into models. It has now happened for the second time this year that jModelTest suggested substitutions models that turned out to be too parameter-rich when I tried to use them in BEAST, prompting me to consider which one I should use instead and why.

Nucleotide substitution models

As their name indicates, nucleotide substitution models are used in phylogenetic analysis of DNA sequence data. They differ in the number of parameters but also have several things in common. Most importantly, all commonly used models assume reversibility, meaning a given substitution is as likely as the the reversal of that substitution.

Parameters of the nucleotide substitution models

Obviously, this cuts the number of possible parameters that can be used to describe substitutions in half, but it still leaves the following:


Each of the six double-headed arrows is a possible parameter describing the likelihood of substitution of one nucleotide in the DNA with another (and the reverse). They can be grouped into classes, most obviously transitions and transversions. Nucleotides fall into two chemically different groups, purines and pyrimidines. A substitution within a group is called a transition (Ts), a substitution between purine and pyrimidine is called a transversion (Tv). Accordingly there are parameter-poorer models that assign the same probability to all transversions and/or to all transitions, or something in-between.

Relevant for model complexity is, by the way, not the number of parameters, but the number of free parameters. Even in a model that has different probabilities assigned to all six substitutions the sixth can be calculated from the other five, meaning that there is always one free parameter less than there are parameters.

In addition to these up to six substitution parameters, models may have the four base frequencies as additional parameters. Again the number of free parameters is one less, i.e. three, because obviously the fourth base frequency must be one minus the other three. Other models simply assume equal base frequencies; in these cases the number of free parameters from that part of the model is zero.

Any of the substitution models described by these sets of parameters can then be modified as '+ I', '+ G', and '+ I + G'. I is the invariant model, involving an estimate of the ratio of invariant sites in the data. G is the gamma model which allows sites to evolve at different rates using a set number of gamma categories and a factor alpha describing the shape of the frequency distribution of sites across the categories. I found this explanation rather helpful.

What this is going to be about

Finally, note that the following posts will not be about how the likelihood function is maximised, how phylogenetic software searches for the best tree given a model, or any theoretical or mathematical background. Honestly, at least to some degree the end user of phylogenetic software would be justified to treat much of this as a black box. The point is rather to summarise information on what parameters the different models have, how they are related, where and how they are implemented, and even such minor but useful to know things as which two or three names confusingly refer to the very same model.

Saturday, June 18, 2016

Taxonomic terminology again

I have often complained about idiotically unusable identification keys. Unclear descriptions of species are a related issue.

Yesterday I read the description of a daisy and encountered the term 'peracute'. Acute means pointy and is usually understood to mean that the apex of an organ shows an angle of less than 90 degrees (as opposed to obtuse for more than 90 degrees). The most frequently encountered problem is somebody writing 'subacute' to indicate... well, that is a good question. Pointy but not very pointy? And then of course we have issues with some taxonomists using related terms for pointy apices that show some kind of curvature, like apiculate or attenuate, in inconsistent ways.

But I have never before come across peracute. My first reaction was to try and translate it. Again, acute means pointy. Per means through, so through-acute? That does not make any sense whatsoever. A more modern meaning is 'for each', as in twenty kilometres per hour. For each acute? Yeah, not very sensible either.

But wait, I have that old glossary I wrote about in March. But no, sadly it doesn't contain that term either. So really, what is the point if a taxonomist uses a word that simply doesn't exist? Who will understand the description under those circumstances? Wouldn't it be nice if we could define four to six clear and universally agreed-on terms for, in this case, apex shapes, and use them, and only them, consistently?

Ah well. At least I had the opportunity to peruse the glossary again. And of course peracute is nothing. This book is the mother lode of bad ideas.

Pampinus - n. Tendril.
Then why not write tendril? Ye gods.

Panduriform - a. Fiddle-shaped.
Then why not... oh, we had that already.

Parastomon - n. An abortive stamen, a staminodium.
You know, we already have a word for that. It is 'staminodium' or 'staminode'.

Phoranthium - n. The receptacle of the capitulum of Compositae.
Ah, I have heard of that structure, only everybody calls it a ... receptacle.

Argh. Argh. Argh. Argh. Headdesk. Are some people actually deliberately trying to make their taxonomic publications pointless, useless and maximally infuriating?

Friday, June 17, 2016

Botany picture #230: Jacobaea vulgaris


Jacobaea vulgaris (Asteraceae), also known as Senecio jacobaea, Germany, 2005. This species is a widely introduced weed, and among the places it has been introduced to is Australia.

I had known for some time that it and several relatives had been officially segregated out into Jacobaea. Having met it first as a student under a Senecio name, not seeing any obvious morphological differences to species that remained under that genus, and, perhaps most importantly, without any pressing need to research the matter, I was unsure what to make of it.

Perhaps this was just some random splitting based on a whim? Or there would have been a splitting and a lumping option to make something monophyletic, and the splitting option that creates two indistinguishable genera was preferred over a lumping option that would create a heterogeneous genus? For all I knew this may well have been one of those scenarios that 'evolutionary' systematists always complain about.

Recently, however, a colleague discussed the matter with me, and that prompted me to properly study the literature. The phylogeny of the Senecioneae was thoroughly resolved in a series of papers published around ten years ago by Pieter Pelser and collaborators. Perhaps the best overview for present purposes is provided by figure 1 of Pelser et al. (2007, Taxon 56: 1077-1104). It is a single phylogeny broken up over ten pages. The part that concerns Jacobaea is figure 1F, and it can conveniently be summarised as follows:


So this is not just paraphyletic. Even 'evolutionary' systematists should be unable to accept a genus that includes Jacobaea and Senecio in the strict sense. But surely then two clades separated by so many other clades, among them illustrious genera like Crassocephalum and Dendrosenecio, must have some diagnostic characters? Sadly, no:
Although Jacobaea is distinguished from Senecio s.str. and other genera in Senecioneae by DNA sequence characters (Pelser & al. 2002), clear morphological synapomorphies for Jacobaea have not been identified to date.
That may be very unsatisfying, but life isn't always fair. There is no rule that says that we will always find diagnostic characters to tell all distantly related groups apart.

Sunday, June 12, 2016

Parsimony in phylogenetics again

Just some short observations:

A few days ago I learned that somebody has found my TNT script for the Templeton test useful and is not only using it but also improving on it. A few days before that I found that my post on using TNT made it into the acknowledgements of a publication. That is really nice to see; my expectation was never that this blog would be home to a lot of real-time discussion, but rather that people can find something useful to them even years after I posted it.

---

I checked that 'parsimonygate' hash tag again, and found a few interesting (or perhaps revealing) tweets. The first comments on one of the graphs from my surprisingly popular post on the popularity of phylogenetic programs over the years with a laconic "TNT is thin green". Now I have no idea what the tweeter meant with that. His profile clarifies that "re-tweet does not equal endorsement", so any comment could be about anything. But in the context of the parsimonygate hash tag, it could be read as an argumentum ad populum, on the lines of: see, hardly anybody uses parsimony these days, those guys are fringe.

That, however, would make little sense regardless of one's position on the silly parsimony versus model controversy. It would be much harder to figure out how often people use methods than how often they cite programs, but it should be obvious that many of the people citing PAUP, PHYLIP or MEGA have also used the parsimony methods implemented in those programs. TNT is just one of the parsimony programs out there, and it unsurprising that it is not the most popular one, seeing how it uses an idiosyncratic data format and scripting language instead of the more widely used Nexus, Newick and Phylip formats.

The other notable tweets are the series of comments that appeared after the publication of a recent study comparing the performance of Bayesian and parsimony methods on discrete morphological characters. (This is of some interest to myself. My preference when faced with DNA sequence data is to use likelihood, but the results of using the Mk model on morphology generally seem nonsensical to me.) Samples:
Bayesian phylogenetic methods outperform parsimony in estimating phylogeny from discrete morphological data (link to paper)
and
Time to abandon all parsimony? (link to paper)
Wow, that paper must have really shown parsimony to have trouble! Let's look at the paper then:
Only minor differences are seen in the accuracy of phylogenetic topology reconstruction between the Bayesian implementation of the Mk-model and parsimony methods. Our findings both support and contradict elements of the results of Wright & Hillis [5] in that we can corroborate their observation, that the Mk-model outperforms equal-weights parsimony in accuracy, but the Mk-model achieves this at the expense of precision.
Oh.

Again, I cannot stress enough that I am a methods pragmatist who regularly uses both parsimony and model-based approaches. I also appreciate that there are indeed phylogeneticists who are irrationally opposed to model-based methods. But are these not examples of rather, shall we say?, selective perception of what turned out to be a trade-off situation with minor difference either way?

Wednesday, June 8, 2016

Botany picture(s) #229: Stylidium from Western Australia



The Australian genus Stylidium is remarkable for its moving floral column. When an insect visits the flower, this column is released and hits the visitor like a little hammer. The mechanism has given the genus the common name triggerplants. Above is the largest species of the genus, Stylidium schoenoides, commonly called cow kicks, photographed in Western Australia in 2012.


And at the other end of the spectrum, there are plants like these, also photographed on the same field trip. I do not know what species it is nor if its column is still functional, but it is another Stylidium. Extreme miniaturisation and very short life-cycles, often accompanied by self-pollination, are a common adaptation strategy to highly irregular rainfall in the Australian arid zone.

Saturday, June 4, 2016

Journal citation styles

Among the minor frustrations of publishing scientific papers is formatting, in particular of reference lists. Every journal has its own format, with only sister journals published by the same society generally using the same. Here just a sample of the same paper as formatted for the reference lists of four journals I have published in:

Breitwieser I, Ford KA, Smissen RD (2010) A test of reproductive isolation among three sympatric putative species of Craspedia (Asteraceae: Gnaphalieae) at Mt Arthur in New Zealand. New Zealand Journal of Botany 48, 75–81.

Breitwieser, I., Ford, K.A., Smissen, R.D. 2010. A test of reproductive isolation among three sympatric putative species of Craspedia (Asteraceae: Gnaphalieae) at Mt Arthur in New Zealand. New Zealand Journal of Botany. 48, 75–81.

Breitwieser I., Ford K.A., & Smissen R.D. (2010) A test of reproductive isolation among three sympatric putative species of Craspedia (Asteraceae: Gnaphalieae) at Mt Arthur in New Zealand. New Zealand Journal of Botany, 48, 75–81. 

Breitwieser, I., Ford, K.A. & Smissen, R.D. (2010) A test of reproductive isolation among three sympatric putative species of Craspedia (Asteraceae: Gnaphalieae) at Mt Arthur in New Zealand. New Zealand Journal of Botany 48: 75–81.

The problem is, of course, that it would be extremely fiddly to reformat every journal name into italics or every volume number into bold every time that needs to be done. Wouldn't it be nice if publishers agreed on a standard, at least across the same field of science?

This is where reference managers come in. Perhaps the best known currently is the commercial software EndNote, and I could get it at work. But because I want something that I can also use on my own computer at home I prefer to use the open source software Zotero, which works through the browser. It has got very user friendly and flexible, and as far as I can tell there are only two downsides left.

The first is that when I need an in-line reference I only have the options of inserting "(Smith 2010)" or "(2010)", the latter by ticking the 'suppress author' option. It would be neat if it could also offer the option of "Smith (2010)", as EndNote apparently does.

The second is quite simply that the number of available citation styles - see above - is still a bit limited. For example, the Zotero Style Repository does not currently appear to have:
  • Botanical Journal of the Linnean Society
  • Systematic Botany
  • Taxon
These are not exactly obscure under any measure, they are three of the most important plant systematics journals globally. Admittedly the Bot. J. has the same one as the Biological Journal published by the same society, but Taxon for example has a very unique style.

There appear to be two ways of building one's own style. As they have to be coded in the CSL format, one can learn that format and 'program' a style. Obviously that is a bit of a barrier; it amounts to learning what is pretty much a super-specialised, extremely cryptic scripting language that the average scientist will not need for anything else, ever. I at least do not really consider learning all the commands an efficient use of my time.

Luckily, a Visual CSL Editor has been made available (apparently by Mendeley and Columbia University?), and yesterday I experimented with it a bit. The idea is to start by searching the style database of the site for an existing style closely matching the desired one, changing it as desired, and exporting it to disk as a CSL file. It can then simply be dragged into an open browser window to install it in Zotero (if Zotero is what you are using).

This is much easier than scripting a style but also has its hurdles to overcome. The thing looks like this:


In the upper right field we have the way the example citations would be formatted under the style, so that is clear.

Below that is a field showing all the current settings for the part of a reference that we are currently working on. This is where it gets a bit more difficult. In the above screenshot, we can guess that delimiter is what separates author names, and that prefix and suffix are elements that would precede or follow any author name. In other cases it is harder to figure out what something means.

But the main problem is navigation. Looking at the field on the left we see a bewildering tree of fields that may or may not need to be edited. What is more, many of them have the same name, for example "names". Here we realise, for example, that our guess from the previous paragraph was wrong. Although we are in a field called "names" (directly under another field called "names", nested two levels under yet another field called "names"!), the delimiters, prefixes and suffixes are actually for the collective of all authors of a paper. Why do they have to have a delimiter between each other when they are actually delimited by year of publication, name of paper, journal name, volume number, page numbers, and line break? God and the people who came up with the CSL format will know.

In other words, there are so many more fields that I can ignore that I find it difficult to find the ones I do need, and they have the same names and are often directly next to each other. On the other hand, I was unable to figure out how to tell the style that it should use "& al." instead of "et al.", as required by the journal Taxon. Maybe that is impossible, or maybe I just can't find the option?

Of course, this is available for free and much, much better than scripting. I am extremely grateful to have this option. But still, it is clear why few people would find themselves able and the time to write their own styles...