Tuesday, July 28, 2015

What to avoid when making an identification key, continued

Currently I am reading a very long taxonomic paper. Not making a lot of progress due to various unrelated issues, but even only less than a fifth in I already notice several cardinal sins of identification keys. This is like another list of how not to do it:

Key symmetry

All couplets except one in a large and crucial key of the paper are structured as "either this or everything else". This means that, again with the single exception of one single couplet, the key is as long as it can possibly be for its number of solutions, making it much more tedious and difficult to use than necessary. The ideal key should be shorter by making it more symmetric, with most couplets dividing the remaining solutions into two more or less equal halves.

In case this is unclear, imagine you have eight species to key out. If you always choose couplets so that they divide the remaining solutions equally, the user will have to go through square root log2 of 8 = 3 couplets to identify their specimen. If you always choose couplets so that they divide one versus the rest, the user will have to go through an average of (1+2+3+4+5+6+7+7)/8 = 3.5 4.375 couplets to identify their specimens, assuming that all species are equally likely to need identification. (Edited to correct the math. Ye gods, I must have been half asleep when I wrote this post.) The people who have the most deeply nested species at hand will need to go through every single couplet in the key!

"Characters not in this combination"

Worse, several of the couplets in these overly long strings of questions consist of one lead that gives an extremely specific character combination, and the alternative is "not as above" or "not with the above character combination". This means that the end user will have to check an annoying number of partly very obscure characters to ensure that yes, the plant they have in front of them differs in that one character even if several others agree. Again, tedious and unnecessarily difficult to use.

Bad contrasts

Another issue is that there are questions that provide really poor contrasts. What you want to see is "leaves hairy" versus "leaves glabrous". What I see in this paper is the equivalent of "leaves mostly hairy" versus "leaves mostly glabrous". This is accompanied by a second character going "character absent" versus "character absent or present", and that's it. Ye gods.

Lots of exceptions

Finally, there are several couplets on the lines of "leaves hairy" versus "leaves hairy except in species X", which means that species X should just have been moved to the other half of the key.

This is presumably because of a misguided desire to structure the key by the underlying systematics. I see that quite often in taxonomists who have worked on a group for a very long time and know it very well. There appears to be an unconscious desire to lead the user through the classification, but what the user really needs are not the characters and divisions that make for a natural classification but the characters that are easy to see and the divisions that make the key as short as possible.

If that means that the same genus has to be keyed out three times, so be it. If that means that totally unrelated genera come out next to each other, so be it. This is not a classification, it is an identification key.

The frustrating thing is, the paper I am reading was not written by a beginner, quite the opposite...

No comments:

Post a Comment