Sunday, March 1, 2015

Parsimony analysis in TNT using the command line version

I guess I can just as well make it a habit to blog some advice whenever I have dealt with a recalcitrant piece of software. Today: Tree analysis using New Technology (TNT).

As I have mentioned before, there are four main ways of inferring phylogenetic trees of evolutionary relationships:
  • Distance/clustering analysis. This is not really a phylogenetic analysis in the strict sense but merely clusters terminals by their similarity, but on the plus side clustering is always extremely fast. There are several programs that can do it, including good old PAUP and MEGA.
  • Likelihood analysis. Simplifying a bit one could say it searches for the tree with the best log likelihood score given a model of sequence evolution and the data. Again there are several programs available to do this kind of analysis, including PAUP, MEGA and PHYLIP. Calculating likelihood values across large phylogenetic trees is computationally intensive, and thus they can take quite some time for larger datasets. This is why somebody wrote the software RAxML, which is designed to do complex likelihood searches with seemingly ridiculous speed by cutting a few corners.
  • Bayesian phylogenetics. This approach estimates the posterior probability of phylogenetic relationships with a Marcov Chain Monte Carlo (MCMC) method. Standard software packages for this are MrBayes and BEAST. If you want a quick answer, you are out of luck though, because MCMC always takes time.
  • Parsimony analysis. The logic here is to find the tree with the lowest number of character changes along the branches, under the assumption that, all else being equal, the simplest explanation is the best. It is often considered less sophisticated than the previous two approaches but it comes with less assumptions; I like it that I know where the computer has its hands, so to say. Once more PAUP, MEGA and PHYLIP implement parsimony searches but they are fairly slow for larger datasets.
This is where TNT comes in. Published in 2003 and made free-ware through a subsidy of the Willi Hennig Society in 2007, TNT could be called the RAxML of parsimony analysis. It can take a fairly large dataset and finish the tree search before PAUP has got its shoes on. What is more, in addition to the already fast standard search it implements the innovative search strategies that gave it its New Technology name part, such as the Parsimony Ratchet. When you use these you will know what speed means!

Sadly, the program has a few downsides. First, its input and output formats are rather idiosyncratic. Second, it has a GUI only on the Windows version but not on Mac or Linux, so that you will have to use command line and scripting on the latter two systems. Third, the documentation is unsystematic and unhelpful, making it very hard to figure out how to effectively use the command line and scripting. Actually, that is not quite true; documentation on scripting per se seems to be okay, it is rather the simple standard analyses that aren't explained anywhere.

This is why I am writing this post. I have just done a simple analysis, and I want to spare others the same investment in time and frustration, and I want to be able to look up my own post in the future, especially should some time pass before I use TNT again.

So, how do we find things out? A vaguely manual-like HTML file is distributed with the software, and you will also find something called TNT Scripts - General Documentation on its website. But again, while both of these tell you how to write loops and if then clauses in complex scripts, they do not provide a systematic overview over the commands you need to run simple TNT analyses on Linux. Wouldn't it be nice if we could just have something like the PAUP command reference manual and text-search it for the explanation of the command that changes the maximum number of trees to be retained?

The next best thing to do is to fire up TNT and to look at the help function, so let us start there. All the following assumes Linux, by the way, but except for starting the program it should be the same on Mac. You turn the program on by entering ./tnt in its folder; if it doesn't work you may have to sudo it. The prompt should change to tnt*>. You can now simply enter TNT commands, once you know them that is.

But before we consider these commands, a general remark is in order: sometimes you will enter what you think is a perfectly sensible command only to be greeted by a new prompt named after the command in question, seemingly asking you for more parameters. In that case, you will most likely simply have to enter a semicolon (;), because that is the character that signifies the end of a command. Often, however, TNT seems happy without the semicolon on the end, no idea what makes the difference.

Anyway, type help ; to get a list of all the commands. Now that you know what they are called, you can get information on each of them individually by typing help commandname. This plainly unsatisfying approach is partly how I figured out what follows; in other cases, some googling was of assistance.

Okay, lets think about an analysis. First prepare a data matrix, and to be safe it should probably have the TNT format:
xread
10 5
Species1   ACGTACGTAC
Species2   ACGTACGTAC
Species3   ACGTCCGGAC
Species4   ATGTCCGGAC
Species5   ACGTACGGAC
;
As you can see, you can just use a PHYLIP file as you would use for RAxML, swap number of characters and number of terminals around, and add xread and the final semicolon. (Still, wouldn't it be nice if all phylogenetic programs used the same matrix format?)

Many people seem to use TNT for parsimony analyses of small morphological datasets, which seems a bit like using a tactical nuke to kill a fly. My scenario is generally that of a DNA sequence dataset with dozens to hundreds of terminals and thousands of characters. Before we can import the data file, we will therefore have to make a few preparations. First, TNT is set to a ridiculously low memory usage and will thus generally throw an error if you attempt to import a large matrix. Enter mxram 200 to set memory usage to 200 MB or whatever is realistic and necessary. Second, tell the program to expect DNA data by entering nstates DNA, then nstates NOGAPS to tell it to treat gaps as missing data as opposed to a fifth state.

Now read the data file with procedure filename. Only from now on are we allowed to set the number of equally parsimonious trees to be retained during search. The default is 100, but that is way too low. In fact in my experience this value has a much stronger influence on whether you will find the best trees than the number of search replicates, because a higher number of trees retained gives the program more to swap on. Rectify the situation with hold 1000 or whatever number you prefer.

From here I am mostly adapting the information helpfully provided a few years ago by another blogger, Matthew Vavrek (thanks!). Start logging the analysis output into a text file with log filename. To do a simple, preliminary search, type mult; if you want more replicates, use mult=replic 10 for example. The resulting trees that are now in memory will be alright but still not the best, so they will be used in a more thorough search with bbreak=tbr. This latter search takes time and will, unless there is really only a very low number of equally parsimonious trees, probably fill up the number of trees to be retained. Because you want the strict consensus of all those trees you have, you enter nelsen *.

At this stage, we have got some real results and want to save them. Unfortunately, TNT is set to save taxa only as numbers, making it impossible to interpret the trees it saves. So first we set taxname=, which cryptic command tells it to save taxon names.

To obtain trees in the usual Newick format used by nearly all phylogenetic software on the planet, we have to export the results into Nexus format using export * filename; here the asterisk is crucial to get the actual trees, otherwise you will only export the data matrix.

The resulting Nexus file can be opened with FigTree, for example, but it will not understand the branch lengths. For that I find it necessary to open the file in PAUP and export the trees once more from there, but there is probably an easier way. (Update 6 May 2015: I have yet to find an easier way. Even Mesquite doesn't seem to get it. I really really wonder why TNT wasn't just programmed to export branch lengths.)

Update 25 Jan 2017: To export the trees with branch lengths, use export > filename; thanks to Matt Buys for the information.

Finally, branch support. There are scripts out there to do Bremer support (Decay Indices), but for DNA sequence data bootstrapping or jackknifing is more common. The command here is simply resample replications 200 for two hundred bootstrap replicates. Sadly, I do not know how to export a tree with bootstrap values or if that is even possible, but the results will be written into the log file you set up earlier. (Update 25 Jan 2017:) Export the bootstrap results with export - filename.nex; or export < filename.nex; - thanks again to Matt Buys. These two parameters tell the program to add tree tags, which I assume the bootstrap values are automatically considered to be (?).

Leave the program by typing quit. To script all the above commands, write them one after the other into a text file (with space and semicolons at the ends of each line!) and run it from TNT with procedure runfilename.
mxram 200 ;
nstates DNA ;
nstates NOGAPS ;
procedure yourdatamatrix.tnt ;
log yourlogfilename.txt ;
hold 1000 ;
mult=replic 10;
bbreak=tbr ;
nelsen * ;
taxname= ;
export > yourfilename.nex ;
resample replications 200 ;
export - yourfilename_bs200.nex ;
quit
Cheers!

(Updated 6 May 2015 to add the taxname= command and 25 Jan 2017 to include information on branch length export.)

19 comments:

  1. To get the bootstrap values out use ttags =;....ttags -/;

    ReplyDelete
    Replies
    1. Thanks! I will work it into the above post when I have a bit more time.

      Delete
    2. Hm. I see that this command makes TNT output the bootstrap values into the log file, but I assume that it is still impossible to output them into a Nexus (Newick) tree file?

      Delete
    3. If you still need an answer to this question, to save bootstrap values (and/or other tree tags) you need to stop storing, not clear, tree tags before you write them to a file:

      ttags =;
      <tag generating commands>
      ttags );
      tsave *tags.tre;
      save * <tree number>;
      tsave /;

      Hope that helps!

      Delete
    4. Thanks, will try that out next time!

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Hi Alexander,

    Thanks for your helpful post on TNT! I've been playing around with it myself and have managed to get it to work (I think) as I would like (on Ubuntu). I'll share a few things I noticed.

    For branch lengths, I found that you have to set them as a ttag before exporting the tree (like the bootstrap values). Like you, I am finding the output format to be strangely idiosyncratic. It says it is in Nexus format, but the R package ape doesn't recognize it as a Nexus tree (perhaps a limitation of ape?). The only way I could get it to work so far is to manually alter the output file to delete all the Nexus commands (left with only the parenthetical tree) and add ":" before each branch length. This altered file can be opened in FigTree. Note: this works when I am dealing with a single consensus tree with branch lengths. When I tried something similar with the bootstrap tree (with branch lengths AND bootstrap support), I couldn't get it to work. But, like you said, the support values are available in the log file.

    It was frustrating that the tnt prompt doesn't allow scrolling up (e.g. ">help bbreak" will only display the end of the long documentation for bbreak, at least for me). I got around this by starting tnt, setting a log file (">log logfile.log"), then typing all the help calls I was interested in. These get output to the log file and you can scroll through that.

    Here are the commands I put in a file to do a similar run to the one you suggested:

    mxram 2000;
    nstates NOGAPS;
    nstates dna;
    hold 1000 ;
    log tnt_run_log ;
    taxname = ;
    collapse tbr ;
    mult=replic 10 ;
    bbreak = tbr ;
    nelsen * ;
    tchoose { strict } ;
    ttags = ;
    blength *;
    ttags ;
    export - consensus_tree.tre ;
    ttags = ;
    collapse tbr ;
    resample replications 100 ;
    keep 1 ;
    ttags ;
    export - boots.tre
    proc/;

    I wonder if there is a way to erase the branch lengths ttags that get carried over to the bootstrap tree, making the output easier to edit. Perhaps it is fairly straightforward, but I haven't taken the time to read through all the documentation.

    I hope this is helpful, and if anyone else has some pointers or comments on this approach, please comment!

    Cheers,

    Ben

    ReplyDelete
    Replies
    1. That is great, thanks for this useful addition! Just two weeks ago I tried playing around with ttags, but for some reason I wiped all my trees from memory before I exported the branch lengths. I think I know what I did wrong but didn't pursue it because in that instance I really only needed the topology. Will try out your script the next time.

      Delete
  4. I think I figured out how to clear those ttags (based on earlier posts which I did not read clearly enough). So, here goes:

    hold 1000 ;
    log tnt_run_log ;
    taxname = ;
    mult=replic 100 tbr;
    bbreak = tbr ;
    nelsen * ;
    tchoose { strict } ;
    ttags = ;
    blength *;
    ttags );
    export - consensus_tree.tre ;
    ttags -;

    ttags = ;
    collapse tbr ;
    resample replications 100 ;
    keep 1 ;
    ttags );
    export - boots.tre
    proc/;

    ReplyDelete
  5. Hi Alex,

    A quick update on my TNT runs. I have adjusted for a (I think) more thorough analysis (takes longer) and I have figured out how to get the output I want. As an aside, with my nextgen data, it seems it only ever finds one most parsimonious tree. I'm not sure if this is because of how I'm running the program or because the signal is fairly strong.

    Here are my TNT commands:

    log tnt_run_log ;
    mxram 2000;
    nstates NOGAPS;
    nstates dna;
    hold 10000 ;
    taxname = ;

    xmult=hits 10 noupdate nocss replic 10 ratchet 10 fuse 1 drift 5 hold 100 noautoconst keepall;
    bbreak = tbr ;
    nelsen * ;
    tchoose { strict } ;
    ttags = ;
    blength *;
    ttags );
    ttags ;
    export - consensus_tree.tre ;
    ttags -;

    ttags = ;
    resample replications 100 [ mult=replic 10 tbr hold 1000 ];
    keep 1 ;
    ttags );
    ttags ;
    export - boots.tre;
    proc/;


    This produces two outputs: a consensus tree with tags of branch lengths, and a bootstrap tree with tags of support. The bootstrap tree is ok, as all you need to do is manually delete everything in the file except the parenthetical tree with the bootstrap values, and then it opens in FigTree (in which you can assign the values and display on branches).
    The problem is the consensus tree, which we want to have tags interpreted as branch lengths. I wrote a simple Python script to quickly extract the tree from the file. Here are the commands for anyone interested:

    import sys # for accessing command line arguments
    import re # for regular expressions

    tree_newick = "" # create an empty string variable to hold the tree
    tree_line = "" # create an empty string to store the line in the file that has the tree

    with open(sys.argv[1], 'r') as tnt_output:
    for line_num, line in enumerate(tnt_output, start=1):
    if line.startswith("tree tagged_tree"):
    tree_line = line_num + 1
    if line_num == tree_line:
    tree_newick = re.sub(" (\d+) ", ":\g<1> ", line)
    tree_newick = re.sub(" (\d+),", ":\g<1>, ", tree_newick)

    output = open(sys.argv[1].replace(".tre", "") + "_newick.tre", 'w')
    output.write(tree_newick)
    output.close()

    Cheers,

    Ben

    ReplyDelete
    Replies
    1. Again, thanks! I think it is increasingly likely to get only a single tree the more data one has.

      Delete
    2. Whoops, I missed a line in my Python script, and I don't know how to indent in this posting (indentation is essential for Python). Here's a better version of part of the above code with (tab) for indentation levels:

      with open(sys.argv[1], 'r') as tnt_output:
      (tab)for line_num, line in enumerate(tnt_output, start=1):
      (tab)(tab)if line.startswith("tree tagged_tree"):
      (tab)(tab)(tab)tree_line = line_num + 1
      (tab)(tab)if line_num == tree_line:
      (tab)(tab)(tab)tree_newick = re.sub(" (\d+) ", ":\g<1> ", line)
      (tab)(tab)(tab)tree_newick = re.sub(" (\d+),", ":\g<1>, ", tree_newick)
      (tab)(tab)(tab)tree_newick = re.sub(" (\d+)\)", ":\g<1>)", tree_newick)

      Delete
    3. For a tree exported using:

      export > consensus_tree.tre ;

      and using Ben's Python script, I wrote a Bash script:

      sed -i 's/[0-9]*\//\:/g' consensus_tree.tre
      sed -i 's/\:\:/\:/g' consensus_tree.tre
      sed -i 's/\:\///g' consensus_tree.tre

      Delete
    4. Howdy

      I've been in touch with Pablo Goloboff, first author of TNT for advice regarding tree formatting. In your list of commands above,

      replace

      export - consensus_tree.tre ;

      with

      export > consensus_tree.tre ;

      This adds branch lengths in a readable format for FigTree. Also, the latest non-beta version of TNT now only provides a single parenthical tree.

      For the bootstrap tree, both

      export - boots.tre;

      and

      export < boots.tre;

      do the same thing(note: not >).

      Delete
    5. That's great, thanks. Don't have time now to put that into the original post, hope I remember to do so later.

      Delete
  6. I am trying to do bootstrap with two different scripts. The first one has 10 characters states and its doing fine, the second one have 32 states, and the bootstrap is not working. The only difference between the two scripts is the number of states. Do you have any idea about what is going on? There is a way to increase the memory just for boostrap? Thanks!

    ReplyDelete
    Replies
    1. Sorry, with so little to go on I would not dare guess. Is there a reason to assume that it is the memory, and if so, why not simply increase it at the beginning of the analysis? The default setting is really very low.

      Delete
  7. Hi,

    I used NONA to calculate Bremer support. The program already finished calculating the trees, following the next prompt;

    > out final_TE;
    > hold 32000;
    > suboptimal 50;
    > mult*1000 max*;
    > bsupport;

    My question is, now that NONA calculated the Bremer of the trees, which prompt should I use to view the tree, in NONOA with the branch supports?

    Thank you!

    ReplyDelete
    Replies
    1. Sorry, I have not yet calculated Bremer support when using TNT, so I cannot help with that at the moment. Might look into Bremer support sometime and do a general post on it, but at the moment I have no time...

      Delete