SumTrees: Phylogenetic Tree Summarization and Annotation

Introduction

SumTrees is a program by Jeet Sukumaran and Mark T. Holder to summarize non-parameteric bootstrap or Bayesian posterior probability support for splits or clades on phylogenetic trees.

The basis of the support assessment is typically given by a set of non-parametric bootstrap replicate tree samples produced by programs such as GARLI or RAxML, or by a set of MCMC tree samples produced by programs such as Mr. Bayes or BEAST. The proportion of trees out of the samples in which a particular split is found is taken to be the degree of support for that split as indicated by the samples. The samples that are the basis of the support can be distributed across multiple files, and a burn-in option allows for an initial number of trees in each file to be excluded from the analysis if they are not considered to be drawn from the true support distribution.

The support for the splits will be mapped onto one or more target trees either in terms of node labels or branch lengths. The target trees can be supplied by yourself, or, if no target trees are given, then a a summary tree can be constructed. This summary tree can be a Maximum Clade Credibility Topology (i.e., MCCT, a topology that maximizes the product of the the clade posterior probabilities), the majority-rule clade consensus tree, or some other type. If a majority-rule consensus tree is selected, you have the option of specifying the minimum posterior probability or proportional frequency threshold for a clade to be included on the consensus tree.

By default SumTrees will provide summaries of edge lengths (i.e., mean, median, standard deviation, range, 95% HPD, 5% and 95% quantiles, etc.) as special node comments. These can be visualized in FigTree by, for example, checking “Node Labels”, then selecting one of “length_mean”, “length_median”, “length_sd”, “length_hpd95”, etc. If the trees are ultrametric and the “--summarize-node-ages” flag is used, or edge lengths are set so that node ages on the output trees correspond to be mean or median of the node ages of the input trees (“--edges=mean-age” or “--edges=median-age”), then node ages will be summarized as well. In all cases, the flag “--suppress-annotations” will suppress calculation and output of these summaries.

If you are processing multiple source files and you have multiple cores available on your machine, you can specify the “-M” flag to use all the cores or, e.g., “-m 4” to use 4 cores. Using multiple cores will, of course, speed up processing of your files.

Where to Find the Package

SumTrees is distributed and installed as part of the DendroPy phylogenetic computing library, which has its homepage here:

with the main download page here:

How to Install the Package

DendroPy is fully easy-installable and can be installed using pip:

$ sudo pip install -U dendropy

If you do not have pip installed, you should definitely install it ! Note: the “sudo” command should only be used if installing system-wide on a machine on which you have administrative privileges. Otherwise, you would use the “--user” flag for a local user install:

$ pip install --user -U dendropy

These, and other ways of obtaining and installing DendroPy (e.g., by downloading the DendroPy source code archive, or by cloning the DendroPy Git repository), are discussed in detail in the “Downloading and Installing DendroPy” section.

Checking the Installation

If the installation was successful, you should be able to type “sumtrees.py” in the terminal window and see something like the following (with possibly a different date or version number):

/==========================================================================\
|                                 SumTrees                                 |
|                     Phylogenetic Tree Summarization                      |
|                       Version 4.0.0 (Jan 31 2015)                        |
|                   By Jeet Sukumaran and Mark T. Holder                   |
|    Using: DendroPy 4.0.0.dev (DendroPy4-feea2b0, 2015-06-02 20:13:49)    |
+--------------------------------------------------------------------------+
|                                 Citation                                 |
|                                 ~~~~~~~~                                 |
| If any stage of your work or analyses relies on code or programs from    |
| this library, either directly or indirectly (e.g., through usage of your |
| own or third-party programs, pipelines, or toolkits which use, rely on,  |
| incorporate, or are otherwise primarily derivative of code/programs in   |
| this library), please cite:                                              |
|                                                                          |
|   Sukumaran, J and MT Holder. 2010. DendroPy: a Python library for       |
|     phylogenetic computing. Bioinformatics 26: 1569-1571.                |
|                                                                          |
|   Sukumaran, J and MT Holder. SumTrees: Phylogenetic Tree Summarization. |
|     4.0.0 (Jan 31 2015). Available at                                    |
|     https://github.com/jeetsukumaran/DendroPy.                           |
|                                                                          |
| Note that, in the interests of scientific reproducibility, you should    |
| describe in the text of your publications not only the specific          |
| version of the SumTrees program, but also the DendroPy library used in   |
| your analysis. For your information, you are running DendroPy            |
| 4.0.0.dev (DendroPy4-feea2b0, 2015-06-02 20:13:49).                      |
\==========================================================================/

usage: sumtrees.py [-i FORMAT] [-b BURNIN] [--force-rooted] [--force-unrooted]
                [-v] [--weighted-trees] [--preserve-underscores]
                [--taxon-name-file FILEPATH] [-t FILE] [-s SUMMARY-TYPE]
                [-f #.##] [--allow-unknown-target-tree-taxa]
                [--root-target-at-outgroup TAXON-LABEL]
                [--root-target-at-midpoint] [--set-outgroup TAXON-LABEL]
                [-e STRATEGY]
                [--force-minimum-edge-length FORCE_MINIMUM_EDGE_LENGTH]
                [--collapse-negative-edges] [--summarize-node-ages]
                [-l {support,keep,clear}] [--suppress-annotations] [-p]
                [-d #] [-o FILEPATH] [-F {nexus,newick,phylip,nexml}]
                [-x PREFIX] [--no-taxa-block]
                [--no-analysis-metainformation] [-c ADDITIONAL_COMMENTS]
                [-r] [-M] [-m NUM-PROCESSES] [-g LOG-FREQUENCY] [-q]
                [--ignore-missing-support] [-h] [--citation]
                [--usage-examples] [--describe]
                [TREE-FILEPATH [TREE-FILEPATH ...]]

Type 'sumtrees.py --help' for details on usage.
Type 'sumtrees.py --usage-examples' for examples of usage.

You can now delete the original downloaded archive and unpacked directory if you want.

How to Use the Program

SumTrees is typically invoked by providing it a list of one or more tree files to be summarized:

$ sumtrees.py [OPTIONS] <TREEFILE> [<TREEFILE> [<TREEFILE> ...]]]

Common options include specification of a target topology onto which to map the support (“-t” or “--target”), a summary tree to use (e.g., “-s mcct” or “--summary-target=mcct” an output file (“-o” or “--output”), and a burn-in value (“-b” or “--burnin”).

Full help on program usage and options is given by using the “--help” option:

$ sumtrees.py --help

Quick Recipes

Summarization of Posterior Probabilities of Clades with a Consensus Tree

Summarize a set of tree files using a 95% rule consensus tree, with support for clades indicated as proportions (posterior probabilities) using branch labels, and branch lengths being the mean across all trees, dropping the first 200 trees in each file as a burn-in, and saving the result to “result.tre”:

$ sumtrees.py --min-clade-freq=0.95 --burnin=200 --output-tree-filepath=result.tre treefile1.tre treefile2.tre treefile3.tre
$ sumtrees.py -f0.95 -b200 -o result.tre treefile1.tre treefile2.tre treefile3.tre

Summarization of Posterior Probabilities of Clades with a Maximum Clade Credibility Tree (MCCT)

Summarize a set of tree files using a tree in the input set that maximizes the product of clade support, with support for clades indicated as proportions (posterior probabilities) using branch labels, and branch lengths the mean across all trees, dropping the first 200 trees in each file as a burn-in, and saving the result to “result.tre”:

$ sumtrees.py --summary-target=mcct --burnin=200 --support-as-labels --output-tree-filepath=result.tre treefile1.tre treefile2.tre treefile3.tre
$ sumtrees.py -s mcct -b200 -l -o result.tre treefile1.tre treefile2.tre treefile3.tre

Non-parametric Bootstrap Support of a Model Tree

Calculate support for nodes on a specific tree, “best.tre” as given by a set of tree files, with support reported as percentages rounded to integers, and saving the result to “result.tre”:

$ sumtrees.py --decimals=0 --percentages --output-tree-filepath=result.tre --target=best.tre treefile1.tre treefile2.tre treefile3.tre
$ sumtrees.py -d0 -p -o result.tre -t best.tre treefile1.tre treefile2.tre treefile3.tre

Set Node Ages of Target or Summary Tree(s) to Mean/Median Node Age of Input Trees

Summarize a set of ultrametric tree files using a 95% majority-rule consensus tree, with support for clades indicated as proportions (posterior probabilities) using branch labels, and branch lengths adjusted so the ages of internal nodes are the mean across all trees, dropping the first 200 trees in each file as a burn-in:

$ sumtrees.py --min-clade-freq=0.95 --burnin=200 --edges=mean-age --output-tree-filepath=result.tre treefile1.tre treefile2.tre treefile3.tre
$ sumtrees.py -f0.95 -b200 -o result.tre -l -e mean-age treefile1.tre treefile2.tre treefile3.tre

To use the median age instead:

$ sumtrees.py --min-clade-freq=0.95 --burnin=200 --edges=median-age --output-tree-filepath=result.tre treefile1.tre treefile2.tre treefile3.tre
$ sumtrees.py -f0.95 -b200 -o result.tre -e median-age treefile1.tre treefile2.tre treefile3.tre

Running in Parallel Mode

Running in parallel mode will analyze each input source in its own independent process, with multiple processes running in parallel. Multiprocessing analysis is invoked by adding the “-m” or “--multiprocessing” flag to the SumTrees command, and passing in the maximum number of processes to run in parallel. For example, if your machine has two cores, and you want to run the previous analyses using both of them, you would specify that SumTrees run in parallel mode with two processes by adding “-m2” or “--multiprocessing=2” to the SumTrees command invocation:

$ sumtrees.py --multiprocessing=2 --decimals=0 --percentages --output-tree-filepath=result.tre --target=best.tre treefile1.tre treefile2.tre treefile3.tre
$ sumtrees.py -m2 -d0 -p -o result.tre -t best.tre treefile1.tre treefile2.tre treefile3.tre
$ sumtrees.py --multiprocessing=2 --min-clade-freq=0.95 --burnin=200 --support-as-labels --output-tree-filepath=result.tre treefile1.tre treefile2.tre treefile3.tre
$ sumtrees.py -m2 -f0.95 -b200 -l -o result.tre treefile1.tre treefile2.tre treefile3.tre

You can specify as many processes as you want, up to the total number of tree support files passed as input sources. If you want to use all the available cores on your machine, you can use the “-M” or “--maximum-multiprocessing” flag:

$ sumtrees.py --maximum-multiprocessing --decimals=0 --percentages --output-tree-filepath=result.tre --target=best.tre treefile1.tre treefile2.tre treefile3.tre
$ sumtrees.py -M -d0 -p -o result.tre -t best.tre treefile1.tre treefile2.tre treefile3.tre

If you specify fewer processes than input sources, then the files will be cycled through the processes.

Primers and Examples

At its most basic, you will need to supply SumTrees with the path to one or more tree files in Newick or NEXUS format that you want to summarize:

$ sumtrees.py phylo.tre

As no target tree was provided and no summary tree type was specified, SumTrees will, by default construct ad 50% majority-rule clade consensus tree of all the trees found in the file “phylo.tre” as the summarization target. The internal node labels of the resulting consensus tree will, by default, indicate the proportion of trees in “phylo.tre” in which that clade was found, while the branch lengths of the resulting consensus tree being set to the mean of the branch lengths of that clade across all the trees in “phylo.tre”.

If you have split searches across multiple runs (across, for example, multiple computers, so as to speed up the search time), such that you have multiple tree files (“phylo.run1.tre”, “phylo.run2.tre”, “phylo.run3.tre”, …), you can instruct SumTrees to consider all these files together when summarizing the support by simply listing them one after another separated by spaces:

$ sumtrees.py phylo.run1.tre phylo.run2.tre phylo.run3.tre

As before, the above command will construct a 50% majority-rule consensus tree with clade supported indicated by internal node labels and branch lengths being the mean across all trees, but this time it will use all the trees found across all the files listed: “phylo.run1.tre”, “phylo.run2.tre”, and “phylo.run3.tre”.

You will notice that the final resulting tree is displayed to the terminal and not saved anywhere. It will probably be more useful if we can save it to a file for visualization for further analysis. This can be done in one of two ways, either by redirecting the screen output to a file, using the standard (at least on UNIX and UNIX-like systems) redirection operator, >:

$ sumtrees.py phylo.tre > phylo.consensus.sumtrees

or by using the or “--output-tree-filepath” option:

$ sumtrees.py --output-tree-filepath=phylo.consensus.sumtrees phylo.tre

If the files are in different directories, or you are not in the same directory as the files, you should use the full directory path specification:

$ sumtrees.py --output-tree-filepath=/Users/myself/MyProjects/phylo1/final/phylo.consensus.sumtrees /Users/myself/MyProjects/phylo1/phylo.tre

More extended options specify things like: where to save the output (by default it goes to the screen), the topology or tree to which to map the support (user-supplied or consensus tree), the output format (NEXUS by default, but it can also be Newick), whether support is indicated in terms of proportions or percentages etc. All of these options are specified on the command line when invoking the program, with multiple options separated by spaces. Many of the options have two alternate forms, a long form (a word or phrase preceded by two dashes, e.g., “--option”) and a short form (a single letter preceded by a single dash, “-o”). The long form of the options needs an equals sign before setting the paramater (e.g., “--option=1”), while the short one does not (e.g., “-o1” or “-o 1”). Most of the options have default values that will be used if not explicitly set when the program is invoked. The order that the options are given does not matter, i.e., “sumtrees.py --option1=something --option2=something” is the same as “sumtrees.py --option2=something --option1=something”. As mentioned above, full details on these options, their long and short forms, as well as their default values will be given by invoking the program with the “--help” or “-h” option: “sumtrees.py --help”.

Specifying and Customization of the Summarization Target

SumTrees maps support values calculated from the input set of trees onto a target topology. This target topology can be a summary topology constructed from the input set of trees based on a strategy specified by the user (using the the “--summary-target” or “-s” flag to specify, for example, a majority-rule consensus tree or a maximum credbility tree) or a topology provided by the user (using the “--target-tree-filepath” or “-t” option to provide, e.g., a maximum-likehood estimate or some other topology sourced by other means).

Specifying a Summarization Topology Type

You can specify the type of summary topology onto which the support values are mapped using the “--summary-target” or “-s” option. This option takes one of three values as an argument:

   
“consensus” The majority-rule consensus tree (default)
“mcct” The Maximum Credibility Tree: the topology from the input set that maximizes the product of the support of the clades
“msct” The Maximum Sum of Credibilities Tree: the topology from the input set that maximizes the sum of support of the clades.

Majority-Rule Consensus Topology

For example, say you have completed a 1000-replicate non-parametric analysis of your dataset using a program such as GARLI or RAxML. You want to construct a 70% majority-rule consensus tree of the replicates, with support indicated as percentages on the node labels. If the bootstrap replicates are in the file “phylo-boots.tre”, you would then enter the following command:

$ sumtrees.py --summary-target=consensus --min-clade-freq=0.7 --percentages --decimals=0 phylo-boots.tre

Or, using the short option syntax:

$ sumtrees.py -s consensus -f0.7 -p -d0 phylo-boots.tre

Here, the “--min-clade-freq=0.7” or “-f0.7” option lowers the minimum threshold for clade inclusion to 70%. If you want a 95% majority-rule consensus tree instead, you would use “--min-clade-freq=0.95” or “-f0.95”. The default threshold if none is specified is 0.5 (50%). The “--percentages” or “-p” option instructs SumTrees to report the support in terms of percentages, while the “--decimals=0” or “-d 0” option instructs SumTrees not to bother reporting any decimals. Note that even if you instruct SumTrees to report the support in terms of percentages, the minimum clade inclusion threshold is still given in terms of proportions.

Note

As noted, if no target topology is specified (either using the “--summary-target”/”-s” or “--target-tree-filepath”/”-t” options), then SumTrees by default will construct and use a majority-rule consensus topology as a target, and hence the explicit specification of this as a target is not needed. Thus, the following will produce exactly the same results as above:

$ sumtrees.py --min-clade-freq=0.7 --percentages --decimals=0 phylo-boots.tre
$ sumtrees.py -f0.7 -p -d0 phylo-boots.tre

Again, if we want to actually save the results to the file, we should use the “--output-tree-filepath” option:

$ sumtrees.py --output-tree-filepath=phylo-mle-support.sumtrees --min-clade-freq=0.7 --percentages --decimals=0 phylo-boots.tre
$ sumtrees.py -o phylo-mle-support.sumtrees -f0.7 --p --d0 phylo-boots.tre

Maximum Clade Credibility Topology

The Maximum Clade Credibility Topology, or MCCT, is the topology that maximizes the product of the split support. You can use this as the target topology by specifying “--summary-target=mcct” or “-s mcct”:

$ sumtrees.py --summary-target=mcct phylo-boots.tre
$ sumtrees.py -s mcct phylo-boots.tre

As might be expected, in can be combined with other options. For example, to discard the first 200 trees from each of the input sources and write the results to a file, “results.tre:

$ sumtrees.py --summary-target=mcct --burnin=200 --output-tree-filepath=results.tre treefile1.tre treefile2.tre treefile3.tre
$ sumtrees.py -s mcct -b 200 -o results.tre treefile1.tre treefile2.tre treefile3.tre

Specifying a Custom Topology or Set of Topologies

Say you also have a maximum likelihood estimate of the phylogeny, and want to annotate the nodes of the maximum likelihood tree with the proportion of trees out of the bootstrap replicates in which the node is found. Then, assuming your maximum likelihood tree is in the file, “phylo-mle.tre”, and the bootstrap tree file is “phylo-boots.tre”, you would use the “--target-tree-filepath” options, as in the following command:

$ sumtrees.py --target-tree-filepath=phylo-mle.tre phylo-boots.tre

Here, “--target-tree-filepath” specifies the target topology onto which the support will be mapped, while the remaining (unprefixed) argument specifies the tree file that is the source of the support. An equivalent form of the same command, using the short option syntax is:

$ sumtrees.py -t phylo-mle.tre phylo-boots.tre

If you want the support expressed in percentages instead of proportions, and the final tree saved to a file, you would enter:

$ sumtrees.py --output phylo-mle-support.sumtrees --target-tree-filepath phylo-mle.tre --proportions --decimals=0 phylo-boots.tre
$ sumtrees.py -o phylo-mle-support.sumtrees -t phylo-mle.tre -p -d0 phylo-boots.tre

Collapsing Clades/Edges/Splits Below a Minimum Support Frequency Threshold

The “-f” or “--min-clade-freq” flag sets the threshold below which clades will be collapsed. This applies to all summary targets: user-specified target topologies as well consensus trees, MCCT trees, etc.:

$ sumtrees.py --min-clade-freq=0.75 --percentages --decimals=0 phylo-boots.tre
$ sumtrees.py -f0.75 -p -d0 phylo-boots.tre
$ sumtrees.py --min-clade-freq=0.95 --summary-target=mcct phylo-boots.tre
$ sumtrees.py -f0.95 -s mcct phylo-boots.tre
$ sumtrees.py --min-clade-freq=0.75 --decimals=0 --percentages --target=best.tre treefile1.tre treefile2.tre treefile3.tre
$ sumtrees.py -f0.75 -d0 -p -t best.tre treefile1.tre treefile2.tre treefile3.tre

Note that the minimum support frequency threshold is specified as proportion scaled from 0 to 1, even if the percentages are requested as the support units.

Summarizing Edge Lengths

If a target topology has been specified using the “--target-tree-filepath” or the “-t” option, then by default SumTrees retains the edge lengths of the target topologies. Otherwise, if the input trees are ultrametric and the “--summarize-node-ages” option is given, then by default SumTrees will adjust the edge lengths of the target topology so that the ages of the internal nodes are the mean of the ages of the corresponding nodes in the input set of trees. Otherwise, if no target trees are specified and the “--summarize-node-ages” is not given, the edge lengths of the target topology will be set to the mean lengths of the corresponding edges of the input set. In all cases, these defaults can be overridden by using the “--set-edges” or “-e” option, which takes one of the following values:

  • mean-length: sets the edge lengths of the target/consensus tree(s) to the mean of the lengths of the corresponding edges of the input trees.
  • median-length: sets the edge lengths of the target/consensus tree(s) to the median of the lengths of the corresponding edges of the input trees.
  • median-age: adjusts the edge lengths of the target/consensus tree(s) such that the node ages correspond to the median age of corresponding nodes of the input trees [requires rooted ultrametric trees].
  • mean-age: adjusts the edge lengths of the target/consensus tree(s) such that the node ages correspond to the mean age of corresponding nodes of the input trees [requires rooted ultrametric trees].
  • support: edge lengths will be set to the support value for the split represented by the edge.
  • keep: do not change the existing edge lengths of the target topology.
  • clear: all edge lengths will be removed

So, for example, to construct a consensus tree of a post-burnin set of ultrametric trees, with the node ages set to the mean instead of the median node age:

$ sumtrees.py --set-edges=mean-age --summarize-node-ages --burnin=200 beast1.trees beast2.trees beast3.trees
$ sumtrees.py --e mean-age --summarize-node-ages --b 200 beast1.trees beast2.trees beast3.trees

Or to set the edges of a user-specifed tree to the median edge length of the input trees:

$ sumtrees.py --set-edges=median-length --target=mle.tre boots1.tre boots2.tre
$ sumtrees.py --e median-length -t mle.tre boots1.tre boots2.tre

Summarizing Rooted and Ultrametric Trees

Forcing Trees to be Treated as Rooted

SumTrees treats all trees as unrooted unless specified otherwise. You can force SumTrees to treat all trees as rooted by passing it the “--force-rooted” flag:

$ sumtrees.py --force-rooted phylo.trees

Summarizing Node Ages

If the trees are rooted and ultrametric, the “--summarize-node-ages” flag will result in SumTrees summarizing node age information as well:

$ sumtrees.py --summarize-node-ages phylo.trees

This will calculate and report statistics such as the mean, standard deviation, ranges, high posterior density intervals, etc. of the node ages.

Setting the Node Ages of the Summary Trees

If the tree is rooted and ultrametric, instead of summarizing edge lengths to be the mean of the edge lengths across the input set, you probably want to set the edge lengths of the summary tree so the node ages correspond to the mean or median of the node ages of the input set:

$ sumtrees.py --set-edges=mean-age --summarize-node-ages beast1.trees beast2.trees beast3.trees
$ sumtrees.py --e mean-age --summarize-node-ages beast1.trees beast2.trees beast3.trees

Rooting the Target Topology

The following options allow for re-rooting of the target topology or topologies:

--root-target-at-outgroup TAXON-LABEL
                      Root target tree(s) using specified taxon as outgroup.
--root-target-at-midpoint
                      Root target tree(s) at midpoint.
--set-outgroup TAXON-LABEL
                      Rotate the target trees such the specified taxon is in
                      the outgroup position, but do not explicitly change
                      the target tree rooting.

For example:

$ sumtrees.py --root-target-at-outgroup Python_regius --target=mle.tre boots1.tre boots2.tre
$ sumtrees.py --root-target-at-midpoint -s mcct trees1.tre trees2.tre
$ sumtrees.py --set-outgroup Python_regius -s mcct trees1.tre trees2.tre

Assigning the Ages of Non-Contemporaneous Tips (“Tip-Dating”)

In some studies, such as those including fossil taxa or serially-sampled lineages (e.g., viral or bacterial studies), the tips of the tree corresponding to taxa are non-contemporaneous. In these studies, the tips corresponding to taxa or samples in the past are usually assigned dates. Commonly referred to by some folks by the unfortunate and somewhat ambiguous phrase “tip-dating”, here we refer to this process of assigning ages of non-contemporaneous tips as somewhat less sloppy and more explicit, “assigning ages of non-contemporaneous tips”.

To do this using SumTrees, you would use the option --tip-ages to specify the path to a file containing the tip age data. The format of this file is specified by the --tip-ages-format option, which can be “tsv” (tab-separated values), “csv” (comma-separated values), or “json” (JSON). The default format is “tsv”, which specifies a tab delimited file. For both the tab-separated and comma-separated formats, the data should consist of two columns, with the first column specifying the taxon label and the second column specifying the age of the corresponding tip. The JSON-format file should consist of a single dictionary with keys being taxon labels and values being the age of the corresponding tips.

Taxon labels must match EXACTLY with the taxon labels in the input tree sources. This needs to account for things like the conversion of unprotected underscores to spaces as mandated by the Newick/NEXUS standard unless overridden by the --preserve-underscores option. Any taxon not listed in the tip ages file will get assigned an age of 0.0 by default.

So, for example, given a tree file ‘data/x1.tre’ with trees like:

[&R] ((d:2,(a:3,(b:1,c:2):1):3):2,(g:4,(e:1,f:1):2):4);
[&R] ((d:2,(a:3,(b:1,c:2):1):3):2,(g:4,(e:1,f:1):2):4);
.
.
.

and a tip age data file, ‘data/x1.ages.tsv’, consisting of tab-separated values like:

d<TAB>4
b<TAB>1
e<TAB>1
f<TAB>1

then the following is the invocation to summarize the trees and their node ages:

$ sumtrees.py \
 --tip-ages data/x1.ages.tsv
 --summarize-node-ages \
 data/x1-1.tre
 data/x1-2.tre

If the tip age data file is given in a comma-separated file, ‘data/x1.ages.csv’, like the following:

d,4
b,1
e,1
f,1

then the invocation is:

$ sumtrees.py \
 --tip-ages data/x1.ages.csv
 --tip-ages-format=csv \
 --summarize-node-ages \
 data/x1-1.tre
 data/x1-2.tre

Note that if specifying the tip ages you have to explicitly specify --summarize-node-ages or some other option that results in node ages being analyzed (e.g., --set-edges=mean-age).

Parallelizing SumTrees

Basics

If you have multiple input (support) files, you can greatly increase the performance of SumTrees by running it in parallel mode. In parallel mode, each input source will be handled in a separate process, resulting in a speed-up linearly proportional to the total number of processes running in parallel. At its most basic, running in parallel mode involves nothing more than adding the “-m” or “--multiprocessing option to the SumTrees invocation, and passing in the number of parallel processes to run. So, for example, if you have four tree files that you want to summarize, and you want to run these using two processes in parallel:

$ sumtrees.py --multiprocessing=2 -o result.tre phylo.run1.tre phylo.run2.tre phylo.run3.tre phylo.run4.tre
$ sumtrees.py -m 2 -o result.tre phylo.run1.tre phylo.run2.tre phylo.run3.tre phylo.run4.tre

Or, to run in four processes simultaneously:

$ sumtrees.py --multiprocessing=4 phylo.run1.tre phylo.run2.tre phylo.run3.tre phylo.run4.tre
$ sumtrees.py -m 4 phylo.run1.tre phylo.run2.tre phylo.run3.tre phylo.run4.tre

If you want to use all the available cores on the current machine, use the “-M” or “--maximum-multiprocessing” flag:

$ sumtrees.py --maximum-multiprocessing phylo.run1.tre phylo.run2.tre phylo.run3.tre phylo.run4.tre
$ sumtrees.py -M phylo.run1.tre phylo.run2.tre phylo.run3.tre phylo.run4.tre

Other options as described above, can, of course be added as needed:

$ sumtrees.py --multiprocessing=4 --burnin=200 phylo.run1.tre phylo.run2.tre phylo.run3.tre phylo.run4.tre
$ sumtrees.py -m 4 -b 200 phylo.run1.tre phylo.run2.tre phylo.run3.tre phylo.run4.tre

$ sumtrees.py --maximum-multiprocessing --burnin=200 --min-clade-freq=0.75 phylo.run1.tre phylo.run2.tre phylo.run3.tre phylo.run4.tre
$ sumtrees.py -M -b 200 -f0.75 phylo.run1.tre phylo.run2.tre phylo.run3.tre phylo.run4.tre

$ sumtrees.py --maximum-multiprocessing --burnin=200 --min-clade-freq=0.75 --output-tree-filepath=con.tre phylo.run1.tre phylo.run2.tre phylo.run3.tre phylo.run4.tre
$ sumtrees.py -M -b 200 -f0.75 -o con.tre phylo.run1.tre phylo.run2.tre phylo.run3.tre phylo.run4.tre

Parallelization Strategy: Deciding on the Number of Processes

In parallel mode, SumTrees parallelizes by input files, which means that the maximum number of processes that can be run is limited to the number of input files. If you have four support tree files as input sources, as in the above example (“phylo.run1.tre”, “phylo.run2.tre”,”phylo.run3.tre”, and “phylo.run4.tre”), then you can run from 2 up to 4 processes in parallel. If you have 40 tree source or support files, then you can run from 2 up to 40 processes in parallel.

Note that there is a difference between the number of processes that SumTrees runs, and the number of processors or cores available on your machine or in your given hardware context. When running SumTrees in parallel mode, you can specify any number of parallel processes, up to the number of support files, even if the number of processes greatly exceeds the number processors available. For example, you might have an octo-core machine available, which means that you have 8 processors available. If your analysis has 40 independent support files, you can invoke SumTrees with 40 parallel processes:

$ sumtrees.py --multiprocessing=40 -o result.tre t1.tre ... t40.tre

SumTrees will indeed launch 40 parallel processes to process the files, and it will seem like 40 processes are being executed in parallel. In reality, however, the operating system is actually cycling the processes through the available processors in rapid succession, such that, on a nanosecond time-scale, only 8 processes are actually executing simultaneously: one on each of the available cores. While your run should complete without problems when oversubscribing the hardware in this way, there is going to be some degree of performance hit due to the overhead involved in managing the cycling of the processes through the processors. On some operating systems and hardware contexts, depending on the magnitude of oversubscription, this can be considerable. Thus it probably is a good idea to match the number of processes to the number of processors available.

If you do not mind using all the available cores on your machine, you can use the “--maximum-multiprocessing” or “-M” flag to request this, instead of using a specific number.

Another issue to consider is an even distribution of workload. Assuming that each of your input support files have the same number of trees, then it makes sense to specify a number of processes that is a factor of the number of input files. So, for example, if you have 8 input files to be summarized, you will get the best performance out of SumTrees by specifying 2, 4, or 8 processes, with the actual number given by the maximum number of processors available or that you want to dedicate to this task.

Running Parallel-Mode SumTrees in a Parallel Environment on a High-Performance Computing (HPC) Cluster

Unfortunately, the diversity and idiosyncracies in various HPC configurations and set-ups are so great that it is very difficult to provide general recipes on how to run parallel-mode SumTrees in a parallel environment on an HPC cluster. However, in all cases, all you really need to do is to set up an appropriate parallel execution environment on the cluster, requesting a specific number of processors from the cluster scheduler software, and then tell SumTrees to run the same number of processes.

For example, if your cluster uses the Sun Grid Engine as its scheduler, then you might use a job script such as the following:

# /bin/sh
#$ -cwd
#$ -V
#$ -pe mpi 4
sumtrees.py -m4 -o result.con.tre -burnin 100 mcmc1.tre mcmc2.tre mcmc3.tre mcmc4.tre

The “#$ -pe mpi 4” line tells the SGE scheduler to allocate four processors to the job in the “mpi” parallel environment, while the “-m4” part of the SumTrees command tells SumTrees to run 4 processes in parallel.

If you are using PBS/Torque as a scheduler, the equivalent job script might be:

# /bin/sh
#PBS -l nodes=2:ncpus=2
cd $PBS_O_WORKDIR
sumtrees.py -m4 -o result.con.tre -burnin 100 mcmc1.tre mcmc2.tre mcmc3.tre mcmc4.tre

Here, the “#PBS -l nodes=2:ncpus=2” requests two processors on two nodes, while the “-m4” part of the SumTrees command, as before, tells SumTrees to run 4 processes in parallel.

The particular job scripts you use will almost certainly be different, varying with the cluster job/load management software, scheduler and computing resources. Apart from the parallel environment name, number of processors and/or machine configuration, you might also need to provide a queue name, a maximum run time limit, and a soft or hard memory limit. None of it should make any difference to how SumTrees is actually invoked: you would still just use the “-m” or “--multiprocessing” flags to specify that SumTrees runs in parallel mode with a particular number of parallel processes). You just need to check with your cluster administrators or documentation to make sure your job script or execution context provides sufficient processors to match the number of processes that you want run (as well as other resources, e.g. wall time limit and memory limit, that SumTrees needs to finish its job).

Improving Performance

  • Run SumTrees in parallel mode (see “Running in Parallel Mode” or “Parallelizing SumTrees”):

    $ sumtrees.py -m 4 phylo.run1.tre phylo.run2.tre phylo.run3.tre phylo.run4.tre
    
  • Reduce the tree-processing logging frequency. By default, SumTrees will report back every 500th tree in a tree file, just to let you know where it is and to give you a sense of how long it will take. You can use the “-g” or “--log-frequency” flag to control this behavior. If you have very large files, you may be content to have SumTrees report back every 5000th or even every 10000th tree instead:

    $ sumtrees.py --log-frequency=10000 phylo.run1.tre phylo.run2.tre phylo.run3.tre phylo.run4.tre
    $ sumtrees.py -g10000 phylo.run1.tre phylo.run2.tre phylo.run3.tre phylo.run4.tre
    

    If you are content to let SumTrees run without reporting its progress within each file (SumTrees will still report back whenever it begins or ends working on a file), then you can switch off tree processing logging altogether by specifying a logging frequency of 0:

    $ sumtrees.py --log-frequency=0 phylo.run1.tre phylo.run2.tre phylo.run3.tre phylo.run4.tre
    $ sumtrees.py -g0 phylo.run1.tre phylo.run2.tre phylo.run3.tre phylo.run4.tre
    

Troubleshooting

Prerequisites

DendroPy is a Python library. It requires and presupposes not only the existence of a Python installation on your system, but also that this Python installation is available on the system path.

The biggest problem faced by most users is not so much not having Python installed, but not having the correct version of Python installed. You can check which version of Python you have running by typing:

$ python -V

SumTrees, and the DendroPy library that it is part of, works out-of-the-box with any 2.x Python version 2.4 or greater.

SumTrees will not work with versions of Python prior to 2.4, such as Python 2.3. It can probably be made to work pretty easily, and if you have strong enough motiviation to use Python 2.3, it might be worth the effort for you. It is not for me.

SumTrees (and DendroPy) is currently not compatible with Python 3.

My Computer Does Not Know What a Python Is

If you get a message like:

python: command not found

it is either because Python is not installed on your system, or is not found on the system path.

SumTrees is a Python script, and, as such, you will need to have a Python interpreter installed on your system.

Otherwise, you must download and install Python 2.6 from: http://www.python.org/download/releases/2.6/. For your convenience, the clicking on the following links should lead you directly to the appropriate pre-compiled download:

For other platforms, the usual “./configure”, “make”, and “sudo make install” dance should get you up and running with the following:

Microsoft Windows users should also refer to the “Python Windows FAQ” (http://www.python.org/doc/faq/windows.html) after installing Python, and pay particular attention to the “How do I run a Python program under Windows?” section, as it will help them greatly in getting Python up and running on the system path.

Manual Installation

The DendroPy library is actually quite straightforward to install manually, especially if you have any familiarity with Python and how Python files are organized. There are a couple of different things you could do:

  • Add the current location of the “dendropy” subdirectory to your Python path environmental variable, “$PYTHONPATH”, and place the file “programs/sumtrees.py” on your system path.
  • Copy (or symlink) the “dendropy” directory to the “site-packages” directory of your Python installation, and place the file “programs/sumtrees.py” on your system path.

Repository Access

The DendroPy public-access Git repository can be cloned from:

Bugs, Suggestions, Comments, etc.

If you encounter any problems, errors, crashes etc. while using this program, please let me (Jeet Sukumaran) know at jeetsukumaran@gmail.com. If you include the term “sumtrees” anywhere on the subject line (e.g. “Problem such-and-such with bootscore), it would help greatly with getting through the spam filter. Please include all the datafiles involved, as well the complete command used (with all the options and parameters) and the complete error message returned (simply cutting-and-pasting the terminal text should work fine). Please feel free to contact me if you have any other questions, suggestions or comments as well.

How to Cite this Program

If you use this program in your analysis, please cite it as:

Sukumaran, J. and Mark T. Holder. 2010. DendroPy: A Python library for phylogenetic computing. Bioinformatics 26: 1569-1571.

In the text of your paper, if you want to look like you know what you are doing, you should probably also mention explicitly that you specifically used the SumTrees program of the DendroPy package, as well as the particular version numbers of SumTrees and DendroPy that you used.

Acknowledgments

Portions of DendroPy were developed under CIPRES, a multi-site collaboration funded by the NSF Information Technology Research (ITR) program grant entitled “BUILDING THE TREE OF LIFE: A National Resource for Phyloinformatics and Computational Phylogenetics” and NSF ATOL grant 0732920 to Mark Holder.