dendropy.calculate.phylogeneticdistance: Phylogenetic Distance Calculations and Operations

The PhylogeneticDistanceMatrix Class

class dendropy.calculate.phylogeneticdistance.PhylogeneticDistanceMatrix(is_store_path_edges=False)

Calculates and maintains patristic distance information of taxa on a tree.

as_data_table(is_weighted_edge_distances=True)

Returns this as a table.

assemblage_membership_definitions_from_csv(src, default_data_type=<type 'float'>, **csv_reader_kwargs)

Convenience method to return list of community sets from a delimited file that lists taxon (labels) in columns and community presence/absences or abundances in rows.

compile_from_tree(tree)

Calculates the distances. Note that the path length (in number of steps) between taxa that span the root will be off by one if the tree is unrooted.

distance(taxon1, taxon2, is_weighted_edge_distances=True, is_normalize_by_tree_size=False)

Returns distance between taxon1 and taxon2.

distances(is_weighted_edge_distances=True, is_normalize_by_tree_size=False)

Returns list of patristic distances.

distinct_taxon_pair_iter(filter_fn=None)

Iterates over all distinct pairs of taxa in matrix.

classmethod from_csv(src, taxon_namespace=None, is_allow_new_taxa=None, is_first_row_column_names=True, is_first_column_row_names=True, default_data_type=<type 'float'>, label_transform_fn=None, **csv_reader_kwargs)

Instantiates a new PhylogeneticDistanceMatrix instance with data from an external source.

Parameters:
  • src (file or file-like) – Source of data. This is a token delimited-file (e.g., a comma-delimited or tab-delimited file) providing a table which lists taxon labels in both rows and columns. The cells of the table are numeric (typically real) values that indicate the distance between the taxa of the current row and column. Note that only the upper right section of the table is considered. The diagonals values are typically zeroes and, in either case, ignored along with the lower diagonal. Despite being ignored by the PhylogeneticDistanceMatrix object, the values are parsed by the underlying reader and thus have to be valid numerical values.
  • taxon_namespace (TaxonNamespace instance) – The taxon namespace with which to manage taxa. If this has not already been pre-populated with the taxon names, then is_allow_new_taxa should be set to True.
  • is_allow_new_taxa (bool) – If False: we do not expect to encounter any new taxa in the data file, and it is an error if we do. If True: we do expect to encounter new taxa in the data file. The default value of this depends on the value passed to taxon_namespace. If taxon_namespace is None or an empty TaxonNamespace instance, then unless explicitly set to False, is_allow_new_taxa will default to True: allowing of creation of new taxa corresponding to labels found in the data source. On the other hand, if taxon_namespace is not None and its value is a TaxonNamespace instance with at least one taxon, unless explicitly set to True, is_allow_new_taxa will default to False, and it will be an error if taxon labels are found in the data source that do not correspond (exactly) to Taxon objects defined in the taxon namespace. This is to err on the side of caution, to avoid (or rather, highlight) problems due to incorrect or mismatching labels between the data source and the current taxon namespace.
  • is_first_row_column_names (bool) – By default True: assumes that first row lists the taxon names. Set to False if there is no header row.
  • is_first_column_row_names (bool) – By default True: assumes that first column lists the taxon names. Set to False if there is now row name column.
  • label_transform_fn (function object) – If not None, this should be a function object that takes a string as an argument and returns another string. This function will be applied to row and column labels before they are matched to taxon labels in the TaxonNamespace instance given by taxon_namespace.
  • **csv_reader_kwargs (keyword arguments) – This arguments will be passed to the underlying CSV reader. The most important one is probably ‘delimiter’.
Returns:

pdm (A |PhylogeneticDistanceMatrix| instance)

Examples

import dendropy
pdm1 = dendropy.PhylogeneticDistanceMatrix.from_csv(
        src=open("data.csv"),
        delimiter=",")
pdm2 = dendropy.PhylogeneticDistanceMatrix.from_csv(
        src=open("data.tsv"),
        delimiter=" ")
classmethod from_tree(tree, *args, **kwargs)

Creates and returns a PhylogeneticDistanceMatrix based on the given tree.

Note that this creates a “snapshot” of the current state of the tree. Subsequent changes to the tree will not be reflected in PhylogeneticDistanceMatrix instances previously created.

Also note that syntactically you may prefer to use:

pdm = tree.phylogenetic_distance_matrix()

instead of:

pdm = PhylogeneticDistanceMatrix.from_tree(tree)
Parameters:tree (a Tree instance) – The Tree from which to get the phylogenetic distances.
Returns:pdm (A |PhylogeneticDistanceMatrix| instance)

Examples

import dendropy
tree = dendropy.Tree.get(path="tree.nex",
        schema="nexus")
pdm1 = dendropy.PhylogeneticDistanceMatrix.from_tree(tree)

# following is equivalent to above and probably preferred:
pdm2 = tree.phylogenetic_distance_matrix()
mean_nearest_taxon_distance(filter_fn=None, is_weighted_edge_distances=True, is_normalize_by_tree_size=False)

Calculates the phylogenetic ecology statistic “MNTD”[1,2] for the tree (only considering taxa for which filter_fn returns True when applied if filter_fn is specified).

The mean nearest taxon distance (mntd) is given by:

\[mntd = \frac{ \sum_{i}^{n} min(\delta_{i,j}) }{n},\]

where \(i \neq j\), \(\delta_{i,j}\) is the phylogenetic distance between species \(i\) and \(j\), and \(n\) is the number of species in the sample.

Parameters:
  • filter_fn (function object or None) – If None, then all leaves will be considered. Otherwise should be a function object that takes a Taxon instance as an argument and returns True if it is to be included in the calculation or False otherwise. In trees sampled from multiple communites, filter_fn can be used to restrict the calculation to only one community based on some criteria.
  • is_weighted_edge_distances (bool) – If True then the edge-weighted distance, i.e., considering edge lengths, is returned. Otherwise the the path steps or the number of edges rather then the sum of is_weighted_edge_distances edges, connecting two taxa is considered.
  • is_normalize_by_tree_size (bool) – If True then the results are normalized by the total tree length or steps/edges (depending on whether edge-weighted or unweighted distances are used, respectively). Otherwise, raw distances are used.
Returns:

mntd (float) – The Mean Nearest Taxon Distance (MNTD) statistic for the daata.

Examples

import dendropy
tree = dendropy.Tree.get(path="data.nex",
        schema="nexus")
pdm = dendropy.PhylogeneticDistanceMatrix(tree)

# consider all tips
mntd = pdm.mean_nearest_taxon_distance()

# only tips within the same community, based on the node annotation
# "community"
mntds_by_community = {}
for community_label in ("1", "2", "3",):
    filter_fn = lambda x: x.annotations["community"] == community_label
    mntd = pdm.mean_pairwise_distance(filter_fn=filter_fn)
    mntds_by_community[community_label] = mntd

References

[1] Webb, C.O. 2000. Exploring the phylogenetic structure of ecological communities: An example for rainforest trees. The American Naturalist 156: 145-155.

[2] Swenson, N.G. Functional and Phylogenetic Ecology in R.

mean_pairwise_distance(filter_fn=None, is_weighted_edge_distances=True, is_normalize_by_tree_size=False)

Calculates the phylogenetic ecology statistic “MPD”[1,2] for the tree (only considering taxa for which filter_fn returns True when applied if filter_fn is specified).

The mean pairwise distance (mpd) is given by:

\[mpd = \frac{ \sum_{i}^{n} \sum_{j}^{n} \delta_{i,j} }{n \choose 2},\]

where \(i \neq j\), \(\delta_{i,j}\) is the phylogenetic distance between species \(i\) and \(j\), and \(n\) is the number of species in the sample.

Parameters:
  • filter_fn (function object or None) – If None, then all leaves will be considered. Otherwise should be a function object that takes a Taxon instance as an argument and returns True if it is to be included in the calculation or False otherwise. In trees sampled from multiple communites, filter_fn can be used to restrict the calculation to only one community based on some criteria.
  • is_weighted_edge_distances (bool) – If True then the edge-weighted distance, i.e., considering edge lengths, is returned. Otherwise the the path steps or the number of edges rather then the sum of is_weighted_edge_distances edges, connecting two taxa is considered.
  • is_normalize_by_tree_size (bool) – If True then the results are normalized by the total tree length or steps/edges (depending on whether edge-weighted or unweighted distances are used, respectively). Otherwise, raw distances are used.
Returns:

mpd (float) – The Mean Pairwise Distance (MPD) statistic for the daata.

Examples

import dendropy
tree = dendropy.Tree.get(path="data.nex",
        schema="nexus")
pdm = dendropy.PhylogeneticDistanceMatrix(tree)

# consider all tips
mpd1 = pdm.mean_pairwise_distance()

# only tips within the same community, based on the node annotation
# "community"
mpds_by_community = {}
for community_label in ("1", "2", "3",):
    filter_fn = lambda x: x.annotations["community"] == community_label
    mpd = pdm.mean_pairwise_distance(filter_fn=filter_fn)
    mpds_by_community[community_label] = mpd

References

[1] Webb, C.O. 2000. Exploring the phylogenetic structure of ecological communities: An example for rainforest trees. The American Naturalist 156: 145-155.

[2] Swenson, N.G. Functional and Phylogenetic Ecology in R.

mrca(taxon1, taxon2)

Returns MRCA of two taxon objects.

nj_tree(is_weighted_edge_distances=True, tree_factory=None)

Returns an Neighbor-Joining (NJ) tree based on the distances in the matrix.

Calculates and returns a tree under the Neighbor-Joining algorithm of Saitou and Nei (1987) for the data in the matrix.

Parameters:is_weighted_edge_distances (bool) – If True then edge lengths will be considered for distances. Otherwise, just the number of edges.
Returns:t (|Tree|) – A Tree instance corresponding to the Neighbor-Joining (NJ) tree for this data.

Examples

import dendropy

# Read data from a CSV file into a PhylogeneticDistanceMatrix
# object
with open("distance_matrix.csv") as src:
    pdm = dendropy.PhylogeneticDistanceMatrix.from_csv(
            src,
            is_first_row_column_names=True,
            is_first_column_row_names=True,
            is_allow_new_taxa=True,
            delimiter=",",
            )

# Calculate the tree
nj_tree = pdm.nj_tree()

# Print it
print(nj_tree.as_string("nexus"))

References

Saitou, N. and Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4: 406-425.

path_edge_count(taxon1, taxon2, is_normalize_by_tree_size=False)

Returns the number of edges between two taxon objects.

path_edges(taxon1, taxon2)

Returns the edges between two taxon objects.

patristic_distance(taxon1, taxon2, is_normalize_by_tree_size=False)

Returns patristic distance between two taxon objects.

shuffle_taxa(is_shuffle_phylogenetic_distances=True, is_shuffle_phylogenetic_path_steps=True, is_shuffle_mrca=True, rng=None)

Randomly shuffles taxa in-situ.

standardized_effect_size_mean_nearest_taxon_distance(assemblage_memberships, num_randomization_replicates=1000, is_weighted_edge_distances=True, is_normalize_by_tree_size=False, is_skip_single_taxon_assemblages=False, null_model_type='taxa.label', rng=None)

Returns the standardized effect size value for the MNTD statistic under a null model under various community compositions.

The S.E.S. is given by:

\[SES(statistic) = \frac{observed - mean(model_{null})}{sd(model_{null})}\]

This removes any bias associated with the decrease in variance in the MPD statistic value as species richness increases to the point where communities become saturated. Equivalent to -1 times the Nearest Taxon Index when using phylogenetic distances.

In contrast to the function calculating the non-standardized effect size version of this statistic, which uses filter function to specify the subset of taxa to be considerd, here a collection of (multiple) sets of taxa constituting a community is specified. This to allow calculation of the null model statistic across all community sets for each randomization replicate.

Parameters:
  • assemblage_memberships (iterable of iterable of Taxon objects) – A collection of collections, e.g. a list of sets, with the elements of each set being Taxon instances. Each set specifies the composition of a community. The standardized effect size of this statistic will be calculated for each community as specified by a set of Taxon instances.
  • num_randomization_replicates (int) – Number of randomization replicates.
  • is_weighted_edge_distances (bool) – If True then edge lengths will be considered for distances. Otherwise, just the number of edges.
Returns:

r (list of results) – A list of results, with each result corresponding to a community set given in assemblage_memberships. Each result consists of a named tuple with the following elements:

  • obs : the observed value of the statistic
  • null_model_mean : the mean value of the statistic under the null
    model
  • null_model_sd : the standard deviation of the statistic under
    the null model
  • z : the standardized effect value of the statistic
    (= SES as defined in [1] above)
  • p : the p-value of the observed value of the
    statistic under the null model.

Examples

import dendropy
tree = dendropy.Tree.get_from_path(
        src="data/community.tree.newick",
        schema="newick",
        rooting="force-rooted")
pdm = dendropy.PhylogeneticDistanceMatrix.from_tree(tree)
assemblage_memberships = pdm.assemblage_membership_definitions_from_csv("data/comm1.csv")
results = pdm.standardized_effect_size_mean_nearest_taxon_distance(assemblage_memberships=assemblage_memberships)
print(results)
standardized_effect_size_mean_pairwise_distance(assemblage_memberships, num_randomization_replicates=1000, is_weighted_edge_distances=True, is_normalize_by_tree_size=False, is_skip_single_taxon_assemblages=False, null_model_type='taxa.label', rng=None)

Returns the standardized effect size value for the MPD statistic under a null model under various community compositions.

The S.E.S. is given by:

\[SES(statistic) = \frac{observed - mean(model_{null})}{sd(model_{null})}\]

This removes any bias associated with the decrease in variance in the MPD statistic value as species richness increases to the point where communities become saturated. Equivalent to -1 times the Nearest Relative Index (NRI) when using phylogenetic distances.

In contrast to the function calculating the non-standardized effect size version of this statistic, which uses filter function to specify the subset of taxa to be considerd, here a collection of (multiple) sets of taxa constituting a community is specified. This to allow calculation of the null model statistic across all community sets for each randomization replicate.

Parameters:
  • assemblage_memberships (iterable of iterable of Taxon objects) – A collection of collections, e.g. a list of sets, with the elements of each set being Taxon instances. Each set specifies the composition of a community. The standardized effect size of this statistic will be calculated for each community as specified by a set of Taxon instances.
  • num_randomization_replicates (int) – Number of randomization replicates.
  • is_weighted_edge_distances (bool) – If True then edge lengths will be considered for distances. Otherwise, just the number of edges.
Returns:

r (list of results) – A list of results, with each result corresponding to a community set given in assemblage_memberships. Each result consists of a named tuple with the following elements:

  • obs : the observed value of the statistic
  • null_model_mean : the mean value of the statistic under the null
    model
  • null_model_sd : the standard deviation of the statistic under
    the null model
  • z : the standardized effect value of the statistic
    (= SES as defined in [1] above)
  • p : the p-value of the observed value of the

Examples

import dendropy
tree = dendropy.Tree.get_from_path(
        src="data/community.tree.newick",
        schema="newick",
        rooting="force-rooted")
pdm = tree.phylogenetic_distance_matrix()
assemblage_membership_definitions = pdm.assemblage_membership_definitions_from_csv("data/comm1.csv")
results = pdm.standardized_effect_size_mean_pairwise_distance(assemblage_memberships=assemblage_membership_definitions.values())
print(results)
sum_of_distances(is_weighted_edge_distances=True, is_normalize_by_tree_size=False)

Returns sum of patristic distances on tree.

taxon_iter(filter_fn=None)

Iterates over taxa in matrix. Note that this could be a subset of the taxa in the associated taxon namespace.

upgma_tree(is_weighted_edge_distances=True, tree_factory=None)

Returns an Unweighted Pair Group Method with Arithmetic Mean (UPGMA) tree based on the distances in the matrix.

Parameters:is_weighted_edge_distances (bool) – If True then edge lengths will be considered for distances. Otherwise, just the number of edges.
Returns:t (|Tree|) – A Tree instance corresponding to the UPGMA tree for this data.

Examples

import dendropy

# Read data from a CSV file into a PhylogeneticDistanceMatrix
# object
with open("distance_matrix.csv") as src:
    pdm = dendropy.PhylogeneticDistanceMatrix.from_csv(
            src,
            is_first_row_column_names=True,
            is_first_column_row_names=True,
            is_allow_new_taxa=True,
            delimiter=",",
            )

# Calculate the tree
upgma_tree = pdm.upgma_tree()

# Print it
print(upgma_tree.as_string("nexus"))