dendropy.model.coalescent: The Coalescent

Functions, classes, and methods for working with Kingman’s n-coalescent framework.

dendropy.model.coalescent.coalesce_nodes(nodes, pop_size=None, period=None, rng=None, use_expected_tmrca=False)

Returns a list of nodes that have not yet coalesced once period is exhausted.

This function will a draw a coalescence time, t, from an exponential distribution with a rate of choose(k, 2), where k is the number of nodes. If period is given and if this time is less than period, or if period is not given, then two nodes are selected at random from nodes, and coalesced: a new node is created, and the two nodes are added as child_nodes to this node with an edge length such the the total length from tip to the ancestral node is equal to the depth of the deepest child + t. The two nodes are removed from the list of nodes, and the new node is added to it. t is then deducted from period, and the process repeats.

The function ends and returns the list of nodes once period is exhausted or if any draw of t exceeds period, if period is given or when there is only one node left.

As each coalescent event occurs, all nodes have their edges extended to the point of the coalescent event. In the case of constrained coalescence, all uncoalesced nodes have their edges extended to the end of the period (coalesced nodes have the edges fixed by the coalescent event in their ancestor). Thus multiple calls to this method with the same set of nodes will gradually ‘grow’ the edges, until all the the nodes coalesce. The edge lengths of the nodes passed to this method thus should not be modified or reset until the process is complete.

Parameters:
  • nodes (iterable[Node]) – An interable of Node objects representing a sample of neutral genes (some, all, or none of these nodes may have descendent nodes).
  • pop_size (integer) – The effective haploid population size; i.e., number of genes in the population: 2 * N in a diploid population of N individuals, or N in a haploid population of N individuals.
  • period (numeric) – The time that the genes have to coalesce. If pop_size is 1 or 0 or None, then time is in haploid population units; i.e. where 1 unit of time equals 2N generations for a diploid population of size N, or N generations for a haploid population of size N. Otherwise time is in generations.
  • rng (Random object) – The random number generator instance to use. If not specified, the default RNG will be used.
  • use_expected_tmrca (bool) – If True, then instead of random times, the expected times will be used.
Returns:

nodes (iterable[|Node|]) – A list of nodes once period is exhausted or if any draw of t exceeds period, if period is given or when there is only one node left.

dendropy.model.coalescent.constrained_kingman_tree(pop_tree, gene_tree_list=None, rng=None, gene_node_label_fn=None, num_genes_attr='num_genes', pop_size_attr='pop_size', decorate_original_tree=False)

Given a population tree, pop_tree this will return a pair of trees: a gene tree simulated on this population tree based on Kingman’s n-coalescent, and population tree with the additional attribute ‘gene_nodes’ on each node, which is a list of uncoalesced nodes from the gene tree associated with the given node from the population tree.

pop_tree should be a DendroPy Tree object or an object of a class derived from this with the following attribute num_genes – the number of gene samples from each population in the present. Each edge on the tree should also have the attribute

pop_size_attr is the attribute name of the edges of pop_tree that specify the population size. By default it is pop_size. The should specify the effective haploid population size; i.e., number of gene in the population: 2 * N in a diploid population of N individuals, or N in a haploid population of N individuals.

If pop_size is 1 or 0 or None, then the edge lengths of pop_tree is taken to be in haploid population units; i.e. where 1 unit equals 2N generations for a diploid population of size N, or N generations for a haploid population of size N. Otherwise the edge lengths of pop_tree is taken to be in generations.

If gene_tree_list is given, then the gene tree is added to the tree block, and the tree block’s taxa block will be used to manage the gene tree’s taxa.

gene_node_label_fn is a function that takes two arguments (a string and an integer, respectively, where the string is the containing species taxon label and the integer is the gene index) and returns a label for the corresponding the gene node.

if decorate_original_tree is True, then the list of uncoalesced nodes at each node of the population tree is added to the original (input) population tree instead of a copy.

Note that this function does very much the same thing as contained_coalescent(), but provides a very different API.

dendropy.model.coalescent.contained_coalescent_tree(containing_tree, gene_to_containing_taxon_map, edge_pop_size_attr='pop_size', default_pop_size=1, rng=None)

Returns a gene tree simulated under the coalescent contained within a population or species tree.

containing_tree
The population or species tree. If edge_pop_size_map is not None, and population sizes given are non-trivial (i.e., >1), then edge lengths on this tree are in units of generations. Otherwise edge lengths are in population units; i.e. 2N generations for diploid populations of size N, or N generations for diploid populations of size N.
gene_to_containing_taxon_map
A TaxonNamespaceMapping object mapping Taxon objects in the containing_tree TaxonNamespace to corresponding Taxon objects in the resulting gene tree.
edge_pop_size_attr
Name of attribute of edges that specify population size. By default this is “pop_size”. If this attribute does not exist, default_pop_size will be used. The value for this attribute should be the haploid population size or the number of genes; i.e. 2N for a diploid population of N individuals, or N for a haploid population of N individuals. This value determines how branch length units are interpreted in the input tree, containing_tree. If a biologically-meaningful value, then branch lengths on the containing_tree are properly read as generations. If not (e.g. 1 or 0), then they are in population units, i.e. where 1 unit of time equals 2N generations for a diploid population of size N, or N generations for a haploid population of size N. Otherwise time is in generations. If this argument is None, then population sizes default to default_pop_size.
default_pop_size
Population size to use if edge_pop_size_attr is None or if an edge does not have the attribute. Defaults to 1.

The returned gene tree will have the following extra attributes:

pop_node_genes
A dictionary with nodes of containing_tree as keys and a list of gene tree nodes that are uncoalesced as values.

Note that this function does very much the same thing as constrained_kingman(), but provides a very different API.

dendropy.model.coalescent.discrete_time_to_coalescence(n_genes, pop_size=None, n_to_coalesce=2, rng=None)

A random draw from the “Kingman distribution” (discrete time version): Time to go from n_genes genes to n_genes-1 genes in a discrete-time Wright-Fisher population of pop_size genes; i.e. waiting time until n-genes lineages coalesce in a population of pop_size genes.

Parameters:
  • n_genes (integer) – The number of genes in the sample.
  • pop_size (integer) – The effective haploid population size; i.e., number of genes in the population: 2 * N in a diploid population of N individuals, or N in a haploid population of N individuals.
  • n_to_coalesce (integer) – The waiting time that will be returned will be the waiting time for this number of genes in the sample to coalesce.
  • rng (Random object) – The random number generator instance.
Returns:

k (integer) – A randomly-generated waiting time (in discrete generations) for n_to_coalesce genes to coalesce out of a sample of n_genes in a population of pop_size genes.

dendropy.model.coalescent.expected_tmrca(n_genes, pop_size=None, n_to_coalesce=2)

Expected (mean) value for the Time to the Most Recent Common Ancestor of n_to_coalesce genes in a sample of n_genes drawn from a population of pop_size genes.

Parameters:
  • n_genes (integer) – The number of genes in the sample.
  • pop_size (integer) – The effective haploid population size; i.e., number of genes in the population: 2 * N in a diploid population of N individuals, or N in a haploid population of N individuals.
  • n_to_coalesce (integer) – The waiting time that will be returned will be the waiting time for this number of genes in the sample to coalesce.
  • rng (Random object) – The random number generator instance.
Returns:

k (float) – The expected waiting time (in continuous time) for n_to_coalesce genes to coalesce out of a sample of n_genes in a population of pop_size genes.

dendropy.model.coalescent.extract_coalescent_frames(tree, ultrametricity_precision=1e-05)

Returns a list of tuples describing the coalescent frames on the tree. That is, each element in the list is tuple pair consisting of where: the first element of the pair is the number of separate lineages remaining on the tree at coalescence event, and the second element of the pair is the time between this coalescence event and the earlier (more recent) one.

Parameters:
  • tree (Tree) – A tree instance.
  • ultrametricity_precision (float) – When calculating the node ages, an error will be raised if the tree is not ultrametric. This error may be due to floating-point or numerical imprecision. You can set the precision of the ultrametricity validation by setting the ultrametricity_precision parameter. E.g., use ultrametricity_precision=0.01 for a more relaxed precision, down to 2 decimal places. Use ultrametricity_precision=False to disable checking of ultrametricity.
Returns:

x (dict) – Returns dictionary, with key = number of alleles, and values = waiting time for coalescent for the given tree

dendropy.model.coalescent.log_probability_of_coalescent_frames(coalescent_frames, haploid_pop_size)

Under the classical neutral coalescent citep{Kingman1982, Kingman1982b}, the waiting times between coalescent events in a sample of $k$ alleles segregating in a population of (haploid) size $N_e$ is distributed exponentially with a rate parameter of :math`frac{{k choose 2}}{N_e}`:

.. math::

    \Pr(T) =  \frac{{k \choose 2}}{N_e} \e{-  \frac{{k \choose 2}}{N_e} T},

where $T$ is the length of (chronological) time in which there are $k$ alleles in the sample (i.e., for $k$ alleles to coalesce into $k-1$ alleles).

dendropy.model.coalescent.log_probability_of_coalescent_tree(tree, haploid_pop_size, ultrametricity_precision=1e-05)

Wraps up extraction of coalescent frames and reporting of probability.

dendropy.model.coalescent.mean_kingman_tree(taxon_namespace, pop_size=1, rng=None)

Returns a tree with coalescent intervals given by the expected times under Kingman’s neutral coalescent.

dendropy.model.coalescent.node_waiting_time_pairs(tree, ultrametricity_precision=1e-05)

Returns a list of tuples of (nodes, coalescent interval time) on the tree. That is, each element in the list is tuple pair consisting of where: the first element of the pair is an internal node representing a coalescent event on the tree, and the second element of the pair is the time between this coalescence event and the earlier (more recent) one.

Parameters:
  • tree (Tree) – A tree instance.
  • ultrametricity_precision (float) – When calculating the node ages, an error will be raised if the tree is not ultrametric. This error may be due to floating-point or numerical imprecision. You can set the precision of the ultrametricity validation by setting the ultrametricity_precision parameter. E.g., use ultrametricity_precision=0.01 for a more relaxed precision, down to 2 decimal places. Use ultrametricity_precision=False to disable checking of ultrametricity.
Returns:

x (list of tuples (node, coalescent interval)) – Returns list of tuples of (node, coalescent interval [= time between last coalescent event and current node age])

dendropy.model.coalescent.pure_kingman_tree(taxon_namespace, pop_size=1, rng=None)

Generates a tree under the unconstrained Kingman’s coalescent process.

Parameters:
  • taxon_namespace (TaxonNamespace instance) – A pre-populated TaxonNamespace where the contained Taxon instances represent the genes or individuals sampled from the population.
  • pop_size (numeric) – The size of the population from the which the coalescent process is sampled.
Returns:

t (|Tree|) – A tree sampled from the Kingman’s neutral coalescent.

dendropy.model.coalescent.pure_kingman_tree_shape(num_leaves, pop_size=1, rng=None)

Like dendropy.model.pure_kingman_tree, but does not assign taxa to tips.

Parameters:
  • num_leaves (int) – Number of individuals/genes sampled.
  • pop_size (numeric) – The size of the population from the which the coalescent process is sampled.
Returns:

t (|Tree|) – A tree sampled from the Kingman’s neutral coalescent.

dendropy.model.coalescent.time_to_coalescence(n_genes, pop_size=None, n_to_coalesce=2, rng=None)

A random draw from the “Kingman distribution” (discrete time version): Time to go from n_genes genes to n_genes-1 genes in a continuous-time Wright-Fisher population of pop_size genes; i.e. waiting time until n-genes lineages coalesce in a population of pop_size genes.

Given the number of gene lineages in a sample, n_genes, and a population size, pop_size, this function returns a random number from an exponential distribution with rate $choose(pop_size, 2)$. pop_size is the effective haploid population size; i.e., number of gene in the population: 2 * N in a diploid population of N individuals, or N in a haploid population of N individuals. If pop_size is 1 or 0 or None, then time is in haploid population units; i.e. where 1 unit of time equals 2N generations for a diploid population of size N, or N generations for a haploid population of size N. Otherwise time is in generations.

The coalescence time, or the waiting time for the coalescence, of two gene lineages evolving in a population with haploid size $N$ is an exponentially-distributed random variable with rate of $N$ an expectation of $frac{1}{N}$). The waiting time for coalescence of any two gene lineages in a sample of $n$ gene lineages evolving in a population with haploid size $N$ is an exponentially-distributed random variable with rate of $choose{N, 2}$ and an expectation of $frac{1}{choose{N, 2}}$.

Parameters:
  • n_genes (integer) – The number of genes in the sample.
  • pop_size (integer) – The effective haploid population size; i.e., number of genes in the population: 2 * N in a diploid population of N individuals, or N in a haploid population of N individuals.
  • n_to_coalesce (integer) – The waiting time that will be returned will be the waiting time for this number of genes in the sample to coalesce.
  • rng (Random object) – The random number generator instance to use.
Returns:

k (float) – A randomly-generated waiting time (in continuous time) for n_to_coalesce genes to coalesce out of a sample of n_genes in a population of pop_size genes.