Reading and Writing Phylogenetic Data

Creating New Objects From an External Data Source

The Tree, TreeList, CharacterMatrix-derived (i.e., DnaCharacterMatrix, ProteinCharacterMatrix, StandardCharacterMatrix, etc.), and DataSet classes all support a “get” factory class-method that instantiates an object of the given class from a data source. This method takes, at a minumum, two keyword arguments that specify the source of the data and the schema (or format) of the data.

The source must be specifed using one and exactly one of the following:

  • a path to a file (specified using the keyword argument “path”)
  • a file or a file-like object opened for reading (specified using the keyword argument "file")
  • a string value giving the data directly (specified using the keyword argument "data")
  • or a URL (specified using the keyword argument "url")

The schema is specified using the keyword argument "schema", and takes a string value that identifies the format of data. This “schema specification string” can be one of: “fasta”, “newick”, “nexus”, “nexml”, or “phylip”. Not all formats are supported for reading, and not all formats make sense for particular objects (for example, it would not make sense to try and instantiate a Tree or TreeList object from a FASTA-formatted data source).

For example:

import dendropy

tree1 = dendropy.Tree.get(path="mle.tre", schema="newick")
tree2 = dendropy.Tree.get(file=open("mle.nex", "r"), schema="nexus")
tree3 = dendropy.Tree.get(data="((A,B),(C,D));", schema="newick")
tree4 = dendropy.Tree.get(url="http://api.opentreeoflife.org/v2/study/pg_1144/tree/tree2324.nex", schema="nexus")

tree_list1 = dendropy.TreeList.get(path="pythonidae.mcmc.nex", schema="nexus")
tree_list2 = dendropy.TreeList.get(file=open("pythonidae.mcmc.nex", "r"), schema="nexus")
tree_list3 = dendropy.TreeList.get(data="(A,(B,C));((A,B),C);", "r"), schema="newick")

dna1 = dendropy.DnaCharacterMatrix.get(file=open("pythonidae.fasta"), schema="fasta")
dna2 = dendropy.DnaCharacterMatrix.get(url="http://purl.org/phylo/treebase/phylows/matrix/TB2:M2610?format=nexus", schema="nexus")
aa1 = dendropy.ProteinCharacterMatrix.get(file=open("pythonidae.dat"), schema="phylip")
std1 = dendropy.StandardCharacterMatrix.get(path="python_morph.nex", schema="nexus")
std2 = dendropy.StandardCharacterMatrix.get(data=">t1\n01011\n\n>t2\n11100", schema="fasta")

dataset1 = dendropy.DataSet.get(path="pythonidae.chars_and_trees.nex", schema="nexus")
dataset2 = dendropy.DataSet.get(url="http://purl.org/phylo/treebase/phylows/study/TB2:S1925?format=nexml", schema="nexml")

The “get” method takes a number of other optional keyword arguments that provide control over how the data is interpreted and processed. Some are general to all classes (e.g., the “label” or “taxon_namespace” arguments), while others specific to a given class (e.g. the “exclude_trees” argument when instantiating data into a DataSet object, or the “tree_offset” argument when instantiating data into a Tree or TreeList object). These are all covered in detail in the documentation of the respective methods for each class:

Other optional keyword arguments are specific to the schema or format (e.g., the “preserve_underscores” argument when reading Newick or NEXUS data). These are covered in detail in the DendroPy Schema Guide.

Note

The Tree, TreeList, CharacterMatrix-derived, and DataSet classes also support a “get_from_*()” family of factory class-methods that can be seen as specializations of the “get” method for various types of sources (in fact, the “get” method is actually a dispatcher that calls on one of these methods below for implementation of the functionality):

get_from_stream(src, schema, **kwargs)
Takes a file or file-like object opened for reading the data source as the first argument, and a schema specification string as the second. Optional schema-specific keyword arguments can be to control the parsing and other options. This is equivalent to calling “get(file=src, schema=schema, ...)”.
get_from_path(src, schema, **kwargs)
Takes a string specifying the path to the the data source file as the first argument, and a schema specification string as the second. Optional schema-specific keyword arguments can be to control the parsing and other options. This is equivalent to calling “get(path=src, schema=schema, ...)”.
get_from_string(src, schema, **kwargs)
Takes a string containing the source data as the first argument, and a schema specification string as the second. Optional schema-specific keyword arguments can be to control the parsing and other options. This is equivalent to calling “get(data=src, schema=schema, ...)”.
get_from_url(src, schema, **kwargs)
Takes a string containing the URL of the data as the first argument, and a schema specification string as the second. Optional schema-specific keyword arguments can be to control the parsing and other options. This is equivalent to calling “get(url=src, schema=schema, ...)”.

As with the “get” method, the additional keyword arguments are specific to the given class or schema type.

Adding Data to Existing Objects from an External Data Source

In addition to the “get” class factory method, the collection classes (TreeList, TreeArray and DataSet) each support a “readinstance method that add data from external sources to an existing object (as opposed to creating and returning a new object based on an external data source). This “read” instance method has a signature that parallels the “get” factory method described above, requiring:

  • A specification of a source using one and exactly one of the following keyword arguments: “path”, “file”, “data”, “url”.
  • A specification of the schema or format of the data.
  • Optional keyword arguments to customize/control the parsing and interpretation of the data.

As with the “get” method, the “read” method takes a number of other optional keyword arguments that provide control over how the data is interpreted and processed, which are covered in more detail in the documentation of the respective methods for each class:

as well as schema-specific keyword arguments which are covered in detail in the DendroPy Schema Guide.

For example, the following accumulates post-burn-in trees from several different files into a single TreeList object:

>>> import dendropy
>>> post_trees = dendropy.TreeList()
>>> post_trees.read(
...         file=open("pythonidae.nex.run1.t", "r")
...         schema="nexus",
...         tree_offset=200)
>>> print(len(post_trees))
800
>>> post_trees.read(
...         path="pythonidae.nex.run2.t",
...         schema="nexus",
...         tree_offset=200)
>>> print(len(post_trees))
1600
>>> s = open("pythonidae.nex.run3.t", "r").read()
>>> post_trees.read(
...         data=s,
...         schema="nexus",
...         tree_offset=200)
>>> print(len(post_trees))
2400

while the following accumulates data from a variety of sources into a single DataSet object under the same TaxonNamespace to ensure that they all reference the same set of Taxon objects:

>>> import dendropy
>>> ds = dendropy.DataSet()
>>> tns = ds.new_taxon_namespace()
>>> ds.attach_taxon_namespace(tns)
>>> ds.read(url="http://api.opentreeoflife.org/v2/study/pg_1144/tree/tree2324.nex",
...     schema="nexus")
>>> ds.read(file=open("pythonidae.fasta"), schema="fasta")
>>> ds.read(url="http://purl.org/phylo/treebase/phylows/matrix/TB2:M2610?format=nexus",
...     schema="nexus")
>>> ds.read(file=open("pythonidae.dat"), schema="phylip")
>>> ds.read(path="python_morph.nex", schema="nexus")
>>> ds.read(data=">t1\n01011\n\n>t2\n11100", schema="fasta")

Note

DendroPy 3.xx supported “read_from_*()” methods on Tree and CharacterMatrix-derived classes. This is no longer supported in DendroPy 4 and above. Instead of trying to re-populate an existing Tree or CharacterMatrix-derived object by using “read_from_*()”:

x = dendropy.Tree()
x.read_from_path("tree1.nex", "nexus")
.
.
.
x.read_from_path("tree2.nex", "nexus")

simply rebind the new object returned by “get”:

x = dendropy.Tree.get(path="tree1.nex", schema="nexus")
.
.
.
x = dendropy.Tree.get(path="tree2.nex", schema="nexus")

Note

The TreeList, TreeArray, and DataSet classes also support a “read_from_*()” family of instance methods that can be seen as specializations of the “read” method for various types of sources (in fact, the “read” method is actually a dispatcher that calls on one of these methods below for implementation of the functionality):

read_from_stream(src, schema, **kwargs)
Takes a file or file-like object opened for reading the data source as the first argument, and a schema specification string as the second. Optional schema-specific keyword arguments can be to control the parsing and other options. This is equivalent to calling “read(file=src, schema=schema, ...)”.
read_from_path(src, schema, **kwargs)
Takes a string specifying the path to the the data source file as the first argument, and a schema specification string as the second. Optional schema-specific keyword arguments can be to control the parsing and other options. This is equivalent to calling “read(path=src, schema=schema, ...)”.
read_from_string(src, schema, **kwargs)
Takes a string containing the source data as the first argument, and a schema specification string as the second. Optional schema-specific keyword arguments can be to control the parsing and other options. This is equivalent to calling “read(data=src, schema=schema, ...)”.
read_from_url(src, schema, **kwargs)
Takes a string containing the URL of the data as the first argument, and a schema specification string as the second. Optional schema-specific keyword arguments can be to control the parsing and other options. This is equivalent to calling “read(url=src, schema=schema, ...)”.

As with the “read” method, the additional keyword arguments are specific to the given class or schema type.

Writing Out Phylogenetic Data

The Tree, TreeList, CharacterMatrix-derived (i.e., DnaCharacterMatrix, ProteinCharacterMatrix, StandardCharacterMatrix, etc.), and DataSet classes all support a “write” instance method for serialization of data to an external data source. This method takes two mandatory keyword arguments:

  • One and exactly one of the following to specify the destination: - a path to a file (specified using the keyword argument “path”) - a file or a file-like object opened for writing (specified using the keyword argument "file")
  • A “schema specification string” given by the keyword argument “schema”, to identify the schema or format for the output.

Alternatively, the Tree, TreeList, CharacterMatrix-derived, or DnaCharacterMatrix objects may also be represented as a string by calling the “as_string()” method, which requires at least one single mandatory argument, “schema”, giving the “schema specification string” to identify the format of the output.

In either case, the “schema specification string” can be one of: “fasta”, “newick”, “nexus”, “nexml”, or “phylip”.

For example:

tree.write(path="output.tre", schema="newick")
dest = open("output.xml", "w")
tree_list.write(file=dest, schema="nexml")
print(dna_character_matrix.as_string(schema="fasta"))

As with the “get” and “read” methods, further keyword arguments can be specified to control behavior. These are covered in detail in the “DendroPy Schemas: Phylogenetic and Evolutionary Biology Data Formats” section.

Note

The Tree, TreeList, CharacterMatrix-derived, and DataSet classes also support a “write_to_*()” family of instance methods that can be seen as specializations of the “write” method for various types of destinations:

write_to_stream(dest, schema, **kwargs)
Takes a file or file-like object opened for writing the data as the first argument, and a string specifying the schema as the second.
write_to_path(dest, schema, **kwargs)
Takes a string specifying the path to the file as the first argument, and a string specifying the schema as the second.
as_string(schema, **kwargs)
Takes a string specifying the schema as the first argument, and returns a string containing the formatted-representation of the data.