Population Genetic Summary Statistics

The dendropy.calculate.popgenstat module provides functions that calculate some common population genetic summary statistics.

For example, given a DnaCharacterMatrix as an argument, the num_segregating_sites function returns the raw number of segregating sites, average_number_of_pairwise_differences returns the average number of pairwise differences, and nucleotide_diversity returns the nucleotide diversity.

More complex statistics are provided by the PopulationPairSummaryStatistics class. Objects of this class are instantatiated with two lists of DnaCharacterDataSequence objects as arguments, each representing a sample of DNA sequences drawn from two distinct but related populations. Once instantiated, the following attributes of the PopulationPairSummaryStatistics object are available:

average_number_of_pairwise_differences
The average number of pairwise differences between every sequence across both populations.
average_number_of_pairwise_differences_between
The average number of pairwise differences between every sequence between both populations.
average_number_of_pairwise_differences_within
The average number of pairwise differences between every sequence within each population.
average_number_of_pairwise_differences_net
The net number of pairwise differences.
num_segregating_sites
The number of segregating sites.
wattersons_theta
Watterson’s theta.
wakeleys_psi
Wakeley’s psi.
tajimas_d
Tajima’s D.

The following example calculates the suite of population genetic summary statistics for sequences drawn from two populations of sticklebacks. The original data consists of 23 sequences, with individuals from Eastern Pacific populations identified by their taxon labels beginning with “EPAC” and individuals from Western Pacific populations identified by their taxon labels beginning with “WPAC”. The taxon labels thus are used as the basis for sorting the sequences into the required lists of DnaCharacterDataSequence objects, p1 and p2.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#! /usr/bin/env python
# -*- coding: utf-8 -*-

import dendropy
from dendropy.calculate import popgenstat

seqs = dendropy.DnaCharacterMatrix.get(
        path="orti1994.nex",
        schema="nexus")
p1 = []
p2 = []
for idx, t in enumerate(seqs.taxon_namespace):
    if t.label.startswith('EPAC'):
        p1.append(seqs[t])
    else:
        p2.append(seqs[t])
pp = popgenstat.PopulationPairSummaryStatistics(p1, p2)

print('Average number of pairwise differences (total): %s' \
    % pp.average_number_of_pairwise_differences)
print('Average number of pairwise differences (between populations): %s' \
    % pp.average_number_of_pairwise_differences_between)
print('Average number of pairwise differences (within populations): %s' \
    % pp.average_number_of_pairwise_differences_within)
print('Average number of pairwise differences (net): %s' \
    % pp.average_number_of_pairwise_differences_net)
print('Number of segregating sites: %s' \
    % pp.num_segregating_sites)
print("Watterson's theta: %s" \
    % pp.wattersons_theta)
print("Wakeley's Psi: %s" \
    % pp.wakeleys_psi)
print("Tajima's D: %s" \
    % pp.tajimas_d)

Lines 6-12 build up the two lists of DnaCharacterDataSequence objects by sorting the original sequences into their source populations based on the taxon label (with operational taxonomic units with labels beginning with “EPAC” coming from the Eastern Pacific, and assigned to the list p1, while those that begin with “WPAC” coming from the Western Pacific, and assigned to the list p2). These lists are then passed as the instantiation arguments to the PopulationPairSummaryStatistics constructor in line 14. The calculations are performed immediately, and the results reported in the following lines.