SuchTree

In this article, we'll look at some basic uses of SuchTree's SuchTree class :

Loading tree data from files
Inspecting taxa names
Calculating patrsitic distances between pairs of taxa
Calculating lots of patristic distances
Loading tree data from URLs
Sampling patristic distances from very large trees
Parallel processing with SuchTree

As a bonus, we'll also cover plotting trees with toytree, which is especiall useful in notebook enviornemnts. To run this notebook, you will need to have the following python packages installed :

SuchTree
pandas
cython
scipy
numpy
matplotlib
seaborn
fastcluster
dendropy
toytree
jupyter

If you want to run this locally, I recommend following the Jupyter Lab documentation.

The Internet is full of opinions about how to set up your python environment. You should find one that works for you, but this guide is as good as any to get you started. I start off by supressing some warnings that come out of scipy's hierarchal clustering function. Feel free to leave them turned on if you enjoy pendantic notifications. I'm going to assume that you are running this notebook from its locations in a local copy of the SuchTree repository for any local file paths.

%config InlineBackend.figure_format='retina'

from SuchTree import SuchTree, SuchLinkedTrees
from numpy import zeros, array
import seaborn
import pandas
import toytree
import random
import warnings

from scipy.cluster.hierarchy import ClusterWarning

warnings.simplefilter( 'ignore', UserWarning )
warnings.simplefilter( 'ignore', FutureWarning )

Let's have a look at some example data. Here is a tree of cichlid fishes from my dissertation :

tree = toytree.tree( '../../data/bigtrees/host.tree' )

canvas, axes, mark = tree.draw( tree_style='d', tip_labels_align=True )

Loading tree data into SuchTree is pretty simple -- just give it a path to a valid Newick file.

T = SuchTree( '../../data/bigtrees/host.tree' )

The SuchTree object has a dictionary called leaves that maps leaf names onto their node ids. We'll make extensive use of this as we put the utility through its paces.

T.leaves

{'Tropheus_moorii': 0,
 'Lobochilotes_labiatus': 2,
 'Tanganicodus_irsacae': 4,
 'Cyprichromis_coloratus': 6,
 'Haplotaxodon_microlepis': 8,
 'Perissodus_microlepis': 10,
 'Plecodus_straeleni': 12,
 'Xenotilapia_flavipinnis': 14,
 'Triglachromis_otostigma': 16,
 'Reganochromis_calliurus': 18,
 'Trematochromis_benthicola': 20,
 'Lepidiolamprologus_profundicola': 22,
 'Neolamprologus_buescheri': 24,
 'Chalinochromis_brichardi': 26}

Calculating distances

SuchTree has two ways to calculate distances; one pair a time, or in batches. Batches are more efficient because it does each calculation without the interpreter's overhead.

Here's how to measure distances one at a time :

a = random.choice( list( T.leafs.values() ) )
b = random.choice( list( T.leafs.values() ) )

d = T.distance( a, b )

print( 'taxon 1  : %d' % a )
print( 'taxon 2  : %d' % b )
print( 'distance : %f' % d )

taxon 1  : 12
taxon 2  : 26
distance : 0.388425

The distance() function will accept either node ids (which are integers), or taxon names (which are strings).

a = random.choice( list( T.leafs.keys() ) )
b = random.choice( list( T.leafs.keys() ) )

d = T.distance( a, b )

print( 'taxon 1  : %s' % a )
print( 'taxon 2  : %s' % b )
print( 'distance : %f' % d )

taxon 1  : Reganochromis_calliurus
taxon 2  : Haplotaxodon_microlepis
distance : 0.270743

You can loop over all of the distances one at a time to construct a distance matrix...

D1 = zeros( ( len(T.leaves),len(T.leaves) ) )
for i,a in enumerate(  T.leafs.values() ) :
    for j,b in enumerate( T.leafs.values() ) :
        D1[i,j] = T.distance( a, b )

Let's look at the distance matrix using a nice seaborn clustermap plot.

It's worth noting that seaborn is using scipy's cluster.hierarchy functions to build those dendrograms from the distance matrix. They aren't going to have exactly the same topology as the input tree, which was build with RAxML.

df = pandas.DataFrame( D1, index=[ i.replace('_',' ') for i in T.leaves.keys() ] )
seaborn.clustermap( df, xticklabels=False, cmap='viridis', figsize=(6,4) )

<seaborn.matrix.ClusterGrid at 0x70655aabb4d0>

png

To calculate the distances in a batch, we can use the distances() function, which takes an \(n \times 2\) array of node ids (make sure your array is representing them as integers).

D2_list = []
for i,a in enumerate(T.leaves.values()) :
    for j,b in enumerate( T.leaves.values() ) :
        D2_list.append( ( a, b ) )
D2_array = array( D2_list )

print( D2_array.shape )
print( D2_array[:5] )

(196, 2)
[[0 0]
 [0 2]
 [0 4]
 [0 6]
 [0 8]]

D2 = T.distances( D2_array )

D2 = D2.reshape( ( len(T.leafs), len(T.leafs) ) )

We should get the same distance matrix and clustermap as before.

df = pandas.DataFrame( D2, index=[ i.replace('_',' ') for i in T.leaves.keys() ] )
seaborn.clustermap( df, xticklabels=False, cmap='viridis', figsize=(6,4) )

<seaborn.matrix.ClusterGrid at 0x70655a702db0>

png

If you want to use taxon names instead, distances_by_name() accepts an \(n \times 2\) list of tuples of taxon names, and looks up the node ids for you.

Loading data from URLs

SuchTree can also import data from the internets. Here is the distance matrix for the penguins, from the Global Phylogeny of Birds.

T3 = SuchTree( 'https://data.vertlife.org/birdtree/PatchClade/Stage2/set10/Spheniscidae.tre' )

D3_list = []
for i,a in enumerate(T3.leafs.values()) :
    for j,b in enumerate( T3.leafs.values() ) :
        D3_list.append( ( a, b ) )
D3_array = array( D3_list )
D3 = T3.distances( D3_array )
D3 = D3.reshape( ( len(T3.leafs), len(T3.leafs) ) )

df = pandas.DataFrame( D3, index=[ i.replace('_',' ') for i in T3.leafs.keys() ] )
seaborn.clustermap( df, xticklabels=False, cmap='viridis', figsize=(6,4) )

<seaborn.matrix.ClusterGrid at 0x70654b07d370>

png

Comparing the topologies of two large trees

So far, we haven't done anything you couldn't do with other phylogenetics packages. SuchTree really shines when you have to do a lot of distance calculations on very large trees.

Here, we use SuchTree to compare the topology of a two trees containing the taxa but constructed with different methods (FastTree and neighbor joining). One million random pairs are sampled from each tree, and the distances compared.

T1 = SuchTree( 'https://raw.githubusercontent.com/ryneches/SuchTree/refs/heads/main/data/bigtrees/ml.tree' )
T2 = SuchTree( 'https://raw.githubusercontent.com/ryneches/SuchTree/refs/heads/main/data/bigtrees/nj.tree' )

print( 'nodes : %d, leafs : %d' % ( T1.length, len(T1.leafs) ) )
print( 'nodes : %d, leafs : %d' % ( T2.length, len(T2.leafs) ) )

nodes : 108653, leafs : 54327
nodes : 108653, leafs : 54327

N = 1000000

v = list( T1.leafs.keys() )

pairs = []
for i in range(N) :
    pairs.append( ( random.choice( v ), random.choice( v ) ) )

%time D1, D2 = T1.distances_by_name( pairs ), T2.distances_by_name( pairs )

CPU times: user 34.4 s, sys: 142 ms, total: 34.6 s
Wall time: 34.4 s

The distances() function, which uses node ids rather than taxa names, would be a little bit faster. However, because the trees have different topologies, the taxa have different node ids in each tree. SuchTree's distances_by_name() function untangles the leaf name to leaf node id mappings for you.

df = pandas.DataFrame( { 'ML' : D1, 'neighbor joining' : D2 } )

g = seaborn.jointplot( x='ML', y='neighbor joining', data=df,
                       alpha=0.3, linewidth=0, s=1,
                       marginal_kws={ 'bins': 64, 'linewidth' : 0 } )

g.ax_joint.set_xticks( [ 0, 0.5, 1.0, 1.5, 2.0 ] )

[<matplotlib.axis.XTick at 0x70654bdfad50>,
 <matplotlib.axis.XTick at 0x70654bc6aae0>,
 <matplotlib.axis.XTick at 0x70654bc6b530>,
 <matplotlib.axis.XTick at 0x70654bc6b3b0>,
 <matplotlib.axis.XTick at 0x70654bc7e2a0>]

png

from scipy.stats import spearmanr, kendalltau, pearsonr

print( 'Spearman\'s rs : %0.3f' % spearmanr(  D1, D2 )[0] )
print( 'Kendall\'s tau : %0.3f' % kendalltau( D1, D2 )[0] )
print( 'Pearson\'s r   : %0.3f' % pearsonr(   D1, D2 )[0] )

Spearman's rs : 0.961
Kendall's tau : 0.824
Pearson's r   : 0.969

Parallel processing with SuchTree

Another advantage of SuchTree's support for performing batches of distance calculations is that these calculations can run outside of Python's global interpreter lock. This makes it possible to parallelize with Python's Threads or multiprocessing libraries. On most Unix-like operating systems, multiprocessing.Pool uses a copy-on-write scheme for data shared by jobs. Internally, SuchTree stores tree topology data as an immutable block of memory, so no copying will take place.

For better or worse, Python gives users a lot of different strategies for parallel processing. SuchTree is designed to work seamlessly with all of them.

Approach	Parallelism	Memory	Overhead	Best for
Threading	No (GIL)	Shared	Low	I/O bound
ProcessPoolExecutor	Yes	Copied	High (pickle)	Small data
Pool with fork	Yes	Shared (CoW)	Minimal	Immutable data
Pool with spawn	Yes	Copied	High (pickle)	Windows

SuchTree intentionally does not allow the user to alter trees once they are created, and so distance calculations are always thread safe. This makes it possible to use only one instance of a tree for all threads (or processes or subinterpreters or whatever parallel processing scheme you choose). This gives you best chance of keeping the data within the CPU's L3 cache for maximum performance.

First, let's create a Cython function that calls SuchTree outside of the GIL.

%load_ext Cython

%%cython
import cython
from libc.math cimport sqrt

def correlation(double[:] x, double[:] y) :
    if x.shape[0] != y.shape[0] :
        raise ValueError( 'Arrays must have the same length' )
    if x.shape[0] < 2 :
        raise ValueError( 'Arrays must have at least 2 elements' )
    return _correlation( x, y )

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
cdef double _correlation( double[:] x, double[:] y ) nogil :
    cdef int n = x.shape[0]
    cdef int i
    cdef double r = 0.0
    cdef double xbar = 0.0
    cdef double ybar = 0.0
    cdef double sx = 0.0
    cdef double sy = 0.0
    cdef double dx, dy

    # Compute means
    for i in range(n):
        xbar += x[i]
        ybar += y[i]
    xbar /= n
    ybar /= n

    # Compute standard deviations and correlation in one pass
    for i in range(n):
        dx = x[i] - xbar
        dy = y[i] - ybar
        sx += dx * dx
        sy += dy * dy
        r += dx * dy

    sx = sqrt(sx)
    sy = sqrt(sy)

    # Handle zero variance case
    if sx == 0.0 or sy == 0.0:
        return 0.0

    return r / (sx * sy)

Next, we'll load use Pool.map() from Python's multiprocessing library to run our processes in parallel.

from multiprocessing import Pool

n = 4   # number of processes
m = 12  # number of work units
v = list( T1.leaves.keys() )


def worker_task( args ) :
    block_idx, v, T1_leafs, T2_leafs = args
    random.seed( block_idx )  # Different seed per block

    pairs = []
    for _ in range( 100000 ) :
        pairs.append( ( T1_leafs[ random.choice(v) ],
                        T2_leafs[ random.choice(v) ] ) )

    task = array( pairs, dtype=int )
    D1 = T1.distances(task)
    D2 = T2.distances(task)
    return correlation( D1, D2 )

print( 'building work blocks...' )
work_items = [ (i, v, dict(T1.leaves), dict(T2.leaves) ) 
               for i in range(m) ]

print( 'processing...' )
with Pool( processes=n ) as pool :
    results = pool.map( worker_task, work_items )

print( '\nResults :' )
for r in results :
    print(r)

building work blocks...
processing...

Results :
0.7835680969259644
0.7848744370724512
0.7844003849954652
0.7850059312425519
0.7828163187067951
0.7839538411961768
0.783997570675356
0.7853124277733835
0.7865262108378938
0.7851786514073134
0.7847794533814942
0.7842134015490633

Adapting algorithms to parallel processing is a complex topic, and this is only a toy example. Hopefully it is useful, but don't take this as the final word. SuchTree is designed to sidestep many of the problems commonly encountered in parallel computing, but that doesn't mean you won't discover problems anyway!