rapidtrees 0.2.3

Fast pairwise tree distance calculations (Robinson-Foulds, Weighted RF, Kuhner-Felsenstein) for phylogenetic trees
Documentation

🌲⚡ rapidtrees


🗺️ Overview

rapidtrees computes pairwise tree distances from BEAST/NEXUS .trees files and writes a labeled distance matrix. Three metrics are supported:

Metric Flag Output Description
Robinson–Foulds --metric rf integer Symmetric difference of bipartitions
Weighted RF --metric weighted float Branch-length-weighted bipartition difference
Kuhner–Felsenstein --metric kf float Euclidean distance on branch lengths

✨ Why rapidtrees?

  • 🦀 Rust core — zero-overhead bitset operations with a cache-friendly memory layout
  • 🔀 Parallel by default — powered by rayon, automatically scales across all cores
  • 🐍 Python bindings — drop into any Python/NumPy workflow via PyO3
  • 📦 No Rust toolchain required — pre-built wheels on PyPI for Linux, macOS, and Windows
  • 🗜️ Gzip output — stream directly to .tsv.gz without a separate compression step

🚀 Performance

Benchmarked on a ZIKA dataset (283 taxa · 4 000 trees · ~8 M comparisons):

Metric Total time Throughput
Robinson-Foulds ~3.5 s ~2.3 M comparisons/sec
Weighted RF ~3.5 s ~2.3 M comparisons/sec
Kuhner-Felsenstein ~3.5 s ~2.3 M comparisons/sec

🔧 Installing

🐍 Python (PyPI) — recommended

Pre-built wheels for Linux, macOS, and Windows. No Rust toolchain needed.

pip install rapidtrees

🦀 CLI (crates.io)

Install the standalone command-line binary. Requires the Rust toolchain.

cargo install rapidtrees

🛠️ From source

git clone https://github.com/Joon-Klaps/rapidtrees.git
cd rapidtrees
cargo build --release   # CLI binary → target/release/rapidtrees
pip install -e .        # Python bindings (requires Rust toolchain + maturin)

💻 Usage

rapidtrees \
  --input <path/to/file.trees> \
  --output <path/to/output.tsv[.gz]> \
  [--burnin-trees <N>] \
  [--burnin-states <STATE>] \
  [--use-real-taxa] \
  [--metric rf|weighted|kf] \
  [-q|--quiet]
Flag Description
-i, --input <INPUT> Path to BEAST .trees (NEXUS) file
-o, --output <OUTPUT> Output path. Use .gz suffix for gzip compression; - for stdout
-t, --burnin-trees <N> Drop the first N trees (default: 0)
-s, --burnin-states <STATE> Keep only trees with STATE > STATE (default: 0)
--use-real-taxa Map numeric taxon IDs via the TRANSLATE block
--metric <rf|weighted|kf> Distance metric (default: rf)
-q, --quiet Suppress progress messages (errors still go to stderr)

The output is a square TSV matrix where both the header row and first column contain tree names formatted as <file_basename>_tree_STATE<state>. Use -o - to write to stdout for easy piping.

💡 Examples

rapidtrees \
  -i tests/data/hiv1.trees \
  -o out/hiv1_rf.tsv.gz \
  --metric rf

# Reading in beast 0.003s
# Read in 162 taxons for 21 trees
# Creating tree bit snapshots 0.002s
# Determining distances using RF for 210 combinations
# Determining distances using RF 0.000s
# Writing to output 0.000s
rapidtrees \
  -i tests/data/hiv1.trees \
  -o out/hiv1_rf.tsv \
  -t 2

# Reading in beast 0.003s
# Read in 162 taxons for 19 trees
# Creating tree bit snapshots 0.001s
# Determining distances using RF for 171 combinations
# Determining distances using RF 0.000s
# Writing to output 0.000s

🐍 Python API

rapidtrees ships Python bindings for seamless integration into Python workflows.

import rapidtrees as rtd

# Robinson-Foulds distances
tree_names, rf_matrix = rtd.pairwise_rf(
    paths=["file1.trees", "file2.trees"],
    burnin_trees=10,    # skip first 10 trees per file
    burnin_states=0,    # skip trees with STATE < 0
    use_real_taxa=True  # use TRANSLATE block when merging multiple files
)

# Weighted RF (considers branch lengths)
tree_names, wrf_matrix = rtd.pairwise_weighted_rf(
    paths=["file1.trees"],
    burnin_trees=10
)

# Kuhner-Felsenstein distances
tree_names, kf_matrix = rtd.pairwise_kf(
    paths=["file1.trees"],
    burnin_trees=10
)

print(f"Computed distances for {len(tree_names)} trees")
print(f"RF distance between tree 0 and 1: {rf_matrix[0][1]}")

⏱️ Benchmarks

Benchmarks were run on a MacBook Pro M1. Trees are parsed once and bitset snapshots are reused across all pairwise comparisons. Parallelism is provided by rayon — no manual thread management needed.

Taxa (N) Trees (T) Combinations Est. Memory Actual Memory Wall Time CPU Time
10 100 10.0K 51.56 KB 64.00 KB 391.71 µs 1.77 ms
10 1000 1.0M 515.62 KB 448.00 KB 1.16 ms 9.52 ms
10 10000 100.0M 5.04 MB 5.02 MB 81.86 ms 757.66 ms
10 100000 10.0B 50.35 MB 52.42 MB 8.17 s (est) 1.27 min (est)
100 100 10.0K 481.25 KB 464.00 KB 230.54 µs 1.45 ms
100 1000 1.0M 4.70 MB 1.17 MB 42.00 ms 147.22 ms
100 10000 100.0M 47.00 MB 38.42 MB 1.78 s 14.75 s
100 100000 10.0B 469.97 MB 451.47 MB 2.77 min (est) 25.29 min (est)
500 100 10.0K 4.59 MB 80.00 KB 1.61 ms 14.18 ms
500 1000 1.0M 45.90 MB 25.95 MB 165.58 ms 1.46 s
500 10000 100.0M 458.98 MB 399.44 MB 19.75 s 2.87 min
500 100000 10.0B 4.48 GB 4.51 GB 31.75 min (est) 294.98 min (est)
1000 100 10.0K 15.27 MB 5.30 MB 4.69 ms 44.25 ms
1000 1000 1.0M 152.71 MB 126.77 MB 660.03 ms 5.30 s
1000 10000 100.0M 1.49 GB 1.44 GB 1.13 min 9.72 min
1000 100000 10.0B 14.91 GB 9.41 GB 111.42 min (est) 983.70 min (est)
2000 100 10.0K 54.94 MB 34.56 MB 23.11 ms 182.28 ms
2000 1000 1.0M 549.44 MB 480.02 MB 2.89 s 19.48 s
2000 10000 100.0M 5.37 GB 5.10 GB 3.92 min 34.76 min
2000 100000 10.0B 53.66 GB Skipped (>30GB) - -
5000 100 10.0K 316.63 MB 299.31 MB 270.13 ms 2.09 s
5000 1000 1.0M 3.09 GB 3.06 GB 27.61 s 4.05 min
5000 10000 100.0M 30.92 GB Skipped (>30GB) - -
5000 100000 10.0B 309.21 GB Skipped (>30GB) - -

Note: Weighted RF and KF produce floating-point matrices; RF produces integer matrices.


🔍 Troubleshooting

  • No trees parsed? Verify the input is a valid NEXUS .trees file and adjust --burnin-* settings.
  • Piping to other tools? Use -q to suppress timing messages on stdout.
  • Gzipped output not working? Ensure the output filename ends with .gz.

⚖️ License

rapidtrees is provided under the MIT License.