rapidtrees CLI
Compute pairwise tree distances (Robinson–Foulds, weighted RF, Kuhner–Felsenstein) from BEAST/NEXUS .trees files and write a labeled distance matrix.
Performance ZIKA dataset (283 taxa, 4000 trees, ~8M comparisons)
- Robinson-Foulds (RF): ~3.5s total → ~2.3M comparisons/sec
- Weighted RF: ~3.5s total → ~2.3M comparisons/sec
- Kuhner-Felsenstein (KF): ~3.5s total → ~2.3M comparisons/sec
Install / Build
Requirements: Rust toolchain (stable). Then build the binary:
[!NOTE] Install the Rust toolchain from https://rustup.rs/ if you don't have it yet.
|
Clone the repository:
Build the project:
The executable will be at target/release/rapidtrees.
Usage
# if rapidtrees is not in your PATH, use the full path, e.g. ./target/release/rapidtrees
Flags and options:
-i, --input <INPUT>: Path to BEAST.trees(NEXUS) file.-o, --output <OUTPUT>: Output path for the TSV distance matrix. If the path ends with.gzit will be gzip-compressed. Use-to write to stdout (uncompressed).-t, --burnin-trees <N>: Drop the first N trees (default: 0).-s, --burnin-states <STATE>: Keep only trees withSTATE_ > STATE(default: 0).--use-real-taxa: Map numeric taxon IDs to labels using the TRANSLATE block if present.--metric <rf|weighted|kf>: Choose the distance metric (default:rf), weighted referring to weighted RF.-q, --quiet: Suppress progress messages on stdout. Errors still go to stderr.
The output is a square TSV matrix where both the header row and the first column are tree names, constructed as <file_basename>_tree_STATE<state>. When writing to stdout (-o -), the matrix is printed to stdout, allowing easy piping.
Examples
- Compute RF matrix and write to gzipped file:
# Reading in beast 0.003s
# Read in 162 taxons for 21 trees
# Creating tree bit snapshots 0.002s
# Determining distances using RF for 210 combinations
# Determining distances using RF 0.000s
# Writing to output 0.000s
- Apply burn-in by tree count and state:
# Reading in beast 0.003s
# Read in 162 taxons for 19 trees
# Creating tree bit snapshots 0.001s
# Determining distances using RF for 171 combinations
# Determining distances using RF 0.000s
# Writing to output 0.000s
Performance notes
- Trees are parsed once. Bitset snapshots are built once and reused for pairwise comparisons. Parallelism is provided by
rayon. - Weighted RF and KF produce floating-point matrices; RF produces integer matrices.
Benchmarks
We provide a benchmark suite to evaluate both memory usage and runtime performance for varying dataset sizes.
This will output a table showing estimated memory usage, wall-clock time, and CPU time for pairwise distance calculations.
Various output (on a MacBook Pro M1):
| Taxa (N) | Trees (T) | Combinations | Est. Memory | Actual Memory | Wall Time | CPU Time |
|---|---|---|---|---|---|---|
| 10 | 100 | 10.0K | 51.56 KB | 64.00 KB | 391.71 µs | 1.77 ms |
| 10 | 1000 | 1.0M | 515.62 KB | 448.00 KB | 1.16 ms | 9.52 ms |
| 10 | 10000 | 100.0M | 5.04 MB | 5.02 MB | 81.86 ms | 757.66 ms |
| 10 | 100000 | 10.0B | 50.35 MB | 52.42 MB | 8.17 s (est) | 1.27 min (est) |
| 100 | 100 | 10.0K | 481.25 KB | 464.00 KB | 230.54 µs | 1.45 ms |
| 100 | 1000 | 1.0M | 4.70 MB | 1.17 MB | 42.00 ms | 147.22 ms |
| 100 | 10000 | 100.0M | 47.00 MB | 38.42 MB | 1.78 s | 14.75 s |
| 100 | 100000 | 10.0B | 469.97 MB | 451.47 MB | 2.77 min (est) | 25.29 min (est) |
| 500 | 100 | 10.0K | 4.59 MB | 80.00 KB | 1.61 ms | 14.18 ms |
| 500 | 1000 | 1.0M | 45.90 MB | 25.95 MB | 165.58 ms | 1.46 s |
| 500 | 10000 | 100.0M | 458.98 MB | 399.44 MB | 19.75 s | 2.87 min |
| 500 | 100000 | 10.0B | 4.48 GB | 4.51 GB | 31.75 min (est) | 294.98 min (est) |
| 1000 | 100 | 10.0K | 15.27 MB | 5.30 MB | 4.69 ms | 44.25 ms |
| 1000 | 1000 | 1.0M | 152.71 MB | 126.77 MB | 660.03 ms | 5.30 s |
| 1000 | 10000 | 100.0M | 1.49 GB | 1.44 GB | 1.13 min | 9.72 min |
| 1000 | 100000 | 10.0B | 14.91 GB | 9.41 GB | 111.42 min (est) | 983.70 min (est) |
| 2000 | 100 | 10.0K | 54.94 MB | 34.56 MB | 23.11 ms | 182.28 ms |
| 2000 | 1000 | 1.0M | 549.44 MB | 480.02 MB | 2.89 s | 19.48 s |
| 2000 | 10000 | 100.0M | 5.37 GB | 5.10 GB | 3.92 min | 34.76 min |
| 2000 | 100000 | 10.0B | 53.66 GB | Skipped (>30GB) | - | - |
| 5000 | 100 | 10.0K | 316.63 MB | 299.31 MB | 270.13 ms | 2.09 s |
| 5000 | 1000 | 1.0M | 3.09 GB | 3.06 GB | 27.61 s | 4.05 min |
| 5000 | 10000 | 100.0M | 30.92 GB | Skipped (>30GB) | - | - |
| 5000 | 100000 | 10.0B | 309.21 GB | Skipped (>30GB) | - | - |
Troubleshooting
- If no trees are parsed, verify the input is a valid NEXUS
.treesfile and adjust--burnin-*settings. - Use
-qwhen writing to stdout and piping to other tools to suppress timing messages. - For gzipped output, ensure the output filename ends with
.gz.
Python API
The package also provides Python bindings for easy integration into Python workflows.
Installation
From source (requires Rust):
# Using pip (recommended)
# Or using maturin directly
Quick Start
# Compute Robinson-Foulds distances
, =
# Compute Weighted RF distances (considers branch lengths)
, =
# Compute Kuhner-Felsenstein distances
, =
# Output is a list of tree names and a 2D distance matrix