rapidtrees 0.2.2

Fast pairwise tree distance calculations (Robinson-Foulds, Weighted RF, Kuhner-Felsenstein) for phylogenetic trees
Documentation
# rapidtrees CLI

Compute pairwise tree distances (Robinson–Foulds, weighted RF, Kuhner–Felsenstein) from BEAST/NEXUS `.trees` files and write a labeled distance matrix.

## Performance ZIKA dataset (283 taxa, 4000 trees, ~8M comparisons)

- **Robinson-Foulds (RF)**: ~3.5s total → **~2.3M comparisons/sec**
- **Weighted RF**: ~3.5s total → **~2.3M comparisons/sec**
- **Kuhner-Felsenstein (KF)**: ~3.5s total → **~2.3M comparisons/sec**

## Install / Build

Requirements: Rust toolchain (stable). Then build the binary:

> [!NOTE]
> Install the Rust toolchain from https://rustup.rs/ if you don't have it yet.
>
> ```bash
> curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
> ```

Clone the repository:

```bash
git clone https://github.com/Joon-Klaps/rapidtrees.git
```

Build the project:

```bash
cd rapidtrees
cargo build --release
```

The executable will be at `target/release/rapidtrees`.

## Usage

```bash
# if rapidtrees is not in your PATH, use the full path, e.g. ./target/release/rapidtrees
rapidtrees \
  --input <path/to/file.trees> \
  --output <path/to/output.tsv[.gz]> \
  [--burnin-trees <N>] \
  [--burnin-states <STATE>] \
  [--use-real-taxa] \
  [--metric rf|weighted|kf] \
  [-q|--quiet]
```

Flags and options:

- `-i, --input <INPUT>`: Path to BEAST `.trees` (NEXUS) file.
- `-o, --output <OUTPUT>`: Output path for the TSV distance matrix. If the path ends with `.gz` it will be gzip-compressed. Use `-` to write to stdout (uncompressed).
- `-t, --burnin-trees <N>`: Drop the first N trees (default: 0).
- `-s, --burnin-states <STATE>`: Keep only trees with `STATE_ > STATE` (default: 0).
- `--use-real-taxa`: Map numeric taxon IDs to labels using the TRANSLATE block if present.
- `--metric <rf|weighted|kf>`: Choose the distance metric (default: `rf`), weighted referring to weighted RF.
- `-q, --quiet`: Suppress progress messages on stdout. Errors still go to stderr.

The output is a square TSV matrix where both the header row and the first column are tree names, constructed as `<file_basename>_tree_STATE<state>`. When writing to stdout (`-o -`), the matrix is printed to stdout, allowing easy piping.

## Examples

- Compute RF matrix and write to gzipped file:

```bash
rapidtrees \
  -i tests/data/hiv1.trees \
  -o out/hiv1_rf.tsv.gz \
  --metric rf

# Reading in beast 0.003s
# Read in 162 taxons for 21 trees
# Creating tree bit snapshots 0.002s
# Determining distances using RF for 210 combinations
# Determining distances using RF 0.000s
# Writing to output 0.000s
```

- Apply burn-in by tree count and state:

```bash
rapidtrees \                                                                                                      2 ↵
  -i tests/data/hiv1.trees \
  -o out/hiv1_rf.tsv \
  -t 2 \

# Reading in beast 0.003s
# Read in 162 taxons for 19 trees
# Creating tree bit snapshots 0.001s
# Determining distances using RF for 171 combinations
# Determining distances using RF 0.000s
# Writing to output 0.000s
```

## Performance notes

- Trees are parsed once. Bitset snapshots are built once and reused for pairwise comparisons. Parallelism is provided by `rayon`.
- Weighted RF and KF produce floating-point matrices; RF produces integer matrices.

## Benchmarks

We provide a benchmark suite to evaluate both memory usage and runtime performance for varying dataset sizes.

This will output a table showing estimated memory usage, wall-clock time, and CPU time for pairwise distance calculations.

Various output (on a MacBook Pro M1):

| Taxa (N) | Trees (T) | Combinations | Est. Memory | Actual Memory   | Wall Time        | CPU Time         |
| -------- | --------- | ------------ | ----------- | --------------- | ---------------- | ---------------- |
| 10       | 100       | 10.0K        | 51.56 KB    | 64.00 KB        | 391.71 µs        | 1.77 ms          |
| 10       | 1000      | 1.0M         | 515.62 KB   | 448.00 KB       | 1.16 ms          | 9.52 ms          |
| 10       | 10000     | 100.0M       | 5.04 MB     | 5.02 MB         | 81.86 ms         | 757.66 ms        |
| 10       | 100000    | 10.0B        | 50.35 MB    | 52.42 MB        | 8.17 s (est)     | 1.27 min (est)   |
| 100      | 100       | 10.0K        | 481.25 KB   | 464.00 KB       | 230.54 µs        | 1.45 ms          |
| 100      | 1000      | 1.0M         | 4.70 MB     | 1.17 MB         | 42.00 ms         | 147.22 ms        |
| 100      | 10000     | 100.0M       | 47.00 MB    | 38.42 MB        | 1.78 s           | 14.75 s          |
| 100      | 100000    | 10.0B        | 469.97 MB   | 451.47 MB       | 2.77 min (est)   | 25.29 min (est)  |
| 500      | 100       | 10.0K        | 4.59 MB     | 80.00 KB        | 1.61 ms          | 14.18 ms         |
| 500      | 1000      | 1.0M         | 45.90 MB    | 25.95 MB        | 165.58 ms        | 1.46 s           |
| 500      | 10000     | 100.0M       | 458.98 MB   | 399.44 MB       | 19.75 s          | 2.87 min         |
| 500      | 100000    | 10.0B        | 4.48 GB     | 4.51 GB         | 31.75 min (est)  | 294.98 min (est) |
| 1000     | 100       | 10.0K        | 15.27 MB    | 5.30 MB         | 4.69 ms          | 44.25 ms         |
| 1000     | 1000      | 1.0M         | 152.71 MB   | 126.77 MB       | 660.03 ms        | 5.30 s           |
| 1000     | 10000     | 100.0M       | 1.49 GB     | 1.44 GB         | 1.13 min         | 9.72 min         |
| 1000     | 100000    | 10.0B        | 14.91 GB    | 9.41 GB         | 111.42 min (est) | 983.70 min (est) |
| 2000     | 100       | 10.0K        | 54.94 MB    | 34.56 MB        | 23.11 ms         | 182.28 ms        |
| 2000     | 1000      | 1.0M         | 549.44 MB   | 480.02 MB       | 2.89 s           | 19.48 s          |
| 2000     | 10000     | 100.0M       | 5.37 GB     | 5.10 GB         | 3.92 min         | 34.76 min        |
| 2000     | 100000    | 10.0B        | 53.66 GB    | Skipped (>30GB) | -                | -                |
| 5000     | 100       | 10.0K        | 316.63 MB   | 299.31 MB       | 270.13 ms        | 2.09 s           |
| 5000     | 1000      | 1.0M         | 3.09 GB     | 3.06 GB         | 27.61 s          | 4.05 min         |
| 5000     | 10000     | 100.0M       | 30.92 GB    | Skipped (>30GB) | -                | -                |
| 5000     | 100000    | 10.0B        | 309.21 GB   | Skipped (>30GB) | -                | -                |


## Troubleshooting

- If no trees are parsed, verify the input is a valid NEXUS `.trees` file and adjust `--burnin-*` settings.
- Use `-q` when writing to stdout and piping to other tools to suppress timing messages.
- For gzipped output, ensure the output filename ends with `.gz`.

---

## Python API

The package also provides Python bindings for easy integration into Python workflows.

### Installation

From source (requires Rust):

```bash
# Using pip (recommended)
pip install -e .

# Or using maturin directly
pip install maturin
maturin develop --release
```

### Quick Start

```python
import rapidtrees as rtd

# Compute Robinson-Foulds distances
tree_names, rf_matrix = rtd.pairwise_rf(
    paths=["file1.trees", "file2.trees"],
    burnin_trees=10,  # Skip first 10 trees from each file
    burnin_states=0,   # Skip trees with STATE < 0
    use_real_taxa=True # Use TRANSLATE block if available, set to true if multiple files are provided
)

# Compute Weighted RF distances (considers branch lengths)
tree_names, wrf_matrix = rtd.pairwise_weighted_rf(
    paths=["file1.trees"],
    burnin_trees=10
)

# Compute Kuhner-Felsenstein distances
tree_names, kf_matrix = rtd.pairwise_kf(
    paths=["file1.trees"],
    burnin_trees=10
)

# Output is a list of tree names and a 2D distance matrix
print(f"Computed distances for {len(tree_names)} trees")
print(f"RF distance between tree 0 and 1: {rf_matrix[0][1]}")
```