nj 0.0.3

Neighbor Joining, fast phylogenetic tree construction. Library and CLI.
Documentation

Neighbor-Joining tree builder and triangular distance matrix utilities.

This module provides a compact representation for pairwise distances (TriMat), a simple tree node structure used to represent Neighbor-Joining (NJ) trees (NJNode), a robust, borrow-safe implementation of the Neighbor-Joining algorithm (neighbor_joining), and a helper to build a TriMat from a multiple sequence alignment (tri_from_msa).

Types

  • TriMat

    • Compact triangular (upper-triangular stored as a flat Vec) distance matrix with associated node names.
    • Access via get(i, j) and set(i, j, value). Diagonal distances are always treated as 0. The internal index mapping follows a strict triangular packing so only i != j is stored.
    • Construct with TriMat::with_names(names); dim() returns the number of taxa.
  • NJNode

    • Binary tree node representing either a leaf (no children) or an internal node with exactly two children.
    • Stores child branch lengths on the parent as left_len and right_len.
    • to_newick(hide_internal) emits a Newick string with branch lengths formatted to 3 decimal places. When hide_internal is true the internal node labels are omitted.

Functions

  • neighbor_joining(dist: TriMat) -> Result<NJNode, String>

    • Performs the Neighbor-Joining algorithm on the provided triangular distance matrix and returns the inferred tree as an NJNode.
    • The implementation is iterator-oriented and avoids borrowing issues by moving and reorganizing node ownership via a HashMap of indices -> nodes.
    • Uses a BitVec to track active taxa and maintains a row_sums vector for efficient Q-matrix computation.
    • Branch lengths computed for joins are clamped to >= 0.0 to avoid negative lengths caused by rounding or non-ultrametric inputs.
    • Returns an Err if the input matrix is empty or if an internal consistency check fails (e.g., no pair to join or unexpected number of remaining active nodes).
    • Complexity: O(n2) memory for the distance representation and O(n3) worst-case time if implemented naively; this implementation reduces repeated work via row sums but still iterates over active pairs while choosing minimal Q entries.
  • tri_from_msa(msa: &[String], names: Option<&[String]>) -> TriMat

    • Builds a TriMat from a multiple sequence alignment (MSA).
    • Pairwise distance = (number of differing positions) / (number of compared positions), ignoring any column where either sequence has a gap ('-'). If two sequences share no compared positions the distance is set to 0.0.
    • If names is omitted, default names are generated as Seq0, Seq1, ....

Usage notes and invariants

  • TriMat stores only i != j entries; get(i,i) is defined to be 0.0.
  • Indices used with TriMat must be valid (0..dim()). set silently ignores attempts to set diagonal entries.
  • neighbor_joining consumes the distance matrix (it is taken by value) and returns a tree whose leaves preserve the original node name strings.
  • The NJ implementation expects a symmetric distance matrix; asymmetries will lead to undefined/incorrect results.
  • Branch lengths are clamped to non-negative values to keep the tree valid in downstream tools; if you require different handling of negative lengths adjust the implementation accordingly.

Examples

Construct a TriMat from an MSA and produce a Newick tree:

let msa = vec!["ACG".into(), "ATG".into(), "A-G".into()];
let tm = tri_from_msa(&msa, None);
let tree = neighbor_joining(tm).expect("NJ failed");
let newick = tree.to_newick(true);

Errors & Panics

  • Functions in this module return Result where appropriate to indicate recoverable errors (e.g., empty input). Internal indexing errors return descriptive strings rather than panicking.
  • The code assumes reasonably small n such that triangular storage in a Vec is practical; very large numbers of taxa may cause memory pressure.

Testing

The accompanying unit tests exercise TriMat indexing, distance-building behavior with gaps, two- and three-taxon NJ behavior, and Newick output formatting. They also validate that branch lengths are non-negative after clamping.