Neighbor-Joining tree builder and triangular distance matrix utilities.
This module provides a compact representation for pairwise distances (TriMat), a simple tree node structure used to represent Neighbor-Joining (NJ) trees (NJNode), a robust, borrow-safe implementation of the Neighbor-Joining algorithm (neighbor_joining), and a helper to build a TriMat from a multiple sequence alignment (tri_from_msa).
Types
-
TriMat
- Compact triangular (upper-triangular stored as a flat Vec) distance matrix with associated node names.
- Access via
get(i, j)andset(i, j, value). Diagonal distances are always treated as 0. The internal index mapping follows a strict triangular packing so only i != j is stored. - Construct with
TriMat::with_names(names);dim()returns the number of taxa.
-
NJNode
- Binary tree node representing either a leaf (no children) or an internal node with exactly two children.
- Stores child branch lengths on the parent as
left_lenandright_len. to_newick(hide_internal)emits a Newick string with branch lengths formatted to 3 decimal places. Whenhide_internalis true the internal node labels are omitted.
Functions
-
neighbor_joining(dist: TriMat) -> Result<NJNode, String>
- Performs the Neighbor-Joining algorithm on the provided triangular
distance matrix and returns the inferred tree as an
NJNode. - The implementation is iterator-oriented and avoids borrowing issues by moving and reorganizing node ownership via a HashMap of indices -> nodes.
- Uses a BitVec to track active taxa and maintains a
row_sumsvector for efficient Q-matrix computation. - Branch lengths computed for joins are clamped to >= 0.0 to avoid negative lengths caused by rounding or non-ultrametric inputs.
- Returns an
Errif the input matrix is empty or if an internal consistency check fails (e.g., no pair to join or unexpected number of remaining active nodes). - Complexity: O(n2) memory for the distance representation and O(n3) worst-case time if implemented naively; this implementation reduces repeated work via row sums but still iterates over active pairs while choosing minimal Q entries.
- Performs the Neighbor-Joining algorithm on the provided triangular
distance matrix and returns the inferred tree as an
-
tri_from_msa(msa: &[String], names: Option<&[String]>) -> TriMat
- Builds a
TriMatfrom a multiple sequence alignment (MSA). - Pairwise distance = (number of differing positions) / (number of
compared positions), ignoring any column where either sequence has a
gap (
'-'). If two sequences share no compared positions the distance is set to 0.0. - If
namesis omitted, default names are generated asSeq0,Seq1, ....
- Builds a
Usage notes and invariants
- TriMat stores only i != j entries;
get(i,i)is defined to be 0.0. - Indices used with TriMat must be valid (0..dim()).
setsilently ignores attempts to set diagonal entries. - neighbor_joining consumes the distance matrix (it is taken by value) and returns a tree whose leaves preserve the original node name strings.
- The NJ implementation expects a symmetric distance matrix; asymmetries will lead to undefined/incorrect results.
- Branch lengths are clamped to non-negative values to keep the tree valid in downstream tools; if you require different handling of negative lengths adjust the implementation accordingly.
Examples
Construct a TriMat from an MSA and produce a Newick tree:
let msa = vec!["ACG".into(), "ATG".into(), "A-G".into()];
let tm = tri_from_msa(&msa, None);
let tree = neighbor_joining(tm).expect("NJ failed");
let newick = tree.to_newick(true);
Errors & Panics
- Functions in this module return
Resultwhere appropriate to indicate recoverable errors (e.g., empty input). Internal indexing errors return descriptive strings rather than panicking. - The code assumes reasonably small
nsuch that triangular storage in a Vec is practical; very large numbers of taxa may cause memory pressure.
Testing
The accompanying unit tests exercise TriMat indexing, distance-building behavior with gaps, two- and three-taxon NJ behavior, and Newick output formatting. They also validate that branch lengths are non-negative after clamping.