# Ancestral Sequence Reconstruction (ASR) — porting treetime into phylo-rs
## Context
phylo-rs is a trait-based phylogenetics library with tree topology, branch lengths,
traversals, and distance metrics — but **no sequence/alignment/substitution-model
support at all**. The goal is to start incorporating methods from
[treetime](https://github.com/neherlab/treetime), beginning with its signature
capability: **GTR maximum-likelihood ancestral sequence reconstruction** (inferring
sequences at internal nodes given an alignment on the leaves + a tree).
Locked decisions (from clarification):
- **Both** marginal ML (Felsenstein up-pass + marginal down-pass posteriors) **and**
joint ML (Viterbi-style) in this first cut.
- **Generic `Alphabet` trait**, shipping **Nucleotide (4-state) + AminoAcid (20-state)**.
- Add **`nalgebra` (core only, no LAPACK/BLAS)** for eigendecomposition / `P(t)=exp(Qt)`.
FASTA via **hand-rolled parser, zero-dep**. **WASM-first is a hard constraint** —
everything must compile to `wasm32-unknown-unknown`.
- Reconstructed sequences returned in a **result struct keyed by `NodeID`**; the
`Node`/arena is **not** mutated.
## Why this design fits the codebase
- Each existing algorithm in `src/tree/distances.rs` (`RobinsonFoulds`,
`CopheneticDistance`/`PathFunction`) is a **trait with where-bounds + default-method
body**, blanket-impl'd for `SimpleRootedTree` in `src/tree.rs`, re-exported via the
prelude. ASR mirrors this exactly.
- `RobinsonFoulds::rf(&self,...)` proves the idiom: a `&self` method over a `Sync` tree
returning an owned result. ASR is read-only → owned `Reconstruction` keyed by NodeID.
- `postord_ids(root)` (leaves→root) = the Felsenstein pruning pass; `preord_ids(root)`
(root→leaves) = the marginal/traceback down-pass. Branch length `t` = `get_edge_weight(p,c)`.
- The `Node` arena offers only one scalar `zeta: Option<Z: Float>` per node — cannot hold
per-site state vectors / posterior profiles. So results live in an external struct, not
the arena. Widening `Node` generics would break every impl + serde + `from_newick`.
## Module layout (new)
Create `src/tree/asr/` (keeps "many small files"; trait-in-module / blanket-impl-in-`tree.rs`):
```
src/tree/asr.rs # module root: declares MarginalAsr + JointAsr traits; re-exports submodules
src/tree/asr/alphabet.rs # Alphabet trait + Nucleotide + AminoAcid + leaf profiles
src/tree/asr/gtr.rs # GtrModel<A>: Q from (pi,W); symmetric eigendecomp; transition(t)
src/tree/asr/alignment.rs # Alignment + hand-rolled FASTA + column/pattern compression
src/tree/asr/profile.rs # LikelihoodVec + scaling / log-offset numerical core
src/tree/asr/reconstruction.rs # Reconstruction<A> keyed by NodeID
```
Modify:
- `src/tree.rs` — add `pub mod asr;`; add 2 blanket impls (copy `RobinsonFoulds` bound set).
- `src/lib.rs` — prelude `pub use crate::tree::asr::*;` + crate-doc mention.
- `Cargo.toml` — add `nalgebra` (core-only) + `proptest` dev-dep.
- `src/error.rs` — add `AsrError` variants (reuse existing `thiserror`).
## 1. `Alphabet` trait (`alphabet.rs`)
treetime semantics: fixed ordered canonical states; a profile row (length `N_STATES`) per
accepted char. Canonical → one-hot; ambiguity codes → mass over implied subset; gap `-`
and `N`/`X`/`?` → all-ones (uninformative).
```rust
pub trait Alphabet: Copy + Clone + Sized + Send + Sync {
const N_STATES: usize; // 4 DNA / 20 protein
const CANONICAL: &'static [u8]; // b"ACGT"
const GAP: u8; // b'-'
fn profile(c: u8) -> Option<Vec<f64>>;
fn index_of(c: u8) -> Option<usize>;
fn char_of(i: usize) -> u8;
}
```
- Associated `const N_STATES` (not const-generic) avoids generics infection. Internals use
`f64` regardless of tree `W=f32`; cast `f64::from(weight)` only at the `P(t)` boundary.
- `Nucleotide`: `b"ACGT"`, IUPAC `R Y S W K M B D H V N`, gap, `U→T`; ambiguity rows uniform
over implied bases.
- `AminoAcid`: `b"ACDEFGHIKLMNPQRSTVWY"`, `B→D|N`, `Z→E|Q`, `J→I|L`, `X`/`-`/`U`/`O`/`*` → all-ones.
## 2. GTR model (`gtr.rs`)
treetime parameterization: `Q_ij = W_ij * pi_j` (i≠j), `Q_ii = -Σ_{j≠i} Q_ij` (rows sum 0);
rows = from-state, cols = to-state. Normalize so `mu = -Σ pi_i Q_ii = 1` (branch lengths in
expected subs/site); default normalized.
**Symmetric eigendecomposition (the pure-Rust nalgebra trick):** Q is pi-reversible, so
`S = diag(√pi)·Q·diag(1/√pi)` is symmetric →
```
S = V·Λ·Vᵀ (nalgebra::SymmetricEigen — real eigenvalues, orthonormal V, wasm-safe)
P(t) = diag(1/√pi)·V·diag(exp(Λ·t))·Vᵀ·diag(√pi)
```
nalgebra-core has a stable symmetric eigensolver but no general complex one — hence the
symmetrization (also better numerics). Requires `pi_i > 0`.
```rust
pub struct GtrModel<A: Alphabet> { /* pi, w, q, eigvals, v, sqrt_pi, inv_sqrt_pi, PhantomData<A> */ }
impl<A: Alphabet> GtrModel<A> {
pub fn new(pi: Vec<f64>, w: DMatrix<f64>, normalize: bool) -> Result<Self, AsrError>;
pub fn jukes_cantor() -> Self; // uniform pi, W = ones
pub fn transition(&self, t: f64) -> DMatrix<f64>; // P(t); t==0 → I
pub fn equilibrium(&self) -> &DVector<f64>;
}
```
`new` validates: `pi.len()==N_STATES`, `pi` sums≈1 & all >0, `W` symmetric. Missing branch
length → `AsrError::MissingBranchLength`. JTT/WAG/LG/HKY = follow-up.
## 3. Alignment + FASTA (`alignment.rs`)
```rust
pub struct Alignment { pub seqs: HashMap<String, Vec<u8>>, pub width: usize }
impl Alignment {
pub fn from_fasta_bytes(data: &[u8]) -> Result<Self, AsrError>; // wasm-safe, bytes only
pub fn from_fasta_file(p: &std::path::Path) -> std::io::Result<Self>; // std::fs wrapper
}
```
Mirror `io.rs`'s byte-parse vs file-read split (so wasm callers use `from_fasta_bytes` + JS
fetch). Parser: `>` header → id = up to first whitespace; concat/upcase sequence lines; strip
`\r`; reject empty/duplicate-id/ragged. Map names → leaves via
`RootedMetaTree::get_taxa_node_id(&name)` (`Meta=String` for `PhyloTree`); validate every
`get_leaf_ids()` leaf has a sequence.
**Column/pattern compression (treetime optimization):** sites are independent; compress
identical alignment columns into unique patterns with multiplicity, run the up/down pass once
per pattern, weight log-lik by multiplicity, re-expand to full width.
```rust
pub struct CompressedColumns {
patterns: Vec<Vec<u8>>, site_to_pattern: Vec<usize>,
multiplicity: Vec<usize>, leaf_order: Vec<TreeNodeID>,
}
```
## 4. Reconstruction traits (`asr.rs`) — mirror `distances.rs`
Shared bounds: `Self: RootedTree + RootedMetaTree + RootedWeightedTree + DFS + PreOrder`,
`Node: RootedMetaNode + RootedWeightedNode`. `&self`, owned result.
```rust
pub struct Reconstruction<A: Alphabet> {
pub sequences: HashMap<TreeNodeID, Vec<usize>>, // per-node state indices
pub posteriors: Option<HashMap<TreeNodeID, Vec<Vec<f64>>>>,// marginal only
pub log_likelihood: f64,
}
impl<A: Alphabet> Reconstruction<A> { pub fn sequence_string(&self, n: TreeNodeID) -> Option<String>; }
```
**Marginal** — `fn marginal_asr<A: Alphabet>(&self, model, aln, want_posteriors) -> Result<Reconstruction<A>, AsrError>`:
- Compress columns. Per pattern:
- UP (`postord_ids(root)`): leaf `L_v = profile(byte)`; internal
`L_v[i] = Π_c ( Σ_j P_c(i,j)·L_c[j] )`, `P_c = transition(weight(v→c))`. **Scale**:
`factor=max_i L_v[i]; L_v/=factor; log_scale_v += ln(factor)+Σ_c log_scale_c`. Cache by NodeID.
Root site log-lik `= ln(Σ_i pi_i·L_root[i]) + log_scale_root`; `total += mult·site_loglik`.
- DOWN (`preord_ids(root)`): root posterior `∝ pi_i·L_root[i]`; child `c` of parent `p`:
`msg[j]=Σ_i post_{p\c}[i]·P_c(i,j)`, `post_c[j] ∝ L_c[j]·msg[j]`, normalize. Use
**exclude-c recompute** of parent product (safer than treetime's divide-out). `argmax` →
`sequences[v][site]`; store profile if requested. Re-expand via `site_to_pattern`.
**Joint** — `fn joint_asr<A: Alphabet>(&self, model, aln) -> Result<Reconstruction<A>, AsrError>`, all log-space:
- UP (`postord_ids`): `C_v[i]` = best log-prob of subtree given v=i. Leaf = `log(profile_i)`
(−∞ for disallowed). Internal `m_c[i]=max_j(log P_c(i,j)+C_c[j])`, `ptr_{v,c}[i]=argmax_j`,
`C_v[i]=Σ_c m_c[i]`. Root state `=argmax_i(log pi_i + C_root[i])`.
- DOWN (`preord_ids` traceback): `state_c = ptr_{parent,c}[state_parent]`. No posteriors.
Blanket impls in `src/tree.rs`:
```rust
impl<T,W,Z> MarginalAsr for SimpleRootedTree<T,W,Z> where T: NodeTaxa, W: EdgeWeight, Z: NodeWeight {}
impl<T,W,Z> JointAsr for SimpleRootedTree<T,W,Z> where T: NodeTaxa, W: EdgeWeight, Z: NodeWeight {}
```
Parallel: per `CopheneticDistance`, add `#[cfg(feature="parallel")]` variants `par_iter()` over
patterns (embarrassingly parallel). Default serial.
## 5. WASM constraints
- `nalgebra = { version = "0.33", default-features = false, features = ["std"] }` — pure Rust,
no BLAS/LAPACK. Do **not** enable `nalgebra-lapack` or nalgebra's `rand`.
- `getrandom` `js` feature already present (`Cargo.toml:38`); ASR uses no RNG.
- Keep `rayon` `parallel`-gated; serial path must have zero `std::thread`/`rayon` outside the cfg.
- Verify: `cargo build --target wasm32-unknown-unknown --no-default-features --features simple_rooted_tree`.
## 6. Numerical
- Marginal: linear-space profiles + per-node max-scaling + running `log_scale` accumulator
(treetime approach); total = ln(root mass) + Σ log-scales. Keeps fast nalgebra matvec.
- Joint: pure log-space (`log P(t)` + additive max) — natural Viterbi, no scaling.
- Clamp `P(t)` round-off negatives to 0; `f64` internally, cast at `t` boundary only.
- `profile.rs` isolates `scale_and_logoffset()` for unit testing.
## Verification
- `cargo test` (+ `cargo test --features parallel`): unit tests per file —
- alphabet: profiles for `A/G/R/N/-`, AA `B/Z/X`, round-trips.
- gtr: `P(0)=I`, row sums=1, `P(t→∞)→pi`, Q rows sum 0, detailed balance
`pi_i Q_ij = pi_j Q_ji`, symmetric-eigen `P(t)` vs brute-force expm for small t.
- alignment: ragged/dup/CRLF/multiline parse, compression re-expansion correctness.
t- `cargo build --target wasm32-unknown-unknown --no-default-features --features simple_rooted_tree`.
- `cargo fmt` + `cargo clippy --all-targets --all-features -D warnings`.
- Optional divan bench for `marginal_asr` in `benches/main.rs` (tracks compression win).
## Implementation order (this PR)
1. `Cargo.toml`: nalgebra core-only + proptest dev-dep. ✅
2. `src/error.rs`: `AsrError`. ✅
3. `alphabet.rs` → `profile.rs` → `gtr.rs` (+JC) → `alignment.rs` (+compression) → `reconstruction.rs`. ✅
4. `asr.rs`: `MarginalAsr` + `JointAsr` traits. ✅
5. `src/tree.rs`: `pub mod asr;` + 2 blanket impls. ✅
6. `src/lib.rs`: prelude re-export + doc. ✅
7. Logic: Implementation of `marginal_asr` and `joint_asr` bodies. ⏳
8. Tests + fixture; wasm build check; fmt + clippy. ⏳
## Follow-ups (not this PR)
Empirical models (JTT/WAG/LG/HKY); ML branch-length & GTR-parameter optimization; gamma rate
heterogeneity / `+I`; marginal sampling; no_std/alloc-only nalgebra path; treetime parity snapshots.