### oxicuda-seq TODO
Sequence Models & Structured Prediction for OxiCUDA — a pure Rust toolkit covering
HMM, linear-chain CRF, MEMM, structured SVM, beam search, classical pairwise sequence
alignment (Needleman-Wunsch / Smith-Waterman / Gotoh / Hirschberg), 2D grid CRFs,
Kalman / RTS / EKF / EM, and general MRFs with Gibbs sampling and loopy belief
propagation. Part of [OxiCUDA](https://github.com/cool-japan/oxicuda) (Vol.51).
(C) 2026 COOLJAPAN OU (Team KitaSan)
## Implementation Status
**Actual: 6,021 lines / 20,887 SLoC (88 files)** — implements every major sequence /
structured-prediction algorithm in pure Rust with log-space numerics for stability,
including L-BFGS for CRF training and Hirschberg O(min(m, n))-memory alignment.
### Completed [x]
#### Core Infrastructure
- [x] `error.rs` — `SeqError` enum (13 variants: `ShapeMismatch`, `DimensionMismatch`,
`NotConverged`, `InvalidConfiguration`, `EmptyInput`, `IndexOutOfBounds`,
`NumericalInstability`, `ProbabilityOutOfRange`, `InvalidParameter`,
`UnsupportedSmVersion`, `InvalidObservation`, `LengthMismatch`,
`GraphInvariantViolated`) plus `SeqResult<T>`
- [x] `handle.rs` — `SmVersion`, `LcgRng` (MMIX 64-bit LCG, bit-32 boolean trick,
Box-Muller normal, `sample_categorical`), `SeqHandle`
- [x] `ptx_kernels.rs` — 7 kernels × 6 SM versions (sm_70/75/80/86/89/90):
`forward_pass`, `viterbi_step`, `crf_features`, `beam_topk`, `edit_dist`,
`kalman_predict`, `mrf_gibbs` (string concatenation; inline LCG inside the
Gibbs-sampling kernel)
#### Hidden Markov Models
- [x] `hmm/hmm.rs` — `HmmDiscrete { n_states, n_obs, pi, A, B }`,
`HmmGaussian { n_states, pi, A, means, covs }` (diagonal covariance)
- [x] `hmm/forward_backward.rs` — Log-space `forward` / `backward`, `log_likelihood`,
posterior tensors γ and ξ; matches exhaustive enumeration to 1e-9 on short chains
- [x] `hmm/viterbi.rs` — Log-space dynamic programming with explicit backpointers;
recovers the ground-truth path on deterministic chains
- [x] `hmm/baum_welch.rs` — Full EM training: E-step computes (γ, ξ), M-step updates
π via γ_0, A via Σξ / Σγ, B via class-conditioned γ; log-likelihood is
non-decreasing across iterations. Also `baum_welch_gaussian` for `HmmGaussian`
(scalar + multivariate-diagonal): drives the `forward_backward_gaussian` E-step
then a γ-weighted M-step re-estimating π, A, per-state mean μ_k and variance σ²_k
(floored at 1e-6). +3 unit tests (parameter recovery up to label-switch, monotone
log-likelihood, determinism / shapes / finite-output) → 699 crate tests, all pass.
#### Linear-Chain Conditional Random Fields
- [x] `crf/linear_chain_crf.rs` — `LinearChainCrf { n_labels, n_features, transitions,
emissions }` score-form model
- [x] `crf/crf_train.rs` — Score-space forward-backward for partition function /
expected features; full L-BFGS two-loop recursion with Armijo backtracking line
search and L2 regularisation; gradient validated to 1e-3 by finite-difference check
- [x] `crf/viterbi_decode.rs` — Log-space Viterbi specialised for the CRF score form
#### Maximum-Entropy Markov Models
- [x] `memm/memm.rs` — MEMM with per-state softmax over features, greedy + beam decoding
#### Structured SVM
- [x] `ssvm/ssvm.rs` — `StructuredSvm` with loss-augmented Viterbi inference and
sub-gradient training
- [x] `ssvm/cutting_plane.rs` — Constraint-based optimiser scaffold (placeholder for
full cutting-plane QP solver)
#### Beam Search
- [x] `beam/beam.rs` — Generic `BeamSearch` over a scoring callback with
length-normalisation and a diversity penalty; matches exhaustive top-1 on tiny vocab
#### Pairwise Sequence Alignment
- [x] `alignment/needleman_wunsch.rs` — Global alignment with linear gaps; classic
"GATTACA" / "GCATGCU" example
- [x] `alignment/smith_waterman.rs` — Local alignment; embedded-substring detection
- [x] `alignment/gotoh.rs` — Affine-gap 3-state DP (M / X / Y) with gap-open and
gap-extend penalties
- [x] `alignment/hirschberg.rs` — O(min(m, n))-memory NW via divide-and-conquer;
identical score to standard NW (validated by e2e test)
#### 2D Grid Conditional Random Fields
- [x] `grid_crf/grid_crf.rs` — 4-connected pairwise CRF for image labelling
- [x] `grid_crf/mean_field.rs` — Damped mean-field variational inference with
neighbour-message accumulation
#### Kalman Filter Family
- [x] `kalman/kalman_filter.rs` — Linear Kalman filter: innovation,
S = H · P · H^T + R, K = P · H^T · S⁻¹, posterior update
- [x] `kalman/rts_smoother.rs` — Rauch-Tung-Striebel backward smoother; variance is
always ≤ filter variance (test-enforced invariant)
- [x] `kalman/ekf.rs` — Extended Kalman filter accepting boxed-closure Jacobians for
the state-transition `f` and observation `h`
- [x] `kalman/kalman_em.rs` — Shumway-Stoffer EM for Q / R covariance refit from data
- [x] `kalman/linalg.rs` — Local helpers (matmul, Gauss-Jordan inverse, Cholesky,
determinant) kept private to the kalman module
#### Markov Random Fields
- [x] `mrf/mrf.rs` — General `Mrf` over a graph plus `IsingModel`; potential functions
- [x] `mrf/gibbs.rs` — Gibbs sampler with simulated-annealing temperature schedule;
recovers low-temperature magnetisation
- [x] `mrf/belief_prop.rs` — Log-space loopy belief propagation: sum-product marginals
and max-product MAP estimates over factor graphs
#### Sequence Metrics
- [x] `metrics/metrics.rs` — `token_accuracy`, `sequence_accuracy`, Levenshtein
`edit_distance` (`"kitten" → "sitting" = 3`), BLEU-n with add-1 smoothing and
brevity penalty, `perplexity`, `log_loss`
#### Tests & Benchmarks
- [x] `e2e_tests.rs` — 17 cross-module tests: HMM forward-backward vs. enumeration,
Viterbi on deterministic chain, Baum-Welch monotone log-likelihood, CRF Viterbi
sanity, CRF gradient finite-difference check, NW / SW / Gotoh / Hirschberg
correctness, edit distance kitten → sitting, Kalman 1σ recovery, RTS smoother
variance reduction, Ising Gibbs magnetisation, beam top-1 vs. enumeration,
BLEU-1 identical = 1.0, PTX × 6 SM versions, plus two worked examples:
Kalman 2D position tracking (filtered RMSE < raw-measurement RMSE) and a CRF
BIO chunker trained on a synthetic in-process corpus
- [x] `benches/seq_ops.rs` — Criterion suite: 7 PTX kernels × all SM versions plus
Viterbi / alignment algorithm benches
### Future Enhancements [ ]
#### P0 — Correctness & Foundation
- [x] `hmm/scaling.rs` — Explicit scaling (Rabiner 1989) for `HmmDiscrete` forward
pass on very long sequences (alternative to log-space)
- [x] `crf/lbfgs_b.rs` — L-BFGS-B with box constraints for non-negativity-constrained
CRF training
(crf/lbfgs_b.rs -- Byrd-Lu-Nocedal 1995; L-BFGS two-loop + projected line search + box-projection; box-constrained CRF training)
- [x] `crf/sgd.rs` — Stochastic gradient training with mini-batches and AdaGrad /
Adam variants
- [x] `ssvm/cutting_plane_full.rs` — Full cutting-plane SVM optimiser with QP
inner solver (current implementation is a scaffold)
(ssvm/cutting_plane_full.rs -- Joachims 2009; n-slack cutting-plane separation-oracle + per-example working sets + projected-gradient QP master dual)
- [x] `mrf/junction_tree.rs` — Exact inference via junction-tree algorithm for low-
treewidth graphs (current loopy BP is approximate)
(mrf/junction_tree.rs -- triangulation + max-spanning clique tree + Hugin two-pass message passing, exact marginals + log partition)
#### P1 — Algorithmic Coverage
- [x] `hmm/variational.rs` — Variational Bayes EM for HMMs with Dirichlet priors
- [x] `hmm/semimarkov.rs` — Semi-Markov / segmental HMM with explicit duration models
- [x] `crf/skip_chain.rs` — Skip-chain CRF for long-range label dependencies
(crf/skip_chain.rs -- linear-chain CRF + long-range skip edges, loopy sum-product/max-product BP; exact = forward-backward when no skip edges)
- [x] `crf/general_graph.rs` — General-graph CRF training (not just linear chain) via
loopy belief propagation
- [x] `memm/lbfgs.rs` — L-BFGS training for MEMM (currently sub-gradient only)
- [x] `ssvm/n_slack.rs` — n-slack formulation alternative to 1-slack
- [x] `beam/diverse.rs` — Diverse beam search (Vijayakumar 2018) with group penalties
- [x] `beam/length_penalty.rs` — GNMT-style length penalty
((5 + |Y|) / 6)^alpha for translation
- [x] `alignment/blast.rs` — Heuristic BLAST-style seed-and-extend alignment for
database search
(alignment/blast.rs -- k-mer seed index + ungapped X-drop bidirectional extension → HSPs; `BlastAligner`/`BlastConfig`/`Hsp`)
- [x] `alignment/banded.rs` — Banded DP for sequences known to be near-identical
(alignment/banded.rs -- NW DP restricted to diagonal band |i−j|≤band, O(n·band) cells; `banded_align`)
- [x] `grid_crf/loopy_bp.rs` — Loopy BP on grid CRF (alternative to mean-field)
(grid_crf/loopy_bp.rs -- log-space damped loopy sum-product on 4-connected grid; exact on loop-free chains; `LoopyBp`/`LoopyBpConfig`/`LoopyBpResult`)
- [x] `kalman/ukf.rs` — Unscented Kalman Filter with sigma-point propagation
- [x] `kalman/particle.rs` — Particle filter / sequential Monte Carlo for non-linear
non-Gaussian state-space models
- [x] `mrf/swendsen_wang.rs` — Swendsen-Wang cluster sampling for Ising-like models
(faster mixing than Gibbs at low T)
- [x] `mrf/wolff.rs` — Wolff single-cluster algorithm for ferromagnetic Ising
#### P1 — Modern Neural Decoders
- [x] `decoders/nucleus.rs` — Top-p (nucleus) sampling (Holtzman et al. 2020) with min_tokens floor + numerically-stable softmax
- [x] `decoders/top_k.rs` — Top-k sampling (Fan-Lewis-Dauphin 2018) with temperature
- [x] `decoders/typical.rs` — Typical sampling (Meister 2022) with entropy-distance ordering
- [x] `decoders/contrastive.rs` — Contrastive search decoding (Su 2022)
- [x] `crf/neural_crf.rs` — Neural CRF (Collobert 2011 extended): replace hand-crafted features with LSTM/CNN encoder; end-to-end training with Viterbi decoding; `NeuralCrfConfig { encoder_type: EncoderType }`
- [x] `structured/sinkhorn_crf.rs` — Sinkhorn CRF (Shi 2020): replace sum-product with Sinkhorn normalisation for structured prediction; differentiable top-k selection via entropic regularisation
(structured/sinkhorn_crf.rs -- log-domain Sinkhorn–Knopp scaling for entropy-regularised optimal transport; doubly-stochastic transport plan + dual potentials + regularised OT score; `SinkhornCrf` layer with differentiable structured loss `∂L/∂s = P_gold − P_pred`; uniform & arbitrary margins; sharpens to a permutation as ε→0)
- [x] `alignment/ctc_advanced.rs` — CTC with language model shallow fusion (Hannun 2014): beam search with LM log-probability addition; blank collapse; prefix probability aggregation; `CtcBeamSearch { beam_width, lm_weight }`
- [x] `seq2seq/pointer_network.rs` — Pointer Network (Vinyals 2015): attention mechanism where output token = position in input (not vocabulary); TSP / sorting / parsing applications; `PointerNetwork { hidden_dim }`
#### P2 — Advanced & Hybrid
- [x] `metrics/chrf.rs` — Character n-gram F-score (chrF / chrF++) for translation
quality
- [x] `metrics/ter.rs` — Translation Edit Rate
- [x] `metrics/wer.rs` — Word Error Rate for ASR-style outputs
- [x] `metrics/bertscore.rs` — Embedding-based metric (Zhang 2020): full greedy
cosine-matching precision / recall / F1 over caller-supplied token embeddings,
optional IDF weighting + baseline rescaling + `corpus_idf` estimator. Embeddings
are an honest input (from any encoder); the BERTScore algorithm is computed
exactly — not a stub.
(metrics/bertscore.rs -- `bert_score`/`bert_score_idf`/`corpus_idf`; `BertScore`/`BertScoreConfig`)
- [ ] `benches/algo_bench.rs` — Extended algorithm benches: standard NLP corpora
(CoNLL-2003 NER, Penn Treebank POS)
#### P2 — GPU / Architecture-Specific
- [ ] PTX kernel for batched forward-backward across many short sequences in parallel
- [ ] PTX kernel for warp-cooperative Viterbi over wide label sets (≥ 64 labels)
- [x] PTX kernel for parallel beam-search top-k via bitonic sort
## Dependencies
| thiserror | Error derive macros | Yes |
| criterion (dev) | Benchmark harness | Yes |
No CUDA SDK, no `nvcc`, no link-time `libcuda`, no external linear-algebra or
optimiser dependencies. Kalman linear algebra (matmul, inverse, Cholesky) is
implemented from scratch; L-BFGS is implemented from scratch with two-loop recursion.
## Quality Status
- Warnings: 0 (clippy clean; `needless_range_loop` and `type_complexity` are
intentionally allowed crate-wide because numerical kernels index multiple parallel
arrays per iteration body and rewriting them in iterator form would obscure the
math)
- Tests: 17 e2e tests passing (host-side); PTX kernel strings validated per SM version
- `unwrap()` calls in production code: 0
- `unsafe` code: 0
- macOS: compiles; GPU integration paths return `UnsupportedPlatform` at runtime
- Linux + NVIDIA driver 525+: GPU paths exercised behind `#[cfg(feature = "gpu-tests")]`
## Performance Targets
| HMM forward-backward | T = 1024, n_states = 64 | ≤ 5 ms on sm_80 |
| Viterbi decoding | T = 1024, n_states = 64 | ≤ 2 ms on sm_80 |
| Baum-Welch single iteration | T = 512, n_states = 32 | ≤ 50 ms on sm_80 |
| CRF L-BFGS training step | T = 256, n_labels = 16, n_feat = 1000 | ≤ 100 ms per epoch |
| NW alignment | m, n ≤ 1024 | ≤ 20 ms host-side |
| Hirschberg alignment | m, n ≤ 4096 | O(min(m, n)) memory; ≤ 200 ms |
| Kalman filter step | state ≤ 16, obs ≤ 16 | ≤ 1 ms per step |
| Ising Gibbs sweep | 128 × 128 lattice | ≤ 10 ms per sweep |
| Beam search top-k | beam = 16, vocab = 32K | ≤ 50 ms per step |
The current GPU paths are PTX-string scaffolds; the algorithmic core runs on the host
CPU. Performance numbers above are targets for the fully GPU-integrated pipeline.
## Notes
- All forward / backward / Viterbi computations are in log-space for numerical
stability across long sequences and small probabilities.
- The CRF L-BFGS implementation uses a two-loop recursion with Armijo backtracking
line search and is finite-difference-checked to 1e-3 in `e2e_tests.rs`.
- Hirschberg's algorithm provides O(min(m, n))-memory alignment matching the standard
NW score; this is essential for very long DNA / protein sequences.
- The Kalman EM (Shumway-Stoffer) refits Q and R covariance matrices given an observed
trajectory; the implementation tracks log-likelihood across iterations.
- The Ising Gibbs sampler uses a simulated-annealing temperature schedule and recovers
ferromagnetic magnetisation at low T as a sanity check in `e2e_tests.rs`.
---
## Architecture-Specific Deepening
### Volta / Turing (sm_70, sm_75)
- [x] `forward_pass` and `viterbi_step` PTX emit warp-cooperative reductions for
the per-state max / sum over predecessors
- [ ] Verified on T4 / V100 against host reference for HMM forward-backward
### Ampere (sm_80, sm_86)
- [x] PTX kernels use `cp.async` for emission-matrix tile prefetch
- [ ] `crf_features` 3-stage software pipeline benchmarked vs. naive load on A100
- [ ] Kalman matrix updates use tensor cores for state dim ≥ 16
### Ada (sm_89)
- [x] Reuses Ampere code path with `cp.async` prefetch
- [ ] Investigate L2 persistence policy for hot Viterbi backpointer arrays
### Hopper (sm_90)
- [x] PTX strings emit `wgmma` warp-group MMA for Kalman matrix products where shapes
are amenable
- [ ] TMA-based asynchronous load of HMM emission matrices for very long sequences
- [ ] Distributed shared-memory cooperative Viterbi across CTAs for large label sets
---
## Deepening Opportunities
### Verification Gaps
- [x] HMM forward-backward matches exhaustive enumeration on short chains to 1e-9
- [x] Viterbi recovers the ground-truth path on deterministic chains
- [x] Baum-Welch log-likelihood is non-decreasing across iterations (monotone EM)
- [x] CRF gradient is finite-difference-checked to 1e-3
- [x] NW / SW / Gotoh / Hirschberg correctness validated against textbook examples
- [x] Hirschberg score matches standard NW exactly
- [x] Edit distance "kitten" → "sitting" = 3 (textbook example)
- [x] Kalman filter recovers a true state within 1σ; RTS smoother variance ≤ filter
variance
- [x] BLEU-1 self-comparison returns 1.0
- [ ] CRF training on CoNLL-2003 NER reaches reported F1 baseline
- [ ] HMM Gaussian on speech-MFCC reaches reported word-error-rate baseline
### Implementation Deepening
- [x] L-BFGS two-loop recursion + Armijo backtracking line search + L2 regularisation
- [x] Hirschberg divide-and-conquer producing identical score to NW
- [x] Gotoh affine-gap 3-state (M / X / Y) DP with separate open / extend penalties
- [x] EKF accepting boxed-closure Jacobians for arbitrary non-linear f / h
- [x] Loopy BP in log-space with sum-product / max-product modes
- [x] UKF with sigma-point propagation for highly non-linear systems
(kalman/ukf.rs — already implemented; see P1 entry above)
- [x] Particle filter for non-Gaussian state-space models
(kalman/particle.rs — already implemented; see P1 entry above)
- [x] Junction-tree algorithm for exact inference on low-treewidth graphs
(mrf/junction_tree.rs — already implemented; see P0 entry above)
### Documentation Gaps
- [x] Each public type carries a doc comment summarising its semantics
- [x] Worked example: training a CRF chunker on a small synthetic in-process
BIO-tagging corpus (NO external corpus — deterministic, seeded)
(e2e_tests.rs/crf_chunker_fits_training_data -- 60 synthetic NP-chunking
sentences / 368 tokens, one-hot category + bias + is_bos features; L-BFGS
training raises total data log-likelihood −404.289 → −0.381; Viterbi per-token
train accuracy = 1.0000 vs majority-class baseline 0.4592. The CoNLL-2003
reference-F1 variant remains data-gated below.)
- [x] Worked example: Kalman tracking of a 2D position from noisy measurements
(e2e_tests.rs/kalman_2d_position_tracking -- 4-state constant-velocity model
[px,py,vx,vy], 60 steps, seeded LcgRng N(0,1²) measurement noise; filtered
RMSE 0.7743 vs raw-measurement RMSE 1.4480 (ratio 0.535, a 47% error
reduction) — the filter genuinely denoises. Self-contained, no external data.)
- [x] Worked example: HMM Gaussian recovering mixture parameters from synthetic data
(hmm/baum_welch.rs:baum_welch_gaussian + examples/hmm_gaussian.rs -- EM fit,
recovery+monotone-LL verified: 2000-sample 3-state scalar Gaussian HMM with means
6σ apart, synthesised via seeded LcgRng; EM converges in 8 iterations, recovers
μ to <0.06 and σ to <0.03 of truth after nearest-mean (label-switch) matching;
per-iteration log-likelihood monotone −5545.80 → −4447.45.)