flowmatch 0.1.6

# flowmatch roadmap: RFM, discrete FM, and ecosystem boundaries

This document is a design/architecture note for evolving `flowmatch` beyond the current
“semidiscrete conditional FM + linear path + Euler sampling” baseline.

## What `flowmatch` is today (and what it is not)

Current implementation (in code):

- **Conditional FM (CFM)** with a **linear path** \(x_t = (1-t)x_0 + t y_j\) and target \(u_t = y_j - x_0\).
- Conditioning variable is **discrete**: \(z := j\), where \(j\) is chosen by a semidiscrete OT-shaped assignment
  implemented in `wass::semidiscrete` (potentials + hard assignment).
- Vector field parameterization is intentionally minimal (`linear::LinearCondField`) and training is simple SGD.
- Sampling integrates an ODE with explicit Euler (and now also Heun/RK2 via `ode::OdeMethod`).

Non-goals (today):

- No SE(3)-equivariant geometry, no manifold-aware flows, no CTMC discrete-time generators.
- No “full reproduction” of any paper; the crate is a small reusable primitive.

## Paper taxonomy → module boundaries (what we should mean by names)

The survey (arXiv:2507.17731) uses a taxonomy that is useful for naming modules:

- **General FM / Conditional FM (CFM)**: regress \(u_\theta(x,t)\) to a tractable conditional target \(u^\*(x,t,z)\).
- **Rectified FM (RFM)**: choose/improve the coupling \(\pi(x_0, x_1)\) to “straighten” flows (reduce curliness).
- **Non-Euclidean FM**: replace Euclidean paths with geodesics / manifold tangent vector fields.
- **Discrete FM**:
  - **CTMC-based**: learn a continuous-time Markov generator over discrete states.
  - **Simplex-based**: learn a continuous flow on the probability simplex (Dirichlet / Fisher-Rao / Gumbel-softmax).

In our crate, those should correspond to *minimal building blocks*, not full application models.

### What is now implemented (scaffolding + baselines)

- **RFM coupling + training baseline**:
  - coupling primitive: `rfm::minibatch_ot_greedy_pairing` (Sinkhorn plan → greedy matching)
  - training: `sd_fm::train_rfm_minibatch_ot_linear` (minibatch OT pairing → straight-line FM regression)
- **Discrete FM scaffolding**:
  - CTMC: `discrete_ctmc` (generator validation + Euler evolution of probabilities)
  - Simplex: `simplex` (simplex validation + Dirichlet sampling)
- **Non-Euclidean scaffolding**: `non_euclidean` (geodesic interpolant trait + Euclidean baseline)

## RFM in our setting (what changes, what stays)

### What stays the same

- The FM loss stays a **velocity regression** on samples \((t, x_t)\).
- We still need a callable vector field \(v_\theta(x,t;\text{cond})\) and an ODE sampler.

### What changes (the key design choice)

RFM is mostly about the **coupling** (how we pair a base sample with a target sample).

In our current SD-FM baseline, the “pairing” is:

- draw \(x_0 \sim \mathcal{N}(0,I)\)
- pick discrete \(j\) by a semidiscrete assignment (OT-flavored)
- set \(x_1 := y_j\)

This is already “rectification-ish” in spirit: the semidiscrete assignment is a coupling between a continuous base
and a discrete support. The *next* step (to deserve the name `rfm`) is to make the coupling choice explicit and swappable.

### Proposed API surface (design)

Introduce a coupling abstraction:

- `CouplingSampler`: given a minibatch of base samples and a minibatch of target samples, return a coupling artifact:
  - simplest: a permutation / matching (one-to-one)
  - more general: a sparse transport plan / weights

Then define “rectified training” as: sample \((x_0, x_1) \sim \pi\) and train on the straight-line path between them.

Where to get \(\pi\) in our ecosystem:

- Prefer using `wass` for approximate OT couplings (entropic OT / Sinkhorn-style), since that’s already our coupling primitive.

What not to do:

- Don’t silently “pretend” a coupling is OT if it isn’t; call it `*_heuristic_*` if it’s a heuristic.

## Discrete FM: two different things (don’t conflate)

### A) CTMC discrete FM (discrete states; rates)

This is for token/graph-like discrete states where the dynamics are a CTMC:

- You model a time-dependent generator \(Q_\theta(t)\) (rates between states).
- Training typically matches probability flows / pathwise objectives, not Euclidean velocities.

**Recommendation**: keep CTMC discrete FM *out of* the current `flowmatch` crate unless we commit to a clean discrete core.
It’s a different math object than “vector field on \(\mathbb{R}^d\)”.

If/when added, it should live in `flowmatch::discrete_ctmc` (or a separate crate) with:

- explicit state representation
- explicit normalization / validity constraints
- deterministic RNG seeds for any sampling

### B) Simplex-based discrete FM (continuous flow on the simplex)

Here you represent categorical data via a point in the simplex and learn a continuous flow there
(Dirichlet FM, Fisher-FM, Gumbel-softmax interpolants).

This is closer to our current continuous-time machinery, but it requires:

- simplex constraints (nonnegativity, sum-to-1)
- stable parameterizations (e.g. logits + softmax with temperature)

**Recommendation**: treat this as a separate module family `flowmatch::simplex_*`
with small, carefully documented invariants (no silent renormalization).

## Ecosystem boundary: backend-agnostic “core” vs ndarray implementation

Right now `flowmatch` exports `ndarray::Array*` types in public structs (`TrainedSdFm`), which means the crate is
not truly backend-agnostic at the public API boundary.

If we want to follow the workspace-wide “backend-agnostic by default” rule, the direction is:

- `flowmatch-core`:
  - traits and scalar math (slice-based inputs/outputs)
  - concepts: interpolants, objectives, coupling interfaces, ODE stepping traits
  - **no `ndarray` in public types**
- `flowmatch-ndarray` (or keep name `flowmatch` and add feature-gated backend modules):
  - the current implementation using `ndarray`
  - test/demo harnesses
  - optional adapters

Current status (as of the latest edits in this workspace): we now have a **feature-gated** Burn-backed
foothold in `flowmatch::burn_euclidean` (compile-tested with `cargo test -p flowmatch --features burn`),
without changing the default ndarray-only API surface.

This split is the clean way to make FM utilities reusable by other crates without importing `ndarray`.

## Suggested next increments (small, testable)

### 1) Make “coupling” explicit in SD-FM training

Refactor `train_sd_fm_semidiscrete_linear` so “choose j” is an injected strategy (still defaulting to the current method).
Add a tiny test that swapping the coupling strategy changes assignment frequencies in the expected direction.

### 2) Add a second ODE integrator

Add Heun/RK2 alongside Euler for sampling, behind a tiny trait. This is low-risk and improves stability.

### 3) Add an RFM-style minibatch coupling utility (optional)

If `wass` exposes a small OT coupling for two minibatches, add:

- `rfm::minibatch_ot_coupling(...) -> matching`
- an e2e test that RFM coupling reduces “path curvature proxy” (e.g. fewer solver steps needed for same MSE).

### 4) Discrete FM prototype (only if we commit)

Pick exactly one:

- `simplex_dirichlet` (continuous-time, simplex constraints), or
- `discrete_ctmc` (true CTMC generator)

and build one minimal invariant test (normalization / probability preservation / determinism).

## References (for naming + scope)

- Lipman et al., *Flow Matching for Generative Modeling* (arXiv:2210.02747)
- Lipman et al., *Flow Matching Guide and Code* (arXiv:2412.06264)
- Li et al., *Flow Matching Meets Biology and Life Science: A Survey* (arXiv:2507.17731)
  - curated list: `https://github.com/Violet24K/Awesome-Flow-Matching-Meets-Biology`