constraint-decoding-trie 0.1.0

# [Generative Retrieval] Transition Matrix Trie for Constraint Decoding

Implementation of STATIC from this [paper](https://arxiv.org/abs/2602.22647v1): **Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators**

This is implemented using parallelisation on CPU, developing the GPU shader will be nice but quite time-consuming.

## Motivation

A Rust library for **fast, production-friendly constrained decoding** over a large, fixed candidate set (e.g., Semantic IDs / item IDs for generative retrieval). It flattens a prefix trie into a CSR transition matrix and uses a vectorized kernel (VNTK) plus a bit-packed dense mask for shallow layers to keep constraint enforcement accelerator-friendly.

This design is inspired by STATIC (“Sparse Transition Matrix‑Accelerated Trie Index for Constrained Decoding”), which recasts trie traversal as vectorized sparse-matrix operations and reports very large speedups and low per-step overhead in an industrial deployment.


## Why this matters commercially

Constrained decoding is what turns “LLM can generate an ID” into “LLM can generate only valid, in-policy IDs”:

- **Hard business logic**: enforce inventory/freshness/eligibility policies at generation-time (no post-filtering surprises).
- **Lower latency overhead**: move from pointer-chasing trie walks to contiguous CSR reads and vectorized scattering.
- **Scale**: supports large constraint sets, where binary-search style methods can become an I/O bottleneck; STATIC frames this as `O(1)` I/O w.r.t. constraint-set size for the kernel.
- **Deterministic outputs**: the index is static and reproducible, which helps auditing and rollout safety.

## What’s included

Project layout:

```bash
src
├── trie.rs          # Prefix trie construction
├── transition.rs    # CSR transition matrix builder
├── dense_mask.rs    # Bit-packed dense mask for shallow layers
├── vntk.rs          # Vectorized Node Transition Kernel
├── decoder.rs       # Constrained decoding step (Algorithm 1)
└── types.rs         # TransitionMatrix, DecodingState, etc.
```

## Quick start

Key structures:

- `DenseMask`: bit-packed validity for the first `d` layers (typically `d=2`).
- `TransitionMatrix`: CSR matrix with interleaved `(token_id, next_node_id)` entries.
- `StaticIndex`: `dense + sparse` combined index.
- `ConstrainedDecoder`: constrained beam step (mask → select → gather).

### Build an index

```rust
use constraint_decoding_trie::transition::build_static_index;

let constraints = vec![
  vec![1u32, 2, 1],
  vec![3u32, 1, 2],
  vec![3u32, 1, 3],
];

let index = build_static_index(
  &constraints,
  /*vocab_size=*/ 4,
  /*sid_length=*/ 3,
  /*dense_depth=*/ 2,
);
```

## Benchmarks (Criterion)

Benchmarks live under `benches/` and use Criterion. Run:

```bash
cargo bench
```

## Notes on correctness

- Children in each CSR row are stored sorted by token ID (deterministic index layout).
- Dense masking uses safe bit-range scans; boundary conditions are covered by tests.
- Sparse masking uses VNTK to produce per-beam masks and next-node slots.

## Roadmap ideas

- Parquet persistence for `StaticIndex`
- SIMD/bitset masks for faster CPU-side gating
- GPU/TPU-friendly kernels (layout already designed for coalesced reads)

## References

STATIC paper: “Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators” (2026), [paper](https://arxiv.org/abs/2602.22647v1).