Module spec_decode

Expand description

Speculative decoding scheduling pattern (plan #34).

Borrowed from MAX’s serving scheduler structure (one_shot_scheduler.py, decode/prefill split). The classic Leviathan-et-al “Fast Inference from Transformers via Speculative Decoding” algorithm — a small draft model proposes n tokens; the larger target model verifies all n in one forward pass; tokens are accepted up to the first rejection, then one extra “corrected” token is sampled from the residual distribution.

Expected speedup on decode-heavy workloads: 2-3×.

Layout:

Speculator — trait an autoregressive model implements. Two methods: propose (draft) and verify (target).
DraftProposal / VerifyResult / AcceptDecision — wire-format data shapes.
speculative_accept — pure function that runs the acceptance algorithm. Testable without a real model.
SpecDecoder — orchestrator that calls a draft + target and returns the next batch of accepted tokens.

Structs§

AcceptDecision: Outcome of one speculative-decoding round.
DraftProposal: One round of draft proposals.
SpecDecoder: Top-level orchestrator. Holds a draft + target speculator and the lookahead window n. step() runs one full round and returns the tokens to append to the running context.
VerifyResult: Target model’s verification of the draft’s proposals.

Traits§

Speculator: Streaming speculator interface — one method to draft, one to verify. Real implementations bind to a CompiledGraph per model; testable implementations can return canned probability tables.

Functions§

speculative_accept: Pure speculative-acceptance algorithm. Given the draft’s proposal and the target’s verification, runs the per-position accept/reject test and returns the final decision. No model state, no I/O — easy to unit-test against hand-built distributions.