Expand description
Speculative decoding scheduling pattern (plan #34).
Borrowed from MAX’s serving scheduler structure
(one_shot_scheduler.py, decode/prefill split). The classic
Leviathan-et-al “Fast Inference from Transformers via
Speculative Decoding” algorithm — a small draft model proposes
n tokens; the larger target model verifies all n in one
forward pass; tokens are accepted up to the first rejection,
then one extra “corrected” token is sampled from the residual
distribution.
Expected speedup on decode-heavy workloads: 2-3×.
Layout:
Speculator— trait an autoregressive model implements. Two methods:propose(draft) andverify(target).DraftProposal/VerifyResult/AcceptDecision— wire-format data shapes.speculative_accept— pure function that runs the acceptance algorithm. Testable without a real model.SpecDecoder— orchestrator that calls a draft + target and returns the next batch of accepted tokens.
Structs§
- Accept
Decision - Outcome of one speculative-decoding round.
- Draft
Proposal - One round of draft proposals.
- Spec
Decoder - Top-level orchestrator. Holds a draft + target speculator and
the lookahead window
n.step()runs one full round and returns the tokens to append to the running context. - Verify
Result - Target model’s verification of the draft’s proposals.
Traits§
- Speculator
- Streaming speculator interface — one method to draft, one to
verify. Real implementations bind to a
CompiledGraphper model; testable implementations can return canned probability tables.
Functions§
- speculative_
accept - Pure speculative-acceptance algorithm. Given the draft’s proposal and the target’s verification, runs the per-position accept/reject test and returns the final decision. No model state, no I/O — easy to unit-test against hand-built distributions.