Skip to main content

Module spec_decode

Module spec_decode 

Source
Expand description

Speculative decoding scheduling pattern (plan #34).

Borrowed from MAX’s serving scheduler structure (one_shot_scheduler.py, decode/prefill split). The classic Leviathan-et-al “Fast Inference from Transformers via Speculative Decoding” algorithm — a small draft model proposes n tokens; the larger target model verifies all n in one forward pass; tokens are accepted up to the first rejection, then one extra “corrected” token is sampled from the residual distribution.

Expected speedup on decode-heavy workloads: 2-3×.

Layout:

  • Speculator — trait an autoregressive model implements. Two methods: propose (draft) and verify (target).
  • DraftProposal / VerifyResult / AcceptDecision — wire-format data shapes.
  • speculative_accept — pure function that runs the acceptance algorithm. Testable without a real model.
  • SpecDecoder — orchestrator that calls a draft + target and returns the next batch of accepted tokens.

Structs§

AcceptDecision
Outcome of one speculative-decoding round.
DraftProposal
One round of draft proposals.
SpecDecoder
Top-level orchestrator. Holds a draft + target speculator and the lookahead window n. step() runs one full round and returns the tokens to append to the running context.
VerifyResult
Target model’s verification of the draft’s proposals.

Traits§

Speculator
Streaming speculator interface — one method to draft, one to verify. Real implementations bind to a CompiledGraph per model; testable implementations can return canned probability tables.

Functions§

speculative_accept
Pure speculative-acceptance algorithm. Given the draft’s proposal and the target’s verification, runs the per-position accept/reject test and returns the final decision. No model state, no I/O — easy to unit-test against hand-built distributions.