Skip to main content

Module engine

Expand description

Speculative decoding engine.

SpeculativeDecoder composes a DraftModel and a TargetModel and runs the Leviathan / Chen speculative generation loop:

loop until max_tokens reached:
  1. draft.propose(prefix, k)       → k candidate tokens + distributions
  2. target.verify(prefix, draft)   → k+1 target distributions
  3. for i in 0..k:
       if accept(p_draft_i, p_target_i, rng):
           append draft[i]
       else:
           append resample_from_adjusted_target(p_target_i, p_draft_i, rng)
           break
  4. if all k accepted:
       append sample_from_logprobs(p_target_{k+1}, rng)  (bonus)
  5. update metrics; continue.

The number of tokens appended per round is therefore in 1..=k+1, and — crucially — the marginal distribution of each appended token is identical to p_target(prefix). That correctness is what the empirical chi-square test in tests.rs validates against 10 000 samples.

Structs§

SpeculativeDecoder
Speculative decoder composing a draft model and a target model.
SpeculativeDecoderConfig
Configuration for the speculative decoder.