1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
//! Model-level **speculative decoding** for language-model generation.
//!
//! This module implements the rejection-sampling acceleration scheme of
//! Leviathan et al. (2023) ["Fast Inference from Transformers via Speculative
//! Decoding"](https://arxiv.org/abs/2211.17192) and Chen et al. (2023)
//! ["Accelerating Large Language Model Decoding with Speculative
//! Sampling"](https://arxiv.org/abs/2302.01318). In their formulation a
//! small *draft model* proposes the next `k` tokens cheaply; a larger
//! *target model* verifies them in a single parallel forward pass. A
//! per-token Bernoulli acceptance test decides which draft tokens are kept —
//! and the first rejected position is re-sampled from an *adjusted* target
//! distribution so that the algorithm's marginal output is provably
//! indistinguishable from sampling directly from the target.
//!
//! ## Relation to `tensorlogic-infer::speculative`
//!
//! That module implements **execution-level** speculation over tensor
//! pipelines (branch prediction, prefetching, rollback). This one targets
//! **model-level** generation: draft/target LLM pairs. The two are
//! deliberately separate — they don't share abstractions because the
//! respective correctness contracts differ.
//!
//! ## Pseudo-code (Leviathan et al. §3)
//!
//! ```text
//! Input: prefix x_1 .. x_t, max_tokens T, draft model p, target model q
//! Output: extension y_1 .. y_L with L ≤ T
//!
//! while |y| < T:
//! # 1. Draft proposes k tokens and returns full per-step distributions.
//! (tokens, p_dists) = p.propose(prefix+y, k)
//!
//! # 2. Target scores k+1 positions in one forward pass.
//! q_dists = q.verify(prefix+y, tokens) // length k+1
//!
//! # 3. Rejection sweep.
//! for i in 0..k:
//! r = U[0,1)
//! if r < min(1, q_dists[i][tokens[i]] / p_dists[i][tokens[i]]):
//! append tokens[i] to y # ACCEPT
//! else:
//! resample from max(0, q_dists[i] - p_dists[i]) / Z
//! append that token to y # REJECT & resample
//! break
//! else:
//! # all k accepted → sample bonus token from q_dists[k]
//! append sample(q_dists[k]) to y
//! ```
//!
//! Both the per-token acceptance rule and the adjusted resampling step are
//! exported as pure functions from [`acceptance`] so they can be unit-tested
//! in isolation.
//!
//! ## Public surface
//!
//! * [`DraftModel`], [`TargetModel`] — the two trait layers.
//! * [`DraftProposal`], [`TargetScores`] — shape-typed outputs.
//! * [`SpeculativeDecoder`], [`SpeculativeDecoderConfig`] — the engine.
//! * [`SpeculativeMetrics`] — throughput / accept-rate / speedup counters.
//! * [`MockDraftModel`], [`MockTargetModel`] — test fixtures.
//! * [`SpeculativeDecodingError`], [`SpeculativeDecodingResult`] — errors.
pub use ;
pub use ;
pub use ;
pub use SpeculativeMetrics;
pub use ;
pub use SpecRng;
pub use ;