1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
//! Speculative decoding for accelerated autoregressive generation.
//!
//! Speculative decoding uses a fast **draft model** to propose candidate tokens,
//! which are then verified in parallel by a slower but more accurate **target model**.
//! Through rejection sampling, the output distribution is mathematically equivalent
//! to sampling from the target model alone, while reducing the number of expensive
//! target-model evaluations.
//!
//! ## Overview
//!
//! The decoding loop works as follows:
//!
//! 1. The draft model generates `k` candidate tokens autoregressively.
//! 2. The target model evaluates all `k` positions in a single forward pass.
//! 3. Rejection sampling accepts a prefix of the draft tokens and, on rejection,
//! resamples from the adjusted distribution `max(0, p_target - p_draft)`.
//! 4. The process repeats until the desired output length is reached.
//!
//! ## Key Types
//!
//! - [`SpeculativeConfig`] — configuration for the decoding loop.
//! - [`DraftModel`] — trait for fast draft models.
//! - [`TargetModel`] — trait for the authoritative target model.
//! - [`SpeculativeDecoder`] — orchestrator that runs the full loop.
//! - [`SpeculativeVerifier`] — rejection-sampling verifier.
//!
//! ## Example
//!
//! ```rust
//! use scirs2_neural::speculative::{
//! SpeculativeConfig, SpeculativeDecoder, UniformDraftModel,
//! };
//!
//! // For a real use case, implement TargetModel for your LLM.
//! // Here we just demonstrate the configuration:
//! let config = SpeculativeConfig {
//! draft_length: 4,
//! temperature: 0.8,
//! top_k: 50,
//! max_tokens: 256,
//! adaptive_draft: true,
//! };
//! assert_eq!(config.draft_length, 4);
//! ```
// Re-export key types
pub use SpeculativeDecoder;
pub use ;
pub use ;
pub use ;