pub struct SpeculativeDecoder { /* private fields */ }Expand description
Async speculative decoder.
Wraps a draft engine (generating spec_k candidates per step) and a target
engine (verifying the candidates in a single batched forward pass). The two
engines run with overlap via tokio.
§Limitations
- Both engines must use the same tokenizer and vocabulary.
- The draft engine must be strictly smaller/faster than the target.
- The target engine must implement
Rewindable(KV-cache based). For SSM targets useSpeculativeDecoder::new_n1which forces N=1 mode.
§Example
let decoder = SpeculativeDecoder::new(
draft_engine,
target_engine,
AsyncSpecConfig::default(),
);
let stats = decoder.generate("hello", 128, |tok| print!("{tok}")).await?;Implementations§
Source§impl SpeculativeDecoder
impl SpeculativeDecoder
Sourcepub fn new(
draft: InferenceEngine,
target: InferenceEngine,
config: AsyncSpecConfig,
) -> Self
pub fn new( draft: InferenceEngine, target: InferenceEngine, config: AsyncSpecConfig, ) -> Self
Construct a new async speculative decoder.
Both engines must be loaded (i.e. is_loaded() is true) before
generate is called.
Sourcepub fn new_n1(
draft: InferenceEngine,
target: InferenceEngine,
config: AsyncSpecConfig,
) -> Self
pub fn new_n1( draft: InferenceEngine, target: InferenceEngine, config: AsyncSpecConfig, ) -> Self
Construct a decoder that always uses N=1 mode (for SSM targets).
Sourcepub fn reset_stats(&mut self)
pub fn reset_stats(&mut self)
Reset statistics counters.
Sourcepub fn cancellation_token(&self) -> CancellationToken
pub fn cancellation_token(&self) -> CancellationToken
Return a reference to the cancellation token for external cancellation.
Sourcepub async fn generate<F>(
&mut self,
prompt: &str,
on_token: F,
) -> RuntimeResult<String>
pub async fn generate<F>( &mut self, prompt: &str, on_token: F, ) -> RuntimeResult<String>
Run async speculative generation for prompt, calling on_token for
each decoded token.
Returns the full generated text and updates self.stats.
§SSM fallback
If the target engine’s rewind() returns RewindError::NotSupported
on the first call, the decoder automatically falls back to N=1 mode for
the rest of the generation. SpecStats::n1_fallbacks is incremented.
§Cancellation
The generation loop checks self.cancel after each speculation step.
Callers can cancel by calling cancel.cancel() from another task.
§Errors
Returns RuntimeError::ModelNotLoaded if either engine is not loaded.
Returns RuntimeError::Cancelled if the cancellation token is
triggered before the first token is produced.
Auto Trait Implementations§
impl !Freeze for SpeculativeDecoder
impl !RefUnwindSafe for SpeculativeDecoder
impl Send for SpeculativeDecoder
impl Sync for SpeculativeDecoder
impl Unpin for SpeculativeDecoder
impl UnsafeUnpin for SpeculativeDecoder
impl !UnwindSafe for SpeculativeDecoder
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more