pub trait DecodeBackend: Send + Sync {
// Required methods
fn decode_step(
&mut self,
token_id: u32,
position: usize,
cache_key: &str,
) -> Result<TensorRef>;
fn init_kv_cache(
&mut self,
cache_key: &str,
kv_data: Vec<(TensorRef, TensorRef)>,
prefill_len: usize,
) -> Result<()>;
fn has_kv_cache(&self, cache_key: &str) -> bool;
fn release_kv_cache(&mut self, cache_key: &str);
fn name(&self) -> &str;
}Expand description
Decode-phase execution backend.
Implements the actual computation for single-token decode steps. Different backends optimize for different hardware:
CudaDecodeBackend: cuBLAS + custom CUDA kernels, pre-allocated buffersMetalDecodeBackend: Metal compute shadersCandleDecodeBackend: candle tensor ops (CPU/fallback)
The backend is initialized with model weights and manages its own internal state (KV cache, buffers, cuBLAS handles, etc.).
Required Methods§
Sourcefn decode_step(
&mut self,
token_id: u32,
position: usize,
cache_key: &str,
) -> Result<TensorRef>
fn decode_step( &mut self, token_id: u32, position: usize, cache_key: &str, ) -> Result<TensorRef>
Execute a single decode step: one token in, logits out.
token_id: the input tokenposition: sequence position (for RoPE)cache_key: identifies the sequence’s KV cache
Returns logits as a TensorRef [1, 1, vocab_size].
Sourcefn init_kv_cache(
&mut self,
cache_key: &str,
kv_data: Vec<(TensorRef, TensorRef)>,
prefill_len: usize,
) -> Result<()>
fn init_kv_cache( &mut self, cache_key: &str, kv_data: Vec<(TensorRef, TensorRef)>, prefill_len: usize, ) -> Result<()>
Initialize KV cache for a new sequence from prefill data.
Called after prefill (which runs through the model’s forward pass) to hand off the KV cache to the decode backend.
kv_data: per-layer (K, V) tensor pairs from the prefill pass.
prefill_len: number of tokens in the prefill.
Sourcefn has_kv_cache(&self, cache_key: &str) -> bool
fn has_kv_cache(&self, cache_key: &str) -> bool
Check if KV cache exists for a sequence.
Sourcefn release_kv_cache(&mut self, cache_key: &str)
fn release_kv_cache(&mut self, cache_key: &str)
Release KV cache for a completed sequence.