pub struct LlmExecutor { /* private fields */ }Implementations§
Source§impl LlmExecutor
impl LlmExecutor
pub fn new(model: Box<dyn DecoderOnlyLLM>, info: ModelInfo) -> Self
Sourcepub fn truncate_kv_for_cache_id(&self, cache_id: &str, new_len: usize)
pub fn truncate_kv_for_cache_id(&self, cache_id: &str, new_len: usize)
Roll the KV cache for cache_id back to new_len positions.
Used by speculative decoding on partial rejection. The caller must
supply a GenericKvCacheHandle whose seq_len is also updated.
Trait Implementations§
Source§impl ModelExecutor for LlmExecutor
impl ModelExecutor for LlmExecutor
Source§fn batch_prefill<'life0, 'life1, 'async_trait>(
&'life0 self,
inputs: &'life1 [PrefillInput],
) -> Pin<Box<dyn Future<Output = Result<Vec<PrefillOutput>>> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
'life1: 'async_trait,
fn batch_prefill<'life0, 'life1, 'async_trait>(
&'life0 self,
inputs: &'life1 [PrefillInput],
) -> Pin<Box<dyn Future<Output = Result<Vec<PrefillOutput>>> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
'life1: 'async_trait,
Batched prefill: combine all prompts into ONE model.unified_forward
call so launch / kernel-overhead is amortized across the cohort.
Falls back to the trait default (serial per-item) when the model
returns Err(unsupported) from unified_forward — e.g. Qwen3MoeModel
today, until Phase 2 adds its native unified path.
Source§fn batch_decode<'life0, 'life1, 'async_trait>(
&'life0 self,
inputs: &'life1 [DecodeInput],
) -> Pin<Box<dyn Future<Output = Result<Vec<DecodeOutput>>> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
'life1: 'async_trait,
fn batch_decode<'life0, 'life1, 'async_trait>(
&'life0 self,
inputs: &'life1 [DecodeInput],
) -> Pin<Box<dyn Future<Output = Result<Vec<DecodeOutput>>> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
'life1: 'async_trait,
Override default fallback to acquire the model lock ONCE for the whole batch, avoiding N round-trips through parking_lot. Does not yet do true attention batching (each cache has its own kv_len), but removes mutex churn that was serialising concurrent requests at async level.
Source§fn unified_decode<'life0, 'life1, 'async_trait>(
&'life0 self,
batch: &'life1 UnifiedBatch,
) -> Pin<Box<dyn Future<Output = Result<Vec<Option<Vec<f32>>>>> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
'life1: 'async_trait,
fn unified_decode<'life0, 'life1, 'async_trait>(
&'life0 self,
batch: &'life1 UnifiedBatch,
) -> Pin<Box<dyn Future<Output = Result<Vec<Option<Vec<f32>>>>> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
'life1: 'async_trait,
Unified mixed-batch dispatch (chunked-prefill API).
This impl is a behavior-preserving FALLBACK over the existing
trait methods on DecoderOnlyLLM: prefill items go through
model.prefill(seq_id, &q_tokens) (one at a time, mirroring the
engine’s current sequential prefill loop), decode items
(q_len == 1 && is_final_chunk) are grouped into a single
model.decode_batch(...) call. Net behavior is identical to the
engine’s pre-Phase-13 path; this just changes WHO orchestrates
the prefill/decode split (caller → unified_decode) so the engine
can converge on a single call.
The real performance unlock comes in Step 5 when models override this with a true unified-forward (one [M_total, hidden] forward
- varlen attention) — at that point the kernel-level mix replaces the host-side serial dispatch here.
Source§fn kv_capacity(&self) -> Option<usize>
fn kv_capacity(&self) -> Option<usize>
Source§fn prefill<'life0, 'life1, 'async_trait>(
&'life0 self,
input: &'life1 PrefillInput,
) -> Pin<Box<dyn Future<Output = Result<PrefillOutput>> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
'life1: 'async_trait,
fn prefill<'life0, 'life1, 'async_trait>(
&'life0 self,
input: &'life1 PrefillInput,
) -> Pin<Box<dyn Future<Output = Result<PrefillOutput>> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
'life1: 'async_trait,
Source§fn truncate_kv<'life0, 'life1, 'async_trait>(
&'life0 self,
kv_cache: &'life1 Arc<dyn KvCacheHandle>,
new_len: usize,
) -> Pin<Box<dyn Future<Output = Result<()>> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
'life1: 'async_trait,
fn truncate_kv<'life0, 'life1, 'async_trait>(
&'life0 self,
kv_cache: &'life1 Arc<dyn KvCacheHandle>,
new_len: usize,
) -> Pin<Box<dyn Future<Output = Result<()>> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
'life1: 'async_trait,
new_len.
Used by speculative decoding on partial rejection so the next
iteration sees a KV prefix that matches the accepted token stream.
Default: Ok(()) — executors that don’t cache per-sequence state
(stub, mock) are inherently tolerant; real LLM executors override.Source§fn forward_verify<'life0, 'life1, 'async_trait>(
&'life0 self,
inputs: &'life1 [DecodeInput],
) -> Pin<Box<dyn Future<Output = Result<Vec<DecodeOutput>>> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
'life1: 'async_trait,
fn forward_verify<'life0, 'life1, 'async_trait>(
&'life0 self,
inputs: &'life1 [DecodeInput],
) -> Pin<Box<dyn Future<Output = Result<Vec<DecodeOutput>>> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
'life1: 'async_trait,
N+1 tokens,
producing one logits row per position. Used by speculative
decoding’s target path so we don’t pay N+1 sequential forwards. Read moreSource§fn decode<'life0, 'life1, 'async_trait>(
&'life0 self,
input: &'life1 DecodeInput,
) -> Pin<Box<dyn Future<Output = Result<DecodeOutput>> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
'life1: 'async_trait,
fn decode<'life0, 'life1, 'async_trait>(
&'life0 self,
input: &'life1 DecodeInput,
) -> Pin<Box<dyn Future<Output = Result<DecodeOutput>> + Send + 'async_trait>>where
Self: 'async_trait,
'life0: 'async_trait,
'life1: 'async_trait,
Source§fn release_cache(&self, cache_id: &str)
fn release_cache(&self, cache_id: &str)
Source§fn capabilities(&self) -> ExecutorCapabilities
fn capabilities(&self) -> ExecutorCapabilities
Source§fn status(&self) -> ExecutorStatus
fn status(&self) -> ExecutorStatus
Source§fn cache_metrics_snapshot(&self) -> Option<Value>
fn cache_metrics_snapshot(&self) -> Option<Value>
Source§fn lora_metrics_snapshot(&self) -> Option<Value>
fn lora_metrics_snapshot(&self) -> Option<Value>
Source§fn forward<'life0, 'life1, 'async_trait>(
&'life0 self,
_input: &'life1 Arc<dyn TensorLike>,
) -> Pin<Box<dyn Future<Output = Result<Arc<dyn TensorLike>, FerrumError>> + Send + 'async_trait>>where
'life0: 'async_trait,
'life1: 'async_trait,
Self: 'async_trait,
fn forward<'life0, 'life1, 'async_trait>(
&'life0 self,
_input: &'life1 Arc<dyn TensorLike>,
) -> Pin<Box<dyn Future<Output = Result<Arc<dyn TensorLike>, FerrumError>> + Send + 'async_trait>>where
'life0: 'async_trait,
'life1: 'async_trait,
Self: 'async_trait,
Auto Trait Implementations§
impl !Freeze for LlmExecutor
impl !RefUnwindSafe for LlmExecutor
impl !UnwindSafe for LlmExecutor
impl Send for LlmExecutor
impl Sync for LlmExecutor
impl Unpin for LlmExecutor
impl UnsafeUnpin for LlmExecutor
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
impl<T> ErasedDestructor for Twhere
T: 'static,
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more