Struct LlmExecutor

Source

pub struct LlmExecutor { /* private fields */ }

Implementations§

Source §

impl LlmExecutor

Source

pub fn new(model: Box<dyn DecoderOnlyLLM>, info: ModelInfo) -> Self

Source

pub fn truncate_kv_for_cache_id(&self, cache_id: &str, new_len: usize)

Roll the KV cache for cache_id back to new_len positions. Used by speculative decoding on partial rejection. The caller must supply a GenericKvCacheHandle whose seq_len is also updated.

Trait Implementations§

Source §

impl ModelExecutor for LlmExecutor

Source §

fn batch_prefill<'life0, 'life1, 'async_trait>( &'life0 self, inputs: &'life1 [PrefillInput], ) -> Pin<Box<dyn Future<Output = Result<Vec<PrefillOutput>>> + Send + 'async_trait>>
where Self: 'async_trait, 'life0: 'async_trait, 'life1: 'async_trait,

Batched prefill: combine all prompts into ONE model.unified_forward call so launch / kernel-overhead is amortized across the cohort.

Falls back to the trait default (serial per-item) when the model returns Err(unsupported) from unified_forward — e.g. Qwen3MoeModel today, until Phase 2 adds its native unified path.

Source §

fn batch_decode<'life0, 'life1, 'async_trait>( &'life0 self, inputs: &'life1 [DecodeInput], ) -> Pin<Box<dyn Future<Output = Result<Vec<DecodeOutput>>> + Send + 'async_trait>>
where Self: 'async_trait, 'life0: 'async_trait, 'life1: 'async_trait,

Override default fallback to acquire the model lock ONCE for the whole batch, avoiding N round-trips through parking_lot. Does not yet do true attention batching (each cache has its own kv_len), but removes mutex churn that was serialising concurrent requests at async level.

Source §

fn unified_decode<'life0, 'life1, 'async_trait>( &'life0 self, batch: &'life1 UnifiedBatch, ) -> Pin<Box<dyn Future<Output = Result<Vec<Option<Vec<f32>>>>> + Send + 'async_trait>>
where Self: 'async_trait, 'life0: 'async_trait, 'life1: 'async_trait,

Unified mixed-batch dispatch (chunked-prefill API).

This impl is a behavior-preserving FALLBACK over the existing trait methods on DecoderOnlyLLM: prefill items go through model.prefill(seq_id, &q_tokens) (one at a time, mirroring the engine’s current sequential prefill loop), decode items (q_len == 1 && is_final_chunk) are grouped into a single model.decode_batch(...) call. Net behavior is identical to the engine’s pre-Phase-13 path; this just changes WHO orchestrates the prefill/decode split (caller → unified_decode) so the engine can converge on a single call.

The real performance unlock comes in Step 5 when models override this with a true unified-forward (one [M_total, hidden] forward

varlen attention) — at that point the kernel-level mix replaces the host-side serial dispatch here.

Source §

fn info(&self) -> &ModelInfo

Get model information and metadata

Source §

fn kv_capacity(&self) -> Option<usize>

Per-request KV capacity in tokens when the executor owns a smaller runtime cache window than the model’s declared context length.

Source §

fn prefill<'life0, 'life1, 'async_trait>( &'life0 self, input: &'life1 PrefillInput, ) -> Pin<Box<dyn Future<Output = Result<PrefillOutput>> + Send + 'async_trait>>
where Self: 'async_trait, 'life0: 'async_trait, 'life1: 'async_trait,

Execute prefill phase (process initial prompt)

Source §

fn truncate_kv<'life0, 'life1, 'async_trait>( &'life0 self, kv_cache: &'life1 Arc<dyn KvCacheHandle>, new_len: usize, ) -> Pin<Box<dyn Future<Output = Result<()>> + Send + 'async_trait>>
where Self: 'async_trait, 'life0: 'async_trait, 'life1: 'async_trait,

Roll the KV cache for this executor’s sequence back to new_len. Used by speculative decoding on partial rejection so the next iteration sees a KV prefix that matches the accepted token stream. Default: Ok(()) — executors that don’t cache per-sequence state (stub, mock) are inherently tolerant; real LLM executors override.

Source §

fn forward_verify<'life0, 'life1, 'async_trait>( &'life0 self, inputs: &'life1 [DecodeInput], ) -> Pin<Box<dyn Future<Output = Result<Vec<DecodeOutput>>> + Send + 'async_trait>>
where Self: 'async_trait, 'life0: 'async_trait, 'life1: 'async_trait,

Multi-position decode-verify: one forward over N+1 tokens, producing one logits row per position. Used by speculative decoding’s target path so we don’t pay N+1 sequential forwards. Read more

Source §

fn decode<'life0, 'life1, 'async_trait>( &'life0 self, input: &'life1 DecodeInput, ) -> Pin<Box<dyn Future<Output = Result<DecodeOutput>> + Send + 'async_trait>>
where Self: 'async_trait, 'life0: 'async_trait, 'life1: 'async_trait,

Execute decode phase (generate next token)

Source §

fn release_cache(&self, cache_id: &str)

Release KV cache and state for a completed sequence. Read more

Source §

fn capabilities(&self) -> ExecutorCapabilities

Get executor capabilities

Source §

fn status(&self) -> ExecutorStatus

Get current executor status

Source §

fn cache_metrics_snapshot(&self) -> Option<Value>

Optional model/executor cache metrics. Read more

Source §

fn lora_metrics_snapshot(&self) -> Option<Value>

Optional LoRA runtime metrics.

Source §

fn forward<'life0, 'life1, 'async_trait>( &'life0 self, _input: &'life1 Arc<dyn TensorLike>, ) -> Pin<Box<dyn Future<Output = Result<Arc<dyn TensorLike>, FerrumError>> + Send + 'async_trait>>
where 'life0: 'async_trait, 'life1: 'async_trait, Self: 'async_trait,

Optional: full forward pass (for non-autoregressive use cases)

Source §

fn warmup<'life0, 'async_trait>( &'life0 mut self, ) -> Pin<Box<dyn Future<Output = Result<(), FerrumError>> + Send + 'async_trait>>
where 'life0: 'async_trait, Self: 'async_trait,

Warm up executor (load model, allocate memory, etc.)

Source §

fn shutdown<'life0, 'async_trait>( &'life0 mut self, ) -> Pin<Box<dyn Future<Output = Result<(), FerrumError>> + Send + 'async_trait>>
where 'life0: 'async_trait, Self: 'async_trait,

Shutdown executor gracefully

Auto Trait Implementations§

§

impl UnsafeUnpin for LlmExecutor

Blanket Implementations§

Source §

impl<T> Any for T
where T: 'static + ?Sized,

Source §

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more

Source §

impl<T> Borrow<T> for T
where T: ?Sized,

Source §

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more

Source §

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source §

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more

Source §

impl<T> ErasedDestructor for T
where T: 'static,

Source §

impl<T> From<T> for T

Source §

fn from(t: T) -> T

Returns the argument unchanged.

Source §

impl<T> Instrument for T

Source §

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more

Source §

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more

Source §

impl<T, U> Into for T
where U: From<T>,

Source §

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source §

impl<T> IntoEither for T

Source §

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

Source §

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

Source §

impl<F, T> IntoSample<T> for F
where T: FromSample<F>,

Source §

fn into_sample(self) -> T

Source §

impl<T> Pointable for T

Source §

const ALIGN: usize

The alignment of pointer.

Source §

type Init = T

The type for initializers.

Source §

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more

Source §

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more

Source §

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more

Source §

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more

Source §

impl<T> PolicyExt for T
where T: ?Sized,

Source §

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Sized + Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow only if self and other return Action::Follow. Read more

Source §

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Sized + Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow if either self or other returns Action::Follow. Read more

Source §

impl<T> Same for T

Source §

type Output = T

Should always be Self

Source §

impl<T, U> TryFrom for T
where U: Into<T>,

Source §

type Error = Infallible

The type returned in the event of a conversion error.

Source §

fn try_from(value: U) -> Result<T, <T as TryFrom>::Error>

Performs the conversion.

Source §

impl<T, U> TryInto for T
where U: TryFrom<T>,

Source §

type Error = >::Error

The type returned in the event of a conversion error.

Source §

fn try_into(self) -> Result<U, >::Error>

Performs the conversion.

Source §

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source §

fn vzip(self) -> V

Source §

impl<T> WithSubscriber for T

Source §

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more

Source §

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more

Struct LlmExecutor Copy item path

Implementations§

impl LlmExecutor

pub fn new(model: Box<dyn DecoderOnlyLLM>, info: ModelInfo) -> Self

pub fn truncate_kv_for_cache_id(&self, cache_id: &str, new_len: usize)

Trait Implementations§

impl ModelExecutor for LlmExecutor

fn batch_prefill<'life0, 'life1, 'async_trait>( &'life0 self, inputs: &'life1 [PrefillInput], ) -> Pin<Box<dyn Future<Output = Result<Vec<PrefillOutput>>> + Send + 'async_trait>>where Self: 'async_trait, 'life0: 'async_trait, 'life1: 'async_trait,

fn batch_decode<'life0, 'life1, 'async_trait>( &'life0 self, inputs: &'life1 [DecodeInput], ) -> Pin<Box<dyn Future<Output = Result<Vec<DecodeOutput>>> + Send + 'async_trait>>where Self: 'async_trait, 'life0: 'async_trait, 'life1: 'async_trait,

fn unified_decode<'life0, 'life1, 'async_trait>( &'life0 self, batch: &'life1 UnifiedBatch, ) -> Pin<Box<dyn Future<Output = Result<Vec<Option<Vec<f32>>>>> + Send + 'async_trait>>where Self: 'async_trait, 'life0: 'async_trait, 'life1: 'async_trait,

fn info(&self) -> &ModelInfo

fn kv_capacity(&self) -> Option<usize>

fn prefill<'life0, 'life1, 'async_trait>( &'life0 self, input: &'life1 PrefillInput, ) -> Pin<Box<dyn Future<Output = Result<PrefillOutput>> + Send + 'async_trait>>where Self: 'async_trait, 'life0: 'async_trait, 'life1: 'async_trait,

fn truncate_kv<'life0, 'life1, 'async_trait>( &'life0 self, kv_cache: &'life1 Arc<dyn KvCacheHandle>, new_len: usize, ) -> Pin<Box<dyn Future<Output = Result<()>> + Send + 'async_trait>>where Self: 'async_trait, 'life0: 'async_trait, 'life1: 'async_trait,

fn forward_verify<'life0, 'life1, 'async_trait>( &'life0 self, inputs: &'life1 [DecodeInput], ) -> Pin<Box<dyn Future<Output = Result<Vec<DecodeOutput>>> + Send + 'async_trait>>where Self: 'async_trait, 'life0: 'async_trait, 'life1: 'async_trait,

fn decode<'life0, 'life1, 'async_trait>( &'life0 self, input: &'life1 DecodeInput, ) -> Pin<Box<dyn Future<Output = Result<DecodeOutput>> + Send + 'async_trait>>where Self: 'async_trait, 'life0: 'async_trait, 'life1: 'async_trait,

fn release_cache(&self, cache_id: &str)

fn capabilities(&self) -> ExecutorCapabilities

fn status(&self) -> ExecutorStatus

fn cache_metrics_snapshot(&self) -> Option<Value>

fn lora_metrics_snapshot(&self) -> Option<Value>

fn forward<'life0, 'life1, 'async_trait>( &'life0 self, _input: &'life1 Arc<dyn TensorLike>, ) -> Pin<Box<dyn Future<Output = Result<Arc<dyn TensorLike>, FerrumError>> + Send + 'async_trait>>where 'life0: 'async_trait, 'life1: 'async_trait, Self: 'async_trait,

fn warmup<'life0, 'async_trait>( &'life0 mut self, ) -> Pin<Box<dyn Future<Output = Result<(), FerrumError>> + Send + 'async_trait>>where 'life0: 'async_trait, Self: 'async_trait,

fn shutdown<'life0, 'async_trait>( &'life0 mut self, ) -> Pin<Box<dyn Future<Output = Result<(), FerrumError>> + Send + 'async_trait>>where 'life0: 'async_trait, Self: 'async_trait,

Auto Trait Implementations§

impl !Freeze for LlmExecutor

impl !RefUnwindSafe for LlmExecutor

impl !UnwindSafe for LlmExecutor

impl Send for LlmExecutor

impl Sync for LlmExecutor

impl Unpin for LlmExecutor

impl UnsafeUnpin for LlmExecutor

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> ErasedDestructor for Twhere T: 'static,

impl<T> From<T> for T

fn from(t: T) -> T

impl<T> Instrument for T

fn instrument(self, span: Span) -> Instrumented<Self>

fn in_current_span(self) -> Instrumented<Self>

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> IntoEither for T

fn into_either(self, into_left: bool) -> Either<Self, Self>

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>where F: FnOnce(&Self) -> bool,

impl<F, T> IntoSample<T> for Fwhere T: FromSample<F>,

fn into_sample(self) -> T

impl<T> Pointable for T

const ALIGN: usize

type Init = T

unsafe fn init(init: <T as Pointable>::Init) -> usize

unsafe fn deref<'a>(ptr: usize) -> &'a T

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

unsafe fn drop(ptr: usize)

impl<T> PolicyExt for Twhere T: ?Sized,

fn and<P, B, E>(self, other: P) -> And<T, P>where T: Sized + Policy<B, E>, P: Policy<B, E>,

fn or<P, B, E>(self, other: P) -> Or<T, P>where T: Sized + Policy<B, E>, P: Policy<B, E>,

impl<T> Same for T

type Output = T

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

impl<V, T> VZip<V> for Twhere V: MultiLane<T>,

fn vzip(self) -> V

impl<T> WithSubscriber for T

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>where S: Into<Dispatch>,

fn with_current_subscriber(self) -> WithDispatch<Self>

Struct LlmExecutor

fn batch_prefill<'life0, 'life1, 'async_trait>( &'life0 self, inputs: &'life1 [PrefillInput], ) -> Pin<Box<dyn Future<Output = Result<Vec<PrefillOutput>>> + Send + 'async_trait>>
where Self: 'async_trait, 'life0: 'async_trait, 'life1: 'async_trait,

fn batch_decode<'life0, 'life1, 'async_trait>( &'life0 self, inputs: &'life1 [DecodeInput], ) -> Pin<Box<dyn Future<Output = Result<Vec<DecodeOutput>>> + Send + 'async_trait>>
where Self: 'async_trait, 'life0: 'async_trait, 'life1: 'async_trait,

fn unified_decode<'life0, 'life1, 'async_trait>( &'life0 self, batch: &'life1 UnifiedBatch, ) -> Pin<Box<dyn Future<Output = Result<Vec<Option<Vec<f32>>>>> + Send + 'async_trait>>
where Self: 'async_trait, 'life0: 'async_trait, 'life1: 'async_trait,

fn prefill<'life0, 'life1, 'async_trait>( &'life0 self, input: &'life1 PrefillInput, ) -> Pin<Box<dyn Future<Output = Result<PrefillOutput>> + Send + 'async_trait>>
where Self: 'async_trait, 'life0: 'async_trait, 'life1: 'async_trait,

fn truncate_kv<'life0, 'life1, 'async_trait>( &'life0 self, kv_cache: &'life1 Arc<dyn KvCacheHandle>, new_len: usize, ) -> Pin<Box<dyn Future<Output = Result<()>> + Send + 'async_trait>>
where Self: 'async_trait, 'life0: 'async_trait, 'life1: 'async_trait,

fn forward_verify<'life0, 'life1, 'async_trait>( &'life0 self, inputs: &'life1 [DecodeInput], ) -> Pin<Box<dyn Future<Output = Result<Vec<DecodeOutput>>> + Send + 'async_trait>>
where Self: 'async_trait, 'life0: 'async_trait, 'life1: 'async_trait,

fn decode<'life0, 'life1, 'async_trait>( &'life0 self, input: &'life1 DecodeInput, ) -> Pin<Box<dyn Future<Output = Result<DecodeOutput>> + Send + 'async_trait>>
where Self: 'async_trait, 'life0: 'async_trait, 'life1: 'async_trait,

fn forward<'life0, 'life1, 'async_trait>( &'life0 self, _input: &'life1 Arc<dyn TensorLike>, ) -> Pin<Box<dyn Future<Output = Result<Arc<dyn TensorLike>, FerrumError>> + Send + 'async_trait>>
where 'life0: 'async_trait, 'life1: 'async_trait, Self: 'async_trait,

fn warmup<'life0, 'async_trait>( &'life0 mut self, ) -> Pin<Box<dyn Future<Output = Result<(), FerrumError>> + Send + 'async_trait>>
where 'life0: 'async_trait, Self: 'async_trait,

fn shutdown<'life0, 'async_trait>( &'life0 mut self, ) -> Pin<Box<dyn Future<Output = Result<(), FerrumError>> + Send + 'async_trait>>
where 'life0: 'async_trait, Self: 'async_trait,

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T> ErasedDestructor for T
where T: 'static,

impl<T, U> Into<U> for T
where U: From<T>,

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

impl<F, T> IntoSample<T> for F
where T: FromSample<F>,

impl<T> PolicyExt for T
where T: ?Sized,

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Sized + Policy<B, E>, P: Policy<B, E>,

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Sized + Policy<B, E>, P: Policy<B, E>,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,