Skip to main content

InferenceEngine

Struct InferenceEngine 

Source
pub struct InferenceEngine { /* private fields */ }
Expand description

The main inference engine.

Manages model loading, forward pass execution, and token generation. The full pipeline: load GGUF → parse metadata → build architecture → generate.

Implementations§

Source§

impl InferenceEngine

Source

pub fn beam_generate( &mut self, prompt_tokens: &[u32], config: &BeamSearchConfig, eos_token_id: u32, ) -> RuntimeResult<Vec<BeamHypothesis>>

Generate using beam search decoding.

Wraps the engine in an EngineBeamAdapter and calls beam_generate.

Returns a list of BeamHypothesis sorted by normalised score descending. The hypotheses include the original prompt tokens in tokens.

§Errors

Returns RuntimeError::ModelNotLoaded if no model has been loaded.

Source§

impl InferenceEngine

Source

pub fn new(config: EngineConfig) -> Self

Create a new inference engine with the given configuration.

Source

pub fn layer_pager(&self) -> Option<&Arc<LayerPager>>

Return a reference to the active layer pager, if offloading is enabled.

This is the inspection / integration hook that arch-layer code (or higher-level callers) can use to acquire tensors on demand. When the pager is None, the engine is running in the default fully-in-RAM mode.

Source

pub fn set_layer_pager(&mut self, pager: Arc<LayerPager>)

Attach a pre-built LayerPager to this engine.

This is the integration point for callers that construct their own pager (e.g. from a custom PagerSource) and want to inject it rather than relying on the engine to build one automatically from the GGUF file.

Source

pub fn load_model_from_bytes( &mut self, model_bytes: &[u8], tokenizer_json: &str, ) -> RuntimeResult<()>

Load the model from an in-memory GGUF byte buffer.

This is the preferred entry point for environments that cannot access the filesystem, such as wasm32-unknown-unknown. The tokenizer must be provided separately as a JSON string because GGUF metadata rarely contains the full HuggingFace tokenizer.json.

The loading pipeline is identical to load_model except:

  • The GGUF data comes from the supplied model_bytes slice (copied into owned storage inside GgufModel::from_bytes).
  • The tokenizer is loaded from tokenizer_json rather than a file path.

Any context_size override from EngineConfig is still applied.

Source

pub fn load_model(&mut self) -> RuntimeResult<()>

Load the model from the configured path.

This performs the full loading pipeline:

  1. Parse GGUF file (header, metadata, tensor info)
  2. Extract model configuration from metadata
  3. Build the architecture-specific forward pass
  4. Initialize KV cache
  5. Load tokenizer
Source

pub fn generate( &mut self, prompt: &str, max_tokens: usize, callback: impl FnMut(&str), ) -> RuntimeResult<String>

Generate tokens from a prompt.

Runs the full generation pipeline:

  1. Tokenize the prompt
  2. Prefill: process all prompt tokens through the model
  3. Decode: autoregressive generation until EOS or max_tokens

The callback is invoked with each decoded token’s text as it’s generated.

Source

pub fn generate_with_config( &mut self, prompt: &str, max_tokens: usize, sampler_config: SamplerConfig, callback: impl FnMut(&str), ) -> RuntimeResult<String>

Generate tokens using an explicit sampler config instead of the engine default.

This is the preferred entry point for per-request sampler customization (e.g., grammar-constrained sampling from the API server).

Source

pub fn vocab_bytes(&self) -> Option<Vec<(u32, Vec<u8>)>>

Build the vocabulary byte table, used for grammar-constrained sampling.

Returns None if no tokenizer is loaded.

Source

pub fn apply_lora_adapters(&mut self, lora: &LoadedLora) -> RuntimeResult<()>

Apply a loaded LoRA adapter to the model’s linear layers.

Delegates to the architecture-specific ForwardPass::apply_lora implementation, which walks the model’s layers and attaches LoraAdapter instances to each matching QuantLinear field.

§Errors

Returns RuntimeError::ModelNotLoaded if no model has been loaded.

Source

pub fn push_lora(&mut self, lora: Arc<LoadedLora>, scale: f32)

Push a LoRA adapter onto the stack with a per-entry scale multiplier.

The adapter is applied additively during inference: output += scale · (alpha/rank) · B @ A @ input

Source

pub fn pop_lora(&mut self) -> Option<(Arc<LoadedLora>, f32)>

Remove the last adapter pushed onto the stack.

Returns None if the stack is empty.

Source

pub fn clear_loras(&mut self)

Remove all LoRA adapters from the stack.

Source

pub fn lora_stack(&self) -> &LoraStack

Inspect the current LoRA stack.

Source

pub fn apply_lora_stack(&mut self) -> RuntimeResult<()>

Apply the stacked LoRA adapters to the loaded model’s linear layers.

This is a hot-swap operation: it can be called at any time without reloading the model. If the stack is empty this is a no-op.

Returns RuntimeError::ModelNotLoaded if no model has been loaded.

Source

pub fn unapply_all_loras(&mut self)

Remove all LoRA adapters from the loaded model’s linear layers.

Clears the lora_stack and calls unapply_all_loras on the forward pass so every QuantLinear.lora field is set back to None.

This is the necessary counterpart to apply_lora_stack for per-request LoRA hot-swap: push adapters, apply, generate, then unapply.

Does nothing when no model is loaded.

Source

pub fn prime_with_prefix( &mut self, cached: &CachedKvState, restore_to: usize, suffix_tokens: &[u32], ) -> RuntimeResult<Vec<f32>>

Restore the KV cache from a cached prefix snapshot and run prefill for the suffix tokens that follow the cached prefix.

This is the prefix-KV-cache fast path: instead of re-prefilling the entire prompt from scratch, the engine restores the KV state for the longest matching cached prefix and only runs the forward pass for the remaining suffix tokens.

Restriction: the cached prefix must start at position 0 (i.e. it was stored from the beginning of a sequence). This matches how PrefixKvCache::store snapshots KV state.

§Errors

Returns RuntimeError::ModelNotLoaded if no model is loaded.

Source

pub fn generate_with_logits( &mut self, prompt_tokens: &[u32], initial_logits: Vec<f32>, max_tokens: usize, sampler_config: SamplerConfig, callback: impl FnMut(&str), ) -> RuntimeResult<String>

Run the autoregressive decode loop starting from pre-computed logits.

Unlike Self::generate_with_config, this does not run prefill — the caller must have already primed the KV cache (via Self::prime_with_prefix or a full prefill) and obtained the initial logits from that step.

§Errors

Returns RuntimeError::ModelNotLoaded if no model is loaded.

Source

pub fn is_loaded(&self) -> bool

Returns whether a model is currently loaded.

Source

pub fn config(&self) -> &EngineConfig

Returns the engine configuration.

Source

pub fn model_config(&self) -> Option<&ModelConfig>

Returns the model configuration, if loaded.

Source

pub fn store_kv_in_prefix_cache( &mut self, tokens: &[u32], prefix_cache: &mut PrefixKvCache, )

Store the current KV cache state into a PrefixKvCache under tokens.

This is the public integration point for server-side prefix caching: after a successful generation pass the worker calls this to persist the KV state so future requests sharing the same prefix can skip prefill.

If no model is loaded (KV cache absent) the call is a silent no-op.

Source

pub fn reset(&mut self)

Reset the KV cache (for starting a new conversation).

Source

pub fn tokenize(&self, text: &str) -> RuntimeResult<Vec<u32>>

Tokenize text and return token IDs.

Requires that a model (and thus a tokenizer) has been loaded.

Source

pub fn prefill(&mut self, tokens: &[u32]) -> RuntimeResult<()>

Prefill the KV cache with the given token sequence without returning logits.

Processes all tokens in order, updating the KV cache at each position. The last token’s logits are discarded; callers typically follow up with forward_one to begin autoregressive generation.

Source

pub fn forward_prefill( &mut self, tokens: &[u32], pos_start: usize, ) -> RuntimeResult<Vec<f32>>

Run a batched prefill forward pass for the given chunk of tokens.

This is the per-chunk entry point for the chunked-prefill scheduler fairness path (A3). It differs from prefill in two ways:

  1. It accepts a multi-token slice and dispatches a single batched forward call, matching the generate path’s chunked prefill logic.
  2. It returns the logits of the last token in the chunk so that the caller can immediately begin decode sampling if pos_end equals the full prompt length.

pos_start is the KV-cache position at which this chunk begins. It must equal the current kv_cache.seq_len() on entry; the parameter is provided explicitly so that callers (e.g. the scheduler) can assert the invariant in debug builds.

§Errors

Returns RuntimeError::ModelNotLoaded if no model is loaded, or any arch-level error from the forward pass.

Source

pub fn forward_decode( &mut self, token: u32, pos: usize, ) -> RuntimeResult<Vec<f32>>

Run a single autoregressive decode step for token and return logits.

This is the per-step entry point for the chunked-prefill scheduler fairness path (A3). It is semantically equivalent to forward_one but named differently to make the prefill/decode distinction explicit in call sites inside the engine and scheduler integration layer.

pos is the current sequence position (= kv_cache.seq_len()). It is accepted as a parameter so that callers can assert the invariant.

§Errors

Returns RuntimeError::ModelNotLoaded if no model is loaded.

Source

pub fn forward_one(&mut self, token: u32) -> RuntimeResult<Vec<f32>>

Run a single forward pass for token and return raw logits.

The KV cache is updated (one position advanced).

Source

pub fn is_eos(&self, token: u32) -> bool

Returns true if token is the EOS token for this model.

Source

pub fn decode_token(&self, token: u32) -> RuntimeResult<String>

Decode a single token ID to its string representation.

Source

pub fn metrics(&self) -> Arc<EngineMetrics>

Returns a shared reference to the engine’s live metrics counters.

Source

pub fn metrics_snapshot(&self) -> MetricsSnapshot

Returns a point-in-time MetricsSnapshot of the engine’s counters.

Source

pub fn kv_snapshot(&self) -> Option<KvCacheSnapshot>

Capture a KvCacheSnapshot from the current KV cache state.

Returns None if no model (and thus no KV cache) is loaded.

Source

pub fn kv_restore(&mut self, snapshot: &KvCacheSnapshot) -> RuntimeResult<()>

Restore the KV cache state from a previously captured KvCacheSnapshot.

Returns RuntimeError::ModelNotLoaded if no model is loaded.

Source

pub fn truncate(&mut self, n: usize) -> RuntimeResult<()>

Truncate the KV cache to n tokens.

After this call the engine behaves as if only n tokens have been processed. This is the low-level primitive used by speculative decoding on divergence rollback.

§Errors

Returns RuntimeError::ModelNotLoaded if no model is loaded.

Source

pub fn kv_cache_seq_len(&self) -> usize

Return the current KV cache sequence length.

Returns 0 if no model is loaded.

Source

pub fn hidden_size(&self) -> Option<usize>

Returns the model’s hidden state dimension, if a model is loaded.

Source

pub fn embed(&mut self, text: &str) -> RuntimeResult<Vec<f32>>

Compute a semantic embedding vector for the given text using PoolingMode::Last.

This is a convenience wrapper around Self::embed_with. Runs tokenization → full transformer layers → final RMSNorm, then L2-normalises the resulting hidden_size-dimensional vector. The KV cache is reset before the pass so that embeddings for different inputs are independent of each other.

Returns RuntimeError::ModelNotLoaded if no model has been loaded.

Source

pub fn embed_with( &mut self, text: &str, mode: PoolingMode, ) -> RuntimeResult<Vec<f32>>

Compute a semantic embedding vector for the given text using the specified pooling strategy.

Runs tokenization → full transformer layers → final RMSNorm → pooling, then L2-normalises the resulting hidden_size-dimensional vector. The KV cache is reset before the pass so that embeddings for different inputs are independent of each other.

§Pooling modes

Returns RuntimeError::ModelNotLoaded if no model has been loaded.

Source

pub fn embed_batch(&mut self, texts: &[String]) -> RuntimeResult<Vec<Vec<f32>>>

Extract embedding vectors for multiple input texts using PoolingMode::Last.

This is a convenience wrapper around Self::embed_batch_with. Each text is processed independently with a fresh KV cache.

Source

pub fn embed_batch_with( &mut self, texts: &[&str], mode: PoolingMode, ) -> RuntimeResult<Vec<Vec<f32>>>

Extract embedding vectors for multiple input texts using the specified pooling strategy.

Each text is processed independently with a fresh KV cache. The output order matches the input order.

Returns RuntimeError::ModelNotLoaded if no model has been loaded.

Source§

impl InferenceEngine

Source

pub fn snapshot(&self) -> RuntimeResult<Vec<u8>>

Capture the full engine state as a portable byte blob.

The returned bytes can be stored on disk, sent over the network, or embedded in a database. Pass them to InferenceEngine::resume to resume inference from the same position.

§Limitations
  • Grammar state: only the grammar source string is stored. On resume the grammar state is reset to its initial state — any partial progress through a grammar constraint is lost.
  • Sampler state: the engine creates a new Sampler for each generate() call. The snapshot captures the config values rather than live RNG state from an in-flight generation.

Returns RuntimeError::ModelNotLoaded if no model has been loaded.

Source

pub fn resume(bytes: &[u8], model_path: &Path) -> RuntimeResult<Self>

Resume an inference session from a previously captured snapshot.

  1. Deserializes the snapshot bytes.
  2. Validates the model fingerprint against model_path on disk.
  3. Loads the model from model_path.
  4. Restores the KV cache state.
  5. Restores the sampler config.
  6. If a grammar source was saved, re-parses it (grammar state is reset to initial).
§Errors

Trait Implementations§

Source§

impl Rewindable for InferenceEngine

Rewindable implementation for InferenceEngine.

Delegates to the engine’s internal KV cache truncate method. If the engine has no loaded model (and thus no KV cache) the method returns RuntimeError::ModelNotLoaded wrapped in RewindError::Runtime.

Source§

fn rewind(&mut self, n: usize) -> Result<(), RewindError>

Truncate the model state so that the next token generated is at position n (0-indexed). Read more
Source§

fn current_length(&self) -> usize

Return the current sequence length (= number of tokens in the KV cache or SSM state).

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more