pub struct InferenceEngine { /* private fields */ }Expand description
The main inference engine.
Manages model loading, forward pass execution, and token generation. The full pipeline: load GGUF → parse metadata → build architecture → generate.
Implementations§
Source§impl InferenceEngine
impl InferenceEngine
Sourcepub fn beam_generate(
&mut self,
prompt_tokens: &[u32],
config: &BeamSearchConfig,
eos_token_id: u32,
) -> RuntimeResult<Vec<BeamHypothesis>>
pub fn beam_generate( &mut self, prompt_tokens: &[u32], config: &BeamSearchConfig, eos_token_id: u32, ) -> RuntimeResult<Vec<BeamHypothesis>>
Generate using beam search decoding.
Wraps the engine in an EngineBeamAdapter and calls beam_generate.
Returns a list of BeamHypothesis sorted by normalised score
descending. The hypotheses include the original prompt tokens in
tokens.
§Errors
Returns RuntimeError::ModelNotLoaded if no model has been loaded.
Source§impl InferenceEngine
impl InferenceEngine
Sourcepub fn new(config: EngineConfig) -> Self
pub fn new(config: EngineConfig) -> Self
Create a new inference engine with the given configuration.
Sourcepub fn layer_pager(&self) -> Option<&Arc<LayerPager>>
pub fn layer_pager(&self) -> Option<&Arc<LayerPager>>
Return a reference to the active layer pager, if offloading is enabled.
This is the inspection / integration hook that arch-layer code (or
higher-level callers) can use to acquire tensors on demand. When
the pager is None, the engine is running in the default fully-in-RAM
mode.
Sourcepub fn set_layer_pager(&mut self, pager: Arc<LayerPager>)
pub fn set_layer_pager(&mut self, pager: Arc<LayerPager>)
Attach a pre-built LayerPager to this engine.
This is the integration point for callers that construct their own
pager (e.g. from a custom PagerSource)
and want to inject it rather than relying on the engine to build one
automatically from the GGUF file.
Sourcepub fn load_model_from_bytes(
&mut self,
model_bytes: &[u8],
tokenizer_json: &str,
) -> RuntimeResult<()>
pub fn load_model_from_bytes( &mut self, model_bytes: &[u8], tokenizer_json: &str, ) -> RuntimeResult<()>
Load the model from an in-memory GGUF byte buffer.
This is the preferred entry point for environments that cannot access
the filesystem, such as wasm32-unknown-unknown. The tokenizer must be
provided separately as a JSON string because GGUF metadata rarely
contains the full HuggingFace tokenizer.json.
The loading pipeline is identical to load_model except:
- The GGUF data comes from the supplied
model_bytesslice (copied into owned storage insideGgufModel::from_bytes). - The tokenizer is loaded from
tokenizer_jsonrather than a file path.
Any context_size override from EngineConfig is still applied.
Sourcepub fn load_model(&mut self) -> RuntimeResult<()>
pub fn load_model(&mut self) -> RuntimeResult<()>
Load the model from the configured path.
This performs the full loading pipeline:
- Parse GGUF file (header, metadata, tensor info)
- Extract model configuration from metadata
- Build the architecture-specific forward pass
- Initialize KV cache
- Load tokenizer
Sourcepub fn generate(
&mut self,
prompt: &str,
max_tokens: usize,
callback: impl FnMut(&str),
) -> RuntimeResult<String>
pub fn generate( &mut self, prompt: &str, max_tokens: usize, callback: impl FnMut(&str), ) -> RuntimeResult<String>
Generate tokens from a prompt.
Runs the full generation pipeline:
- Tokenize the prompt
- Prefill: process all prompt tokens through the model
- Decode: autoregressive generation until EOS or max_tokens
The callback is invoked with each decoded token’s text as it’s generated.
Sourcepub fn generate_with_config(
&mut self,
prompt: &str,
max_tokens: usize,
sampler_config: SamplerConfig,
callback: impl FnMut(&str),
) -> RuntimeResult<String>
pub fn generate_with_config( &mut self, prompt: &str, max_tokens: usize, sampler_config: SamplerConfig, callback: impl FnMut(&str), ) -> RuntimeResult<String>
Generate tokens using an explicit sampler config instead of the engine default.
This is the preferred entry point for per-request sampler customization (e.g., grammar-constrained sampling from the API server).
Sourcepub fn vocab_bytes(&self) -> Option<Vec<(u32, Vec<u8>)>>
pub fn vocab_bytes(&self) -> Option<Vec<(u32, Vec<u8>)>>
Build the vocabulary byte table, used for grammar-constrained sampling.
Returns None if no tokenizer is loaded.
Sourcepub fn apply_lora_adapters(&mut self, lora: &LoadedLora) -> RuntimeResult<()>
pub fn apply_lora_adapters(&mut self, lora: &LoadedLora) -> RuntimeResult<()>
Apply a loaded LoRA adapter to the model’s linear layers.
Delegates to the architecture-specific ForwardPass::apply_lora
implementation, which walks the model’s layers and attaches
LoraAdapter instances to each
matching QuantLinear field.
§Errors
Returns RuntimeError::ModelNotLoaded if no model has been loaded.
Sourcepub fn push_lora(&mut self, lora: Arc<LoadedLora>, scale: f32)
pub fn push_lora(&mut self, lora: Arc<LoadedLora>, scale: f32)
Push a LoRA adapter onto the stack with a per-entry scale multiplier.
The adapter is applied additively during inference:
output += scale · (alpha/rank) · B @ A @ input
Sourcepub fn pop_lora(&mut self) -> Option<(Arc<LoadedLora>, f32)>
pub fn pop_lora(&mut self) -> Option<(Arc<LoadedLora>, f32)>
Remove the last adapter pushed onto the stack.
Returns None if the stack is empty.
Sourcepub fn clear_loras(&mut self)
pub fn clear_loras(&mut self)
Remove all LoRA adapters from the stack.
Sourcepub fn lora_stack(&self) -> &LoraStack
pub fn lora_stack(&self) -> &LoraStack
Inspect the current LoRA stack.
Sourcepub fn apply_lora_stack(&mut self) -> RuntimeResult<()>
pub fn apply_lora_stack(&mut self) -> RuntimeResult<()>
Apply the stacked LoRA adapters to the loaded model’s linear layers.
This is a hot-swap operation: it can be called at any time without reloading the model. If the stack is empty this is a no-op.
Returns RuntimeError::ModelNotLoaded if no model has been loaded.
Sourcepub fn unapply_all_loras(&mut self)
pub fn unapply_all_loras(&mut self)
Remove all LoRA adapters from the loaded model’s linear layers.
Clears the lora_stack and calls unapply_all_loras on the forward
pass so every QuantLinear.lora field is set back to None.
This is the necessary counterpart to apply_lora_stack for per-request
LoRA hot-swap: push adapters, apply, generate, then unapply.
Does nothing when no model is loaded.
Sourcepub fn prime_with_prefix(
&mut self,
cached: &CachedKvState,
restore_to: usize,
suffix_tokens: &[u32],
) -> RuntimeResult<Vec<f32>>
pub fn prime_with_prefix( &mut self, cached: &CachedKvState, restore_to: usize, suffix_tokens: &[u32], ) -> RuntimeResult<Vec<f32>>
Restore the KV cache from a cached prefix snapshot and run prefill for the suffix tokens that follow the cached prefix.
This is the prefix-KV-cache fast path: instead of re-prefilling the entire prompt from scratch, the engine restores the KV state for the longest matching cached prefix and only runs the forward pass for the remaining suffix tokens.
Restriction: the cached prefix must start at position 0 (i.e. it was
stored from the beginning of a sequence). This matches how
PrefixKvCache::store snapshots KV state.
§Errors
Returns RuntimeError::ModelNotLoaded if no model is loaded.
Sourcepub fn generate_with_logits(
&mut self,
prompt_tokens: &[u32],
initial_logits: Vec<f32>,
max_tokens: usize,
sampler_config: SamplerConfig,
callback: impl FnMut(&str),
) -> RuntimeResult<String>
pub fn generate_with_logits( &mut self, prompt_tokens: &[u32], initial_logits: Vec<f32>, max_tokens: usize, sampler_config: SamplerConfig, callback: impl FnMut(&str), ) -> RuntimeResult<String>
Run the autoregressive decode loop starting from pre-computed logits.
Unlike Self::generate_with_config, this does not run prefill — the
caller must have already primed the KV cache (via Self::prime_with_prefix
or a full prefill) and obtained the initial logits from that step.
§Errors
Returns RuntimeError::ModelNotLoaded if no model is loaded.
Sourcepub fn config(&self) -> &EngineConfig
pub fn config(&self) -> &EngineConfig
Returns the engine configuration.
Sourcepub fn model_config(&self) -> Option<&ModelConfig>
pub fn model_config(&self) -> Option<&ModelConfig>
Returns the model configuration, if loaded.
Sourcepub fn store_kv_in_prefix_cache(
&mut self,
tokens: &[u32],
prefix_cache: &mut PrefixKvCache,
)
pub fn store_kv_in_prefix_cache( &mut self, tokens: &[u32], prefix_cache: &mut PrefixKvCache, )
Store the current KV cache state into a PrefixKvCache under tokens.
This is the public integration point for server-side prefix caching: after a successful generation pass the worker calls this to persist the KV state so future requests sharing the same prefix can skip prefill.
If no model is loaded (KV cache absent) the call is a silent no-op.
Sourcepub fn tokenize(&self, text: &str) -> RuntimeResult<Vec<u32>>
pub fn tokenize(&self, text: &str) -> RuntimeResult<Vec<u32>>
Tokenize text and return token IDs.
Requires that a model (and thus a tokenizer) has been loaded.
Sourcepub fn prefill(&mut self, tokens: &[u32]) -> RuntimeResult<()>
pub fn prefill(&mut self, tokens: &[u32]) -> RuntimeResult<()>
Prefill the KV cache with the given token sequence without returning logits.
Processes all tokens in order, updating the KV cache at each position.
The last token’s logits are discarded; callers typically follow up with
forward_one to begin autoregressive generation.
Sourcepub fn forward_prefill(
&mut self,
tokens: &[u32],
pos_start: usize,
) -> RuntimeResult<Vec<f32>>
pub fn forward_prefill( &mut self, tokens: &[u32], pos_start: usize, ) -> RuntimeResult<Vec<f32>>
Run a batched prefill forward pass for the given chunk of tokens.
This is the per-chunk entry point for the chunked-prefill scheduler
fairness path (A3). It differs from prefill in two ways:
- It accepts a multi-token slice and dispatches a single batched
forward call, matching the
generatepath’s chunked prefill logic. - It returns the logits of the last token in the chunk so that the
caller can immediately begin decode sampling if
pos_endequals the full prompt length.
pos_start is the KV-cache position at which this chunk begins. It
must equal the current kv_cache.seq_len() on entry; the parameter is
provided explicitly so that callers (e.g. the scheduler) can assert the
invariant in debug builds.
§Errors
Returns RuntimeError::ModelNotLoaded if no model is loaded, or
any arch-level error from the forward pass.
Sourcepub fn forward_decode(
&mut self,
token: u32,
pos: usize,
) -> RuntimeResult<Vec<f32>>
pub fn forward_decode( &mut self, token: u32, pos: usize, ) -> RuntimeResult<Vec<f32>>
Run a single autoregressive decode step for token and return logits.
This is the per-step entry point for the chunked-prefill scheduler
fairness path (A3). It is semantically equivalent to forward_one
but named differently to make the prefill/decode distinction explicit
in call sites inside the engine and scheduler integration layer.
pos is the current sequence position (= kv_cache.seq_len()). It
is accepted as a parameter so that callers can assert the invariant.
§Errors
Returns RuntimeError::ModelNotLoaded if no model is loaded.
Sourcepub fn forward_one(&mut self, token: u32) -> RuntimeResult<Vec<f32>>
pub fn forward_one(&mut self, token: u32) -> RuntimeResult<Vec<f32>>
Run a single forward pass for token and return raw logits.
The KV cache is updated (one position advanced).
Sourcepub fn is_eos(&self, token: u32) -> bool
pub fn is_eos(&self, token: u32) -> bool
Returns true if token is the EOS token for this model.
Sourcepub fn decode_token(&self, token: u32) -> RuntimeResult<String>
pub fn decode_token(&self, token: u32) -> RuntimeResult<String>
Decode a single token ID to its string representation.
Sourcepub fn metrics(&self) -> Arc<EngineMetrics>
pub fn metrics(&self) -> Arc<EngineMetrics>
Returns a shared reference to the engine’s live metrics counters.
Sourcepub fn metrics_snapshot(&self) -> MetricsSnapshot
pub fn metrics_snapshot(&self) -> MetricsSnapshot
Returns a point-in-time MetricsSnapshot of the engine’s counters.
Sourcepub fn kv_snapshot(&self) -> Option<KvCacheSnapshot>
pub fn kv_snapshot(&self) -> Option<KvCacheSnapshot>
Capture a KvCacheSnapshot from the current KV cache state.
Returns None if no model (and thus no KV cache) is loaded.
Sourcepub fn kv_restore(&mut self, snapshot: &KvCacheSnapshot) -> RuntimeResult<()>
pub fn kv_restore(&mut self, snapshot: &KvCacheSnapshot) -> RuntimeResult<()>
Restore the KV cache state from a previously captured KvCacheSnapshot.
Returns RuntimeError::ModelNotLoaded if no model is loaded.
Sourcepub fn truncate(&mut self, n: usize) -> RuntimeResult<()>
pub fn truncate(&mut self, n: usize) -> RuntimeResult<()>
Truncate the KV cache to n tokens.
After this call the engine behaves as if only n tokens have been
processed. This is the low-level primitive used by speculative
decoding on divergence rollback.
§Errors
Returns RuntimeError::ModelNotLoaded if no model is loaded.
Sourcepub fn kv_cache_seq_len(&self) -> usize
pub fn kv_cache_seq_len(&self) -> usize
Return the current KV cache sequence length.
Returns 0 if no model is loaded.
Returns the model’s hidden state dimension, if a model is loaded.
Sourcepub fn embed(&mut self, text: &str) -> RuntimeResult<Vec<f32>>
pub fn embed(&mut self, text: &str) -> RuntimeResult<Vec<f32>>
Compute a semantic embedding vector for the given text using PoolingMode::Last.
This is a convenience wrapper around Self::embed_with. Runs tokenization →
full transformer layers → final RMSNorm, then L2-normalises the resulting
hidden_size-dimensional vector. The KV cache is reset before the pass
so that embeddings for different inputs are independent of each other.
Returns RuntimeError::ModelNotLoaded if no model has been loaded.
Sourcepub fn embed_with(
&mut self,
text: &str,
mode: PoolingMode,
) -> RuntimeResult<Vec<f32>>
pub fn embed_with( &mut self, text: &str, mode: PoolingMode, ) -> RuntimeResult<Vec<f32>>
Compute a semantic embedding vector for the given text using the specified pooling strategy.
Runs tokenization → full transformer layers → final RMSNorm → pooling,
then L2-normalises the resulting hidden_size-dimensional vector.
The KV cache is reset before the pass so that embeddings for different
inputs are independent of each other.
§Pooling modes
PoolingMode::Last— last token hidden state (causal / decoder models).PoolingMode::Mean— mean across all token positions.PoolingMode::Max— elementwise max across all token positions.PoolingMode::Cls— first token hidden state (BERT / encoder models).
Returns RuntimeError::ModelNotLoaded if no model has been loaded.
Sourcepub fn embed_batch(&mut self, texts: &[String]) -> RuntimeResult<Vec<Vec<f32>>>
pub fn embed_batch(&mut self, texts: &[String]) -> RuntimeResult<Vec<Vec<f32>>>
Extract embedding vectors for multiple input texts using PoolingMode::Last.
This is a convenience wrapper around Self::embed_batch_with.
Each text is processed independently with a fresh KV cache.
Sourcepub fn embed_batch_with(
&mut self,
texts: &[&str],
mode: PoolingMode,
) -> RuntimeResult<Vec<Vec<f32>>>
pub fn embed_batch_with( &mut self, texts: &[&str], mode: PoolingMode, ) -> RuntimeResult<Vec<Vec<f32>>>
Extract embedding vectors for multiple input texts using the specified pooling strategy.
Each text is processed independently with a fresh KV cache. The output order matches the input order.
Returns RuntimeError::ModelNotLoaded if no model has been loaded.
Source§impl InferenceEngine
impl InferenceEngine
Sourcepub fn snapshot(&self) -> RuntimeResult<Vec<u8>>
pub fn snapshot(&self) -> RuntimeResult<Vec<u8>>
Capture the full engine state as a portable byte blob.
The returned bytes can be stored on disk, sent over the network, or
embedded in a database. Pass them to InferenceEngine::resume to
resume inference from the same position.
§Limitations
- Grammar state: only the grammar source string is stored. On resume the grammar state is reset to its initial state — any partial progress through a grammar constraint is lost.
- Sampler state: the engine creates a new
Samplerfor eachgenerate()call. The snapshot captures the config values rather than live RNG state from an in-flight generation.
Returns RuntimeError::ModelNotLoaded if no model has been loaded.
Sourcepub fn resume(bytes: &[u8], model_path: &Path) -> RuntimeResult<Self>
pub fn resume(bytes: &[u8], model_path: &Path) -> RuntimeResult<Self>
Resume an inference session from a previously captured snapshot.
- Deserializes the snapshot bytes.
- Validates the model fingerprint against
model_pathon disk. - Loads the model from
model_path. - Restores the KV cache state.
- Restores the sampler config.
- If a grammar source was saved, re-parses it (grammar state is reset to initial).
§Errors
RuntimeError::SnapshotIncompatible— bytes are not a valid snapshot.RuntimeError::ModelFingerprintMismatch— model file differs from snapshot.- Any error from loading the model.
Trait Implementations§
Source§impl Rewindable for InferenceEngine
Rewindable implementation for InferenceEngine.
impl Rewindable for InferenceEngine
Rewindable implementation for InferenceEngine.
Delegates to the engine’s internal KV cache truncate
method. If the engine has no loaded model (and thus no KV cache) the
method returns RuntimeError::ModelNotLoaded wrapped in RewindError::Runtime.
Auto Trait Implementations§
impl !Freeze for InferenceEngine
impl !RefUnwindSafe for InferenceEngine
impl Send for InferenceEngine
impl Sync for InferenceEngine
impl Unpin for InferenceEngine
impl UnsafeUnpin for InferenceEngine
impl !UnwindSafe for InferenceEngine
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more