Struct InferenceEngine

Source

pub struct InferenceEngine<'a> { /* private fields */ }

Expand description

Top-level inference engine.

Implementations§

Source §

impl<'a> InferenceEngine<'a>

Source

pub fn new( config: Qwen3Config, sampling_params: SamplingParams, seed: u64, ) -> Self

Create a new inference engine from a configuration (no weights — for testing).

Source

pub fn from_model( model: BonsaiModel<'a>, sampling_params: SamplingParams, seed: u64, ) -> Self

Wrap an already-constructed BonsaiModel in an inference engine.

Lets tests (and future custom-model paths) build a model with non-trivial weights and then attach the standard sampler/kernel machinery without going through the GGUF loader.

Source

pub fn from_model_with_kernel( model: BonsaiModel<'a>, kernel: KernelDispatcher, sampling_params: SamplingParams, seed: u64, ) -> Self

Wrap an already-constructed BonsaiModel using a caller-supplied kernel dispatcher.

Use this when you need to pin the engine to a specific kernel tier (e.g. a CPU-only KernelTier::Reference for tests that exercise the CPU KV-cache path on a host that would otherwise auto-detect a GPU).

Source

pub fn from_gguf( gguf: &'a GgufFile<'a>, sampling_params: SamplingParams, seed: u64, max_seq_len: usize, ) -> RuntimeResult<Self>

Create a new inference engine from a loaded GGUF file.

Source

pub fn set_metrics(&mut self, metrics: Arc<InferenceMetrics>)

Attach shared metrics to this engine for recording inference telemetry.

Source

pub fn set_rate_aggregator(&mut self, aggregator: Arc<RequestRateAggregator>)

Attach a workload-level RequestRateAggregator to this engine.

Once attached, every call to InferenceEngine::generate_tracked (or InferenceEngine::generate_with_request_id) will push its per-request RequestRateSnapshot into the aggregator on completion. The aggregator is reference-counted, so the same instance can be shared with the Prometheus metrics layer or the admin endpoints.

Source

pub fn rate_aggregator(&self) -> Option<&Arc<RequestRateAggregator>>

Read-only access to the attached rate aggregator, if any.

Source

pub fn model(&self) -> &BonsaiModel<'a>

Get a reference to the model.

Source

pub fn model_mut(&mut self) -> &mut BonsaiModel<'a>

Get a mutable reference to the model.

Used by the prefix-cache integration to inject restored KV blocks before running the abbreviated prefill.

Source

pub fn kernel(&self) -> &KernelDispatcher

Get a reference to the kernel dispatcher.

Source

pub fn prefill_from_pos( &mut self, prompt_tokens: &[u32], pos_start: usize, ) -> RuntimeResult<Vec<f32>>

Run prefill at a given KV-cache offset.

Unlike InferenceEngine::generate, this does not reset the model’s KV cache before execution: callers (e.g. the prefix-cache engine) are expected to have prepared the cache state explicitly.

Increments the prefill_token_count counter by prompt_tokens.len() on success.

Source

pub fn decode_step(&mut self, token: u32, pos: usize) -> RuntimeResult<Vec<f32>>

Forward one token at the given absolute position.

Source

pub fn sample(&mut self, logits: &[f32]) -> RuntimeResult<u32>

Sample one token from logits using the engine’s current sampler.

Source

pub fn prefill_token_count(&self) -> u64

Cumulative number of tokens that have been processed by InferenceEngine::prefill_from_pos over this engine’s lifetime.

Source

pub fn reset(&mut self)

Reset the model state for a new conversation.

Source

pub fn stats(&self) -> &Arc<EngineStats>

Get a shared reference to the engine statistics.

Source

pub fn active_sessions(&self) -> usize

Number of currently active sessions (tracked via stats).

Source

pub fn session_count(&self) -> u64

Total number of completed requests (tracked via stats).

Source

pub fn batch_generate( &mut self, prompts: &[Vec<u32>], max_tokens: usize, ) -> Vec<RuntimeResult<BatchResult>> ⓘ

Process a batch of prompts, delegating to batch_engine::batch_generate.

Resets the engine state between each prompt. Returns one result per prompt.

Source

pub fn generate( &mut self, prompt_tokens: &[u32], max_tokens: usize, ) -> RuntimeResult<Vec<u32>>

Generate tokens from a prompt.

Runs prefill (process the entire prompt), then decodes token by token until max_tokens or EOS is reached. Returns the generated token IDs (not including the prompt).

Source

pub fn generate_tracked( &mut self, prompt_tokens: &[u32], max_tokens: usize, tracker: &mut RequestRateTracker, ) -> RuntimeResult<Vec<u32>>

Generate tokens from a prompt while populating a RequestRateTracker.

Behaves identically to InferenceEngine::generate but additionally:

records record_admission() immediately on entry,
records record_first_token() for the first sampled token,
records record_token() for every subsequent sampled token,
on success, pushes the resulting RequestRateSnapshot into the engine’s attached RequestRateAggregator (if any).

The tracker is borrowed mutably so callers can inspect intermediate state via RequestRateTracker::snapshot after the call returns.

Source

pub fn generate_with_request_id( &mut self, request_id: RequestId, prompt_tokens: &[u32], max_tokens: usize, ) -> RuntimeResult<(Vec<u32>, RequestRateTracker)>

Generate tokens from a prompt with a RequestId tagging the surrounding tracing span and an internally-managed RequestRateTracker.

Returns both the generated tokens and the final tracker so callers can extract per-request metrics (e.g. queue-wait, p95 inter-token latency) for client-side observability.

Source

pub fn generate_with_seed( &mut self, prompt_tokens: &[u32], max_tokens: usize, seed: u64, params: &SamplingParams, ) -> RuntimeResult<Vec<u32>>

Generate tokens from a prompt using a specific seed for this run.

Temporarily overrides the sampler seed for deterministic multi-completion generation (n > 1). The sampler state is replaced for the duration of this call and then restored.

Source

pub fn generate_streaming( &mut self, prompt_tokens: &[u32], max_tokens: usize, tx: &UnboundedSender<u32>, ) -> RuntimeResult<usize>

Generate tokens one at a time, sending each through the channel. Returns the total count of generated tokens.

Not available on WASM targets (tokio channels not supported on wasm32-unknown-unknown).

Source

pub fn generate_streaming_sync( &mut self, prompt_tokens: &[u32], max_tokens: usize, tx: &Sender<u32>, ) -> RuntimeResult<usize>

Streaming generation using a synchronous std::sync::mpsc::Sender.

Each generated token is sent through the channel immediately, allowing the consumer to print tokens as they arrive without requiring a tokio runtime.

Source §

impl InferenceEngine<'static>

Source

pub fn from_gguf_path( path: impl AsRef<Path>, sampling_params: SamplingParams, seed: u64, max_seq_len: usize, ) -> RuntimeResult<Self>

Load an InferenceEngine directly from a path to a GGUF file.

This is a convenience wrapper intended for server/CLI entry points that need an owned, 'static engine. It memory-maps the file, parses the GGUF container, and leaks both allocations so that the borrowed GgufFile<'a> lifetime can be promoted to 'static.

The leaked memory is intentional — the engine is expected to live for the process lifetime. Do not call this in hot-paths.

§Errors

Returns RuntimeError::FileNotFound if path does not exist. Other IO / parse / model-init errors propagate through RuntimeError.

Auto Trait Implementations§

§

impl<'a> UnwindSafe for InferenceEngine<'a>

Blanket Implementations§

Source §

impl<T> Any for T
where T: 'static + ?Sized,

Source §

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more

Source §

impl<T> Borrow<T> for T
where T: ?Sized,

Source §

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more

Source §

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source §

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more

Source §

impl<T> From<T> for T

Source §

fn from(t: T) -> T

Returns the argument unchanged.

Source §

impl<T> Instrument for T

Source §

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more

Source §

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more

Source §

impl<T, U> Into for T
where U: From<T>,

Source §

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source §

impl<T> IntoEither for T

Source §

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

Source §

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

Source §