Skip to main content

InferenceEngine

Struct InferenceEngine 

Source
pub struct InferenceEngine<'a> { /* private fields */ }
Expand description

Top-level inference engine.

Implementations§

Source§

impl<'a> InferenceEngine<'a>

Source

pub fn new( config: Qwen3Config, sampling_params: SamplingParams, seed: u64, ) -> Self

Create a new inference engine from a configuration (no weights — for testing).

Source

pub fn from_model( model: BonsaiModel<'a>, sampling_params: SamplingParams, seed: u64, ) -> Self

Wrap an already-constructed BonsaiModel in an inference engine.

Lets tests (and future custom-model paths) build a model with non-trivial weights and then attach the standard sampler/kernel machinery without going through the GGUF loader.

Source

pub fn from_model_with_kernel( model: BonsaiModel<'a>, kernel: KernelDispatcher, sampling_params: SamplingParams, seed: u64, ) -> Self

Wrap an already-constructed BonsaiModel using a caller-supplied kernel dispatcher.

Use this when you need to pin the engine to a specific kernel tier (e.g. a CPU-only KernelTier::Reference for tests that exercise the CPU KV-cache path on a host that would otherwise auto-detect a GPU).

Source

pub fn from_gguf( gguf: &'a GgufFile<'a>, sampling_params: SamplingParams, seed: u64, max_seq_len: usize, ) -> RuntimeResult<Self>

Create a new inference engine from a loaded GGUF file.

Source

pub fn set_metrics(&mut self, metrics: Arc<InferenceMetrics>)

Attach shared metrics to this engine for recording inference telemetry.

Source

pub fn set_rate_aggregator(&mut self, aggregator: Arc<RequestRateAggregator>)

Attach a workload-level RequestRateAggregator to this engine.

Once attached, every call to InferenceEngine::generate_tracked (or InferenceEngine::generate_with_request_id) will push its per-request RequestRateSnapshot into the aggregator on completion. The aggregator is reference-counted, so the same instance can be shared with the Prometheus metrics layer or the admin endpoints.

Source

pub fn rate_aggregator(&self) -> Option<&Arc<RequestRateAggregator>>

Read-only access to the attached rate aggregator, if any.

Source

pub fn model(&self) -> &BonsaiModel<'a>

Get a reference to the model.

Source

pub fn model_mut(&mut self) -> &mut BonsaiModel<'a>

Get a mutable reference to the model.

Used by the prefix-cache integration to inject restored KV blocks before running the abbreviated prefill.

Source

pub fn kernel(&self) -> &KernelDispatcher

Get a reference to the kernel dispatcher.

Source

pub fn prefill_from_pos( &mut self, prompt_tokens: &[u32], pos_start: usize, ) -> RuntimeResult<Vec<f32>>

Run prefill at a given KV-cache offset.

Unlike InferenceEngine::generate, this does not reset the model’s KV cache before execution: callers (e.g. the prefix-cache engine) are expected to have prepared the cache state explicitly.

Increments the prefill_token_count counter by prompt_tokens.len() on success.

Source

pub fn decode_step(&mut self, token: u32, pos: usize) -> RuntimeResult<Vec<f32>>

Forward one token at the given absolute position.

Source

pub fn sample(&mut self, logits: &[f32]) -> RuntimeResult<u32>

Sample one token from logits using the engine’s current sampler.

Source

pub fn prefill_token_count(&self) -> u64

Cumulative number of tokens that have been processed by InferenceEngine::prefill_from_pos over this engine’s lifetime.

Source

pub fn reset(&mut self)

Reset the model state for a new conversation.

Source

pub fn stats(&self) -> &Arc<EngineStats>

Get a shared reference to the engine statistics.

Source

pub fn active_sessions(&self) -> usize

Number of currently active sessions (tracked via stats).

Source

pub fn session_count(&self) -> u64

Total number of completed requests (tracked via stats).

Source

pub fn batch_generate( &mut self, prompts: &[Vec<u32>], max_tokens: usize, ) -> Vec<RuntimeResult<BatchResult>>

Process a batch of prompts, delegating to batch_engine::batch_generate.

Resets the engine state between each prompt. Returns one result per prompt.

Source

pub fn generate( &mut self, prompt_tokens: &[u32], max_tokens: usize, ) -> RuntimeResult<Vec<u32>>

Generate tokens from a prompt.

Runs prefill (process the entire prompt), then decodes token by token until max_tokens or EOS is reached. Returns the generated token IDs (not including the prompt).

Source

pub fn generate_tracked( &mut self, prompt_tokens: &[u32], max_tokens: usize, tracker: &mut RequestRateTracker, ) -> RuntimeResult<Vec<u32>>

Generate tokens from a prompt while populating a RequestRateTracker.

Behaves identically to InferenceEngine::generate but additionally:

  • records record_admission() immediately on entry,
  • records record_first_token() for the first sampled token,
  • records record_token() for every subsequent sampled token,
  • on success, pushes the resulting RequestRateSnapshot into the engine’s attached RequestRateAggregator (if any).

The tracker is borrowed mutably so callers can inspect intermediate state via RequestRateTracker::snapshot after the call returns.

Source

pub fn generate_with_request_id( &mut self, request_id: RequestId, prompt_tokens: &[u32], max_tokens: usize, ) -> RuntimeResult<(Vec<u32>, RequestRateTracker)>

Generate tokens from a prompt with a RequestId tagging the surrounding tracing span and an internally-managed RequestRateTracker.

Returns both the generated tokens and the final tracker so callers can extract per-request metrics (e.g. queue-wait, p95 inter-token latency) for client-side observability.

Source

pub fn generate_with_seed( &mut self, prompt_tokens: &[u32], max_tokens: usize, seed: u64, params: &SamplingParams, ) -> RuntimeResult<Vec<u32>>

Generate tokens from a prompt using a specific seed for this run.

Temporarily overrides the sampler seed for deterministic multi-completion generation (n > 1). The sampler state is replaced for the duration of this call and then restored.

Source

pub fn generate_streaming( &mut self, prompt_tokens: &[u32], max_tokens: usize, tx: &UnboundedSender<u32>, ) -> RuntimeResult<usize>

Generate tokens one at a time, sending each through the channel. Returns the total count of generated tokens.

Not available on WASM targets (tokio channels not supported on wasm32-unknown-unknown).

Source

pub fn generate_streaming_sync( &mut self, prompt_tokens: &[u32], max_tokens: usize, tx: &Sender<u32>, ) -> RuntimeResult<usize>

Streaming generation using a synchronous std::sync::mpsc::Sender.

Each generated token is sent through the channel immediately, allowing the consumer to print tokens as they arrive without requiring a tokio runtime.

Source§

impl InferenceEngine<'static>

Source

pub fn from_gguf_path( path: impl AsRef<Path>, sampling_params: SamplingParams, seed: u64, max_seq_len: usize, ) -> RuntimeResult<Self>

Load an InferenceEngine directly from a path to a GGUF file.

This is a convenience wrapper intended for server/CLI entry points that need an owned, 'static engine. It memory-maps the file, parses the GGUF container, and leaks both allocations so that the borrowed GgufFile<'a> lifetime can be promoted to 'static.

The leaked memory is intentional — the engine is expected to live for the process lifetime. Do not call this in hot-paths.

§Errors

Returns RuntimeError::FileNotFound if path does not exist. Other IO / parse / model-init errors propagate through RuntimeError.

Auto Trait Implementations§

§

impl<'a> Freeze for InferenceEngine<'a>

§

impl<'a> RefUnwindSafe for InferenceEngine<'a>

§

impl<'a> Send for InferenceEngine<'a>

§

impl<'a> Sync for InferenceEngine<'a>

§

impl<'a> Unpin for InferenceEngine<'a>

§

impl<'a> UnsafeUnpin for InferenceEngine<'a>

§

impl<'a> UnwindSafe for InferenceEngine<'a>

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

impl<A, B, T> HttpServerConnExec<A, B> for T
where B: Body,