pub struct InferenceEngine<'a> { /* private fields */ }Expand description
Top-level inference engine.
Implementations§
Source§impl<'a> InferenceEngine<'a>
impl<'a> InferenceEngine<'a>
Sourcepub fn new(
config: Qwen3Config,
sampling_params: SamplingParams,
seed: u64,
) -> Self
pub fn new( config: Qwen3Config, sampling_params: SamplingParams, seed: u64, ) -> Self
Create a new inference engine from a configuration (no weights — for testing).
Sourcepub fn from_model(
model: BonsaiModel<'a>,
sampling_params: SamplingParams,
seed: u64,
) -> Self
pub fn from_model( model: BonsaiModel<'a>, sampling_params: SamplingParams, seed: u64, ) -> Self
Wrap an already-constructed BonsaiModel in an inference engine.
Lets tests (and future custom-model paths) build a model with non-trivial weights and then attach the standard sampler/kernel machinery without going through the GGUF loader.
Sourcepub fn from_model_with_kernel(
model: BonsaiModel<'a>,
kernel: KernelDispatcher,
sampling_params: SamplingParams,
seed: u64,
) -> Self
pub fn from_model_with_kernel( model: BonsaiModel<'a>, kernel: KernelDispatcher, sampling_params: SamplingParams, seed: u64, ) -> Self
Wrap an already-constructed BonsaiModel using a caller-supplied
kernel dispatcher.
Use this when you need to pin the engine to a specific kernel tier
(e.g. a CPU-only KernelTier::Reference for tests that exercise the
CPU KV-cache path on a host that would otherwise auto-detect a GPU).
Sourcepub fn from_gguf(
gguf: &'a GgufFile<'a>,
sampling_params: SamplingParams,
seed: u64,
max_seq_len: usize,
) -> RuntimeResult<Self>
pub fn from_gguf( gguf: &'a GgufFile<'a>, sampling_params: SamplingParams, seed: u64, max_seq_len: usize, ) -> RuntimeResult<Self>
Create a new inference engine from a loaded GGUF file.
Sourcepub fn set_metrics(&mut self, metrics: Arc<InferenceMetrics>)
pub fn set_metrics(&mut self, metrics: Arc<InferenceMetrics>)
Attach shared metrics to this engine for recording inference telemetry.
Sourcepub fn set_rate_aggregator(&mut self, aggregator: Arc<RequestRateAggregator>)
pub fn set_rate_aggregator(&mut self, aggregator: Arc<RequestRateAggregator>)
Attach a workload-level RequestRateAggregator to this engine.
Once attached, every call to InferenceEngine::generate_tracked (or
InferenceEngine::generate_with_request_id) will push its
per-request RequestRateSnapshot into the aggregator on completion.
The aggregator is reference-counted, so the same instance can be shared
with the Prometheus metrics layer or the admin endpoints.
Sourcepub fn rate_aggregator(&self) -> Option<&Arc<RequestRateAggregator>>
pub fn rate_aggregator(&self) -> Option<&Arc<RequestRateAggregator>>
Read-only access to the attached rate aggregator, if any.
Sourcepub fn model(&self) -> &BonsaiModel<'a>
pub fn model(&self) -> &BonsaiModel<'a>
Get a reference to the model.
Sourcepub fn model_mut(&mut self) -> &mut BonsaiModel<'a>
pub fn model_mut(&mut self) -> &mut BonsaiModel<'a>
Get a mutable reference to the model.
Used by the prefix-cache integration to inject restored KV blocks before running the abbreviated prefill.
Sourcepub fn kernel(&self) -> &KernelDispatcher
pub fn kernel(&self) -> &KernelDispatcher
Get a reference to the kernel dispatcher.
Sourcepub fn prefill_from_pos(
&mut self,
prompt_tokens: &[u32],
pos_start: usize,
) -> RuntimeResult<Vec<f32>>
pub fn prefill_from_pos( &mut self, prompt_tokens: &[u32], pos_start: usize, ) -> RuntimeResult<Vec<f32>>
Run prefill at a given KV-cache offset.
Unlike InferenceEngine::generate, this does not reset the
model’s KV cache before execution: callers (e.g. the prefix-cache
engine) are expected to have prepared the cache state explicitly.
Increments the prefill_token_count
counter by prompt_tokens.len() on success.
Sourcepub fn decode_step(&mut self, token: u32, pos: usize) -> RuntimeResult<Vec<f32>>
pub fn decode_step(&mut self, token: u32, pos: usize) -> RuntimeResult<Vec<f32>>
Forward one token at the given absolute position.
Sourcepub fn sample(&mut self, logits: &[f32]) -> RuntimeResult<u32>
pub fn sample(&mut self, logits: &[f32]) -> RuntimeResult<u32>
Sample one token from logits using the engine’s current sampler.
Sourcepub fn prefill_token_count(&self) -> u64
pub fn prefill_token_count(&self) -> u64
Cumulative number of tokens that have been processed by
InferenceEngine::prefill_from_pos over this engine’s lifetime.
Sourcepub fn stats(&self) -> &Arc<EngineStats>
pub fn stats(&self) -> &Arc<EngineStats>
Get a shared reference to the engine statistics.
Sourcepub fn active_sessions(&self) -> usize
pub fn active_sessions(&self) -> usize
Number of currently active sessions (tracked via stats).
Sourcepub fn session_count(&self) -> u64
pub fn session_count(&self) -> u64
Total number of completed requests (tracked via stats).
Sourcepub fn batch_generate(
&mut self,
prompts: &[Vec<u32>],
max_tokens: usize,
) -> Vec<RuntimeResult<BatchResult>> ⓘ
pub fn batch_generate( &mut self, prompts: &[Vec<u32>], max_tokens: usize, ) -> Vec<RuntimeResult<BatchResult>> ⓘ
Process a batch of prompts, delegating to batch_engine::batch_generate.
Resets the engine state between each prompt. Returns one result per prompt.
Sourcepub fn generate(
&mut self,
prompt_tokens: &[u32],
max_tokens: usize,
) -> RuntimeResult<Vec<u32>>
pub fn generate( &mut self, prompt_tokens: &[u32], max_tokens: usize, ) -> RuntimeResult<Vec<u32>>
Generate tokens from a prompt.
Runs prefill (process the entire prompt), then decodes
token by token until max_tokens or EOS is reached.
Returns the generated token IDs (not including the prompt).
Sourcepub fn generate_tracked(
&mut self,
prompt_tokens: &[u32],
max_tokens: usize,
tracker: &mut RequestRateTracker,
) -> RuntimeResult<Vec<u32>>
pub fn generate_tracked( &mut self, prompt_tokens: &[u32], max_tokens: usize, tracker: &mut RequestRateTracker, ) -> RuntimeResult<Vec<u32>>
Generate tokens from a prompt while populating a RequestRateTracker.
Behaves identically to InferenceEngine::generate but additionally:
- records
record_admission()immediately on entry, - records
record_first_token()for the first sampled token, - records
record_token()for every subsequent sampled token, - on success, pushes the resulting
RequestRateSnapshotinto the engine’s attachedRequestRateAggregator(if any).
The tracker is borrowed mutably so callers can inspect intermediate
state via RequestRateTracker::snapshot after the call returns.
Sourcepub fn generate_with_request_id(
&mut self,
request_id: RequestId,
prompt_tokens: &[u32],
max_tokens: usize,
) -> RuntimeResult<(Vec<u32>, RequestRateTracker)>
pub fn generate_with_request_id( &mut self, request_id: RequestId, prompt_tokens: &[u32], max_tokens: usize, ) -> RuntimeResult<(Vec<u32>, RequestRateTracker)>
Generate tokens from a prompt with a RequestId tagging the
surrounding tracing span and an internally-managed
RequestRateTracker.
Returns both the generated tokens and the final tracker so callers can extract per-request metrics (e.g. queue-wait, p95 inter-token latency) for client-side observability.
Sourcepub fn generate_with_seed(
&mut self,
prompt_tokens: &[u32],
max_tokens: usize,
seed: u64,
params: &SamplingParams,
) -> RuntimeResult<Vec<u32>>
pub fn generate_with_seed( &mut self, prompt_tokens: &[u32], max_tokens: usize, seed: u64, params: &SamplingParams, ) -> RuntimeResult<Vec<u32>>
Generate tokens from a prompt using a specific seed for this run.
Temporarily overrides the sampler seed for deterministic multi-completion
generation (n > 1). The sampler state is replaced for the duration of
this call and then restored.
Sourcepub fn generate_streaming(
&mut self,
prompt_tokens: &[u32],
max_tokens: usize,
tx: &UnboundedSender<u32>,
) -> RuntimeResult<usize>
pub fn generate_streaming( &mut self, prompt_tokens: &[u32], max_tokens: usize, tx: &UnboundedSender<u32>, ) -> RuntimeResult<usize>
Generate tokens one at a time, sending each through the channel. Returns the total count of generated tokens.
Not available on WASM targets (tokio channels not supported on wasm32-unknown-unknown).
Sourcepub fn generate_streaming_sync(
&mut self,
prompt_tokens: &[u32],
max_tokens: usize,
tx: &Sender<u32>,
) -> RuntimeResult<usize>
pub fn generate_streaming_sync( &mut self, prompt_tokens: &[u32], max_tokens: usize, tx: &Sender<u32>, ) -> RuntimeResult<usize>
Streaming generation using a synchronous std::sync::mpsc::Sender.
Each generated token is sent through the channel immediately, allowing the consumer to print tokens as they arrive without requiring a tokio runtime.
Source§impl InferenceEngine<'static>
impl InferenceEngine<'static>
Sourcepub fn from_gguf_path(
path: impl AsRef<Path>,
sampling_params: SamplingParams,
seed: u64,
max_seq_len: usize,
) -> RuntimeResult<Self>
pub fn from_gguf_path( path: impl AsRef<Path>, sampling_params: SamplingParams, seed: u64, max_seq_len: usize, ) -> RuntimeResult<Self>
Load an InferenceEngine directly from a path to a GGUF file.
This is a convenience wrapper intended for server/CLI entry points that
need an owned, 'static engine. It memory-maps the file, parses the
GGUF container, and leaks both allocations so that the borrowed
GgufFile<'a> lifetime can be promoted to 'static.
The leaked memory is intentional — the engine is expected to live for the process lifetime. Do not call this in hot-paths.
§Errors
Returns RuntimeError::FileNotFound if path does not exist. Other
IO / parse / model-init errors propagate through RuntimeError.
Auto Trait Implementations§
impl<'a> Freeze for InferenceEngine<'a>
impl<'a> RefUnwindSafe for InferenceEngine<'a>
impl<'a> Send for InferenceEngine<'a>
impl<'a> Sync for InferenceEngine<'a>
impl<'a> Unpin for InferenceEngine<'a>
impl<'a> UnsafeUnpin for InferenceEngine<'a>
impl<'a> UnwindSafe for InferenceEngine<'a>
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more