Skip to main content

InferenceEngine

Struct InferenceEngine 

Source
pub struct InferenceEngine {
    pub config: InferenceConfig,
    pub unified_registry: UnifiedRegistry,
    pub adaptive_router: AdaptiveRouter,
    pub outcome_tracker: Arc<RwLock<OutcomeTracker>>,
    pub registry: ModelRegistry,
    pub router: ModelRouter,
    /* private fields */
}
Expand description

The main inference engine. Thread-safe, lazily loads models.

Now includes the unified registry, adaptive router, and outcome tracker for schema-driven model selection with learned performance profiles.

Fields§

§config: InferenceConfig§unified_registry: UnifiedRegistry

Unified model registry (local + remote).

§adaptive_router: AdaptiveRouter

Adaptive router with three-phase selection.

§outcome_tracker: Arc<RwLock<OutcomeTracker>>

Outcome tracker for learning from results.

§registry: ModelRegistry§router: ModelRouter

Implementations§

Source§

impl InferenceEngine

Source

pub fn new(config: InferenceConfig) -> Self

Source

pub async fn init_key_pool(&self)

Initialize key pool: register keys from all remote models and load persisted stats. Call this after construction (requires async).

Source

pub async fn warm_up<S: AsRef<str>>( &self, _schema_ids: &[S], ) -> Vec<Result<(), InferenceError>>

No-op on non-macOS — MLX doesn’t run here.

Source

pub async fn route_adaptive(&self, prompt: &str) -> AdaptiveRoutingDecision

Route a prompt using the adaptive router (new). Returns full decision context.

Source

pub fn route(&self, prompt: &str) -> RoutingDecision

Route a prompt to the best model without executing (legacy compat).

Source

pub fn estimated_tokens( &self, req: &GenerateRequest, model_id: Option<&str>, ) -> (usize, usize, bool)

Estimate token count for a request against a specific model’s context window. Returns (estimated_input_tokens, context_window_tokens, fits).

Source

pub async fn generate_tracked( &self, req: GenerateRequest, ) -> Result<InferenceResult, InferenceError>

Generate text from a prompt with outcome tracking. Returns InferenceResult with trace_id for reporting outcomes.

Source

pub async fn generate_tracked_stream( &self, req: GenerateRequest, ) -> Result<Receiver<StreamEvent>, InferenceError>

Stream a response with real-time token output.

Returns a channel receiver yielding StreamEvent variants:

  • TextDelta(String) — partial text token
  • ToolCallStart { name, index, id } — tool call begins
  • ToolCallDelta { index, arguments_delta } — partial tool arguments

Use StreamAccumulator to collect events into a final result.

§Local streaming behavior

Local models (both MLX and Candle backends) emit true incremental TextDelta events as each token is generated. This enables:

  • Token-by-token UI updates
  • Overlapping speech synthesis with generation (for voice apps)
  • Early cancellation when a stop sequence is detected

The channel capacity is 64 events, providing buffering for burst tokens without blocking the generation loop.

§Example: voice app integration
let mut rx = engine.generate_tracked_stream(req).await?;
let mut text_buf = String::new();
while let Some(event) = rx.recv().await {
    match event {
        StreamEvent::TextDelta(delta) => {
            text_buf.push_str(&delta);
            // Feed text_buf to TTS when a sentence boundary is reached
        }
        StreamEvent::Done { text, .. } => break,
        _ => {}
    }
}
Source

pub async fn route_context_snapshot( &self, prompt: &str, workload: RoutingWorkload, has_tools: bool, has_vision: bool, ) -> AdaptiveRoutingDecision

Route a prompt using the adaptive router without executing inference.

Source

pub async fn generate( &self, req: GenerateRequest, ) -> Result<String, InferenceError>

Generate text from a prompt (legacy API, no outcome tracking). When req.model is None, uses intelligent routing based on prompt complexity.

Source

pub async fn tokenize( &self, model: &str, text: &str, ) -> Result<Vec<u32>, InferenceError>

Encode text via the named model’s tokenizer. Returns raw token IDs without any chat-template wrapping or BOS-prepending — pair with Self::detokenize for the round-trip property detokenize(model, tokenize(model, s)) == s for any UTF-8 s.

Only local models have a tokenizer the runtime can call directly (Candle/GGUF on Linux/Windows, MLX on Apple Silicon). For remote models the call returns InferenceError::UnsupportedMode — provider tokenizer endpoints vary too widely to be portable here, and bundling tiktoken-style tables would lock the registry to a fixed set of providers.

Source

pub async fn detokenize( &self, model: &str, tokens: &[u32], ) -> Result<String, InferenceError>

Inverse of Self::tokenize: decode token IDs back to text.

Source

pub async fn embed( &self, req: EmbedRequest, ) -> Result<Vec<Vec<f32>>, InferenceError>

Generate embeddings for text using the dedicated embedding model. On Apple Silicon, uses the native MLX backend; on other platforms, uses Candle.

Source

pub async fn rerank( &self, req: RerankRequest, ) -> Result<RerankResult, InferenceError>

Rerank candidate documents against a query using a cross-encoder reranker model (Qwen3-Reranker family). Returns documents sorted by descending relevance.

§Scoring

Qwen3-Reranker is a Qwen3 base LM fine-tuned so that the first assistant token is "yes" or "no" given the templated <Instruct>/<Query>/<Document> user turn. We run a short greedy decode (≤ 3 tokens, so a leading space, BOS artifact, or the occasional newline don’t break us) and score yes → 1.0, no → 0.0, anything else → 0.5 with a warning.

This is a binary score — the soft probability softmax(logit_yes, logit_no) would give finer ordering but requires per-token logit access on [backend::MlxBackend], which isn’t exposed publicly yet. Tracked as a follow-up; binary scores still produce a correct partial ordering, just with coarser tiebreaks within the {yes} or {no} groups.

§Prompt template

We emit the upstream Qwen3-Reranker chat template verbatim: a dedicated system prompt fixing the yes/no answer space, then the user turn with <Instruct>/<Query>/<Document>, then the assistant prefix with a closed empty <think> block to suppress thinking (reranker is not a reasoner — it’s a classifier). Deviating from this template produces sharply degraded yes/no distributions.

Source

pub async fn ground( &self, req: GroundRequest, ) -> Result<GroundResult, InferenceError>

Dedicated endpoint for structured visual grounding.

Runs a VL generate call under the hood and parses Qwen2.5-VL’s inline <|object_ref_*|>...<|box_*|>(x1,y1),(x2,y2) spans into typed BoundingBoxes. Distinct from the generic InferenceEngine::generate + InferenceResult.bounding_boxes path so callers can express “I want boxes” as a first-class intent — which also lets the router prefer models that declare the Grounding capability.

Source

pub async fn classify( &self, req: ClassifyRequest, ) -> Result<Vec<ClassifyResult>, InferenceError>

Classify text against candidate labels. When req.model is None, routes to the smallest available model.

Source

pub async fn transcribe( &self, req: TranscribeRequest, ) -> Result<TranscribeResult, InferenceError>

Transcribe an audio file using the best available STT model.

Source

pub async fn synthesize( &self, req: SynthesizeRequest, ) -> Result<SynthesizeResult, InferenceError>

Synthesize speech using the best available TTS model.

Source

pub async fn generate_image( &self, req: GenerateImageRequest, ) -> Result<GenerateImageResult, InferenceError>

Generate an image using the best available local MLX image model.

Source

pub async fn generate_image_batch( &self, req: GenerateImageRequest, ) -> Result<Vec<GenerateImageResult>, InferenceError>

Generate one or more variants in a single call.

Returns req.variant_count results (defaulting to 1). The current MLX Flux backend doesn’t support native batching, so this loops over generate_image with the seed advanced per variant for visual diversity. A future hosted backend (gpt-image-2, Replicate) can short-circuit this with one network call producing N coherent images.

Per-variant errors abort the batch — there’s no partial- success semantics today. Callers needing more lenient behaviour should call generate_image directly in their own loop.

Closes #110.

Source

pub async fn generate_video( &self, req: GenerateVideoRequest, ) -> Result<GenerateVideoResult, InferenceError>

Generate a video using the best available local MLX video model.

Source

pub fn list_models_unified(&self) -> Vec<ModelInfo>

List all known models and their status (new registry).

Source

pub fn available_model_upgrades(&self) -> Vec<ModelUpgrade>

Report installed models that have curated newer replacements.

Source

pub async fn check_upgrade_nudge( &self, inference_active: bool, ) -> (NudgeDecision, NudgeState)

The proactive-upgrade decision for right now: which curated upgrades to auto-apply (under Auto policy) and the single nudge to surface, with throttling and dismissals applied. The daemon calls this on its periodic check and broadcasts decision.nudge over WebSocket. Returns the loaded NudgeState too so the caller can stamp last_nudge_secs after sending.

Source

pub fn dismiss_upgrade_nudge( &self, dismiss_key: &str, ) -> Result<(), InferenceError>

Record that the user dismissed a nudge (by its dismiss_key), so it is never surfaced again. Persists to ~/.car/nudge-state.json.

Source

pub async fn detect_upgrades(&self) -> Vec<UpgradeFinding>

Detect upgrades combining curated rules with upstream Hub discovery, honoring update preferences (channel/policy) and the TTL cache. Upstream probing only happens on the Latest channel and is offline-safe.

Source

pub fn list_schemas(&self) -> Vec<ModelSchema>

List all known models and their download status (legacy). List all model schemas from the unified registry (full metadata).

Source

pub fn list_models(&self) -> Vec<ModelInfo>

Source

pub async fn pull_model(&self, name: &str) -> Result<PathBuf, InferenceError>

Download a model if not already present.

Source

pub async fn pull_model_with_progress( &self, name: &str, sink: &ProgressSink, ) -> Result<PathBuf, InferenceError>

Download a model if not already present, reporting progress to sink and enforcing the acquisition lifecycle (per-model lock, disk preflight, lifecycle events). The CLI and daemon use this to show live progress.

Source

pub fn update_prefs(&self) -> UpdatePreferences

Current update preferences. A team-shared project .car/update-prefs.json (found by walking up from cwd) overrides the user ~/.car/update-prefs.json; defaults if neither exists. Loaded on demand — read at onboarding/ upgrade-check frequency, not on the inference hot path.

Source

pub fn set_update_prefs( &self, prefs: &UpdatePreferences, ) -> Result<(), InferenceError>

Persist update preferences to ~/.car/update-prefs.json.

Source

pub fn remove_model(&self, name: &str) -> Result<(), InferenceError>

Remove a downloaded model.

Source

pub fn register_model(&mut self, schema: ModelSchema)

Register a model in the unified registry.

Source

pub async fn discover_vllm_mlx_models(&mut self) -> usize

Discover generic MLX models from a running vLLM-MLX server and register them. Returns the number of discovered models added or refreshed in the registry.

Source

pub fn outcome_tracker(&self) -> Arc<RwLock<OutcomeTracker>>

Get outcome tracker for external use (e.g., memgine integration).

Source

pub async fn save_outcomes(&self) -> Result<(), Error>

Persist outcome profiles to disk for cross-session learning (#13).

Source

pub async fn save_key_pool_stats(&self) -> Result<(), Error>

Persist key pool stats to disk.

Source

pub async fn key_pool_stats(&self) -> HashMap<String, Vec<KeyStats>>

Get key pool stats for all endpoints.

Source

pub async fn export_profiles(&self) -> Vec<ModelProfile>

Export model performance profiles for persistence.

Source

pub async fn import_profiles(&self, profiles: Vec<ModelProfile>)

Import model performance profiles (from persistence).

Source

pub async fn prepare_speech_runtime(&self) -> Result<PathBuf, InferenceError>

Ensure the managed local speech runtime exists and return its root directory. On Apple Silicon, speech uses native MLX backends and no Python runtime is needed.

Source

pub fn set_speech_policy(&mut self, policy: SpeechPolicy)

Override speech routing preferences for the current engine instance.

Source

pub fn set_routing_config(&mut self, config: RoutingConfig)

Source

pub async fn install_curated_speech( &mut self, ) -> Result<Vec<SpeechInstallReport>, InferenceError>

Download the curated local speech model set into the shared Hugging Face cache.

Source

pub fn speech_health(&self) -> SpeechHealthReport

Report speech runtime, model cache, and remote-provider health.

Source

pub async fn model_health(&self) -> ModelHealthReport

Report the current model catalog, configured defaults, capability coverage, and speech runtime/provider health in one place.

Source

pub async fn smoke_test_speech( &self, local: bool, remote: bool, ) -> Result<SpeechSmokeReport, InferenceError>

Run a real speech smoke test through the configured local and/or remote paths.

Trait Implementations§

Source§

impl InferenceHandle for InferenceEngine

Source§

fn generate<'life0, 'async_trait>( &'life0 self, req: GenerateRequest, ) -> Pin<Box<dyn Future<Output = Result<String, InferenceError>> + Send + 'async_trait>>
where Self: 'async_trait, 'life0: 'async_trait,

Run a generation request to completion. Same contract as InferenceEngine::generate: caller passes a GenerateRequest (which may carry an explicit model, a routing hint, tools, or a thinking budget), receives the final text or an InferenceError.
Source§

fn embed<'life0, 'async_trait>( &'life0 self, req: EmbedRequest, ) -> Pin<Box<dyn Future<Output = Result<Vec<Vec<f32>>, InferenceError>> + Send + 'async_trait>>
where Self: 'async_trait, 'life0: 'async_trait,

Encode one or more texts as embedding vectors. Same contract as InferenceEngine::embed: returns one Vec<f32> per input text in the same order.

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> PolicyExt for T
where T: ?Sized,

Source§

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow only if self and other return Action::Follow. Read more
Source§

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow if either self or other returns Action::Follow. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

impl<T> ErasedDestructor for T
where T: 'static,