car-inference 0.14.0

Local model inference for CAR — Candle backend with Qwen3 models
Documentation
//! `InferenceHandle` — minimal trait surface for embedders that hold
//! a "thing that can do inference" without committing to the concrete
//! `InferenceEngine`.
//!
//! Motivation: pre-v0.8, `MemgineEngine` held `Arc<InferenceEngine>`
//! directly. Every binary that wanted memgine — `car-cli`'s
//! `cmd_distill` / `cmd_reason` / `cmd_dream`, the bench harness,
//! the eval bridge — therefore had to instantiate a full in-process
//! engine even when a daemon was running and serving the same
//! capabilities over WebSocket. That was a v0.7 holdover: cold-start
//! cost, doubled model weights in memory, and a CLI tracker that
//! couldn't see any of the daemon's accumulated outcome data.
//!
//! This trait lets memgine accept any handle that satisfies the
//! narrow surface it actually uses (`generate` + `embed`). The
//! concrete `InferenceEngine` implements it for the in-process
//! path; downstream binaries can provide their own daemon-proxy
//! implementation that dispatches each call over the daemon's
//! existing `infer` / `embed` JSON-RPC methods, no second engine
//! needed.
//!
//! Tracked in Parslee-ai/car#188.

use crate::tasks::{EmbedRequest, GenerateRequest};
use crate::InferenceError;

/// Inference operations memgine (and other embedders) need.
///
/// Implementations must be `Send + Sync` because memgine holds the
/// handle in an `Arc` and reaches it from `&self` methods that may
/// run inside `tokio::spawn` tasks during consolidation passes.
///
/// **Why these two methods.** The CLI / memgine call sites only
/// reach `generate` (for skill distillation, reasoning, dream
/// consolidation) and `embed` (for semantic similarity in
/// retrieval + the speculative summary pre-compute). Other engine
/// methods — classification, routing, tokenization, image / video
/// generation — are reached either through the daemon directly
/// or via the concrete engine path. Adding them to the trait
/// would broaden the daemon-proxy implementation surface without
/// memgine benefit.
#[async_trait::async_trait]
pub trait InferenceHandle: Send + Sync {
    /// Run a generation request to completion. Same contract as
    /// `InferenceEngine::generate`: caller passes a `GenerateRequest`
    /// (which may carry an explicit `model`, a routing hint, tools,
    /// or a thinking budget), receives the final text or an
    /// `InferenceError`.
    async fn generate(&self, req: GenerateRequest) -> Result<String, InferenceError>;

    /// Encode one or more texts as embedding vectors. Same contract
    /// as `InferenceEngine::embed`: returns one `Vec<f32>` per input
    /// text in the same order.
    async fn embed(&self, req: EmbedRequest) -> Result<Vec<Vec<f32>>, InferenceError>;
}

#[async_trait::async_trait]
impl InferenceHandle for crate::InferenceEngine {
    async fn generate(&self, req: GenerateRequest) -> Result<String, InferenceError> {
        crate::InferenceEngine::generate(self, req).await
    }

    async fn embed(&self, req: EmbedRequest) -> Result<Vec<Vec<f32>>, InferenceError> {
        crate::InferenceEngine::embed(self, req).await
    }
}