car-inference 0.14.0

Local model inference for CAR — Candle backend with Qwen3 models
Documentation

car-inference

Local and remote model inference for the Common Agent Runtime.

What it does

Provides on-device inference using Candle (Metal on macOS, CUDA on Linux) with Qwen3 models downloaded from HuggingFace on first use. Also supports remote APIs (OpenAI, Anthropic, Google) via the same typed ModelSchema interface. The AdaptiveRouter selects the best model using a filter-score-explore strategy, learning from outcomes over time via OutcomeTracker.

Usage

use car_inference::{InferenceEngine, InferenceConfig, GenerateRequest, GenerateParams};

let engine = InferenceEngine::new(InferenceConfig::default());
let result = engine.generate(GenerateRequest {
    prompt: "Explain quicksort".into(),
    params: GenerateParams::default(),
    ..Default::default()
}).await?;

Apple FoundationModels backend (macOS 26+)

On Apple Silicon Macs running macOS 26 or later with Apple Intelligence provisioned, CAR can route inference to the system LLM through Apple's FoundationModels framework — no model file to download, no API key, no model weights. The OS owns everything.

The integration is a small Swift shim (car-inference/swift/CarFoundationModels.swift) compiled by build.rs and linked into the crate. The framework is weak-linked so the produced binary still loads on pre-26 macOS; a runtime availability check (is_available(), cached for 5 seconds) gates calls.

// Routing happens automatically — request the apple/foundation:default
// model id, or let the AdaptiveRouter pick it for you.
let result = engine.generate(GenerateRequest {
    prompt: "summarize this in one sentence: ...".into(),
    model_id: Some("apple/foundation:default".into()),
    ..Default::default()
}).await?;

The catalog entry is tagged ["builtin", "local", "low_latency", "private"]. The adaptive router scores low_latency AND private together via SYSTEM_LLM_BONUS (0.12) so system-owned models compete fairly with MLX 4B for short fast-turn tasks (autocomplete, summarize, classify) without claiming heavy reasoning workloads they can't serve.

What's wired in v1:

  • Single-shot text generation via generate()
  • Token-by-token streaming via stream() with prefix-diffing on Apple's cumulative snapshots — StreamEvent::Done.text carries the full assembled output, matching Candle/MLX shape.
  • Graceful fallthrough on tools / vision / audio / video — returns InferenceError::UnsupportedMode so the router picks the next candidate instead of silently dropping capabilities.

What isn't:

  • Tool calling. The Tool protocol takes Arguments: Generable (a Swift-static, macro-derived protocol); bridging dynamic JSON schemas requires either DynamicGenerationSchema (macOS 26-only) or a single-dispatch shim. Design notes are in backend/foundation_models.rs's module docs.

Build requirements:

  • Full Xcode (not just Command Line Tools) on the build host — needed for xcrun swiftc and the FoundationModels SDK.
  • macOS 15+ deployment floor (overridable via MACOSX_DEPLOYMENT_TARGET).

Linux and Intel-Mac builds skip the Swift compile entirely; the ModelSource::AppleFoundationModels schema variant still serializes (so registries can describe the model on any platform), but dispatch errors out with UnsupportedMode before reaching the bridge.

Crate features

  • metal -- Apple Silicon GPU acceleration
  • cuda -- NVIDIA GPU acceleration
  • ast -- AST-aware code generation via car-ast

Part of CAR -- see the main repo for full documentation.