car-inference 0.14.0

Local model inference for CAR — Candle backend with Qwen3 models
Documentation
# car-inference

Local and remote model inference for the [Common Agent Runtime](https://github.com/Parslee-ai/car).

## What it does

Provides on-device inference using Candle (Metal on macOS, CUDA on Linux) with Qwen3 models
downloaded from HuggingFace on first use. Also supports remote APIs (OpenAI, Anthropic, Google)
via the same typed `ModelSchema` interface. The `AdaptiveRouter` selects the best model using
a filter-score-explore strategy, learning from outcomes over time via `OutcomeTracker`.

## Usage

```rust
use car_inference::{InferenceEngine, InferenceConfig, GenerateRequest, GenerateParams};

let engine = InferenceEngine::new(InferenceConfig::default());
let result = engine.generate(GenerateRequest {
    prompt: "Explain quicksort".into(),
    params: GenerateParams::default(),
    ..Default::default()
}).await?;
```

## Apple FoundationModels backend (macOS 26+)

On Apple Silicon Macs running macOS 26 or later with Apple Intelligence
provisioned, CAR can route inference to the **system LLM** through
Apple's FoundationModels framework — no model file to download, no API
key, no model weights. The OS owns everything.

The integration is a small Swift shim
(`car-inference/swift/CarFoundationModels.swift`) compiled by
`build.rs` and linked into the crate. The framework is **weak-linked**
so the produced binary still loads on pre-26 macOS; a runtime
availability check (`is_available()`, cached for 5 seconds) gates
calls.

```rust
// Routing happens automatically — request the apple/foundation:default
// model id, or let the AdaptiveRouter pick it for you.
let result = engine.generate(GenerateRequest {
    prompt: "summarize this in one sentence: ...".into(),
    model_id: Some("apple/foundation:default".into()),
    ..Default::default()
}).await?;
```

The catalog entry is tagged `["builtin", "local", "low_latency", "private"]`.
The adaptive router scores `low_latency` AND `private` together via
`SYSTEM_LLM_BONUS` (0.12) so system-owned models compete fairly with
MLX 4B for short fast-turn tasks (autocomplete, summarize, classify)
without claiming heavy reasoning workloads they can't serve.

**What's wired in v1:**
- Single-shot text generation via `generate()`
- Token-by-token streaming via `stream()` with prefix-diffing on
  Apple's cumulative snapshots — `StreamEvent::Done.text` carries the
  full assembled output, matching Candle/MLX shape.
- Graceful fallthrough on tools / vision / audio / video — returns
  `InferenceError::UnsupportedMode` so the router picks the next
  candidate instead of silently dropping capabilities.

**What isn't:**
- Tool calling. The `Tool` protocol takes `Arguments: Generable` (a
  Swift-static, macro-derived protocol); bridging dynamic JSON
  schemas requires either `DynamicGenerationSchema` (macOS 26-only)
  or a single-dispatch shim. Design notes are in
  `backend/foundation_models.rs`'s module docs.

**Build requirements:**
- Full Xcode (not just Command Line Tools) on the build host —
  needed for `xcrun swiftc` and the FoundationModels SDK.
- macOS 15+ deployment floor (overridable via `MACOSX_DEPLOYMENT_TARGET`).

Linux and Intel-Mac builds skip the Swift compile entirely; the
`ModelSource::AppleFoundationModels` schema variant still serializes
(so registries can describe the model on any platform), but dispatch
errors out with `UnsupportedMode` before reaching the bridge.

## Crate features

- `metal` -- Apple Silicon GPU acceleration
- `cuda` -- NVIDIA GPU acceleration
- `ast` -- AST-aware code generation via `car-ast`

Part of [CAR](https://github.com/Parslee-ai/car) -- see the main repo for full documentation.