ferrum-interfaces 0.7.4

# ferrum-infer-rs

[![Crates.io](https://img.shields.io/crates/v/ferrum-cli.svg)](https://crates.io/crates/ferrum-cli)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/sizzlecar/ferrum-infer-rs/blob/main/LICENSE)

Production-grade LLM inference in Rust: one CLI, one server, Apple Silicon + CUDA backends.

[中文说明](README_zh.md)

## Why look at Ferrum for 10 seconds?

- **One Rust binary** for `ferrum run` and OpenAI-compatible `ferrum serve`; no Python service in the runtime path.
- **Apple Silicon and NVIDIA CUDA** from the same project, with Metal and CUDA release binaries.
- **CUDA evidence today:** RTX 4090 + Qwen3-30B-A3B GPTQ-Int4 reaches `0.83x-0.89x` vLLM `0.20.2` throughput in same-pod `n_repeats=5` testing.
- **Metal evidence today:** Qwen3 / LLaMA 8B and Qwen3-30B-A3B pass correctness, multi-turn, and concurrency release-candidate gates on Apple Silicon.

## What it is

ferrum-infer-rs is a Rust-native inference engine for transformer LLMs:
single binary, no Python, OpenAI-compatible HTTP API, seconds to start.

Designed for single-GPU servers, edge devices, and Apple Silicon —
where Docker image size, cold start time, and Python toolchain friction matter.

## Performance highlight: NVIDIA GPUs (CUDA)

ferrum ships a custom CUDA runner with PagedAttention, continuous batching, INT4 Marlin MoE, CUDA Graphs, and an opt-in FlashAttention-2 prefill path.

**RTX 4090 · Qwen3-30B-A3B GPTQ-Int4 · random 256/128 · output throughput (tok/s)** — same pod, `n_repeats=5`, vLLM `0.20.2`; full raw logs and JSON are in [`docs/bench/cuda-rtx4090-2026-05-30-m3-80pct-confirmed/`](docs/bench/cuda-rtx4090-2026-05-30-m3-80pct-confirmed/).

| c | ferrum FA2 path | vLLM 0.20.2 | ratio |
|---:|---:|---:|---:|
| 1 | 160.4 +/- 0.2 | 183.9 +/- 0.2 | 0.872x |
| 4 | 446.3 +/- 7.0 | 512.5 +/- 2.8 | 0.871x |
| 16 | 1185.1 +/- 12.3 | 1331.9 +/- 5.7 | 0.890x |
| 32 | 1641.9 +/- 4.8 | 1972.9 +/- 18.6 | 0.832x |

Release-candidate CUDA smoke for commit `e511077` is saved in [`docs/bench/dev-loop-product-api-goal-progress-20260601/cuda-quick-regress-e511077-c32-20260601/`](docs/bench/dev-loop-product-api-goal-progress-20260601/cuda-quick-regress-e511077-c32-20260601/): Paris, multi-turn, and three-round chat gates passed; c=32 completed 32/32 requests with 0 errors.

Older single-model decode checks are also kept for auditability:

**Qwen3-4B on RTX PRO 6000 (Blackwell)**

| Mode | Decode (tok/s) | VRAM |
|---|---:|---:|
| FP16 (eager) | 70.3 | ~8 GB |
| FP16 + CUDA Graphs | 82.9 (+18%) | ~8 GB |
| INT4 (GPTQ + Marlin) | **130.4 (+85%)** | **~2.5 GB (-69%)** |
| 4 concurrent (INT4) | 124.2 | ~2.5 GB |

**TinyLlama-1.1B**

| Backend | Decode (tok/s) |
|---|---:|
| Candle | 126 |
| ferrum CUDA | **256.5 (+103%)** |

## Performance highlight: Apple Silicon at concurrency

The hard case for laptop inference is concurrent serving. ferrum holds its own at single-request decode and pulls ahead as concurrency goes up. Same machine, same `Q4_K_M` GGUFs, same OpenAI-compatible HTTP load — see the audit-quality report at [`docs/bench/macos-2026-05-02/`](docs/bench/macos-2026-05-02/) (env, scripts, raw JSON, logs).

**M1 Max 32 GB · Q4_K_M · output throughput (tok/s)** — current release-candidate regression numbers for ferrum are saved in [`docs/bench/dev-loop-product-api-goal-progress-20260601/metal-readme-regression-20260601-release-candidate-rerun3/`](docs/bench/dev-loop-product-api-goal-progress-20260601/metal-readme-regression-20260601-release-candidate-rerun3/). Baseline engine numbers are from the audit-quality [macOS bench report](docs/bench/macos-2026-05-02/README.md).

| Model | c | ferrum | llama.cpp (b8960) | mistralrs (0.8.1) |
|---|---:|---:|---:|---:|
| LLaMA-3.1-8B | 1 | **31.7** | 28.7 | 30.2 |
| LLaMA-3.1-8B | 8 | **51.7** | 42.3 | 14.6 |
| LLaMA-3.1-8B | 16 | **89.4** | 67.2 | 23.3 |
| Qwen3-8B | 16 | **86.0** | 68.6 | 23.5 |
| Qwen3-30B-A3B (MoE) | 16 | 72.5¹ | 83.4 | panic² |

> ¹ ferrum MoE c ≥ 8 requires `FERRUM_MOE_BATCHED=1 FERRUM_MOE_BATCHED_DECODE=1` (currently opt-in). Without it, MoE c = 16 falls to 48 tok/s. ² mistralrs 0.8.1 PoisonError-panics on Qwen3-30B-A3B-Q4_K_M (`add_request.rs:466`) — not a ferrum issue.

> The Qwen3-30B-A3B (MoE) row is still important because Apple Silicon Rust support for this model class was effectively missing two months ago. The current release candidate runs it correctly with concurrent serving and multi-turn gates, but the latest c = 16 throughput is below llama.cpp and is reported as such here.

The full 36-cell grid (c = 1, 4, 8, 16 across all three engines and three models, including TPOT / TTFT distributions) is in the [bench report](docs/bench/macos-2026-05-02/README.md).

## Comparison

|  | ferrum | vLLM | llama.cpp | mistralrs |
|---|---|---|---|---|
| Language | Rust | Python+CUDA | C++ | Rust |
| Single binary | ✓ | ✗ (Docker) | ✓ | ✓ |
| Apple Silicon | ✓ (incl. MoE) | ✗ | ✓ | partial (no MoE) |
| CUDA | ✓ (custom) | ✓ (best) | ✓ | ✓ |
| Concurrent serving | ✓ | ✓ (best) | ✓ | ✓ |
| Continuous batching | ✓ | ✓ | partial | ✓ |
| INT4 quantization | ✓ Marlin / Triton | GPTQ / AWQ | GGUF only | varies |
| OpenAI-compatible API | ✓ | ✓ | ✓ | ✓ |
| Embeddable as a library | ✓ | ✗ | ✓ | ✓ |

## Quick Start

### Homebrew (macOS Apple Silicon, Linux x86_64)

```bash
brew tap sizzlecar/ferrum
brew install ferrum        # macOS Metal / Linux CPU
brew install ferrum-cuda   # Linux x86_64 CUDA sm89 build
ferrum --version
```

### Prebuilt binaries (raw tarball)

```bash
# Linux x86_64
curl -L https://github.com/sizzlecar/ferrum-infer-rs/releases/latest/download/ferrum-linux-x86_64.tar.gz | tar xz
./ferrum --help

# Linux x86_64 CUDA, sm89 build
curl -L https://github.com/sizzlecar/ferrum-infer-rs/releases/latest/download/ferrum-linux-x86_64-cuda-sm89.tar.gz | tar xz
LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-} ./ferrum --help

# macOS Apple Silicon (Metal)
curl -L https://github.com/sizzlecar/ferrum-infer-rs/releases/latest/download/ferrum-macos-aarch64.tar.gz | tar xz
./ferrum --help
```

Linux x86_64 is the CPU build. Linux x86_64 CUDA is built for `sm89` and requires a compatible NVIDIA driver, CUDA runtime, and NCCL runtime on the target host. macOS aarch64 is the Metal build.

### From source

```bash
# crates.io
cargo install ferrum-cli

# or git
cargo build --release -p ferrum-cli --bin ferrum
```

### Run

```bash
# Set HF token for gated models (e.g. Llama 3.x)
export HF_TOKEN=hf_your_token_here

# Chat directly
ferrum run qwen3:4b

# Or serve via OpenAI-compatible API
ferrum serve --model qwen3:4b --port 8000
```

API call:

```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3:4b","messages":[{"role":"user","content":"Hello"}]}'
```

The OpenAI-shaped endpoint contract, explicit rejections, tool-field status,
usage accounting, and structured-output limits are documented in
[docs/openai-api-compatibility.md](docs/openai-api-compatibility.md).

## Supported Models

| Architecture | Apple Silicon | CUDA | INT4 (GPTQ) | Tensor Parallel |
|---|:---:|:---:|:---:|:---:|
| LLaMA (3.x, TinyLlama, Vicuna, Mistral) | ✓ | ✓ | ✓ | ✓ |
| Qwen3 dense (0.6B – 8B) | ✓ | ✓ | ✓ | ✓ |
| Qwen3-MoE (30B-A3B) | ✓ | ✓ | ✓ | — |
| Qwen2 / Qwen2.5 | ✓ | ✓ | ✓ | — |
| BERT (embeddings) | ✓ | — | — | — |
| Whisper ASR (tiny → large-v3-turbo) | ✓ | — | — | — |
| Qwen3-TTS (0.6B / 1.7B) | ✓ | — | — | — |
| CLIP / Chinese-CLIP / SigLIP (text + image) | ✓ | — | — | — |

Use any HuggingFace model ID:

```bash
ferrum run Qwen/Qwen3-4B
ferrum run meta-llama/Llama-3.2-3B-Instruct
ferrum run JunHowie/Qwen3-4B-GPTQ-Int4    # INT4 auto-detected
```

### Multi-modal

```bash
# Speech-to-text (WAV/M4A/MP3/FLAC, auto ffmpeg conversion)
ferrum transcribe whisper-turbo recording.m4a -l zh
ferrum serve whisper-turbo

# Text-to-speech (basic synthesis; optional reference-audio cloning)
ferrum tts qwen3-tts "Hello, welcome to Ferrum TTS" -o output.wav
ferrum serve qwen3-tts

# Embeddings (text + image)
ferrum embed OFA-Sys/chinese-clip-vit-base-patch16 --text "sunset at the beach"
ferrum embed google/siglip-base-patch16-224 --image photo.jpg
```

## Build options

```bash
# CPU only (default)
cargo install ferrum-cli

# Metal acceleration (macOS)
cargo install ferrum-cli --features metal

# CUDA acceleration from source (NVIDIA, requires CUDA toolkit + nvcc)
cargo install ferrum-cli --features cuda,vllm-moe-marlin,vllm-paged-attn-v2,fa2-source
```

## Architecture

```
crates/
├── ferrum-types          # Shared types
├── ferrum-interfaces     # Trait contracts (Backend<B>, ModelExecutor, ...)
├── ferrum-runtime        # Backend registry
├── ferrum-engine         # Continuous-batch engine, Metal shader pipeline
├── ferrum-models         # Model architectures (LlamaFamilyModel<B>, MoE, ...)
├── ferrum-kernels        # Custom CUDA + Metal kernels, decode runner
├── ferrum-attention      # Fused-transformer prototype (Metal/CPU)
├── ferrum-quantization   # GPTQ loader, Marlin, native safetensors
├── ferrum-tokenizer      # Tokenization
├── ferrum-sampler        # Top-k/p, temperature, repetition penalty, JSON-mode
├── ferrum-scheduler      # Continuous batching, paged-KV scheduling
├── ferrum-kv             # Paged KV cache (CUDA + Metal pools)
├── ferrum-server         # HTTP API
├── ferrum-cli            # Binary entry point
└── ferrum-testkit        # Test infrastructure
```

Architecture v2 (Model-as-Code) means the model layer is an explicit Rust generic over a `Backend<B>` trait, not a config-driven runner. Adding a backend = implementing the trait, not editing models. See [docs/architecture-v2.md](docs/architecture-v2.md).

## Status

What works today:
- CLI chat, OpenAI-compatible HTTP server with streaming
- Continuous batching, PagedAttention (CUDA + Metal pools), prefix caching, preemption
- Custom CUDA decode runner (Qwen3, LLaMA): 2× over Candle baseline
- Apple Silicon MoE inference (Qwen3-30B-A3B) — matches llama.cpp at c=16
- INT4 GPTQ with Marlin fused kernel (Blackwell + Ampere); also Triton w4a16
- Tensor parallelism (multi-GPU NCCL, persistent per-rank threads)
- Speculative decoding (`--spec-draft <MODEL>` DeepMind accept/reject)
- Structured output (`json_object` best-effort plus strict `json_schema` validation for the supported schema subset)
- Whisper ASR (Metal-accelerated forward pass) + Qwen3-TTS
- Top-k / top-p / temperature / repetition penalty

Known regressions / in-progress:
- Apple Silicon dense at c = 4 underperforms c = 1 on small models (paged-batched is below crossover). Per-token mode remains the default for c ≤ 4 until the small-m path catches up.
- FP8 (Hopper / Blackwell) — INT4 path is at 24% peak DRAM bandwidth, so there's headroom before FP8 becomes the bottleneck.

## Roadmap

See [docs/ROADMAP.md](docs/ROADMAP.md) for the full picture.

Near-term:
- v0.1: CUDA + Apple Silicon production release with concurrent serving benchmarks
- v0.2: Broader release matrix and long-context serving benchmarks
- v0.3: Long-context tuning (32k+), more architectures (Phi, DeepSeek, Gemma)

## License

MIT