ferrum-infer-rs
Production-grade LLM inference in Rust: one CLI, one server, Apple Silicon + CUDA backends.
Why look at Ferrum for 10 seconds?
- One Rust binary for
ferrum runand OpenAI-compatibleferrum serve; no Python service in the runtime path. - Apple Silicon and NVIDIA CUDA from the same project, with Metal and CUDA release binaries.
- CUDA evidence today: RTX 4090 + Qwen3-30B-A3B GPTQ-Int4 reaches
0.83x-0.89xvLLM0.20.2throughput in same-podn_repeats=5testing. - Metal evidence today: Qwen3 / LLaMA 8B and Qwen3-30B-A3B pass correctness, multi-turn, and concurrency release-candidate gates on Apple Silicon.
What it is
ferrum-infer-rs is a Rust-native inference engine for transformer LLMs: single binary, no Python, OpenAI-compatible HTTP API, seconds to start.
Designed for single-GPU servers, edge devices, and Apple Silicon — where Docker image size, cold start time, and Python toolchain friction matter.
Performance highlight: NVIDIA GPUs (CUDA)
ferrum ships a custom CUDA runner with PagedAttention, continuous batching, INT4 Marlin MoE, CUDA Graphs, and an opt-in FlashAttention-2 prefill path.
RTX 4090 · Qwen3-30B-A3B GPTQ-Int4 · random 256/128 · output throughput (tok/s) — same pod, n_repeats=5, vLLM 0.20.2; full raw logs and JSON are in docs/bench/cuda-rtx4090-2026-05-30-m3-80pct-confirmed/.
| c | ferrum FA2 path | vLLM 0.20.2 | ratio |
|---|---|---|---|
| 1 | 160.4 +/- 0.2 | 183.9 +/- 0.2 | 0.872x |
| 4 | 446.3 +/- 7.0 | 512.5 +/- 2.8 | 0.871x |
| 16 | 1185.1 +/- 12.3 | 1331.9 +/- 5.7 | 0.890x |
| 32 | 1641.9 +/- 4.8 | 1972.9 +/- 18.6 | 0.832x |
Release-candidate CUDA smoke for commit e511077 is saved in docs/bench/dev-loop-product-api-goal-progress-20260601/cuda-quick-regress-e511077-c32-20260601/: Paris, multi-turn, and three-round chat gates passed; c=32 completed 32/32 requests with 0 errors.
Older single-model decode checks are also kept for auditability:
Qwen3-4B on RTX PRO 6000 (Blackwell)
| Mode | Decode (tok/s) | VRAM |
|---|---|---|
| FP16 (eager) | 70.3 | ~8 GB |
| FP16 + CUDA Graphs | 82.9 (+18%) | ~8 GB |
| INT4 (GPTQ + Marlin) | 130.4 (+85%) | ~2.5 GB (-69%) |
| 4 concurrent (INT4) | 124.2 | ~2.5 GB |
TinyLlama-1.1B
| Backend | Decode (tok/s) |
|---|---|
| Candle | 126 |
| ferrum CUDA | 256.5 (+103%) |
Performance highlight: Apple Silicon at concurrency
The hard case for laptop inference is concurrent serving. ferrum holds its own at single-request decode and pulls ahead as concurrency goes up. Same machine, same Q4_K_M GGUFs, same OpenAI-compatible HTTP load — see the audit-quality report at docs/bench/macos-2026-05-02/ (env, scripts, raw JSON, logs).
M1 Max 32 GB · Q4_K_M · output throughput (tok/s) — current release-candidate regression numbers for ferrum are saved in docs/bench/dev-loop-product-api-goal-progress-20260601/metal-readme-regression-20260601-release-candidate-rerun3/. Baseline engine numbers are from the audit-quality macOS bench report.
| Model | c | ferrum | llama.cpp (b8960) | mistralrs (0.8.1) |
|---|---|---|---|---|
| LLaMA-3.1-8B | 1 | 31.7 | 28.7 | 30.2 |
| LLaMA-3.1-8B | 8 | 51.7 | 42.3 | 14.6 |
| LLaMA-3.1-8B | 16 | 89.4 | 67.2 | 23.3 |
| Qwen3-8B | 16 | 86.0 | 68.6 | 23.5 |
| Qwen3-30B-A3B (MoE) | 16 | 72.5¹ | 83.4 | panic² |
¹ ferrum MoE c ≥ 8 requires
FERRUM_MOE_BATCHED=1 FERRUM_MOE_BATCHED_DECODE=1(currently opt-in). Without it, MoE c = 16 falls to 48 tok/s. ² mistralrs 0.8.1 PoisonError-panics on Qwen3-30B-A3B-Q4_K_M (add_request.rs:466) — not a ferrum issue.
The Qwen3-30B-A3B (MoE) row is still important because Apple Silicon Rust support for this model class was effectively missing two months ago. The current release candidate runs it correctly with concurrent serving and multi-turn gates, but the latest c = 16 throughput is below llama.cpp and is reported as such here.
The full 36-cell grid (c = 1, 4, 8, 16 across all three engines and three models, including TPOT / TTFT distributions) is in the bench report.
Comparison
| ferrum | vLLM | llama.cpp | mistralrs | |
|---|---|---|---|---|
| Language | Rust | Python+CUDA | C++ | Rust |
| Single binary | ✓ | ✗ (Docker) | ✓ | ✓ |
| Apple Silicon | ✓ (incl. MoE) | ✗ | ✓ | partial (no MoE) |
| CUDA | ✓ (custom) | ✓ (best) | ✓ | ✓ |
| Concurrent serving | ✓ | ✓ (best) | ✓ | ✓ |
| Continuous batching | ✓ | ✓ | partial | ✓ |
| INT4 quantization | ✓ Marlin / Triton | GPTQ / AWQ | GGUF only | varies |
| OpenAI-compatible API | ✓ | ✓ | ✓ | ✓ |
| Embeddable as a library | ✓ | ✗ | ✓ | ✓ |
Quick Start
Homebrew (macOS Apple Silicon, Linux x86_64)
Prebuilt binaries (raw tarball)
# Linux x86_64
|
# Linux x86_64 CUDA, sm89 build
|
LD_LIBRARY_PATH=/usr/local/cuda/lib64:
# macOS Apple Silicon (Metal)
|
Linux x86_64 is the CPU build. Linux x86_64 CUDA is built for sm89 and requires a compatible NVIDIA driver, CUDA runtime, and NCCL runtime on the target host. macOS aarch64 is the Metal build.
From source
# crates.io
# or git
Run
# Set HF token for gated models (e.g. Llama 3.x)
# Chat directly
# Or serve via OpenAI-compatible API
API call:
The OpenAI-shaped endpoint contract, explicit rejections, tool-field status, usage accounting, and structured-output limits are documented in docs/openai-api-compatibility.md.
Supported Models
| Architecture | Apple Silicon | CUDA | INT4 (GPTQ) | Tensor Parallel |
|---|---|---|---|---|
| LLaMA (3.x, TinyLlama, Vicuna, Mistral) | ✓ | ✓ | ✓ | ✓ |
| Qwen3 dense (0.6B – 8B) | ✓ | ✓ | ✓ | ✓ |
| Qwen3-MoE (30B-A3B) | ✓ | ✓ | ✓ | — |
| Qwen2 / Qwen2.5 | ✓ | ✓ | ✓ | — |
| BERT (embeddings) | ✓ | — | — | — |
| Whisper ASR (tiny → large-v3-turbo) | ✓ | — | — | — |
| Qwen3-TTS (0.6B / 1.7B) | ✓ | — | — | — |
| CLIP / Chinese-CLIP / SigLIP (text + image) | ✓ | — | — | — |
Use any HuggingFace model ID:
Multi-modal
# Speech-to-text (WAV/M4A/MP3/FLAC, auto ffmpeg conversion)
# Text-to-speech (basic synthesis; optional reference-audio cloning)
# Embeddings (text + image)
Build options
# CPU only (default)
# Metal acceleration (macOS)
# CUDA acceleration from source (NVIDIA, requires CUDA toolkit + nvcc)
Architecture
crates/
├── ferrum-types # Shared types
├── ferrum-interfaces # Trait contracts (Backend<B>, ModelExecutor, ...)
├── ferrum-runtime # Backend registry
├── ferrum-engine # Continuous-batch engine, Metal shader pipeline
├── ferrum-models # Model architectures (LlamaFamilyModel<B>, MoE, ...)
├── ferrum-kernels # Custom CUDA + Metal kernels, decode runner
├── ferrum-attention # Fused-transformer prototype (Metal/CPU)
├── ferrum-quantization # GPTQ loader, Marlin, native safetensors
├── ferrum-tokenizer # Tokenization
├── ferrum-sampler # Top-k/p, temperature, repetition penalty, JSON-mode
├── ferrum-scheduler # Continuous batching, paged-KV scheduling
├── ferrum-kv # Paged KV cache (CUDA + Metal pools)
├── ferrum-server # HTTP API
├── ferrum-cli # Binary entry point
└── ferrum-testkit # Test infrastructure
Architecture v2 (Model-as-Code) means the model layer is an explicit Rust generic over a Backend<B> trait, not a config-driven runner. Adding a backend = implementing the trait, not editing models. See docs/architecture-v2.md.
Status
What works today:
- CLI chat, OpenAI-compatible HTTP server with streaming
- Continuous batching, PagedAttention (CUDA + Metal pools), prefix caching, preemption
- Custom CUDA decode runner (Qwen3, LLaMA): 2× over Candle baseline
- Apple Silicon MoE inference (Qwen3-30B-A3B) — matches llama.cpp at c=16
- INT4 GPTQ with Marlin fused kernel (Blackwell + Ampere); also Triton w4a16
- Tensor parallelism (multi-GPU NCCL, persistent per-rank threads)
- Speculative decoding (
--spec-draft <MODEL>DeepMind accept/reject) - Structured output (
json_objectbest-effort plus strictjson_schemavalidation for the supported schema subset) - Whisper ASR (Metal-accelerated forward pass) + Qwen3-TTS
- Top-k / top-p / temperature / repetition penalty
Known regressions / in-progress:
- Apple Silicon dense at c = 4 underperforms c = 1 on small models (paged-batched is below crossover). Per-token mode remains the default for c ≤ 4 until the small-m path catches up.
- FP8 (Hopper / Blackwell) — INT4 path is at 24% peak DRAM bandwidth, so there's headroom before FP8 becomes the bottleneck.
Roadmap
See docs/ROADMAP.md for the full picture.
Near-term:
- v0.1: CUDA + Apple Silicon production release with concurrent serving benchmarks
- v0.2: Broader release matrix and long-context serving benchmarks
- v0.3: Long-context tuning (32k+), more architectures (Phi, DeepSeek, Gemma)
License
MIT