ferrum-infer-rs
Production-grade LLM inference in Rust. Single binary, OpenAI-compatible, runs on Apple Silicon and CUDA.
What it is
ferrum-infer-rs is a Rust-native inference engine for transformer LLMs: single binary, no Python, OpenAI-compatible HTTP API, seconds to start.
Designed for single-GPU servers, edge devices, and Apple Silicon — where Docker image size, cold start time, and Python toolchain friction matter.
Performance highlight: Apple Silicon at concurrency
The hard case for laptop inference is concurrent serving. ferrum holds its own at single-request decode and pulls ahead as concurrency goes up. Same machine, same Q4_K_M GGUFs, same OpenAI-compatible HTTP load — see the audit-quality report at docs/bench/macos-2026-05-02/ (env, scripts, raw JSON, logs).
M1 Max 32 GB · Q4_K_M · output throughput (tok/s) — best of multiple runs, see bench report § Methodology for variance and re-run protocol.
| Model | c | ferrum | llama.cpp (b8960) | mistralrs (0.8.1) |
|---|---|---|---|---|
| LLaMA-3.1-8B | 1 | 29.1 | 28.7 | 30.2 |
| LLaMA-3.1-8B | 8 | 51.3 | 42.3 | 14.6 |
| LLaMA-3.1-8B | 16 | 96.7 | 67.2 | 23.3 |
| Qwen3-8B | 16 | 93.2 | 68.6 | 23.5 |
| Qwen3-30B-A3B (MoE) | 16 | 79.2¹ | 83.4 | panic² |
¹ ferrum MoE c ≥ 8 requires
FERRUM_MOE_BATCHED=1 FERRUM_MOE_BATCHED_DECODE=1(currently opt-in). Without it, MoE c = 16 falls to 48 tok/s. ² mistralrs 0.8.1 PoisonError-panics on Qwen3-30B-A3B-Q4_K_M (add_request.rs:466) — not a ferrum issue.
The Qwen3-30B-A3B (MoE) row is the headline. That's the model where Apple Silicon Rust support was effectively missing two months ago. ferrum closed it from a 51 → 80 tok/s gap to llama.cpp in a single PR (#81) by mirroring the Phase-4 paged-KV scaffolding into
Qwen3MoeModel. On the dense 8B models, ferrum is +36–44% over llama.cpp at c = 16.
The full 36-cell grid (c = 1, 4, 8, 16 across all three engines and three models, including TPOT / TTFT distributions) is in the bench report.
Performance highlight: NVIDIA GPUs (CUDA)
ferrum maintains a custom CUDA decode runner with INT4 Marlin support. Numbers from RTX PRO 6000 (Blackwell):
Qwen3-4B
| Mode | Decode (tok/s) | VRAM |
|---|---|---|
| FP16 (eager) | 70.3 | ~8 GB |
| FP16 + CUDA Graphs | 82.9 (+18%) | ~8 GB |
| INT4 (GPTQ + Marlin) | 130.4 (+85%) | ~2.5 GB (-69%) |
| 4 concurrent (INT4) | 124.2 | ~2.5 GB |
TinyLlama-1.1B
| Backend | Decode (tok/s) |
|---|---|
| Candle | 126 |
| ferrum CUDA | 256.5 (+103%) |
vLLM-style scheduling features included: PagedAttention, continuous batching, FlashAttention-2 prefill, batched decode, custom fused kernels, piecewise CUDA Graphs, NCCL tensor parallel.
Comparison
| ferrum | vLLM | llama.cpp | mistralrs | |
|---|---|---|---|---|
| Language | Rust | Python+CUDA | C++ | Rust |
| Single binary | ✓ | ✗ (Docker) | ✓ | ✓ |
| Apple Silicon | ✓ (incl. MoE) | ✗ | ✓ | partial (no MoE) |
| CUDA | ✓ (custom) | ✓ (best) | ✓ | ✓ |
| Concurrent serving | ✓ | ✓ (best) | ✓ | ✓ |
| Continuous batching | ✓ | ✓ | partial | ✓ |
| INT4 quantization | ✓ Marlin / Triton | GPTQ / AWQ | GGUF only | varies |
| OpenAI-compatible API | ✓ | ✓ | ✓ | ✓ |
| Embeddable as a library | ✓ | ✗ | ✓ | ✓ |
Quick Start
Homebrew (macOS Apple Silicon, Linux x86_64)
Prebuilt binaries (raw tarball)
# Linux x86_64
|
# macOS Apple Silicon (Metal)
|
Linux x86_64 is the CPU build. macOS aarch64 is the Metal build (the same backend that beats llama.cpp at c=16 on the Group A bench). For CUDA, build from source.
From source
# crates.io
# or git
Run
# Set HF token for gated models (e.g. Llama 3.x)
# Chat directly
# Or serve via OpenAI-compatible API
API call:
Supported Models
| Architecture | Apple Silicon | CUDA | INT4 (GPTQ) | Tensor Parallel |
|---|---|---|---|---|
| LLaMA (3.x, TinyLlama, Vicuna, Mistral) | ✓ | ✓ | ✓ | ✓ |
| Qwen3 dense (0.6B – 8B) | ✓ | ✓ | ✓ | ✓ |
| Qwen3-MoE (30B-A3B) | ✓ | — | — | — |
| Qwen2 / Qwen2.5 | ✓ | ✓ | ✓ | — |
| BERT (embeddings) | ✓ | — | — | — |
| Whisper ASR (tiny → large-v3-turbo) | ✓ | — | — | — |
| Qwen3-TTS (0.6B / 1.7B, voice clone) | ✓ | — | — | — |
| CLIP / Chinese-CLIP / SigLIP (text + image) | ✓ | — | — | — |
Use any HuggingFace model ID:
Multi-modal
# Speech-to-text (WAV/M4A/MP3/FLAC, auto ffmpeg conversion)
# Text-to-speech (incl. voice clone via ICL prompting)
# Embeddings (text + image)
Build options
# CPU only (default)
# Metal acceleration (macOS)
# CUDA acceleration (NVIDIA, requires CUDA toolkit + nvcc)
Architecture
crates/
├── ferrum-types # Shared types
├── ferrum-interfaces # Trait contracts (Backend<B>, ModelExecutor, ...)
├── ferrum-runtime # Backend registry
├── ferrum-engine # Continuous-batch engine, Metal shader pipeline
├── ferrum-models # Model architectures (LlamaFamilyModel<B>, MoE, ...)
├── ferrum-kernels # Custom CUDA + Metal kernels, decode runner
├── ferrum-attention # Fused-transformer prototype (Metal/CPU)
├── ferrum-quantization # GPTQ loader, Marlin, native safetensors
├── ferrum-tokenizer # Tokenization
├── ferrum-sampler # Top-k/p, temperature, repetition penalty, JSON-mode
├── ferrum-scheduler # Continuous batching, paged-KV scheduling
├── ferrum-kv # Paged KV cache (CUDA + Metal pools)
├── ferrum-server # HTTP API
├── ferrum-cli # Binary entry point
└── ferrum-testkit # Test infrastructure
Architecture v2 (Model-as-Code) means the model layer is an explicit Rust generic over a Backend<B> trait, not a config-driven runner. Adding a backend = implementing the trait, not editing models. See docs/architecture-v2.md.
Status
What works today:
- CLI chat, OpenAI-compatible HTTP server with streaming
- Continuous batching, PagedAttention (CUDA + Metal pools), prefix caching, preemption
- Custom CUDA decode runner (Qwen3, LLaMA): 2× over Candle baseline
- Apple Silicon MoE inference (Qwen3-30B-A3B) — matches llama.cpp at c=16
- INT4 GPTQ with Marlin fused kernel (Blackwell + Ampere); also Triton w4a16
- Tensor parallelism (multi-GPU NCCL, persistent per-rank threads)
- Speculative decoding (
--spec-draft <MODEL>DeepMind accept/reject) - Structured output (
response_format: json_object+json_schemawith DFA-guided masking) - Whisper ASR (Metal-accelerated forward pass) + Qwen3-TTS (voice clone, streaming)
- Top-k / top-p / temperature / repetition penalty
Known regressions / in-progress:
- Apple Silicon dense at c = 4 underperforms c = 1 on small models (paged-batched is below crossover). Per-token mode remains the default for c ≤ 4 until the small-m path catches up.
- FP8 (Hopper / Blackwell) — INT4 path is at 24% peak DRAM bandwidth, so there's headroom before FP8 becomes the bottleneck.
Roadmap
See docs/ROADMAP.md for the full picture.
Near-term:
- v0.1: Apple Silicon Group A production release with concurrent serving benchmarks (this PR)
- v0.2: CUDA serving benchmark vs vLLM on commodity hardware (RTX 4090)
- v0.3: Long-context tuning (32k+), more architectures (Phi, DeepSeek, Gemma)
License
MIT