ferrum-interfaces 0.7.2

Core trait contracts for the Ferrum LLM inference engine
Documentation

ferrum-infer-rs

Crates.io License: MIT

Production-grade LLM inference in Rust. Single binary, OpenAI-compatible, runs on Apple Silicon and CUDA.

中文说明

What it is

ferrum-infer-rs is a Rust-native inference engine for transformer LLMs: single binary, no Python, OpenAI-compatible HTTP API, seconds to start.

Designed for single-GPU servers, edge devices, and Apple Silicon — where Docker image size, cold start time, and Python toolchain friction matter.

Performance highlight: Apple Silicon at concurrency

The hard case for laptop inference is concurrent serving. ferrum holds its own at single-request decode and pulls ahead as concurrency goes up. Same machine, same Q4_K_M GGUFs, same OpenAI-compatible HTTP load — see the audit-quality report at docs/bench/macos-2026-05-02/ (env, scripts, raw JSON, logs).

M1 Max 32 GB · Q4_K_M · output throughput (tok/s) — best of multiple runs, see bench report § Methodology for variance and re-run protocol.

Model c ferrum llama.cpp (b8960) mistralrs (0.8.1)
LLaMA-3.1-8B 1 29.1 28.7 30.2
LLaMA-3.1-8B 8 51.3 42.3 14.6
LLaMA-3.1-8B 16 96.7 67.2 23.3
Qwen3-8B 16 93.2 68.6 23.5
Qwen3-30B-A3B (MoE) 16 79.2¹ 83.4 panic²

¹ ferrum MoE c ≥ 8 requires FERRUM_MOE_BATCHED=1 FERRUM_MOE_BATCHED_DECODE=1 (currently opt-in). Without it, MoE c = 16 falls to 48 tok/s. ² mistralrs 0.8.1 PoisonError-panics on Qwen3-30B-A3B-Q4_K_M (add_request.rs:466) — not a ferrum issue.

The Qwen3-30B-A3B (MoE) row is the headline. That's the model where Apple Silicon Rust support was effectively missing two months ago. ferrum closed it from a 51 → 80 tok/s gap to llama.cpp in a single PR (#81) by mirroring the Phase-4 paged-KV scaffolding into Qwen3MoeModel. On the dense 8B models, ferrum is +36–44% over llama.cpp at c = 16.

The full 36-cell grid (c = 1, 4, 8, 16 across all three engines and three models, including TPOT / TTFT distributions) is in the bench report.

Performance highlight: NVIDIA GPUs (CUDA)

ferrum maintains a custom CUDA decode runner with INT4 Marlin support. Numbers from RTX PRO 6000 (Blackwell):

Qwen3-4B

Mode Decode (tok/s) VRAM
FP16 (eager) 70.3 ~8 GB
FP16 + CUDA Graphs 82.9 (+18%) ~8 GB
INT4 (GPTQ + Marlin) 130.4 (+85%) ~2.5 GB (-69%)
4 concurrent (INT4) 124.2 ~2.5 GB

TinyLlama-1.1B

Backend Decode (tok/s)
Candle 126
ferrum CUDA 256.5 (+103%)

vLLM-style scheduling features included: PagedAttention, continuous batching, FlashAttention-2 prefill, batched decode, custom fused kernels, piecewise CUDA Graphs, NCCL tensor parallel.

Comparison

ferrum vLLM llama.cpp mistralrs
Language Rust Python+CUDA C++ Rust
Single binary ✗ (Docker)
Apple Silicon ✓ (incl. MoE) partial (no MoE)
CUDA ✓ (custom) ✓ (best)
Concurrent serving ✓ (best)
Continuous batching partial
INT4 quantization ✓ Marlin / Triton GPTQ / AWQ GGUF only varies
OpenAI-compatible API
Embeddable as a library

Quick Start

Homebrew (macOS Apple Silicon, Linux x86_64)

brew tap sizzlecar/ferrum
brew install ferrum
ferrum --version

Prebuilt binaries (raw tarball)

# Linux x86_64
curl -L https://github.com/sizzlecar/ferrum-infer-rs/releases/latest/download/ferrum-linux-x86_64.tar.gz | tar xz
./ferrum --help

# macOS Apple Silicon (Metal)
curl -L https://github.com/sizzlecar/ferrum-infer-rs/releases/latest/download/ferrum-macos-aarch64.tar.gz | tar xz
./ferrum --help

Linux x86_64 is the CPU build. macOS aarch64 is the Metal build (the same backend that beats llama.cpp at c=16 on the Group A bench). For CUDA, build from source.

From source

# crates.io
cargo install ferrum-cli

# or git
cargo build --release -p ferrum-cli --bin ferrum

Run

# Set HF token for gated models (e.g. Llama 3.x)
export HF_TOKEN=hf_your_token_here

# Chat directly
ferrum run qwen3:4b

# Or serve via OpenAI-compatible API
ferrum serve --model qwen3:4b --port 8000

API call:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3:4b","messages":[{"role":"user","content":"Hello"}]}'

Supported Models

Architecture Apple Silicon CUDA INT4 (GPTQ) Tensor Parallel
LLaMA (3.x, TinyLlama, Vicuna, Mistral)
Qwen3 dense (0.6B – 8B)
Qwen3-MoE (30B-A3B)
Qwen2 / Qwen2.5
BERT (embeddings)
Whisper ASR (tiny → large-v3-turbo)
Qwen3-TTS (0.6B / 1.7B, voice clone)
CLIP / Chinese-CLIP / SigLIP (text + image)

Use any HuggingFace model ID:

ferrum run Qwen/Qwen3-4B
ferrum run meta-llama/Llama-3.2-3B-Instruct
ferrum run JunHowie/Qwen3-4B-GPTQ-Int4    # INT4 auto-detected

Multi-modal

# Speech-to-text (WAV/M4A/MP3/FLAC, auto ffmpeg conversion)
ferrum transcribe whisper-turbo recording.m4a -l zh
ferrum serve whisper-turbo

# Text-to-speech (incl. voice clone via ICL prompting)
ferrum tts qwen3-tts "Hello, welcome to Ferrum TTS" -o output.wav
ferrum tts qwen3-tts "你好" --ref-audio ref.wav --ref-text "参考文本" -o clone.wav
ferrum serve qwen3-tts

# Embeddings (text + image)
ferrum embed OFA-Sys/chinese-clip-vit-base-patch16 --text "sunset at the beach"
ferrum embed google/siglip-base-patch16-224 --image photo.jpg

Build options

# CPU only (default)
cargo install ferrum-cli

# Metal acceleration (macOS)
cargo install ferrum-cli --features metal

# CUDA acceleration (NVIDIA, requires CUDA toolkit + nvcc)
cargo install ferrum-cli --features cuda

Architecture

crates/
├── ferrum-types          # Shared types
├── ferrum-interfaces     # Trait contracts (Backend<B>, ModelExecutor, ...)
├── ferrum-runtime        # Backend registry
├── ferrum-engine         # Continuous-batch engine, Metal shader pipeline
├── ferrum-models         # Model architectures (LlamaFamilyModel<B>, MoE, ...)
├── ferrum-kernels        # Custom CUDA + Metal kernels, decode runner
├── ferrum-attention      # Fused-transformer prototype (Metal/CPU)
├── ferrum-quantization   # GPTQ loader, Marlin, native safetensors
├── ferrum-tokenizer      # Tokenization
├── ferrum-sampler        # Top-k/p, temperature, repetition penalty, JSON-mode
├── ferrum-scheduler      # Continuous batching, paged-KV scheduling
├── ferrum-kv             # Paged KV cache (CUDA + Metal pools)
├── ferrum-server         # HTTP API
├── ferrum-cli            # Binary entry point
└── ferrum-testkit        # Test infrastructure

Architecture v2 (Model-as-Code) means the model layer is an explicit Rust generic over a Backend<B> trait, not a config-driven runner. Adding a backend = implementing the trait, not editing models. See docs/architecture-v2.md.

Status

What works today:

  • CLI chat, OpenAI-compatible HTTP server with streaming
  • Continuous batching, PagedAttention (CUDA + Metal pools), prefix caching, preemption
  • Custom CUDA decode runner (Qwen3, LLaMA): 2× over Candle baseline
  • Apple Silicon MoE inference (Qwen3-30B-A3B) — matches llama.cpp at c=16
  • INT4 GPTQ with Marlin fused kernel (Blackwell + Ampere); also Triton w4a16
  • Tensor parallelism (multi-GPU NCCL, persistent per-rank threads)
  • Speculative decoding (--spec-draft <MODEL> DeepMind accept/reject)
  • Structured output (response_format: json_object + json_schema with DFA-guided masking)
  • Whisper ASR (Metal-accelerated forward pass) + Qwen3-TTS (voice clone, streaming)
  • Top-k / top-p / temperature / repetition penalty

Known regressions / in-progress:

  • Apple Silicon dense at c = 4 underperforms c = 1 on small models (paged-batched is below crossover). Per-token mode remains the default for c ≤ 4 until the small-m path catches up.
  • FP8 (Hopper / Blackwell) — INT4 path is at 24% peak DRAM bandwidth, so there's headroom before FP8 becomes the bottleneck.

Roadmap

See docs/ROADMAP.md for the full picture.

Near-term:

  • v0.1: Apple Silicon Group A production release with concurrent serving benchmarks (this PR)
  • v0.2: CUDA serving benchmark vs vLLM on commodity hardware (RTX 4090)
  • v0.3: Long-context tuning (32k+), more architectures (Phi, DeepSeek, Gemma)

License

MIT