ferrum-types 0.7.4

Shared type definitions for the Ferrum LLM inference engine
Documentation

ferrum-infer-rs

Crates.io License: MIT

Production-grade LLM inference in Rust: one CLI, one server, Apple Silicon + CUDA backends.

中文说明

Why look at Ferrum for 10 seconds?

  • One Rust binary for ferrum run and OpenAI-compatible ferrum serve; no Python service in the runtime path.
  • Apple Silicon and NVIDIA CUDA from the same project, with Metal and CUDA release binaries.
  • CUDA evidence today: RTX 4090 + Qwen3-30B-A3B GPTQ-Int4 reaches 0.83x-0.89x vLLM 0.20.2 throughput in same-pod n_repeats=5 testing.
  • Metal evidence today: Qwen3 / LLaMA 8B and Qwen3-30B-A3B pass correctness, multi-turn, and concurrency release-candidate gates on Apple Silicon.

What it is

ferrum-infer-rs is a Rust-native inference engine for transformer LLMs: single binary, no Python, OpenAI-compatible HTTP API, seconds to start.

Designed for single-GPU servers, edge devices, and Apple Silicon — where Docker image size, cold start time, and Python toolchain friction matter.

Performance highlight: NVIDIA GPUs (CUDA)

ferrum ships a custom CUDA runner with PagedAttention, continuous batching, INT4 Marlin MoE, CUDA Graphs, and an opt-in FlashAttention-2 prefill path.

RTX 4090 · Qwen3-30B-A3B GPTQ-Int4 · random 256/128 · output throughput (tok/s) — same pod, n_repeats=5, vLLM 0.20.2; full raw logs and JSON are in docs/bench/cuda-rtx4090-2026-05-30-m3-80pct-confirmed/.

c ferrum FA2 path vLLM 0.20.2 ratio
1 160.4 +/- 0.2 183.9 +/- 0.2 0.872x
4 446.3 +/- 7.0 512.5 +/- 2.8 0.871x
16 1185.1 +/- 12.3 1331.9 +/- 5.7 0.890x
32 1641.9 +/- 4.8 1972.9 +/- 18.6 0.832x

Release-candidate CUDA smoke for commit e511077 is saved in docs/bench/dev-loop-product-api-goal-progress-20260601/cuda-quick-regress-e511077-c32-20260601/: Paris, multi-turn, and three-round chat gates passed; c=32 completed 32/32 requests with 0 errors.

Older single-model decode checks are also kept for auditability:

Qwen3-4B on RTX PRO 6000 (Blackwell)

Mode Decode (tok/s) VRAM
FP16 (eager) 70.3 ~8 GB
FP16 + CUDA Graphs 82.9 (+18%) ~8 GB
INT4 (GPTQ + Marlin) 130.4 (+85%) ~2.5 GB (-69%)
4 concurrent (INT4) 124.2 ~2.5 GB

TinyLlama-1.1B

Backend Decode (tok/s)
Candle 126
ferrum CUDA 256.5 (+103%)

Performance highlight: Apple Silicon at concurrency

The hard case for laptop inference is concurrent serving. ferrum holds its own at single-request decode and pulls ahead as concurrency goes up. Same machine, same Q4_K_M GGUFs, same OpenAI-compatible HTTP load — see the audit-quality report at docs/bench/macos-2026-05-02/ (env, scripts, raw JSON, logs).

M1 Max 32 GB · Q4_K_M · output throughput (tok/s) — current release-candidate regression numbers for ferrum are saved in docs/bench/dev-loop-product-api-goal-progress-20260601/metal-readme-regression-20260601-release-candidate-rerun3/. Baseline engine numbers are from the audit-quality macOS bench report.

Model c ferrum llama.cpp (b8960) mistralrs (0.8.1)
LLaMA-3.1-8B 1 31.7 28.7 30.2
LLaMA-3.1-8B 8 51.7 42.3 14.6
LLaMA-3.1-8B 16 89.4 67.2 23.3
Qwen3-8B 16 86.0 68.6 23.5
Qwen3-30B-A3B (MoE) 16 72.5¹ 83.4 panic²

¹ ferrum MoE c ≥ 8 requires FERRUM_MOE_BATCHED=1 FERRUM_MOE_BATCHED_DECODE=1 (currently opt-in). Without it, MoE c = 16 falls to 48 tok/s. ² mistralrs 0.8.1 PoisonError-panics on Qwen3-30B-A3B-Q4_K_M (add_request.rs:466) — not a ferrum issue.

The Qwen3-30B-A3B (MoE) row is still important because Apple Silicon Rust support for this model class was effectively missing two months ago. The current release candidate runs it correctly with concurrent serving and multi-turn gates, but the latest c = 16 throughput is below llama.cpp and is reported as such here.

The full 36-cell grid (c = 1, 4, 8, 16 across all three engines and three models, including TPOT / TTFT distributions) is in the bench report.

Comparison

ferrum vLLM llama.cpp mistralrs
Language Rust Python+CUDA C++ Rust
Single binary ✗ (Docker)
Apple Silicon ✓ (incl. MoE) partial (no MoE)
CUDA ✓ (custom) ✓ (best)
Concurrent serving ✓ (best)
Continuous batching partial
INT4 quantization ✓ Marlin / Triton GPTQ / AWQ GGUF only varies
OpenAI-compatible API
Embeddable as a library

Quick Start

Homebrew (macOS Apple Silicon, Linux x86_64)

brew tap sizzlecar/ferrum
brew install ferrum        # macOS Metal / Linux CPU
brew install ferrum-cuda   # Linux x86_64 CUDA sm89 build
ferrum --version

Prebuilt binaries (raw tarball)

# Linux x86_64
curl -L https://github.com/sizzlecar/ferrum-infer-rs/releases/latest/download/ferrum-linux-x86_64.tar.gz | tar xz
./ferrum --help

# Linux x86_64 CUDA, sm89 build
curl -L https://github.com/sizzlecar/ferrum-infer-rs/releases/latest/download/ferrum-linux-x86_64-cuda-sm89.tar.gz | tar xz
LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-} ./ferrum --help

# macOS Apple Silicon (Metal)
curl -L https://github.com/sizzlecar/ferrum-infer-rs/releases/latest/download/ferrum-macos-aarch64.tar.gz | tar xz
./ferrum --help

Linux x86_64 is the CPU build. Linux x86_64 CUDA is built for sm89 and requires a compatible NVIDIA driver, CUDA runtime, and NCCL runtime on the target host. macOS aarch64 is the Metal build.

From source

# crates.io
cargo install ferrum-cli

# or git
cargo build --release -p ferrum-cli --bin ferrum

Run

# Set HF token for gated models (e.g. Llama 3.x)
export HF_TOKEN=hf_your_token_here

# Chat directly
ferrum run qwen3:4b

# Or serve via OpenAI-compatible API
ferrum serve --model qwen3:4b --port 8000

API call:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3:4b","messages":[{"role":"user","content":"Hello"}]}'

The OpenAI-shaped endpoint contract, explicit rejections, tool-field status, usage accounting, and structured-output limits are documented in docs/openai-api-compatibility.md.

Supported Models

Architecture Apple Silicon CUDA INT4 (GPTQ) Tensor Parallel
LLaMA (3.x, TinyLlama, Vicuna, Mistral)
Qwen3 dense (0.6B – 8B)
Qwen3-MoE (30B-A3B)
Qwen2 / Qwen2.5
BERT (embeddings)
Whisper ASR (tiny → large-v3-turbo)
Qwen3-TTS (0.6B / 1.7B)
CLIP / Chinese-CLIP / SigLIP (text + image)

Use any HuggingFace model ID:

ferrum run Qwen/Qwen3-4B
ferrum run meta-llama/Llama-3.2-3B-Instruct
ferrum run JunHowie/Qwen3-4B-GPTQ-Int4    # INT4 auto-detected

Multi-modal

# Speech-to-text (WAV/M4A/MP3/FLAC, auto ffmpeg conversion)
ferrum transcribe whisper-turbo recording.m4a -l zh
ferrum serve whisper-turbo

# Text-to-speech (basic synthesis; optional reference-audio cloning)
ferrum tts qwen3-tts "Hello, welcome to Ferrum TTS" -o output.wav
ferrum serve qwen3-tts

# Embeddings (text + image)
ferrum embed OFA-Sys/chinese-clip-vit-base-patch16 --text "sunset at the beach"
ferrum embed google/siglip-base-patch16-224 --image photo.jpg

Build options

# CPU only (default)
cargo install ferrum-cli

# Metal acceleration (macOS)
cargo install ferrum-cli --features metal

# CUDA acceleration from source (NVIDIA, requires CUDA toolkit + nvcc)
cargo install ferrum-cli --features cuda,vllm-moe-marlin,vllm-paged-attn-v2,fa2-source

Architecture

crates/
├── ferrum-types          # Shared types
├── ferrum-interfaces     # Trait contracts (Backend<B>, ModelExecutor, ...)
├── ferrum-runtime        # Backend registry
├── ferrum-engine         # Continuous-batch engine, Metal shader pipeline
├── ferrum-models         # Model architectures (LlamaFamilyModel<B>, MoE, ...)
├── ferrum-kernels        # Custom CUDA + Metal kernels, decode runner
├── ferrum-attention      # Fused-transformer prototype (Metal/CPU)
├── ferrum-quantization   # GPTQ loader, Marlin, native safetensors
├── ferrum-tokenizer      # Tokenization
├── ferrum-sampler        # Top-k/p, temperature, repetition penalty, JSON-mode
├── ferrum-scheduler      # Continuous batching, paged-KV scheduling
├── ferrum-kv             # Paged KV cache (CUDA + Metal pools)
├── ferrum-server         # HTTP API
├── ferrum-cli            # Binary entry point
└── ferrum-testkit        # Test infrastructure

Architecture v2 (Model-as-Code) means the model layer is an explicit Rust generic over a Backend<B> trait, not a config-driven runner. Adding a backend = implementing the trait, not editing models. See docs/architecture-v2.md.

Status

What works today:

  • CLI chat, OpenAI-compatible HTTP server with streaming
  • Continuous batching, PagedAttention (CUDA + Metal pools), prefix caching, preemption
  • Custom CUDA decode runner (Qwen3, LLaMA): 2× over Candle baseline
  • Apple Silicon MoE inference (Qwen3-30B-A3B) — matches llama.cpp at c=16
  • INT4 GPTQ with Marlin fused kernel (Blackwell + Ampere); also Triton w4a16
  • Tensor parallelism (multi-GPU NCCL, persistent per-rank threads)
  • Speculative decoding (--spec-draft <MODEL> DeepMind accept/reject)
  • Structured output (json_object best-effort plus strict json_schema validation for the supported schema subset)
  • Whisper ASR (Metal-accelerated forward pass) + Qwen3-TTS
  • Top-k / top-p / temperature / repetition penalty

Known regressions / in-progress:

  • Apple Silicon dense at c = 4 underperforms c = 1 on small models (paged-batched is below crossover). Per-token mode remains the default for c ≤ 4 until the small-m path catches up.
  • FP8 (Hopper / Blackwell) — INT4 path is at 24% peak DRAM bandwidth, so there's headroom before FP8 becomes the bottleneck.

Roadmap

See docs/ROADMAP.md for the full picture.

Near-term:

  • v0.1: CUDA + Apple Silicon production release with concurrent serving benchmarks
  • v0.2: Broader release matrix and long-context serving benchmarks
  • v0.3: Long-context tuning (32k+), more architectures (Phi, DeepSeek, Gemma)

License

MIT