Ferrum Infer
A Rust-native LLM inference engine. Load models from Hugging Face, chat locally or serve via OpenAI-compatible API. Single binary, no Python, no runtime dependencies.
Install
# From crates.io
# Or build from source
Quick Start
For gated models (e.g. Llama 3.2), set your Hugging Face token first:
# Download a model
# Chat
# Or start an API server
Supported Architectures
Any Hugging Face model using a supported architecture works out of the box:
Text Generation
| Architecture | CUDA Decode | INT4 (GPTQ) | Tensor Parallel | Example Models |
|---|---|---|---|---|
| LLaMA | Yes | Yes | Yes | Llama-3.x, TinyLlama, Vicuna, Alpaca, ... |
| Qwen3 | Yes | Yes | Yes | Qwen3-0.6B ~ 4B |
| Qwen2 | — | — | — | Qwen2.5-Instruct-0.5B ~ 7B |
Speech-to-Text (Whisper ASR)
| Architecture | Metal | CUDA | Example Models |
|---|---|---|---|
| Whisper | Yes | — | whisper-tiny, whisper-base, whisper-small, whisper-medium, whisper-large-v3, whisper-turbo (recommended) |
Text-to-Speech (Qwen3-TTS)
| Architecture | Metal | CPU | Voice Clone | Example Models |
|---|---|---|---|---|
| Qwen3-TTS | Yes | Yes | Yes (ICL) | Qwen3-TTS-12Hz-0.6B-Base |
Embeddings (text + image)
| Architecture | Modality | Embedding Dim | Example Models |
|---|---|---|---|
| CLIP | Text + Image | 512/768 | openai/clip-vit-base-patch32 |
| Chinese-CLIP | Text + Image | 512 | OFA-Sys/chinese-clip-vit-base-patch16 |
| SigLIP | Text + Image | 768 | google/siglip-base-patch16-224 |
| BERT | Text | 768 | google-bert/bert-base-chinese |
# Text generation
# Speech-to-Text (supports WAV/M4A/MP3/FLAC — auto ffmpeg conversion)
# Text-to-Speech
# Voice clone (ICL mode — clone any voice from 5s reference audio)
# Whisper API server (OpenAI-compatible)
# Embeddings (text + image)
# Embedding API server
Commands
| Command | Description |
|---|---|
ferrum run <model> |
Interactive chat |
ferrum serve --model <model> |
OpenAI-compatible HTTP server |
ferrum stop |
Stop running server |
ferrum pull <model> |
Download model from Hugging Face |
ferrum list |
Show cached models |
ferrum bench <model> |
Performance benchmark |
ferrum transcribe <model> <audio> |
Speech-to-text (Whisper, supports WAV/M4A/MP3) |
ferrum tts <model> <text> |
Text-to-speech (Qwen3-TTS, voice clone with --ref-audio) |
ferrum embed <model> |
Generate embeddings (BERT/CLIP/SigLIP, text + image) |
API Endpoints
# Chat completions (OpenAI-compatible)
# Audio transcription (OpenAI-compatible, multipart form)
# Embeddings
# List models
# Health check
Performance
Benchmarked on RTX PRO 6000 (Blackwell):
Qwen3-4B
| Mode | FP16 | INT4 (GPTQ + Marlin) |
|---|---|---|
| Single request decode | 88.1 tok/s | 130.4 tok/s (+48%) |
| 4 concurrent (batch) | 109.4 tok/s | 124.2 tok/s |
| VRAM | ~8 GB | ~2.5 GB (-69%) |
TinyLlama-1.1B (Llama architecture)
| Mode | Candle | CUDA Runner |
|---|---|---|
| Decode | 126 tok/s | 256.5 tok/s (+103%) |
Tensor Parallelism (multi-GPU)
| Config | Qwen3-4B FP16 |
|---|---|
| 1× GPU | 82.3 tok/s (TPOT 12.1ms) |
| 2× GPU TP | 26.1 tok/s (TPOT 38.4ms) |
TP decode uses persistent per-rank threads with NCCL all-reduce. Current bottleneck is PCIe interconnect latency (~0.44ms × 72 NCCL calls/step). TP is most beneficial for models that don't fit on a single GPU, or with NVLink interconnect.
Whisper ASR (Apple Silicon Metal)
| Model | 5-min audio | Realtime factor |
|---|---|---|
| whisper-large-v3-turbo | ~72s | 4.2x realtime |
| whisper-tiny | ~20s | 15x realtime |
Custom Whisper forward pass with rustfft STFT. Full decode pipeline: timestamp-based sequential decode, temperature fallback, compression ratio check. Mel precision matches Python whisper exactly.
Qwen3-TTS (Apple Silicon Metal)
| Text | Audio | Time | RTF |
|---|---|---|---|
| 29 chars Chinese | 4.6s | 11.3s | 2.8x realtime |
| Voice clone (ICL, 5s ref) | 5.3s | 13.1s | 2.5x realtime |
All-Metal fused transformer pipeline: custom GEMM (64×32 simdgroup tiles), fused residual+norm, flash attention with layer_scale. Full Mimi-based vocoder with 8-layer pre-transformer. Zero-copy on Apple Silicon unified memory.
Key Optimizations
- Custom CUDA decode runner: bypasses candle for the decode hot path (Qwen3 + LLaMA)
- INT4 quantization: GPTQ models auto-detected, Marlin fused INT4×FP16 kernel
- Tensor parallelism: persistent per-rank threads, barrier sync, NCCL all-reduce (Megatron-LM pattern)
- Batched attention kernel: single launch for all batch items (SM utilization 17%→67%)
- Batched RoPE: per-item positions in single kernel launch
- Custom CUDA kernels: fused RmsNorm, SiLU×mul, RoPE, decode attention (all on single stream)
- Flash Decoding: split-K for long-context decode (auto at KV > 256)
- Batch decode: batched cuBLAS GEMM + batched attention for concurrent requests
- Metal TTS pipeline: all-Metal fused transformer for talker (28 layers) + SubTalker (5 layers) + vocoder (8 layers), cached GPU buffers, fused residual+norm kernel, layer_scale support
- TTS voice clone: ICL prompting with speaker encoder (ECAPA-TDNN) + speech tokenizer (Mimi RVQ)
- Paged KV attention: GPU block pool with block-table indirection
- Double-buffered residual: cross-layer norm fusion (-108 kernel launches)
Current Status
What works:
- CLI chat, HTTP serving with streaming, benchmarking
- Qwen3, Qwen2/2.5, LLaMA 3.x, TinyLlama architectures
- Custom CUDA decode runner for Qwen3 and LLaMA (2x speedup)
- Metal GPU acceleration (macOS), CUDA (NVIDIA), CPU
- INT4 GPTQ quantization with Marlin fused kernel (Blackwell compatible)
- FlashAttention-2 prefill + custom CUDA decode runner
- Paged KV cache with block reclamation
- Continuous batching with batch decode
- Tensor parallelism (multi-GPU NCCL, auto-detects GPU count)
- CLIP/Chinese-CLIP/SigLIP embeddings (text + image,
/v1/embeddingsAPI) - Whisper ASR (speech-to-text, Metal accelerated,
/v1/audio/transcriptionsAPI) - Multi-format audio support (WAV/M4A/MP3/FLAC via ffmpeg)
- Top-k/top-p/temperature/repetition-penalty sampling
Roadmap
- Speculative decoding — draft model verification
- More model architectures — Mistral, Phi, DeepSeek, etc.
- Qwen2 CUDA runner — same pattern as LLaMA
See docs/ROADMAP.md for full details.
Build Options
# CPU only (default)
# With Metal acceleration (macOS)
# With CUDA acceleration (NVIDIA, requires CUDA toolkit + nvcc)
Or build from source:
Prerequisites: Rust stable toolchain.
Project Structure
crates/
├── ferrum-types # Shared type definitions
├── ferrum-interfaces # Core trait contracts (ComputeBackend, KernelOps, ModelExecutor)
├── ferrum-runtime # Backend implementations (Candle, CPU)
├── ferrum-engine # Metal kernels, model orchestration
├── ferrum-models # Model architectures (LLaMA, Qwen2, Qwen3, BERT, Whisper)
├── ferrum-cuda-kernels # Custom CUDA kernels + decode runner
├── ferrum-tokenizer # Tokenization
├── ferrum-sampler # Sampling strategies
├── ferrum-scheduler # Request scheduling
├── ferrum-kv # KV cache management
├── ferrum-server # HTTP API server
├── ferrum-cli # CLI binary
└── ferrum-testkit # Testing utilities
License
MIT