ferrum-types 0.6.0

Shared type definitions for the Ferrum LLM inference engine
Documentation

Ferrum Infer

Crates.io License: MIT

A Rust-native LLM inference engine. Load models from Hugging Face, chat locally or serve via OpenAI-compatible API. Single binary, no Python, no runtime dependencies.

中文说明

Install

# From crates.io
cargo install ferrum-cli

# Or build from source
cargo build --release -p ferrum-cli --bin ferrum

Quick Start

For gated models (e.g. Llama 3.2), set your Hugging Face token first:

export HF_TOKEN=hf_your_token_here
# Download a model
ferrum pull qwen3:0.6b

# Chat
ferrum run qwen3:0.6b

# Or start an API server
ferrum serve --model qwen3:0.6b --port 8000

Supported Architectures

Any Hugging Face model using a supported architecture works out of the box:

Text Generation

Architecture CUDA Decode INT4 (GPTQ) Tensor Parallel Example Models
LLaMA Yes Yes Yes Llama-3.x, TinyLlama, Vicuna, Alpaca, ...
Qwen3 Yes Yes Yes Qwen3-0.6B ~ 4B
Qwen2 Qwen2.5-Instruct-0.5B ~ 7B

Speech-to-Text (Whisper ASR)

Architecture Metal CUDA Example Models
Whisper Yes whisper-tiny, whisper-base, whisper-small, whisper-medium, whisper-large-v3, whisper-turbo (recommended)

Text-to-Speech (Qwen3-TTS)

Architecture Metal CPU Voice Clone Example Models
Qwen3-TTS Yes Yes Yes (ICL) Qwen3-TTS-12Hz-0.6B-Base

Embeddings (text + image)

Architecture Modality Embedding Dim Example Models
CLIP Text + Image 512/768 openai/clip-vit-base-patch32
Chinese-CLIP Text + Image 512 OFA-Sys/chinese-clip-vit-base-patch16
SigLIP Text + Image 768 google/siglip-base-patch16-224
BERT Text 768 google-bert/bert-base-chinese
# Text generation
ferrum run Qwen/Qwen3-4B
ferrum run llama3.2:3b

# Speech-to-Text (supports WAV/M4A/MP3/FLAC — auto ffmpeg conversion)
ferrum transcribe whisper-turbo recording.m4a -l zh
ferrum transcribe whisper-turbo meeting.wav -l en

# Text-to-Speech
ferrum tts qwen3-tts "Hello, welcome to Ferrum TTS" -o output.wav
ferrum tts qwen3-tts "你好欢迎使用语音合成系统" -o output.wav

# Voice clone (ICL mode — clone any voice from 5s reference audio)
ferrum tts qwen3-tts "你好" --ref-audio ref.wav --ref-text "参考文本" -o clone.wav

# Whisper API server (OpenAI-compatible)
ferrum serve whisper-turbo
curl localhost:8000/v1/audio/transcriptions -F "file=@audio.wav" -F "language=zh"

# Embeddings (text + image)
ferrum embed OFA-Sys/chinese-clip-vit-base-patch16 --text "sunset at the beach"
ferrum embed google/siglip-base-patch16-224 --image photo.jpg

# Embedding API server
ferrum serve --model OFA-Sys/chinese-clip-vit-base-patch16
curl localhost:8000/v1/embeddings -d '{"model":"clip","input":"hello"}'
curl localhost:8000/v1/embeddings -d '{"model":"clip","input":{"image":"/path/to/photo.jpg"}}'

Commands

Command Description
ferrum run <model> Interactive chat
ferrum serve --model <model> OpenAI-compatible HTTP server
ferrum stop Stop running server
ferrum pull <model> Download model from Hugging Face
ferrum list Show cached models
ferrum bench <model> Performance benchmark
ferrum transcribe <model> <audio> Speech-to-text (Whisper, supports WAV/M4A/MP3)
ferrum tts <model> <text> Text-to-speech (Qwen3-TTS, voice clone with --ref-audio)
ferrum embed <model> Generate embeddings (BERT/CLIP/SigLIP, text + image)

API Endpoints

# Chat completions (OpenAI-compatible)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3:0.6b","messages":[{"role":"user","content":"Hello"}]}'

# Audio transcription (OpenAI-compatible, multipart form)
curl http://localhost:8000/v1/audio/transcriptions \
  -F "file=@audio.wav" -F "language=zh"

# Embeddings
curl http://localhost:8000/v1/embeddings \
  -d '{"model":"clip","input":"hello"}'

# List models
curl http://localhost:8000/v1/models

# Health check
curl http://localhost:8000/health

Performance

Benchmarked on RTX PRO 6000 (Blackwell):

Qwen3-4B

Mode FP16 INT4 (GPTQ + Marlin)
Single request decode 88.1 tok/s 130.4 tok/s (+48%)
4 concurrent (batch) 109.4 tok/s 124.2 tok/s
VRAM ~8 GB ~2.5 GB (-69%)

TinyLlama-1.1B (Llama architecture)

Mode Candle CUDA Runner
Decode 126 tok/s 256.5 tok/s (+103%)

Tensor Parallelism (multi-GPU)

Config Qwen3-4B FP16
1× GPU 82.3 tok/s (TPOT 12.1ms)
2× GPU TP 26.1 tok/s (TPOT 38.4ms)

TP decode uses persistent per-rank threads with NCCL all-reduce. Current bottleneck is PCIe interconnect latency (~0.44ms × 72 NCCL calls/step). TP is most beneficial for models that don't fit on a single GPU, or with NVLink interconnect.

Whisper ASR (Apple Silicon Metal)

Model 5-min audio Realtime factor
whisper-large-v3-turbo ~72s 4.2x realtime
whisper-tiny ~20s 15x realtime

Custom Whisper forward pass with rustfft STFT. Full decode pipeline: timestamp-based sequential decode, temperature fallback, compression ratio check. Mel precision matches Python whisper exactly.

Qwen3-TTS (Apple Silicon Metal)

Text Audio Time RTF
29 chars Chinese 4.6s 11.3s 2.8x realtime
Voice clone (ICL, 5s ref) 5.3s 13.1s 2.5x realtime

All-Metal fused transformer pipeline: custom GEMM (64×32 simdgroup tiles), fused residual+norm, flash attention with layer_scale. Full Mimi-based vocoder with 8-layer pre-transformer. Zero-copy on Apple Silicon unified memory.

Key Optimizations

  • Custom CUDA decode runner: bypasses candle for the decode hot path (Qwen3 + LLaMA)
  • INT4 quantization: GPTQ models auto-detected, Marlin fused INT4×FP16 kernel
  • Tensor parallelism: persistent per-rank threads, barrier sync, NCCL all-reduce (Megatron-LM pattern)
  • Batched attention kernel: single launch for all batch items (SM utilization 17%→67%)
  • Batched RoPE: per-item positions in single kernel launch
  • Custom CUDA kernels: fused RmsNorm, SiLU×mul, RoPE, decode attention (all on single stream)
  • Flash Decoding: split-K for long-context decode (auto at KV > 256)
  • Batch decode: batched cuBLAS GEMM + batched attention for concurrent requests
  • Metal TTS pipeline: all-Metal fused transformer for talker (28 layers) + SubTalker (5 layers) + vocoder (8 layers), cached GPU buffers, fused residual+norm kernel, layer_scale support
  • TTS voice clone: ICL prompting with speaker encoder (ECAPA-TDNN) + speech tokenizer (Mimi RVQ)
  • Paged KV attention: GPU block pool with block-table indirection
  • Double-buffered residual: cross-layer norm fusion (-108 kernel launches)

Current Status

What works:

  • CLI chat, HTTP serving with streaming, benchmarking
  • Qwen3, Qwen2/2.5, LLaMA 3.x, TinyLlama architectures
  • Custom CUDA decode runner for Qwen3 and LLaMA (2x speedup)
  • Metal GPU acceleration (macOS), CUDA (NVIDIA), CPU
  • INT4 GPTQ quantization with Marlin fused kernel (Blackwell compatible)
  • FlashAttention-2 prefill + custom CUDA decode runner
  • Paged KV cache with block reclamation
  • Continuous batching with batch decode
  • Tensor parallelism (multi-GPU NCCL, auto-detects GPU count)
  • CLIP/Chinese-CLIP/SigLIP embeddings (text + image, /v1/embeddings API)
  • Whisper ASR (speech-to-text, Metal accelerated, /v1/audio/transcriptions API)
  • Multi-format audio support (WAV/M4A/MP3/FLAC via ffmpeg)
  • Top-k/top-p/temperature/repetition-penalty sampling

Roadmap

  • Speculative decoding — draft model verification
  • More model architectures — Mistral, Phi, DeepSeek, etc.
  • Qwen2 CUDA runner — same pattern as LLaMA

See docs/ROADMAP.md for full details.

Build Options

# CPU only (default)
cargo install ferrum-cli

# With Metal acceleration (macOS)
cargo install ferrum-cli --features metal

# With CUDA acceleration (NVIDIA, requires CUDA toolkit + nvcc)
cargo install ferrum-cli --features cuda

Or build from source:

cargo build --release -p ferrum-cli                    # CPU
cargo build --release -p ferrum-cli --features metal   # Metal (macOS)
cargo build --release -p ferrum-cli --features cuda    # CUDA (NVIDIA)
cargo build --release -p ferrum-cli --features cuda    # Multi-GPU auto-detected when available

Prerequisites: Rust stable toolchain.

Project Structure

crates/
├── ferrum-types          # Shared type definitions
├── ferrum-interfaces     # Core trait contracts (ComputeBackend, KernelOps, ModelExecutor)
├── ferrum-runtime        # Backend implementations (Candle, CPU)
├── ferrum-engine         # Metal kernels, model orchestration
├── ferrum-models         # Model architectures (LLaMA, Qwen2, Qwen3, BERT, Whisper)
├── ferrum-cuda-kernels   # Custom CUDA kernels + decode runner
├── ferrum-tokenizer      # Tokenization
├── ferrum-sampler        # Sampling strategies
├── ferrum-scheduler      # Request scheduling
├── ferrum-kv             # KV cache management
├── ferrum-server         # HTTP API server
├── ferrum-cli            # CLI binary
└── ferrum-testkit        # Testing utilities

License

MIT