Ferrum Infer

A Rust-native LLM inference engine. Load models from Hugging Face, chat locally or serve via OpenAI-compatible API. Single binary, no Python, no runtime dependencies.

中文说明

Install

# From crates.io
cargo install ferrum-cli

# Or build from source
cargo build --release -p ferrum-cli --bin ferrum

Quick Start

For gated models (e.g. Llama 3.2), set your Hugging Face token first:

export HF_TOKEN=hf_your_token_here

# Download a model
ferrum pull qwen3:0.6b

# Chat
ferrum run qwen3:0.6b

# Or start an API server
ferrum serve --model qwen3:0.6b --port 8000

Supported Models

Alias	Model	Architecture	CUDA Runner
`qwen3:0.6b` / `1.7b` / `4b`	Qwen3	Qwen3	Yes
`qwen2.5:0.5b` / `1.5b` / `3b` / `7b`	Qwen2.5-Instruct	Qwen2	—
`llama3.2:1b` / `3b`	Llama-3.2-Instruct	LLaMA	Yes
`tinyllama`	TinyLlama-1.1B-Chat	LLaMA	Yes

GPTQ INT4 quantized models are auto-detected and use the Marlin fused kernel:

./target/release/ferrum run JunHowie/Qwen3-4B-GPTQ-Int4

Any Hugging Face model ID with a supported architecture also works directly:

./target/release/ferrum run Qwen/Qwen3-0.6B

Commands

Command	Description
`ferrum run <model>`	Interactive chat
`ferrum serve --model <model>`	OpenAI-compatible HTTP server
`ferrum stop`	Stop running server
`ferrum pull <model>`	Download model from Hugging Face
`ferrum list`	Show cached models
`ferrum bench <model>`	Performance benchmark
`ferrum embed <model>`	Generate BERT embeddings

API Endpoints

# Chat completions (OpenAI-compatible)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3:0.6b","messages":[{"role":"user","content":"Hello"}]}'

# List models
curl http://localhost:8000/v1/models

# Health check
curl http://localhost:8000/health

Performance

Benchmarked on RTX PRO 6000 (Blackwell):

Qwen3-4B

Mode	FP16	INT4 (GPTQ + Marlin)
Single request decode	88.1 tok/s	130.4 tok/s (+48%)
4 concurrent (batch)	109.4 tok/s	124.2 tok/s
VRAM	~8 GB	~2.5 GB (-69%)

TinyLlama-1.1B (Llama architecture)

Mode	Candle	CUDA Runner
Decode	126 tok/s	256.5 tok/s (+103%)

Key Optimizations

Custom CUDA decode runner: bypasses candle for the decode hot path (Qwen3 + LLaMA)
INT4 quantization: GPTQ models auto-detected, Marlin fused INT4×FP16 kernel
Batched attention kernel: single launch for all batch items (SM utilization 17%→67%)
Batched RoPE: per-item positions in single kernel launch
Custom CUDA kernels: fused RmsNorm, SiLU×mul, RoPE, decode attention (all on single stream)
Flash Decoding: split-K for long-context decode (auto at KV > 256)
Batch decode: batched cuBLAS GEMM + batched attention for concurrent requests
Paged KV attention: GPU block pool with block-table indirection
Double-buffered residual: cross-layer norm fusion (-108 kernel launches)

Current Status

What works:

CLI chat, HTTP serving with streaming, benchmarking
Qwen3, Qwen2/2.5, LLaMA 3.x, TinyLlama architectures
Custom CUDA decode runner for Qwen3 and LLaMA (2x speedup)
Metal GPU acceleration (macOS), CUDA (NVIDIA), CPU
INT4 GPTQ quantization with Marlin fused kernel (Blackwell compatible)
FlashAttention-2 prefill + custom CUDA decode runner
Paged KV cache with block reclamation
Continuous batching with batch decode
Top-k/top-p/temperature/repetition-penalty sampling

Roadmap

Tensor parallelism — multi-GPU via NCCL
Speculative decoding — draft model verification
More model architectures — Mistral, Phi, DeepSeek, etc.
Qwen2 CUDA runner — same pattern as LLaMA

See docs/ROADMAP.md for full details.

Build Options

# CPU only (default)
cargo install ferrum-cli

# With Metal acceleration (macOS)
cargo install ferrum-cli --features metal

# With CUDA acceleration (NVIDIA, requires CUDA toolkit + nvcc)
cargo install ferrum-cli --features cuda

Or build from source:

cargo build --release -p ferrum-cli                    # CPU
cargo build --release -p ferrum-cli --features metal   # Metal (macOS)
cargo build --release -p ferrum-cli --features cuda    # CUDA (NVIDIA)

Prerequisites: Rust stable toolchain.

Project Structure

crates/
├── ferrum-types          # Shared type definitions
├── ferrum-interfaces     # Core trait contracts (ComputeBackend, KernelOps, ModelExecutor)
├── ferrum-runtime        # Backend implementations (Candle, CPU)
├── ferrum-engine         # Metal kernels, model orchestration
├── ferrum-models         # Model architectures (LLaMA, Qwen2, Qwen3, BERT)
├── ferrum-cuda-kernels   # Custom CUDA kernels + decode runner
├── ferrum-tokenizer      # Tokenization
├── ferrum-sampler        # Sampling strategies
├── ferrum-scheduler      # Request scheduling
├── ferrum-kv             # KV cache management
├── ferrum-server         # HTTP API server
├── ferrum-cli            # CLI binary
└── ferrum-testkit        # Testing utilities

License

MIT

ferrum-cli 0.3.0