ForgeLLM
Compile your LLMs, don't interpret them.
ForgeLLM is a Rust-native ahead-of-time (AOT) ML compiler for language models (1M-7B parameters). It compiles GGUF models into optimized, self-contained binaries with native Metal GPU acceleration — no runtime interpreter, no Python dependencies, no dynamic dispatch.
Faster than llama.cpp on Apple Silicon.
Documentation | Crates.io | forgellm.dev | Blog: How we beat llama.cpp
Performance
Benchmarks on Apple M5 Pro, 8-bit quantization, 64-token generation.
Generation Speed (tok/s)
| Model | ForgeLLM Metal | MLX (8-bit) | llama.cpp (Q8_0) | vs MLX | vs llama.cpp |
|---|---|---|---|---|---|
| SmolLM2-135M | 496 tok/s | 414 tok/s | 481 tok/s | 1.20x | 1.03x |
| SmolLM2-360M | 289 tok/s | 264 tok/s | 267 tok/s | 1.09x | 1.08x |
| Llama-3.2-1B | 178 tok/s | 111 tok/s | 130 tok/s | 1.60x | 1.37x |
| Llama-3.2-3B | 70.4 tok/s | 42.2 tok/s | 67.8 tok/s | 1.67x | 1.04x |
Prefill Speed (tok/s, long prompt)
| Model | ForgeLLM Metal | MLX (8-bit) | llama.cpp (Q8_0) |
|---|---|---|---|
| SmolLM2-135M (~130 tok) | 3,173 | 1,507 | 2,812 |
| SmolLM2-135M (~1250 tok) | 9,335 | — | — |
| Llama-3.2-1B (~325 tok) | 475 | 2,718 | 556 |
Deploy Size
| Model | Binary | Weights | Total |
|---|---|---|---|
| SmolLM2-135M | 3.7 MB | 244 MB | 248 MB |
| Llama-3.2-1B | 3.7 MB | 2.2 GB | 2.2 GB |
| Llama-3.2-3B | 3.7 MB | 4.6 GB | 4.9 GB |
Binary size is constant across all models. Compare: llama.cpp ~15 MB, MLX ~500 MB Python runtime.
We beat MLX and llama.cpp on generation across all model sizes, and on prefill for small-to-medium models. For very large models (1B+), MLX's Apple Accelerate BLAS leads on prefill — closing that gap requires hardware matrix multiply instructions (simdgroup_multiply_accumulate).
See benchmarks/HISTORY.md and blog/beating-llama-cpp.md for details.
Quick Start
Metal GPU (Apple Silicon)
# Build from source
&&
# Compile model to Metal binary
# Build and run
&&
API Server
# Start OpenAI-compatible server
# Query it
CPU (cross-platform)
# Compile for CPU with NEON SIMD + Rayon parallelism
Why ForgeLLM is faster
Every existing LLM inference engine — llama.cpp, vLLM, MLX — loads model weights at runtime and executes a generic inference loop. This is like shipping a Python interpreter when you could ship a compiled binary.
ForgeLLM compiles models into hardware-specific code:
| llama.cpp (interpreter) | ForgeLLM (compiler) | |
|---|---|---|
| Dispatch | Runtime graph build + plan + execute | Direct function calls, zero overhead |
| Dimensions | Dynamic (runtime checks) | Baked in at compile time |
| GPU commands | Multiple command encoders per layer | Single encoder for entire forward pass |
| Projections | Separate Q, K, V matmuls | Fused QKV in one dispatch |
| Memory | Runtime allocation | Static, pre-allocated buffers |
| Quantization | Dequant at load time | Native Q8_0/Q4_0 GPU kernels |
| Output | Shared library + runtime | Self-contained binary, deploy with scp |
Compilation Targets
| Target | Command | Features |
|---|---|---|
| Metal GPU | --target metal |
Native MSL shaders, simdgroup reductions, Q8_0/Q4_0 kernels, API server |
| CPU | --target cpu |
NEON sdot inline asm, Rayon parallelism, Apple AMX via Accelerate |
| WASM | --target wasm |
SIMD128, wasm-bindgen exports, browser-ready |
| wgpu/WGSL | --target gpu |
Cross-platform GPU via WebGPU |
Supported Models
| Architecture | Models | Status |
|---|---|---|
| LlamaForCausalLM | SmolLM2 (135M, 360M, 1.7B), Llama 3.2 (1B, 3B), TinyLlama | Verified |
| Qwen2ForCausalLM | Qwen2.5 (0.5B-7B) | Verified |
| MistralForCausalLM | Mistral 7B (sliding-window attention) | Supported |
| Phi3ForCausalLM | Phi-3 Mini | Supported |
| GemmaForCausalLM | Gemma 2B, 7B | Supported |
| StableLMForCausalLM | StableLM 1.6B, 3B | Supported |
Supports GGUF quantization formats: F32, F16, BF16, Q8_0, Q4_0, Q4_1, Q2_K through Q8_K. Also supports SafeTensors and LoRA adapter merging at compile time.
Metal GPU Features
The Metal backend generates optimized Apple Silicon compute shaders:
- Simdgroup cooperative matmul — 32-lane SIMD reductions with shared memory vector caching
- Native Q8_0/Q4_0 kernels — Dequantize on-the-fly during matmul, halving memory bandwidth
- Fused projections — QKV and gate+up concatenated into single matmul dispatches
- Single compute encoder — Entire forward pass in one encoder, zero transitions
- Double-buffered prefill — GPU overlaps with CPU encoding
fast::math — Hardware-accelerated rsqrt/exp in normalization and attention- OpenAI-compatible API —
--servemode with SSE streaming
CLI Commands
# AOT compile to Metal GPU binary
# AOT compile to CPU binary
# Export weights for compiled binary
# Run interpreter (no compilation)
# Interactive chat
# Start API server (interpreter mode)
# Benchmark
# Inspect model
# ONNX export
# Speculative decoding
Architecture
GGUF/SafeTensors → Frontend → IR Graph → Optimizer → Codegen → Binary
parse build fuse emit compile
8 crates: forgellm-frontend, forgellm-optimizer, forgellm-codegen-cpu, forgellm-codegen-wasm, forgellm-codegen-gpu, forgellm-codegen-metal, forgellm-runtime, forgellm-cli
Contributing
License
MIT