ForgeLLM

Compile your LLMs, don't interpret them.

ForgeLLM is a Rust-native ahead-of-time (AOT) ML compiler for language models (1M-7B parameters). It compiles GGUF models into optimized, self-contained binaries with native Metal GPU acceleration — no runtime interpreter, no Python dependencies, no dynamic dispatch.

Faster than llama.cpp on Apple Silicon.

Documentation | Crates.io | forgellm.dev | Blog: How we beat llama.cpp

Performance

Benchmarks on Apple M5 Pro, 8-bit quantization, 64-token generation.

Generation Speed (tok/s)

Model	ForgeLLM Metal	MLX (8-bit)	llama.cpp (Q8_0)	vs MLX	vs llama.cpp
SmolLM2-135M	496 tok/s	414 tok/s	481 tok/s	1.20x	1.03x
SmolLM2-360M	289 tok/s	264 tok/s	267 tok/s	1.09x	1.08x
Llama-3.2-1B	178 tok/s	111 tok/s	130 tok/s	1.60x	1.37x
Llama-3.2-3B	70.4 tok/s	42.2 tok/s	67.8 tok/s	1.67x	1.04x

Prefill Speed (tok/s, long prompt)

Model	ForgeLLM Metal	MLX (8-bit)	llama.cpp (Q8_0)
SmolLM2-135M (~130 tok)	3,173	1,507	2,812
SmolLM2-135M (~1250 tok)	9,335	—	—
Llama-3.2-1B (~325 tok)	475	2,718	556

Deploy Size

Model	Binary	Weights	Total
SmolLM2-135M	3.7 MB	244 MB	248 MB
Llama-3.2-1B	3.7 MB	2.2 GB	2.2 GB
Llama-3.2-3B	3.7 MB	4.6 GB	4.9 GB

Binary size is constant across all models. Compare: llama.cpp ~15 MB, MLX ~500 MB Python runtime.

We beat MLX and llama.cpp on generation across all model sizes, and on prefill for small-to-medium models. For very large models (1B+), MLX's Apple Accelerate BLAS leads on prefill — closing that gap requires hardware matrix multiply instructions (simdgroup_multiply_accumulate).

See benchmarks/HISTORY.md and blog/beating-llama-cpp.md for details.

Quick Start

Metal GPU (Apple Silicon)

# Build from source
git clone https://github.com/sauravpanda/forge-llm.git
cd forge-llm && cargo build --release

# Compile model to Metal binary
forge compile --model model.gguf --output ./my-model --target metal
forge export-weights --model model.gguf --output ./my-model/weights.bin
cp tokenizer.json ./my-model/

# Build and run
cd my-model && cargo build --release
./target/release/my-model weights.bin tokenizer.json "The meaning of life is"

API Server

# Start OpenAI-compatible server
./target/release/my-model weights.bin tokenizer.json --serve --port 8080

# Query it
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}], "stream": true}'

CPU (cross-platform)

# Compile for CPU with NEON SIMD + Rayon parallelism
forge compile --model model.gguf --output ./my-model --target cpu --run

Why ForgeLLM is faster

Every existing LLM inference engine — llama.cpp, vLLM, MLX — loads model weights at runtime and executes a generic inference loop. This is like shipping a Python interpreter when you could ship a compiled binary.

ForgeLLM compiles models into hardware-specific code:

	llama.cpp (interpreter)	ForgeLLM (compiler)
Dispatch	Runtime graph build + plan + execute	Direct function calls, zero overhead
Dimensions	Dynamic (runtime checks)	Baked in at compile time
GPU commands	Multiple command encoders per layer	Single encoder for entire forward pass
Projections	Separate Q, K, V matmuls	Fused QKV in one dispatch
Memory	Runtime allocation	Static, pre-allocated buffers
Quantization	Dequant at load time	Native Q8_0/Q4_0 GPU kernels
Output	Shared library + runtime	Self-contained binary, deploy with `scp`

Compilation Targets

Target	Command	Features
Metal GPU	`--target metal`	Native MSL shaders, simdgroup reductions, Q8_0/Q4_0 kernels, API server
CPU	`--target cpu`	NEON sdot inline asm, Rayon parallelism, Apple AMX via Accelerate
WASM	`--target wasm`	SIMD128, wasm-bindgen exports, browser-ready
wgpu/WGSL	`--target gpu`	Cross-platform GPU via WebGPU

Supported Models

Architecture	Models	Status
LlamaForCausalLM	SmolLM2 (135M, 360M, 1.7B), Llama 3.2 (1B, 3B), TinyLlama	Verified
Qwen2ForCausalLM	Qwen2.5 (0.5B-7B)	Verified
MistralForCausalLM	Mistral 7B (sliding-window attention)	Supported
Phi3ForCausalLM	Phi-3 Mini	Supported
GemmaForCausalLM	Gemma 2B, 7B	Supported
StableLMForCausalLM	StableLM 1.6B, 3B	Supported

Supports GGUF quantization formats: F32, F16, BF16, Q8_0, Q4_0, Q4_1, Q2_K through Q8_K. Also supports SafeTensors and LoRA adapter merging at compile time.

Metal GPU Features

The Metal backend generates optimized Apple Silicon compute shaders:

Simdgroup cooperative matmul — 32-lane SIMD reductions with shared memory vector caching
Native Q8_0/Q4_0 kernels — Dequantize on-the-fly during matmul, halving memory bandwidth
Fused projections — QKV and gate+up concatenated into single matmul dispatches
Single compute encoder — Entire forward pass in one encoder, zero transitions
Double-buffered prefill — GPU overlaps with CPU encoding
fast:: math — Hardware-accelerated rsqrt/exp in normalization and attention
OpenAI-compatible API — --serve mode with SSE streaming

CLI Commands

# AOT compile to Metal GPU binary
forge compile --model model.gguf --output ./out --target metal

# AOT compile to CPU binary
forge compile --model model.gguf --output ./out --target cpu --run

# Export weights for compiled binary
forge export-weights --model model.gguf --output ./out/weights.bin

# Run interpreter (no compilation)
forge run --model model.gguf --tokenizer tokenizer.json --prompt "Hello"

# Interactive chat
forge chat --model model.gguf --tokenizer tokenizer.json

# Start API server (interpreter mode)
forge serve --model model.gguf --tokenizer tokenizer.json --port 8080

# Benchmark
forge bench --model model.gguf --tokenizer tokenizer.json --num-tokens 128

# Inspect model
forge info model.gguf

# ONNX export
forge export-onnx --model model.gguf --output model.onnx

# Speculative decoding
forge speculative --draft small.gguf --target-model large.gguf --output ./spec

Architecture

GGUF/SafeTensors → Frontend → IR Graph → Optimizer → Codegen → Binary
                     parse      build      fuse       emit     compile

8 crates: forgellm-frontend, forgellm-optimizer, forgellm-codegen-cpu, forgellm-codegen-wasm, forgellm-codegen-gpu, forgellm-codegen-metal, forgellm-runtime, forgellm-cli

Contributing

cargo test --workspace --exclude forgellm-python  # 258+ tests
cargo clippy --workspace -- -D warnings
cargo fmt --all -- --check

License

MIT

forgellm-frontend 0.5.2