Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Table of Contents
- What is realizar?
- Installation
- Usage
- Features
- Benchmarks
- Quality
- Sovereign AI Stack
- Documentation
- Contributing
- License
What is realizar?
realizar is a pure Rust LLM inference engine. It loads models in APR v2, GGUF, and SafeTensors formats, runs transformer inference with quantized kernels (Q4_K through Q8_0), and serves predictions over an OpenAI-compatible REST API.
Key design decisions:
- Row-major mandate -- All tensors are row-major internally. GGUF column-major data is transposed at import by aprender. This matches PyTorch/SafeTensors layout and simplifies kernel implementations.
- Pure Rust CUDA -- GPU acceleration via trueno-gpu generates PTX directly from Rust. No nvcc, no LLVM, no C++ dependencies.
- Cost-based dispatch -- Backend selection (GPU/SIMD/scalar) uses a 5x PCIe cost model to avoid GPU overhead on small workloads.
Installation
CLI
Library
Add to your Cargo.toml:
[]
= "0.8"
Usage
Serving
# Start demo server
# Health check
# Prometheus metrics
OpenAI-Compatible API
# Chat completions
# Streaming
Library API
use ;
let template = auto_detect_template;
let messages = vec!;
let formatted = template.format_conversation?;
Tracing
Use the X-Trace-Level header for inference debugging:
# Brick-level: token-by-token timing
# Layer-level: per-layer timing breakdown
Features
Model Formats
| Format | Description |
|---|---|
| APR v2 | Native format with LZ4/ZSTD compression, zero-copy loading, Int4/Int8 quantization |
| GGUF | llama.cpp-compatible quantized models |
| SafeTensors | HuggingFace full-precision format |
GPU Kernels
| Kernel | Purpose |
|---|---|
GemmKernel |
Matrix multiplication (naive, tiled, tensor core) |
AttentionKernel |
FlashAttention-style tiled attention |
SoftmaxKernel |
Numerically stable with warp shuffle |
LayerNormKernel |
Fused layer normalization |
QuantizeKernel |
Q4_K dequantization fused with matmul |
Q5KKernel |
Q5_K dequantization |
Q6KKernel |
Q6_K dequantization |
Quantization
Q4_0, Q8_0, Q4_K, Q5_K, Q6_K -- SIMD-accelerated on AVX2, AVX-512, and NEON. GPU dequantization fused with matrix operations to avoid memory round-trips.
KV Cache
Autoregressive decoding with persistent key-value cache. Supports grouped-query attention (GQA) for models like Qwen2.5 and Llama 3.
Chat Templates
Automatic template detection from model metadata:
| Format | Models | System Prompt |
|---|---|---|
| ChatML | Qwen2, Yi, OpenHermes | Yes |
| Llama2 | TinyLlama, Vicuna, LLaMA 2 | Yes |
| Mistral | Mistral-7B, Mixtral | No |
| Phi | Phi-2, Phi-3 | Yes |
| Alpaca | Alpaca, Guanaco | Yes |
| Raw | Fallback | Passthrough |
| Custom | Any (Jinja2) | Configurable |
Feature Flags
| Flag | Description |
|---|---|
default |
server + cli + gpu |
cuda |
NVIDIA CUDA support (pure Rust PTX, no nvcc) |
minimal |
Core inference only |
bench-http |
External server benchmarking |
Benchmarks
LLM Inference (GPU)
| Model | Size | Format | Backend | Throughput |
|---|---|---|---|---|
| Qwen2.5-Coder Q4_K_M | 1.5B | APR | RTX 4090 (CUDA) | 240 tok/s |
| Phi-2 Q4_K_M | 2.7B | GGUF | RTX 4090 (CUDA) | 276 tok/s |
| Phi-2 Q4_K_M | 2.7B | GGUF | llama.cpp CUDA | 256 tok/s |
| Phi-2 Q4_K_M | 2.7B | GGUF | Ollama CUDA | 228 tok/s |
realizar achieves 8--21% faster inference than llama.cpp/Ollama via pure Rust CUDA PTX generation.
Classical ML (APR Format)
| Model | Parameters | Latency | Throughput |
|---|---|---|---|
| Iris | 131 | 103ns | 9.6M inferences/sec |
| MNIST | 103K | 73us | 13.6K inferences/sec |
| Large NN | 1M | 410us | 2.4K inferences/sec |
Methodology follows Hoefler & Belli SC'15 (CV-based stopping, warmup iterations discarded).
Quality
- 15,000+ tests across unit, integration, and property-based suites
- 95%+ line coverage via cargo-llvm-cov
- Zero clippy warnings with
-D warnings - Mutation testing via cargo-mutants
- Provable contracts -- 1,725 bindings with AllImplemented policy
Sovereign AI Stack
realizar is the inference layer of the PAIML Sovereign AI Stack:
| Layer | Crate | Purpose |
|---|---|---|
| Compute | trueno | SIMD/GPU primitives (AVX2/AVX-512/NEON, wgpu) |
| ML | aprender | ML algorithms, APR v2 format |
| Training | entrenar | Autograd, LoRA/QLoRA, quantization |
| Inference | realizar | LLM inference, GPU kernels, model serving |
| Speech | whisper-apr | Pure Rust Whisper ASR |
| Distribution | repartir | Distributed compute (CPU/GPU/Remote) |
| Registry | pacha | Model registry with Ed25519 signatures |
| Orchestration | batuta | Stack coordination and CLI |
Documentation
- API docs: docs.rs/realizar
- Repository: github.com/paiml/realizar
- Cookbook: github.com/paiml/apr-cookbook
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines, or open an issue to discuss your idea first.
License
MIT -- Pragmatic AI Labs