Realizar is a high-performance machine learning inference engine for serving transformer models in production. Built entirely from scratch in Rust with zero external ML dependencies, it delivers 9.6x faster inference than PyTorch for CPU-only deployments while maintaining 94% test coverage and full GGUF/SafeTensors compatibility.
Table of Contents
- Quick Start
- Features
- Benchmark: 9.6x Faster Than PyTorch
- AWS Lambda: 53,000x Faster Cold Start
- Installation
- Usage
- Feature Flags
- Philosophy
- Target API
- Architecture
- Examples
- Reproducible Benchmarks
- Roadmap
- Development
- Documentation
- Contributing
- License
Quick Start
# Install from crates.io
# Start the demo server
# Test inference
For production use with custom models, see the Installation and Usage sections.
Features
Realizar is a production-ready ML inference engine built entirely from scratch in pure Rust:
- π Blazing Fast: 9.6x faster than PyTorch for CPU-only inference with 0.52Β΅s latency
- π¦ Zero Dependencies: No external ML libraries - 100% custom implementation using Trueno
- π§ Multiple Model Formats: Native APR, GGUF, and SafeTensors parsers built from scratch
- β‘ Advanced Quantization: Q4_0, Q8_0, Q4_K, Q5_K, Q6_K support for reduced memory footprint
- π― Production Ready: REST API server, streaming responses, model caching, Prometheus metrics
- βοΈ Serverless Optimized: 53,000x faster cold starts for AWS Lambda deployments
- π Swappable Backends: Modular HTTP server design (axum default, hyper/actix-web ready)
- π§ͺ Extreme Testing: 260+ tests with 94.61% coverage, property-based and mutation testing
π Benchmark: 9.6x Faster Than PyTorch
For CPU-only, single-request inference (AWS Lambda, edge, embedded):
| Metric | Aprender (Rust) | PyTorch (Python) | Winner |
|---|---|---|---|
| Inference Latency (p50) | 0.52 Β΅s | 5.00 Β΅s | 9.6x faster |
| Throughput | 1,898,614/sec | 195,754/sec | 9.7x higher |
| Cold Start | ~5 ms | ~500 ms+ | 100x faster |
| Package Size | ~5 MB | ~500 MB+ | 100x smaller |
| Lambda Memory | 128 MB | 512 MB+ | 4x less |
Statistical Validation: p < 0.001, Cohen's d = 5.19 (large effect), 10,000 iterations
# Run Aprender benchmark
# Run PyTorch benchmark
# Generate comparison report
See BENCHMARK_RESULTS.md for full methodology.
Why 9.6x Faster?
PyTorch (5.00 Β΅s):
βββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ
β Python β Bridge β Checks β COMPUTE β Alloc β Return β
β interp β FFI β dispatchβ (real) β tensor β to Py β
βββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ
β Only 10% is actual work
Aprender (0.52 Β΅s):
βββββββββββββ¬ββββ
β COMPUTE βretβ β 77% is actual work
βββββββββββββ΄ββββ
Bottom line: For .apr models on Lambda/edge, Aprender eliminates Python entirelyβfaster, smaller, cheaper.
AWS Lambda: 53,000x Faster Cold Start
For serverless deployment, the .apr format dominates PyTorch:
| Metric | .apr (Rust) | PyTorch | Improvement |
|---|---|---|---|
| Cold Start | 15Β΅s | 800ms | 53,000x faster |
| Inference | 0.6Β΅s | 5.0Β΅s | 8.5x faster |
| Binary Size | 3.2KB | >100MB | 30,000x smaller |
| Lambda Memory | 128MB | 512MB+ | 4x less |
100% Reproducible Lambda Deployment
The model file is checked into git for byte-for-byte reproducibility:
# Model is already in the repo
# Build Lambda binary (uses checked-in model)
# Package for AWS
# Run locally
See the Lambda MNIST Benchmark chapter for full details.
Copy-paste for LinkedIn:
We benchmarked Rust vs Python for ML inference. The results: 9.6x faster.
For CPU-only, single-request inference (AWS Lambda, edge devices):
- Latency: 0.52Β΅s (Rust) vs 5.0Β΅s (Python) β 9.6x faster
- Cold start: 5ms vs 500ms+ β 100x faster
- Package: 5MB vs 500MB β 100x smaller
- Lambda RAM: 128MB vs 512MB β 4x less
Why? Python's interpreter + FFI bridge overhead dominates small operations. 90% of PyTorch inference time is overhead, only 10% is actual compute.
Statistically validated: p < 0.001, Cohen's d = 5.19, 10,000 iterations, 100-point QA checklist.
Full methodology + reproducible benchmark: github.com/paiml/realizar
#MachineLearning #Rust #Python #AWS #Lambda #Performance #MLOps
Installation
# From crates.io
# From source
Usage
# Build the binary
# Start the inference server (demo mode)
# Test the API
# View help
βοΈ Feature Flags
Realizar supports modular compilation through feature flags:
[]
= { = "0.1", = false, = ["minimal"] }
Available Features:
default=["server", "cli", "gpu"]- Full functionalityminimal=[]- Core inference engine only (no server, no CLI)server- REST API server (requires axum, tokio)cli- Command-line interface (requires clap)gpu- GPU acceleration via Truenofull- Alias for all features
Examples:
# Core inference library only (minimal dependencies)
# Server without CLI
# Everything enabled
π― Philosophy
Total Control, Zero Compromise
Build everything ourselves except HTTP infrastructure:
- β Transformer architecture - Our code, Trueno-backed
- β Quantization - Q4_0, Q8_0, Q4_K from scratch
- β Model parsing - GGUF, safetensors native readers
- β Token encoding - BPE, SentencePiece in pure Rust
- β Inference engine - Every optimization under our control
- π§ HTTP server - axum (swappable via trait)
π Target API
use ;
// Load model (our loader, our format parsing)
let model = from_gguf?;
// Serve (swappable server backend)
new
.with_gpu
.serve?;
# CLI
# REST API
# Metrics (Prometheus format)
ποΈ Architecture
βββββββββββββββββββββββββββββββββββββββ
β HTTP Server (Swappable) β
β - axum (default, trait-based) β
β - hyper (future) β
β - actix-web (future) β
ββββββββββββββ¬βββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββ
β Inference Engine (FROM SCRATCH) β
β - Transformer (our code) β
β - Attention (Trueno-backed) β
β - Quantization (our algorithms) β
β - KV cache (our management) β
ββββββββββββββ¬βββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββ
β Model Loader (FROM SCRATCH) β
β - GGUF parser (pure Rust) β
β - Safetensors reader (pure Rust) β
ββββββββββββββ¬βββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββ
β Trueno (Compute Primitives) β
β - Matrix ops (SIMD/GPU) β
β - Vector ops (AVX2/NEON) β
βββββββββββββββββββββββββββββββββββββββ
π¦ Dependencies (Minimal)
[]
# OUR ecosystem - we control these
= { = "../trueno" } # SIMD/GPU compute primitives
# HTTP server ONLY (swappable via trait)
= "0.7"
= { = "1", = ["rt-multi-thread"] }
# CLI
= { = "4", = ["derive"] }
# Serialization (for API only, not ML)
= { = "1", = ["derive"] }
= "1"
# That's it. NO candle, NO llama-cpp-rs, NO hf-hub
π§ What We Build from Scratch
1. Model Formats (Pure Rust Parsers)
- APR - Aprender native format (PRIMARY, sovereign stack)
- GGUF - Ollama/llama.cpp format
- Safetensors - HuggingFace format
- No external dependencies, complete control
2. Transformer Architecture
3. Attention Mechanism
4. Quantization
5. Token Encoding
6. KV Cache
π Swappable HTTP Server
// HTTP server trait (axum is default, can swap)
// Default: axum
// Future: hyper, actix-web, custom
// Usage
let server = new
.with_backend // or HyperServer
.serve?;
π‘ Examples
Realizar includes 6 comprehensive examples demonstrating all major features:
1. End-to-End Inference (inference.rs)
Complete text generation pipeline with model initialization, forward pass, and multiple sampling strategies (greedy, top-k, top-p).
2. HTTP API Server (api_server.rs)
Deploy Realizar as a REST API service with demo model, handling tokenization and generation requests.
# Server runs at http://127.0.0.1:3000
# Test: curl http://127.0.0.1:3000/health
3. Tokenization (tokenization.rs)
Compare different tokenization strategies: Basic (vocabulary-based), BPE (Byte Pair Encoding), and SentencePiece.
4. SafeTensors Loading (safetensors_loading.rs)
Load and inspect SafeTensors files (aprender compatibility), extract tensor data, interoperate with aprender-trained models.
5. Model Caching (model_cache.rs)
Demonstrate ModelCache for efficient model reuse with LRU eviction, metrics tracking, and config-based cache keys.
6. GGUF Format Loading (gguf_loading.rs)
Load and inspect GGUF files (llama.cpp/Ollama format), parse headers and metadata, extract tensor data with dequantization support.
7. APR Format Loading (apr_loading.rs)
Load and use Aprender's native .apr format - the PRIMARY inference format for the sovereign AI stack. Demonstrates format specification, model types, and inference.
See examples/README.md for detailed documentation.
For SLM evaluation with Pareto frontier analysis:
See single-shot-eval - SLM Pareto Frontier Evaluation Framework.
β‘ Reproducible Benchmarks
Realizar provides scientifically rigorous, reproducible benchmarks following MLPerfβ’ Inference methodology. All benchmarks use Criterion.rs for statistical analysis with 95% confidence intervals.
Quick Start
# Run all Realizar benchmarks
# Run comparative benchmarks (Realizar vs PyTorch)
# CLI benchmark commands
Benchmark Suites
| Suite | Command | Description |
|---|---|---|
tensor_ops |
cargo bench --bench tensor_ops |
Tensor creation, shape access, indexing |
inference |
cargo bench --bench inference |
End-to-end token generation |
cache |
cargo bench --bench cache |
KV cache hit/miss, eviction |
tokenizer |
cargo bench --bench tokenizer |
BPE/SentencePiece encode/decode |
quantize |
cargo bench --bench quantize |
Q4_0/Q8_0 dequantization |
comparative |
cargo bench --bench comparative |
MNIST, CIFAR-10, Iris vs PyTorch |
Reproducing Results
Prerequisites:
# Rust toolchain
# Python environment (uv)
|
# PyTorch dependencies
Hardware Requirements:
- CPU: x86_64 with AVX2 or ARM64 with NEON
- RAM: 8GB minimum
- Recommended: Disable CPU frequency scaling for stable measurements
# Linux: Set performance governor
Step-by-Step Reproduction:
# 1. Clone and build
# 2. Run Realizar benchmarks
# 3. Run PyTorch baseline (requires uv)
# 4. Generate comparison report
# 5. View HTML reports
Datasets
Benchmarks use canonical ML datasets via Alimentar for PyTorch parity:
| Dataset | Dimensions | Classes | Features |
|---|---|---|---|
| MNIST | 28Γ28Γ1 | 10 | 784 |
| CIFAR-10 | 32Γ32Γ3 | 10 | 3,072 |
| Fashion-MNIST | 28Γ28Γ1 | 10 | 784 |
| Iris | Tabular | 3 | 4 |
Comparative Framework Testing
We benchmark against PyTorch under equivalent conditions:
| Setting | Value |
|---|---|
| Threads | 1 (single-threaded) |
| Batch sizes | 1, 8, 32 |
| Device | CPU only |
| Warm-up | 50 iterations |
| Measurement | 1000 iterations |
Run comparative benchmarks:
# Full comparison (Makefile)
# Manual execution
Performance Results
Realizar (v0.2.1) - Intel Core i7, Linux 6.8:
| Benchmark | Batch | Latency (p50) | Throughput |
|---|---|---|---|
| MNIST inference | 1 | 780 ns | 1.28M samples/s |
| MNIST inference | 32 | 23.8 Β΅s | 1.34M samples/s |
| CIFAR-10 inference | 1 | 1.58 Β΅s | 633K samples/s |
| CIFAR-10 inference | 32 | 49.8 Β΅s | 642K samples/s |
| Iris inference | 32 | 210 ns | 152M samples/s |
| Tensor creation (10) | - | 18 ns | - |
| Tensor creation (10K) | - | 643 ns | - |
| Cache hit | - | 39 ns | - |
Statistical Methodology
- Warm-up phase: Stabilize CPU caches and branch predictors
- Sample collection: 100 samples per benchmark (Criterion default)
- Confidence intervals: 95% CI reported as [lower, mean, upper]
- Regression detection: Automatic comparison against baseline
- Effect size: Cohen's d for practical significance
tensor_creation/10 time: [17.887 ns 17.966 ns 18.043 ns]
^ ^ ^
lower mean upper
bound estimate bound
Visualization
# Terminal visualization
# Output includes:
# - Sparklines (trend visualization)
# - ASCII histograms (distribution shape)
# - Statistical summary (mean, std_dev, p50/p95/p99)
# - Multi-benchmark comparison tables
References
- MLPerfβ’ Inference Benchmark Suite. MLCommons. https://mlcommons.org/benchmarks/inference/
- Criterion.rs: Statistics-driven Microbenchmarking. https://bheisler.github.io/criterion.rs/book/
- Box, G. E. P., Hunter, J. S., & Hunter, W. G. (2005). Statistics for Experimenters. Wiley.
- Fleming, P. J., & Wallace, J. J. (1986). How not to lie with statistics. CACM, 29(3), 218-221.
π Roadmap
Phase 1: Core Inference (Weeks 1-8) β COMPLETE
Build from scratch:
- β GGUF parser (binary format reader)
- β Safetensors parser (zero-copy reader)
- β Transformer architecture (attention, FFN, LayerNorm, RoPE)
- β Quantization (Q4_0, Q8_0, dequantization)
- β Tokenizer (BPE, SentencePiece)
- β KV cache management
- β Inference engine (generation loop, greedy/top-k/top-p)
- β HTTP server with axum (REST API)
- β
CLI:
realizar serve --demo(model loading in Phase 2) - β 260 tests (211 unit + 42 property + 7 integration), 94.61% coverage
Success criteria:
- β GGUF and Safetensors parsers working
- β Quantization working (Q4_0, Q8_0)
- β REST API with /health, /tokenize, /generate
- β GPU acceleration via Trueno
- β Zero external ML dependencies
- β TDG Score: 93.9/100 (A)
Phase 2: Optimization (Weeks 9-16) β COMPLETE
- β Advanced quantization (Q4_K, Q5_K, Q6_K)
- β Flash Attention (memory-efficient block-wise computation)
- β Batch inference
- β Streaming responses (SSE)
- β Model caching/warming
- β Benchmarks vs llama.cpp
Phase 3: Advanced Models (Weeks 17-24)
- β Multi-query attention (MQA)
- β Grouped-query attention (GQA)
- β RoPE position embeddings
- β ALiBi position embeddings
- Vision models (LLaVA, Qwen-VL)
Phase 4: Production (Weeks 25-32) β COMPLETE
- β Multi-model serving (ModelRegistry with concurrent access)
- β Request batching (batch tokenize & generate endpoints)
- β Monitoring/metrics (Prometheus-compatible /metrics endpoint)
- β Docker + GPU support (Dockerfile, docker-compose, Kubernetes, AWS ECS)
- β Load testing (Rust-based load test client, 7 scenarios, performance targets)
π οΈ Development
# Build
# Test
# Quality gates
# Run (when implemented)
π Documentation
Comprehensive documentation is available as an mdBook:
# Build and view the book
# Build only
# Live reload (for writing docs)
# Open in browser
The book covers:
- Core Architecture - Design philosophy, Trueno integration, feature flags
- Model Formats - GGUF and Safetensors parsing from scratch
- Quantization - Q4_0, Q8_0, and K-quant algorithms
- Transformer Architecture - Attention, RoPE, FFN, KV cache implementation
- Tokenization - BPE and SentencePiece without external libraries
- REST API & CLI - Production HTTP server and command-line interface
- GPU Acceleration - Trueno SIMD/GPU dispatch
- EXTREME TDD - Property-based testing, mutation testing methodology
- Development Phases - Phase 1-4 roadmap and implementation details
Note: Book structure is validated in make quality-gates to ensure documentation stays in sync with code.
π Learning Resources
We're building everything from scratch. Key papers:
- [11] TensorFlow - Model serving architecture
- [12] PyTorch - Imperative ML framework design
- [13] NumPy - N-dimensional array design
- [18] BLAS - Linear algebra API design
- [19] Strassen - Fast matrix multiplication
- [20] Kahan - Numerical stability
Full spec: docs/specifications/pure-rust-ml-library-research-spec.md
π Security
- Pure Rust - Memory safe by design
- Zero unsafe in public API
- Minimal deps - axum + tokio only for HTTP
cargo auditpre-commitcargo-denylicense checks
π€ Contributing
- Fork repo
- EXTREME TDD (tests first)
make quality-gatespasses- All commits on
master
π License
MIT License - see LICENSE
π Acknowledgments
- Trueno - SIMD/GPU compute primitives (our ecosystem)
- Aprender - ML algorithms (Phase 2+)
- Renacer - Profiling
- single-shot-eval - SLM Pareto Frontier Evaluation
- paiml-mcp-agent-toolkit - Quality gates
- bashrs - Script enforcement
Developed by Pragmatic AI Labs
Built from SCRATCH with EXTREME TDD π¦β‘