Realizar β‘
Pure Rust Model Serving - Built from Scratch
Realizar - Production ML inference engine built 100% from scratch in pure Rust.
π Benchmark: 9.6x Faster Than PyTorch
For CPU-only, single-request inference (AWS Lambda, edge, embedded):
| Metric | Aprender (Rust) | PyTorch (Python) | Winner |
|---|---|---|---|
| Inference Latency (p50) | 0.52 Β΅s | 5.00 Β΅s | 9.6x faster |
| Throughput | 1,898,614/sec | 195,754/sec | 9.7x higher |
| Cold Start | ~5 ms | ~500 ms+ | 100x faster |
| Package Size | ~5 MB | ~500 MB+ | 100x smaller |
| Lambda Memory | 128 MB | 512 MB+ | 4x less |
Statistical Validation: p < 0.001, Cohen's d = 5.19 (large effect), 10,000 iterations
# Run Aprender benchmark
# Run PyTorch benchmark
# Generate comparison report
See BENCHMARK_RESULTS.md for full methodology.
Why 9.6x Faster?
PyTorch (5.00 Β΅s):
βββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ
β Python β Bridge β Checks β COMPUTE β Alloc β Return β
β interp β FFI β dispatchβ (real) β tensor β to Py β
βββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ
β Only 10% is actual work
Aprender (0.52 Β΅s):
βββββββββββββ¬ββββ
β COMPUTE βretβ β 77% is actual work
βββββββββββββ΄ββββ
Bottom line: For .apr models on Lambda/edge, Aprender eliminates Python entirelyβfaster, smaller, cheaper.
AWS Lambda: 53,000x Faster Cold Start
For serverless deployment, the .apr format dominates PyTorch:
| Metric | .apr (Rust) | PyTorch | Improvement |
|---|---|---|---|
| Cold Start | 15Β΅s | 800ms | 53,000x faster |
| Inference | 0.6Β΅s | 5.0Β΅s | 8.5x faster |
| Binary Size | 3.2KB | >100MB | 30,000x smaller |
| Lambda Memory | 128MB | 512MB+ | 4x less |
100% Reproducible Lambda Deployment
The model file is checked into git for byte-for-byte reproducibility:
# Model is already in the repo
# Build Lambda binary (uses checked-in model)
# Package for AWS
# Run locally
See the Lambda MNIST Benchmark chapter for full details.
Copy-paste for LinkedIn:
We benchmarked Rust vs Python for ML inference. The results: 9.6x faster.
For CPU-only, single-request inference (AWS Lambda, edge devices):
- Latency: 0.52Β΅s (Rust) vs 5.0Β΅s (Python) β 9.6x faster
- Cold start: 5ms vs 500ms+ β 100x faster
- Package: 5MB vs 500MB β 100x smaller
- Lambda RAM: 128MB vs 512MB β 4x less
Why? Python's interpreter + FFI bridge overhead dominates small operations. 90% of PyTorch inference time is overhead, only 10% is actual compute.
Statistically validated: p < 0.001, Cohen's d = 5.19, 10,000 iterations, 100-point QA checklist.
Full methodology + reproducible benchmark: github.com/paiml/realizar
#MachineLearning #Rust #Python #AWS #Lambda #Performance #MLOps
π Quick Start
# Build the binary
# Start the inference server (demo mode)
# Test the API
# View help
βοΈ Feature Flags
Realizar supports modular compilation through feature flags:
[]
= { = "0.1", = false, = ["minimal"] }
Available Features:
default=["server", "cli", "gpu"]- Full functionalityminimal=[]- Core inference engine only (no server, no CLI)server- REST API server (requires axum, tokio)cli- Command-line interface (requires clap)gpu- GPU acceleration via Truenofull- Alias for all features
Examples:
# Core inference library only (minimal dependencies)
# Server without CLI
# Everything enabled
π― Philosophy
Total Control, Zero Compromise
Build everything ourselves except HTTP infrastructure:
- β Transformer architecture - Our code, Trueno-backed
- β Quantization - Q4_0, Q8_0, Q4_K from scratch
- β Model parsing - GGUF, safetensors native readers
- β Token encoding - BPE, SentencePiece in pure Rust
- β Inference engine - Every optimization under our control
- π§ HTTP server - axum (swappable via trait)
π Target API
use ;
// Load model (our loader, our format parsing)
let model = from_gguf?;
// Serve (swappable server backend)
new
.with_gpu
.serve?;
# CLI
# REST API
# Metrics (Prometheus format)
ποΈ Architecture
βββββββββββββββββββββββββββββββββββββββ
β HTTP Server (Swappable) β
β - axum (default, trait-based) β
β - hyper (future) β
β - actix-web (future) β
ββββββββββββββ¬βββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββ
β Inference Engine (FROM SCRATCH) β
β - Transformer (our code) β
β - Attention (Trueno-backed) β
β - Quantization (our algorithms) β
β - KV cache (our management) β
ββββββββββββββ¬βββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββ
β Model Loader (FROM SCRATCH) β
β - GGUF parser (pure Rust) β
β - Safetensors reader (pure Rust) β
ββββββββββββββ¬βββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββ
β Trueno (Compute Primitives) β
β - Matrix ops (SIMD/GPU) β
β - Vector ops (AVX2/NEON) β
βββββββββββββββββββββββββββββββββββββββ
π¦ Dependencies (Minimal)
[]
# OUR ecosystem - we control these
= { = "../trueno" } # SIMD/GPU compute primitives
# HTTP server ONLY (swappable via trait)
= "0.7"
= { = "1", = ["rt-multi-thread"] }
# CLI
= { = "4", = ["derive"] }
# Serialization (for API only, not ML)
= { = "1", = ["derive"] }
= "1"
# That's it. NO candle, NO llama-cpp-rs, NO hf-hub
π§ What We Build from Scratch
1. Model Formats (Pure Rust Parsers)
- GGUF - Ollama/llama.cpp format
- Safetensors - HuggingFace format
- No external dependencies, complete control
2. Transformer Architecture
3. Attention Mechanism
4. Quantization
5. Token Encoding
6. KV Cache
π Swappable HTTP Server
// HTTP server trait (axum is default, can swap)
// Default: axum
// Future: hyper, actix-web, custom
// Usage
let server = new
.with_backend // or HyperServer
.serve?;
π‘ Examples
Realizar includes 6 comprehensive examples demonstrating all major features:
1. End-to-End Inference (inference.rs)
Complete text generation pipeline with model initialization, forward pass, and multiple sampling strategies (greedy, top-k, top-p).
2. HTTP API Server (api_server.rs)
Deploy Realizar as a REST API service with demo model, handling tokenization and generation requests.
# Server runs at http://127.0.0.1:3000
# Test: curl http://127.0.0.1:3000/health
3. Tokenization (tokenization.rs)
Compare different tokenization strategies: Basic (vocabulary-based), BPE (Byte Pair Encoding), and SentencePiece.
4. SafeTensors Loading (safetensors_loading.rs)
Load and inspect SafeTensors files (aprender compatibility), extract tensor data, interoperate with aprender-trained models.
5. Model Caching (model_cache.rs)
Demonstrate ModelCache for efficient model reuse with LRU eviction, metrics tracking, and config-based cache keys.
6. GGUF Format Loading (gguf_loading.rs)
Load and inspect GGUF files (llama.cpp/Ollama format), parse headers and metadata, extract tensor data with dequantization support.
See examples/README.md for detailed documentation.
β‘ Reproducible Benchmarks
Realizar provides scientifically rigorous, reproducible benchmarks following MLPerfβ’ Inference methodology. All benchmarks use Criterion.rs for statistical analysis with 95% confidence intervals.
Quick Start
# Run all Realizar benchmarks
# Run comparative benchmarks (Realizar vs PyTorch)
# CLI benchmark commands
Benchmark Suites
| Suite | Command | Description |
|---|---|---|
tensor_ops |
cargo bench --bench tensor_ops |
Tensor creation, shape access, indexing |
inference |
cargo bench --bench inference |
End-to-end token generation |
cache |
cargo bench --bench cache |
KV cache hit/miss, eviction |
tokenizer |
cargo bench --bench tokenizer |
BPE/SentencePiece encode/decode |
quantize |
cargo bench --bench quantize |
Q4_0/Q8_0 dequantization |
comparative |
cargo bench --bench comparative |
MNIST, CIFAR-10, Iris vs PyTorch |
Reproducing Results
Prerequisites:
# Rust toolchain
# Python environment (uv)
|
# PyTorch dependencies
Hardware Requirements:
- CPU: x86_64 with AVX2 or ARM64 with NEON
- RAM: 8GB minimum
- Recommended: Disable CPU frequency scaling for stable measurements
# Linux: Set performance governor
Step-by-Step Reproduction:
# 1. Clone and build
# 2. Run Realizar benchmarks
# 3. Run PyTorch baseline (requires uv)
# 4. Generate comparison report
# 5. View HTML reports
Datasets
Benchmarks use canonical ML datasets via Alimentar for PyTorch parity:
| Dataset | Dimensions | Classes | Features |
|---|---|---|---|
| MNIST | 28Γ28Γ1 | 10 | 784 |
| CIFAR-10 | 32Γ32Γ3 | 10 | 3,072 |
| Fashion-MNIST | 28Γ28Γ1 | 10 | 784 |
| Iris | Tabular | 3 | 4 |
Comparative Framework Testing
We benchmark against PyTorch under equivalent conditions:
| Setting | Value |
|---|---|
| Threads | 1 (single-threaded) |
| Batch sizes | 1, 8, 32 |
| Device | CPU only |
| Warm-up | 50 iterations |
| Measurement | 1000 iterations |
Run comparative benchmarks:
# Full comparison (Makefile)
# Manual execution
Performance Results
Realizar (v0.2.1) - Intel Core i7, Linux 6.8:
| Benchmark | Batch | Latency (p50) | Throughput |
|---|---|---|---|
| MNIST inference | 1 | 780 ns | 1.28M samples/s |
| MNIST inference | 32 | 23.8 Β΅s | 1.34M samples/s |
| CIFAR-10 inference | 1 | 1.58 Β΅s | 633K samples/s |
| CIFAR-10 inference | 32 | 49.8 Β΅s | 642K samples/s |
| Iris inference | 32 | 210 ns | 152M samples/s |
| Tensor creation (10) | - | 18 ns | - |
| Tensor creation (10K) | - | 643 ns | - |
| Cache hit | - | 39 ns | - |
Statistical Methodology
- Warm-up phase: Stabilize CPU caches and branch predictors
- Sample collection: 100 samples per benchmark (Criterion default)
- Confidence intervals: 95% CI reported as [lower, mean, upper]
- Regression detection: Automatic comparison against baseline
- Effect size: Cohen's d for practical significance
tensor_creation/10 time: [17.887 ns 17.966 ns 18.043 ns]
^ ^ ^
lower mean upper
bound estimate bound
Visualization
# Terminal visualization
# Output includes:
# - Sparklines (trend visualization)
# - ASCII histograms (distribution shape)
# - Statistical summary (mean, std_dev, p50/p95/p99)
# - Multi-benchmark comparison tables
References
- MLPerfβ’ Inference Benchmark Suite. MLCommons. https://mlcommons.org/benchmarks/inference/
- Criterion.rs: Statistics-driven Microbenchmarking. https://bheisler.github.io/criterion.rs/book/
- Box, G. E. P., Hunter, J. S., & Hunter, W. G. (2005). Statistics for Experimenters. Wiley.
- Fleming, P. J., & Wallace, J. J. (1986). How not to lie with statistics. CACM, 29(3), 218-221.
π Roadmap
Phase 1: Core Inference (Weeks 1-8) β COMPLETE
Build from scratch:
- β GGUF parser (binary format reader)
- β Safetensors parser (zero-copy reader)
- β Transformer architecture (attention, FFN, LayerNorm, RoPE)
- β Quantization (Q4_0, Q8_0, dequantization)
- β Tokenizer (BPE, SentencePiece)
- β KV cache management
- β Inference engine (generation loop, greedy/top-k/top-p)
- β HTTP server with axum (REST API)
- β
CLI:
realizar serve --demo(model loading in Phase 2) - β 260 tests (211 unit + 42 property + 7 integration), 94.61% coverage
Success criteria:
- β GGUF and Safetensors parsers working
- β Quantization working (Q4_0, Q8_0)
- β REST API with /health, /tokenize, /generate
- β GPU acceleration via Trueno
- β Zero external ML dependencies
- β TDG Score: 93.9/100 (A)
Phase 2: Optimization (Weeks 9-16) β COMPLETE
- β Advanced quantization (Q4_K, Q5_K, Q6_K)
- β Flash Attention (memory-efficient block-wise computation)
- β Batch inference
- β Streaming responses (SSE)
- β Model caching/warming
- β Benchmarks vs llama.cpp
Phase 3: Advanced Models (Weeks 17-24)
- β Multi-query attention (MQA)
- β Grouped-query attention (GQA)
- β RoPE position embeddings
- β ALiBi position embeddings
- Vision models (LLaVA, Qwen-VL)
Phase 4: Production (Weeks 25-32) β COMPLETE
- β Multi-model serving (ModelRegistry with concurrent access)
- β Request batching (batch tokenize & generate endpoints)
- β Monitoring/metrics (Prometheus-compatible /metrics endpoint)
- β Docker + GPU support (Dockerfile, docker-compose, Kubernetes, AWS ECS)
- β Load testing (Rust-based load test client, 7 scenarios, performance targets)
π οΈ Development
# Build
# Test
# Quality gates
# Run (when implemented)
π Documentation
Comprehensive documentation is available as an mdBook:
# Build and view the book
# Build only
# Live reload (for writing docs)
# Open in browser
The book covers:
- Core Architecture - Design philosophy, Trueno integration, feature flags
- Model Formats - GGUF and Safetensors parsing from scratch
- Quantization - Q4_0, Q8_0, and K-quant algorithms
- Transformer Architecture - Attention, RoPE, FFN, KV cache implementation
- Tokenization - BPE and SentencePiece without external libraries
- REST API & CLI - Production HTTP server and command-line interface
- GPU Acceleration - Trueno SIMD/GPU dispatch
- EXTREME TDD - Property-based testing, mutation testing methodology
- Development Phases - Phase 1-4 roadmap and implementation details
Note: Book structure is validated in make quality-gates to ensure documentation stays in sync with code.
π Learning Resources
We're building everything from scratch. Key papers:
- [11] TensorFlow - Model serving architecture
- [12] PyTorch - Imperative ML framework design
- [13] NumPy - N-dimensional array design
- [18] BLAS - Linear algebra API design
- [19] Strassen - Fast matrix multiplication
- [20] Kahan - Numerical stability
Full spec: docs/specifications/pure-rust-ml-library-research-spec.md
π Security
- Pure Rust - Memory safe by design
- Zero unsafe in public API
- Minimal deps - axum + tokio only for HTTP
cargo auditpre-commitcargo-denylicense checks
π€ Contributing
- Fork repo
- EXTREME TDD (tests first)
make quality-gatespasses- All commits on
master
π License
MIT License - see LICENSE
π Acknowledgments
- Trueno - SIMD/GPU compute primitives (our ecosystem)
- Aprender - ML algorithms (Phase 2+)
- Renacer - Profiling
- paiml-mcp-agent-toolkit - Quality gates
- bashrs - Script enforcement
Developed by Pragmatic AI Labs
Built from SCRATCH with EXTREME TDD π¦β‘