Realizar ⚡
Pure Rust Model Serving - Built from Scratch
Realizar - Production ML inference engine built 100% from scratch in pure Rust.
🚀 Quick Start
# Build the binary
# Start the inference server (demo mode)
# Test the API
# View help
⚙️ Feature Flags
Realizar supports modular compilation through feature flags:
[]
= { = "0.1", = false, = ["minimal"] }
Available Features:
default=["server", "cli", "gpu"]- Full functionalityminimal=[]- Core inference engine only (no server, no CLI)server- REST API server (requires axum, tokio)cli- Command-line interface (requires clap)gpu- GPU acceleration via Truenofull- Alias for all features
Examples:
# Core inference library only (minimal dependencies)
# Server without CLI
# Everything enabled
🎯 Philosophy
Total Control, Zero Compromise
Build everything ourselves except HTTP infrastructure:
- ✅ Transformer architecture - Our code, Trueno-backed
- ✅ Quantization - Q4_0, Q8_0, Q4_K from scratch
- ✅ Model parsing - GGUF, safetensors native readers
- ✅ Token encoding - BPE, SentencePiece in pure Rust
- ✅ Inference engine - Every optimization under our control
- 🔧 HTTP server - axum (swappable via trait)
🚀 Target API
use ;
// Load model (our loader, our format parsing)
let model = from_gguf?;
// Serve (swappable server backend)
new
.with_gpu
.serve?;
# CLI
# REST API
🏗️ Architecture
┌─────────────────────────────────────┐
│ HTTP Server (Swappable) │
│ - axum (default, trait-based) │
│ - hyper (future) │
│ - actix-web (future) │
└────────────┬────────────────────────┘
↓
┌─────────────────────────────────────┐
│ Inference Engine (FROM SCRATCH) │
│ - Transformer (our code) │
│ - Attention (Trueno-backed) │
│ - Quantization (our algorithms) │
│ - KV cache (our management) │
└────────────┬────────────────────────┘
↓
┌─────────────────────────────────────┐
│ Model Loader (FROM SCRATCH) │
│ - GGUF parser (pure Rust) │
│ - Safetensors reader (pure Rust) │
└────────────┬────────────────────────┘
↓
┌─────────────────────────────────────┐
│ Trueno (Compute Primitives) │
│ - Matrix ops (SIMD/GPU) │
│ - Vector ops (AVX2/NEON) │
└─────────────────────────────────────┘
📦 Dependencies (Minimal)
[]
# OUR ecosystem - we control these
= { = "../trueno" } # SIMD/GPU compute primitives
# HTTP server ONLY (swappable via trait)
= "0.7"
= { = "1", = ["rt-multi-thread"] }
# CLI
= { = "4", = ["derive"] }
# Serialization (for API only, not ML)
= { = "1", = ["derive"] }
= "1"
# That's it. NO candle, NO llama-cpp-rs, NO hf-hub
🔧 What We Build from Scratch
1. Model Formats (Pure Rust Parsers)
- GGUF - Ollama/llama.cpp format
- Safetensors - HuggingFace format
- No external dependencies, complete control
2. Transformer Architecture
3. Attention Mechanism
4. Quantization
5. Token Encoding
6. KV Cache
🔌 Swappable HTTP Server
// HTTP server trait (axum is default, can swap)
// Default: axum
// Future: hyper, actix-web, custom
// Usage
let server = new
.with_backend // or HyperServer
.serve?;
📊 Roadmap
Phase 1: Core Inference (Weeks 1-8) ✅ COMPLETE
Build from scratch:
- ✅ GGUF parser (binary format reader)
- ✅ Safetensors parser (zero-copy reader)
- ✅ Transformer architecture (attention, FFN, LayerNorm, RoPE)
- ✅ Quantization (Q4_0, Q8_0, dequantization)
- ✅ Tokenizer (BPE, SentencePiece)
- ✅ KV cache management
- ✅ Inference engine (generation loop, greedy/top-k/top-p)
- ✅ HTTP server with axum (REST API)
- ✅ CLI:
realizar serve --demo(model loading in Phase 2) - ✅ 260 tests (211 unit + 42 property + 7 integration), 94.61% coverage
Success criteria:
- ✅ GGUF and Safetensors parsers working
- ✅ Quantization working (Q4_0, Q8_0)
- ✅ REST API with /health, /tokenize, /generate
- ✅ GPU acceleration via Trueno
- ✅ Zero external ML dependencies
- ✅ TDG Score: 93.9/100 (A)
Phase 2: Optimization (Weeks 9-16)
- Advanced quantization (Q4_K, Q5_K, Q6_K)
- Flash Attention (Trueno-backed)
- Batch inference
- Streaming responses (SSE)
- Model caching/warming
- Benchmarks vs llama.cpp
Phase 3: Advanced Models (Weeks 17-24)
- Multi-query attention (MQA)
- Grouped-query attention (GQA)
- RoPE position embeddings
- ALiBi position embeddings
- Vision models (LLaVA, Qwen-VL)
Phase 4: Production (Weeks 25-32)
- Multi-model serving
- Request batching
- Monitoring/metrics
- Docker + GPU support
- Load testing
🛠️ Development
# Build
# Test
# Quality gates
# Run (when implemented)
📚 Documentation
Comprehensive documentation is available as an mdBook:
# Build and view the book
# Build only
# Live reload (for writing docs)
# Open in browser
The book covers:
- Core Architecture - Design philosophy, Trueno integration, feature flags
- Model Formats - GGUF and Safetensors parsing from scratch
- Quantization - Q4_0, Q8_0, and K-quant algorithms
- Transformer Architecture - Attention, RoPE, FFN, KV cache implementation
- Tokenization - BPE and SentencePiece without external libraries
- REST API & CLI - Production HTTP server and command-line interface
- GPU Acceleration - Trueno SIMD/GPU dispatch
- EXTREME TDD - Property-based testing, mutation testing methodology
- Development Phases - Phase 1-4 roadmap and implementation details
Note: Book structure is validated in make quality-gates to ensure documentation stays in sync with code.
🎓 Learning Resources
We're building everything from scratch. Key papers:
- [11] TensorFlow - Model serving architecture
- [12] PyTorch - Imperative ML framework design
- [13] NumPy - N-dimensional array design
- [18] BLAS - Linear algebra API design
- [19] Strassen - Fast matrix multiplication
- [20] Kahan - Numerical stability
Full spec: docs/specifications/pure-rust-ml-library-research-spec.md
🔒 Security
- Pure Rust - Memory safe by design
- Zero unsafe in public API
- Minimal deps - axum + tokio only for HTTP
cargo auditpre-commitcargo-denylicense checks
🤝 Contributing
- Fork repo
- EXTREME TDD (tests first)
make quality-gatespasses- All commits on
master
📄 License
MIT License - see LICENSE
🙏 Acknowledgments
- Trueno - SIMD/GPU compute primitives (our ecosystem)
- Aprender - ML algorithms (Phase 2+)
- Renacer - Profiling
- paiml-mcp-agent-toolkit - Quality gates
- bashrs - Script enforcement
Developed by Pragmatic AI Labs
Built from SCRATCH with EXTREME TDD 🦀⚡