Realizar âĄ
Pure Rust Model Serving - Built from Scratch
Realizar - Production ML inference engine built 100% from scratch in pure Rust.
ð Quick Start
# Build the binary
# Start the inference server (demo mode)
# Test the API
# View help
âïļ Feature Flags
Realizar supports modular compilation through feature flags:
[]
= { = "0.1", = false, = ["minimal"] }
Available Features:
default=["server", "cli", "gpu"]- Full functionalityminimal=[]- Core inference engine only (no server, no CLI)server- REST API server (requires axum, tokio)cli- Command-line interface (requires clap)gpu- GPU acceleration via Truenofull- Alias for all features
Examples:
# Core inference library only (minimal dependencies)
# Server without CLI
# Everything enabled
ðŊ Philosophy
Total Control, Zero Compromise
Build everything ourselves except HTTP infrastructure:
- â Transformer architecture - Our code, Trueno-backed
- â Quantization - Q4_0, Q8_0, Q4_K from scratch
- â Model parsing - GGUF, safetensors native readers
- â Token encoding - BPE, SentencePiece in pure Rust
- â Inference engine - Every optimization under our control
- ð§ HTTP server - axum (swappable via trait)
ð Target API
use ;
// Load model (our loader, our format parsing)
let model = from_gguf?;
// Serve (swappable server backend)
new
.with_gpu
.serve?;
# CLI
# REST API
# Metrics (Prometheus format)
ðïļ Architecture
âââââââââââââââââââââââââââââââââââââââ
â HTTP Server (Swappable) â
â - axum (default, trait-based) â
â - hyper (future) â
â - actix-web (future) â
ââââââââââââââŽâââââââââââââââââââââââââ
â
âââââââââââââââââââââââââââââââââââââââ
â Inference Engine (FROM SCRATCH) â
â - Transformer (our code) â
â - Attention (Trueno-backed) â
â - Quantization (our algorithms) â
â - KV cache (our management) â
ââââââââââââââŽâââââââââââââââââââââââââ
â
âââââââââââââââââââââââââââââââââââââââ
â Model Loader (FROM SCRATCH) â
â - GGUF parser (pure Rust) â
â - Safetensors reader (pure Rust) â
ââââââââââââââŽâââââââââââââââââââââââââ
â
âââââââââââââââââââââââââââââââââââââââ
â Trueno (Compute Primitives) â
â - Matrix ops (SIMD/GPU) â
â - Vector ops (AVX2/NEON) â
âââââââââââââââââââââââââââââââââââââââ
ðĶ Dependencies (Minimal)
[]
# OUR ecosystem - we control these
= { = "../trueno" } # SIMD/GPU compute primitives
# HTTP server ONLY (swappable via trait)
= "0.7"
= { = "1", = ["rt-multi-thread"] }
# CLI
= { = "4", = ["derive"] }
# Serialization (for API only, not ML)
= { = "1", = ["derive"] }
= "1"
# That's it. NO candle, NO llama-cpp-rs, NO hf-hub
ð§ What We Build from Scratch
1. Model Formats (Pure Rust Parsers)
- GGUF - Ollama/llama.cpp format
- Safetensors - HuggingFace format
- No external dependencies, complete control
2. Transformer Architecture
3. Attention Mechanism
4. Quantization
5. Token Encoding
6. KV Cache
ð Swappable HTTP Server
// HTTP server trait (axum is default, can swap)
// Default: axum
// Future: hyper, actix-web, custom
// Usage
let server = new
.with_backend // or HyperServer
.serve?;
ðĄ Examples
Realizar includes 6 comprehensive examples demonstrating all major features:
1. End-to-End Inference (inference.rs)
Complete text generation pipeline with model initialization, forward pass, and multiple sampling strategies (greedy, top-k, top-p).
2. HTTP API Server (api_server.rs)
Deploy Realizar as a REST API service with demo model, handling tokenization and generation requests.
# Server runs at http://127.0.0.1:3000
# Test: curl http://127.0.0.1:3000/health
3. Tokenization (tokenization.rs)
Compare different tokenization strategies: Basic (vocabulary-based), BPE (Byte Pair Encoding), and SentencePiece.
4. SafeTensors Loading (safetensors_loading.rs)
Load and inspect SafeTensors files (aprender compatibility), extract tensor data, interoperate with aprender-trained models.
5. Model Caching (model_cache.rs)
Demonstrate ModelCache for efficient model reuse with LRU eviction, metrics tracking, and config-based cache keys.
6. GGUF Format Loading (gguf_loading.rs)
Load and inspect GGUF files (llama.cpp/Ollama format), parse headers and metadata, extract tensor data with dequantization support.
See examples/README.md for detailed documentation.
⥠Benchmarks
Realizar includes 4 comprehensive benchmark suites for performance measurement and regression detection:
1. Tensor Operations (tensor_ops)
Measures tensor creation and basic operations across different sizes (10, 100, 1K, 10K elements).
2. Inference Pipeline (inference)
End-to-end generation performance including forward pass, sampling strategies, and token generation latency.
3. Model Caching (cache)
Cache hit/miss latency, LRU eviction overhead, and concurrent access throughput.
4. Tokenization (tokenizer)
Encode/decode performance for Basic, BPE, and SentencePiece tokenizers across varying text lengths and vocabulary sizes.
Run benchmarks:
# All benchmarks
# Specific suite
# View results
Performance Targets:
- Inference latency: p50 <100ms, p95 <200ms for 1B models
- Cache hits: <1Ξs latency
- Tokenization: Sub-millisecond for typical prompts
ð Roadmap
Phase 1: Core Inference (Weeks 1-8) â COMPLETE
Build from scratch:
- â GGUF parser (binary format reader)
- â Safetensors parser (zero-copy reader)
- â Transformer architecture (attention, FFN, LayerNorm, RoPE)
- â Quantization (Q4_0, Q8_0, dequantization)
- â Tokenizer (BPE, SentencePiece)
- â KV cache management
- â Inference engine (generation loop, greedy/top-k/top-p)
- â HTTP server with axum (REST API)
- â
CLI:
realizar serve --demo(model loading in Phase 2) - â 260 tests (211 unit + 42 property + 7 integration), 94.61% coverage
Success criteria:
- â GGUF and Safetensors parsers working
- â Quantization working (Q4_0, Q8_0)
- â REST API with /health, /tokenize, /generate
- â GPU acceleration via Trueno
- â Zero external ML dependencies
- â TDG Score: 93.9/100 (A)
Phase 2: Optimization (Weeks 9-16) â COMPLETE
- â Advanced quantization (Q4_K, Q5_K, Q6_K)
- â Flash Attention (memory-efficient block-wise computation)
- â Batch inference
- â Streaming responses (SSE)
- â Model caching/warming
- â Benchmarks vs llama.cpp
Phase 3: Advanced Models (Weeks 17-24)
- â Multi-query attention (MQA)
- â Grouped-query attention (GQA)
- â RoPE position embeddings
- â ALiBi position embeddings
- Vision models (LLaVA, Qwen-VL)
Phase 4: Production (Weeks 25-32) â COMPLETE
- â Multi-model serving (ModelRegistry with concurrent access)
- â Request batching (batch tokenize & generate endpoints)
- â Monitoring/metrics (Prometheus-compatible /metrics endpoint)
- â Docker + GPU support (Dockerfile, docker-compose, Kubernetes, AWS ECS)
- â Load testing (Rust-based load test client, 7 scenarios, performance targets)
ð ïļ Development
# Build
# Test
# Quality gates
# Run (when implemented)
ð Documentation
Comprehensive documentation is available as an mdBook:
# Build and view the book
# Build only
# Live reload (for writing docs)
# Open in browser
The book covers:
- Core Architecture - Design philosophy, Trueno integration, feature flags
- Model Formats - GGUF and Safetensors parsing from scratch
- Quantization - Q4_0, Q8_0, and K-quant algorithms
- Transformer Architecture - Attention, RoPE, FFN, KV cache implementation
- Tokenization - BPE and SentencePiece without external libraries
- REST API & CLI - Production HTTP server and command-line interface
- GPU Acceleration - Trueno SIMD/GPU dispatch
- EXTREME TDD - Property-based testing, mutation testing methodology
- Development Phases - Phase 1-4 roadmap and implementation details
Note: Book structure is validated in make quality-gates to ensure documentation stays in sync with code.
ð Learning Resources
We're building everything from scratch. Key papers:
- [11] TensorFlow - Model serving architecture
- [12] PyTorch - Imperative ML framework design
- [13] NumPy - N-dimensional array design
- [18] BLAS - Linear algebra API design
- [19] Strassen - Fast matrix multiplication
- [20] Kahan - Numerical stability
Full spec: docs/specifications/pure-rust-ml-library-research-spec.md
ð Security
- Pure Rust - Memory safe by design
- Zero unsafe in public API
- Minimal deps - axum + tokio only for HTTP
cargo auditpre-commitcargo-denylicense checks
ðĪ Contributing
- Fork repo
- EXTREME TDD (tests first)
make quality-gatespasses- All commits on
master
ð License
MIT License - see LICENSE
ð Acknowledgments
- Trueno - SIMD/GPU compute primitives (our ecosystem)
- Aprender - ML algorithms (Phase 2+)
- Renacer - Profiling
- paiml-mcp-agent-toolkit - Quality gates
- bashrs - Script enforcement
Developed by Pragmatic AI Labs
Built from SCRATCH with EXTREME TDD ðĶâĄ