# Realizar âĄ
> **Pure Rust Model Serving - Built from Scratch**
[](LICENSE)
[](https://www.rust-lang.org)
**Realizar** - Production ML inference engine built **100% from scratch** in pure Rust.
## ð Quick Start
```bash
# Build the binary
cargo build --release
# Start the inference server (demo mode)
./target/release/realizar serve --demo --port 8080
# Test the API
curl http://127.0.0.1:8080/health
curl -X POST http://127.0.0.1:8080/tokenize \
-H "Content-Type: application/json" \
-d '{"text": "Hello world"}'
curl -X POST http://127.0.0.1:8080/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "Hello", "max_tokens": 10, "strategy": "greedy"}'
# View help
./target/release/realizar --help
./target/release/realizar serve --help
```
## âïļ Feature Flags
Realizar supports modular compilation through feature flags:
```toml
[dependencies]
realizar = { version = "0.1", default-features = false, features = ["minimal"] }
```
**Available Features:**
- `default` = `["server", "cli", "gpu"]` - Full functionality
- `minimal` = `[]` - Core inference engine only (no server, no CLI)
- `server` - REST API server (requires axum, tokio)
- `cli` - Command-line interface (requires clap)
- `gpu` - GPU acceleration via Trueno
- `full` - Alias for all features
**Examples:**
```bash
# Core inference library only (minimal dependencies)
cargo build --no-default-features --features minimal
# Server without CLI
cargo build --no-default-features --features server,gpu
# Everything enabled
cargo build --features full
```
## ðŊ Philosophy
**Total Control, Zero Compromise**
Build everything ourselves except HTTP infrastructure:
- â
**Transformer architecture** - Our code, Trueno-backed
- â
**Quantization** - Q4_0, Q8_0, Q4_K from scratch
- â
**Model parsing** - GGUF, safetensors native readers
- â
**Token encoding** - BPE, SentencePiece in pure Rust
- â
**Inference engine** - Every optimization under our control
- ð§ **HTTP server** - axum (swappable via trait)
## ð Target API
```rust
use realizar::{Model, Server};
// Load model (our loader, our format parsing)
let model = Model::from_gguf("models/llama-3.2-1b.gguf")?;
// Serve (swappable server backend)
Server::new(model)
.with_gpu()
.serve("0.0.0.0:8080")?;
```
```bash
# CLI
realizar serve --model llama-3.2-1b.gguf --port 8080
# REST API
curl -X POST http://localhost:8080/generate \
-d '{"prompt": "Hello", "max_tokens": 100}'
# Metrics (Prometheus format)
curl http://localhost:8080/metrics
```
## ðïļ Architecture
```
âââââââââââââââââââââââââââââââââââââââ
â HTTP Server (Swappable) â
â - axum (default, trait-based) â
â - hyper (future) â
â - actix-web (future) â
ââââââââââââââŽâââââââââââââââââââââââââ
â
âââââââââââââââââââââââââââââââââââââââ
â Inference Engine (FROM SCRATCH) â
â - Transformer (our code) â
â - Attention (Trueno-backed) â
â - Quantization (our algorithms) â
â - KV cache (our management) â
ââââââââââââââŽâââââââââââââââââââââââââ
â
âââââââââââââââââââââââââââââââââââââââ
â Model Loader (FROM SCRATCH) â
â - GGUF parser (pure Rust) â
â - Safetensors reader (pure Rust) â
ââââââââââââââŽâââââââââââââââââââââââââ
â
âââââââââââââââââââââââââââââââââââââââ
â Trueno (Compute Primitives) â
â - Matrix ops (SIMD/GPU) â
â - Vector ops (AVX2/NEON) â
âââââââââââââââââââââââââââââââââââââââ
```
## ðĶ Dependencies (Minimal)
```toml
[dependencies]
# OUR ecosystem - we control these
trueno = { path = "../trueno" } # SIMD/GPU compute primitives
# HTTP server ONLY (swappable via trait)
axum = "0.7"
tokio = { version = "1", features = ["rt-multi-thread"] }
# CLI
clap = { version = "4", features = ["derive"] }
# Serialization (for API only, not ML)
serde = { version = "1", features = ["derive"] }
serde_json = "1"
# That's it. NO candle, NO llama-cpp-rs, NO hf-hub
```
## ð§ What We Build from Scratch
### 1. Model Formats (Pure Rust Parsers)
- **GGUF** - Ollama/llama.cpp format
- **Safetensors** - HuggingFace format
- No external dependencies, complete control
### 2. Transformer Architecture
```rust
pub struct Transformer {
layers: Vec<TransformerLayer>,
config: ModelConfig,
}
impl Transformer {
pub fn forward(&self, tokens: &[u32]) -> Tensor {
// Our implementation, Trueno ops
let x = self.embed(tokens);
for layer in &self.layers {
x = layer.forward(x); // We write this
}
self.lm_head(x)
}
}
```
### 3. Attention Mechanism
```rust
pub fn attention(
q: &Tensor, // Trueno tensor
k: &Tensor,
v: &Tensor,
) -> Tensor {
// Our attention implementation
// Uses Trueno for matrix ops (SIMD/GPU)
let scores = q.matmul(&k.transpose());
let weights = scores.softmax();
weights.matmul(v)
}
```
### 4. Quantization
```rust
pub mod quantize {
// Q4_0 - 4-bit quantization
pub fn q4_0(weights: &[f32]) -> (Vec<u8>, Vec<f32>) { }
// Q8_0 - 8-bit quantization
pub fn q8_0(weights: &[f32]) -> (Vec<i8>, Vec<f32>) { }
// Q4_K - k-quant 4-bit
pub fn q4_k(weights: &[f32]) -> Vec<u8> { }
// Dequantization for inference
pub fn dequantize(data: &[u8], qtype: QuantType) -> Vec<f32> { }
}
```
### 5. Token Encoding
```rust
pub struct Tokenizer {
vocab: HashMap<String, u32>,
merges: Vec<(String, String)>,
}
impl Tokenizer {
// BPE encoding (from scratch)
pub fn encode(&self, text: &str) -> Vec<u32> { }
// Decoding
pub fn decode(&self, tokens: &[u32]) -> String { }
}
```
### 6. KV Cache
```rust
pub struct KVCache {
keys: Vec<Tensor>, // Trueno tensors
values: Vec<Tensor>,
}
impl KVCache {
// Efficient cache management
pub fn update(&mut self, layer: usize, k: Tensor, v: Tensor) { }
pub fn get(&self, layer: usize) -> (&Tensor, &Tensor) { }
}
```
## ð Swappable HTTP Server
```rust
// HTTP server trait (axum is default, can swap)
pub trait HttpServer {
fn serve(&self, addr: &str) -> Result<()>;
}
// Default: axum
pub struct AxumServer { /* ... */ }
impl HttpServer for AxumServer { /* ... */ }
// Future: hyper, actix-web, custom
pub struct HyperServer { /* ... */ }
impl HttpServer for HyperServer { /* ... */ }
// Usage
let server = Server::new(model)
.with_backend(AxumServer::new()) // or HyperServer
.serve("0.0.0.0:8080")?;
```
## ðĄ Examples
Realizar includes **6 comprehensive examples** demonstrating all major features:
### 1. End-to-End Inference (`inference.rs`)
Complete text generation pipeline with model initialization, forward pass, and multiple sampling strategies (greedy, top-k, top-p).
```bash
cargo run --example inference
```
### 2. HTTP API Server (`api_server.rs`)
Deploy Realizar as a REST API service with demo model, handling tokenization and generation requests.
```bash
cargo run --example api_server
# Server runs at http://127.0.0.1:3000
# Test: curl http://127.0.0.1:3000/health
```
### 3. Tokenization (`tokenization.rs`)
Compare different tokenization strategies: Basic (vocabulary-based), BPE (Byte Pair Encoding), and SentencePiece.
```bash
cargo run --example tokenization
```
### 4. SafeTensors Loading (`safetensors_loading.rs`)
Load and inspect SafeTensors files (aprender compatibility), extract tensor data, interoperate with aprender-trained models.
```bash
cargo run --example safetensors_loading
```
### 5. Model Caching (`model_cache.rs`)
Demonstrate ModelCache for efficient model reuse with LRU eviction, metrics tracking, and config-based cache keys.
```bash
cargo run --example model_cache
```
### 6. GGUF Format Loading (`gguf_loading.rs`)
Load and inspect GGUF files (llama.cpp/Ollama format), parse headers and metadata, extract tensor data with dequantization support.
```bash
cargo run --example gguf_loading
```
See [`examples/README.md`](examples/README.md) for detailed documentation.
## ⥠Benchmarks
Realizar includes **4 comprehensive benchmark suites** for performance measurement and regression detection:
### 1. Tensor Operations (`tensor_ops`)
Measures tensor creation and basic operations across different sizes (10, 100, 1K, 10K elements).
### 2. Inference Pipeline (`inference`)
End-to-end generation performance including forward pass, sampling strategies, and token generation latency.
### 3. Model Caching (`cache`)
Cache hit/miss latency, LRU eviction overhead, and concurrent access throughput.
### 4. Tokenization (`tokenizer`)
Encode/decode performance for Basic, BPE, and SentencePiece tokenizers across varying text lengths and vocabulary sizes.
**Run benchmarks:**
```bash
# All benchmarks
cargo bench
# Specific suite
cargo bench --bench tokenizer
cargo bench --bench cache
# View results
open target/criterion/report/index.html
```
**Performance Targets:**
- Inference latency: p50 <100ms, p95 <200ms for 1B models
- Cache hits: <1Ξs latency
- Tokenization: Sub-millisecond for typical prompts
## ð Roadmap
### Phase 1: Core Inference (Weeks 1-8) â
COMPLETE
**Build from scratch:**
- â
GGUF parser (binary format reader)
- â
Safetensors parser (zero-copy reader)
- â
Transformer architecture (attention, FFN, LayerNorm, RoPE)
- â
Quantization (Q4_0, Q8_0, dequantization)
- â
Tokenizer (BPE, SentencePiece)
- â
KV cache management
- â
Inference engine (generation loop, greedy/top-k/top-p)
- â
HTTP server with axum (REST API)
- â
CLI: `realizar serve --demo` (model loading in Phase 2)
- â
260 tests (211 unit + 42 property + 7 integration), 94.61% coverage
**Success criteria:**
- â
GGUF and Safetensors parsers working
- â
Quantization working (Q4_0, Q8_0)
- â
REST API with /health, /tokenize, /generate
- â
GPU acceleration via Trueno
- â
Zero external ML dependencies
- â
TDG Score: 93.9/100 (A)
### Phase 2: Optimization (Weeks 9-16) â
COMPLETE
- â
Advanced quantization (Q4_K, Q5_K, Q6_K)
- â
Flash Attention (memory-efficient block-wise computation)
- â
Batch inference
- â
Streaming responses (SSE)
- â
Model caching/warming
- â
Benchmarks vs llama.cpp
### Phase 3: Advanced Models (Weeks 17-24)
- â
Multi-query attention (MQA)
- â
Grouped-query attention (GQA)
- â
RoPE position embeddings
- â
ALiBi position embeddings
- [ ] Vision models (LLaVA, Qwen-VL)
### Phase 4: Production (Weeks 25-32) â
COMPLETE
- â
Multi-model serving (ModelRegistry with concurrent access)
- â
Request batching (batch tokenize & generate endpoints)
- â
Monitoring/metrics (Prometheus-compatible /metrics endpoint)
- â
Docker + GPU support (Dockerfile, docker-compose, Kubernetes, AWS ECS)
- â
Load testing (Rust-based load test client, 7 scenarios, performance targets)
## ð ïļ Development
```bash
# Build
cargo build --release
# Test
cargo test
# Quality gates
make quality-gates
# Run (when implemented)
cargo run --release -- serve --model llama-3.2-1b.gguf --port 8080
```
## ð Documentation
Comprehensive documentation is available as an mdBook:
```bash
# Build and view the book
make book
# Build only
make book-build
# Live reload (for writing docs)
make book-serve
# Open in browser
make book-open
```
The book covers:
- **Core Architecture** - Design philosophy, Trueno integration, feature flags
- **Model Formats** - GGUF and Safetensors parsing from scratch
- **Quantization** - Q4_0, Q8_0, and K-quant algorithms
- **Transformer Architecture** - Attention, RoPE, FFN, KV cache implementation
- **Tokenization** - BPE and SentencePiece without external libraries
- **REST API & CLI** - Production HTTP server and command-line interface
- **GPU Acceleration** - Trueno SIMD/GPU dispatch
- **EXTREME TDD** - Property-based testing, mutation testing methodology
- **Development Phases** - Phase 1-4 roadmap and implementation details
**Note:** Book structure is validated in `make quality-gates` to ensure documentation stays in sync with code.
## ð Learning Resources
We're building everything from scratch. Key papers:
- **[11] TensorFlow** - Model serving architecture
- **[12] PyTorch** - Imperative ML framework design
- **[13] NumPy** - N-dimensional array design
- **[18] BLAS** - Linear algebra API design
- **[19] Strassen** - Fast matrix multiplication
- **[20] Kahan** - Numerical stability
Full spec: [docs/specifications/pure-rust-ml-library-research-spec.md](docs/specifications/pure-rust-ml-library-research-spec.md)
## ð Security
- **Pure Rust** - Memory safe by design
- **Zero unsafe** in public API
- **Minimal deps** - axum + tokio only for HTTP
- `cargo audit` pre-commit
- `cargo-deny` license checks
## ðĪ Contributing
1. Fork repo
2. EXTREME TDD (tests first)
3. `make quality-gates` passes
4. All commits on `master`
## ð License
MIT License - see [LICENSE](LICENSE)
## ð Acknowledgments
- **[Trueno](https://github.com/paiml/trueno)** - SIMD/GPU compute primitives (our ecosystem)
- **[Aprender](https://github.com/paiml/aprender)** - ML algorithms (Phase 2+)
- **[Renacer](https://github.com/paiml/renacer)** - Profiling
- **[paiml-mcp-agent-toolkit](https://github.com/paiml/paiml-mcp-agent-toolkit)** - Quality gates
- **[bashrs](https://github.com/paiml/bashrs)** - Script enforcement
Developed by [Pragmatic AI Labs](https://paiml.com)
---
**Built from SCRATCH with EXTREME TDD** ðĶâĄ