realizar 0.1.0

Pure Rust ML inference engine built from scratch - model serving for GGUF and safetensors
Documentation

Realizar ⚡

Pure Rust Model Serving - Built from Scratch

License: MIT Rust

Realizar - Production ML inference engine built 100% from scratch in pure Rust.

🚀 Quick Start

# Build the binary
cargo build --release

# Start the inference server (demo mode)
./target/release/realizar serve --demo --port 8080

# Test the API
curl http://127.0.0.1:8080/health
curl -X POST http://127.0.0.1:8080/tokenize \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world"}'
curl -X POST http://127.0.0.1:8080/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello", "max_tokens": 10, "strategy": "greedy"}'

# View help
./target/release/realizar --help
./target/release/realizar serve --help

⚙ïļ Feature Flags

Realizar supports modular compilation through feature flags:

[dependencies]
realizar = { version = "0.1", default-features = false, features = ["minimal"] }

Available Features:

  • default = ["server", "cli", "gpu"] - Full functionality
  • minimal = [] - Core inference engine only (no server, no CLI)
  • server - REST API server (requires axum, tokio)
  • cli - Command-line interface (requires clap)
  • gpu - GPU acceleration via Trueno
  • full - Alias for all features

Examples:

# Core inference library only (minimal dependencies)
cargo build --no-default-features --features minimal

# Server without CLI
cargo build --no-default-features --features server,gpu

# Everything enabled
cargo build --features full

ðŸŽŊ Philosophy

Total Control, Zero Compromise

Build everything ourselves except HTTP infrastructure:

  • ✅ Transformer architecture - Our code, Trueno-backed
  • ✅ Quantization - Q4_0, Q8_0, Q4_K from scratch
  • ✅ Model parsing - GGUF, safetensors native readers
  • ✅ Token encoding - BPE, SentencePiece in pure Rust
  • ✅ Inference engine - Every optimization under our control
  • 🔧 HTTP server - axum (swappable via trait)

🚀 Target API

use realizar::{Model, Server};

// Load model (our loader, our format parsing)
let model = Model::from_gguf("models/llama-3.2-1b.gguf")?;

// Serve (swappable server backend)
Server::new(model)
    .with_gpu()
    .serve("0.0.0.0:8080")?;
# CLI
realizar serve --model llama-3.2-1b.gguf --port 8080

# REST API
curl -X POST http://localhost:8080/generate \
  -d '{"prompt": "Hello", "max_tokens": 100}'

# Metrics (Prometheus format)
curl http://localhost:8080/metrics

🏗ïļ Architecture

┌─────────────────────────────────────┐
│  HTTP Server (Swappable)           │
│  - axum (default, trait-based)     │
│  - hyper (future)                  │
│  - actix-web (future)              │
└────────────┮────────────────────────┘
             ↓
┌─────────────────────────────────────┐
│  Inference Engine (FROM SCRATCH)   │
│  - Transformer (our code)          │
│  - Attention (Trueno-backed)       │
│  - Quantization (our algorithms)   │
│  - KV cache (our management)       │
└────────────┮────────────────────────┘
             ↓
┌─────────────────────────────────────┐
│  Model Loader (FROM SCRATCH)       │
│  - GGUF parser (pure Rust)         │
│  - Safetensors reader (pure Rust)  │
└────────────┮────────────────────────┘
             ↓
┌─────────────────────────────────────┐
│  Trueno (Compute Primitives)       │
│  - Matrix ops (SIMD/GPU)           │
│  - Vector ops (AVX2/NEON)          │
└─────────────────────────────────────┘

ðŸ“Ķ Dependencies (Minimal)

[dependencies]
# OUR ecosystem - we control these
trueno = { path = "../trueno" }  # SIMD/GPU compute primitives

# HTTP server ONLY (swappable via trait)
axum = "0.7"
tokio = { version = "1", features = ["rt-multi-thread"] }

# CLI
clap = { version = "4", features = ["derive"] }

# Serialization (for API only, not ML)
serde = { version = "1", features = ["derive"] }
serde_json = "1"

# That's it. NO candle, NO llama-cpp-rs, NO hf-hub

🔧 What We Build from Scratch

1. Model Formats (Pure Rust Parsers)

  • GGUF - Ollama/llama.cpp format
  • Safetensors - HuggingFace format
  • No external dependencies, complete control

2. Transformer Architecture

pub struct Transformer {
    layers: Vec<TransformerLayer>,
    config: ModelConfig,
}

impl Transformer {
    pub fn forward(&self, tokens: &[u32]) -> Tensor {
        // Our implementation, Trueno ops
        let x = self.embed(tokens);
        for layer in &self.layers {
            x = layer.forward(x);  // We write this
        }
        self.lm_head(x)
    }
}

3. Attention Mechanism

pub fn attention(
    q: &Tensor,  // Trueno tensor
    k: &Tensor,
    v: &Tensor,
) -> Tensor {
    // Our attention implementation
    // Uses Trueno for matrix ops (SIMD/GPU)
    let scores = q.matmul(&k.transpose());
    let weights = scores.softmax();
    weights.matmul(v)
}

4. Quantization

pub mod quantize {
    // Q4_0 - 4-bit quantization
    pub fn q4_0(weights: &[f32]) -> (Vec<u8>, Vec<f32>) { }

    // Q8_0 - 8-bit quantization
    pub fn q8_0(weights: &[f32]) -> (Vec<i8>, Vec<f32>) { }

    // Q4_K - k-quant 4-bit
    pub fn q4_k(weights: &[f32]) -> Vec<u8> { }

    // Dequantization for inference
    pub fn dequantize(data: &[u8], qtype: QuantType) -> Vec<f32> { }
}

5. Token Encoding

pub struct Tokenizer {
    vocab: HashMap<String, u32>,
    merges: Vec<(String, String)>,
}

impl Tokenizer {
    // BPE encoding (from scratch)
    pub fn encode(&self, text: &str) -> Vec<u32> { }

    // Decoding
    pub fn decode(&self, tokens: &[u32]) -> String { }
}

6. KV Cache

pub struct KVCache {
    keys: Vec<Tensor>,    // Trueno tensors
    values: Vec<Tensor>,
}

impl KVCache {
    // Efficient cache management
    pub fn update(&mut self, layer: usize, k: Tensor, v: Tensor) { }
    pub fn get(&self, layer: usize) -> (&Tensor, &Tensor) { }
}

🔌 Swappable HTTP Server

// HTTP server trait (axum is default, can swap)
pub trait HttpServer {
    fn serve(&self, addr: &str) -> Result<()>;
}

// Default: axum
pub struct AxumServer { /* ... */ }
impl HttpServer for AxumServer { /* ... */ }

// Future: hyper, actix-web, custom
pub struct HyperServer { /* ... */ }
impl HttpServer for HyperServer { /* ... */ }

// Usage
let server = Server::new(model)
    .with_backend(AxumServer::new())  // or HyperServer
    .serve("0.0.0.0:8080")?;

ðŸ’Ą Examples

Realizar includes 6 comprehensive examples demonstrating all major features:

1. End-to-End Inference (inference.rs)

Complete text generation pipeline with model initialization, forward pass, and multiple sampling strategies (greedy, top-k, top-p).

cargo run --example inference

2. HTTP API Server (api_server.rs)

Deploy Realizar as a REST API service with demo model, handling tokenization and generation requests.

cargo run --example api_server
# Server runs at http://127.0.0.1:3000
# Test: curl http://127.0.0.1:3000/health

3. Tokenization (tokenization.rs)

Compare different tokenization strategies: Basic (vocabulary-based), BPE (Byte Pair Encoding), and SentencePiece.

cargo run --example tokenization

4. SafeTensors Loading (safetensors_loading.rs)

Load and inspect SafeTensors files (aprender compatibility), extract tensor data, interoperate with aprender-trained models.

cargo run --example safetensors_loading

5. Model Caching (model_cache.rs)

Demonstrate ModelCache for efficient model reuse with LRU eviction, metrics tracking, and config-based cache keys.

cargo run --example model_cache

6. GGUF Format Loading (gguf_loading.rs)

Load and inspect GGUF files (llama.cpp/Ollama format), parse headers and metadata, extract tensor data with dequantization support.

cargo run --example gguf_loading

See examples/README.md for detailed documentation.

⚡ Benchmarks

Realizar includes 4 comprehensive benchmark suites for performance measurement and regression detection:

1. Tensor Operations (tensor_ops)

Measures tensor creation and basic operations across different sizes (10, 100, 1K, 10K elements).

2. Inference Pipeline (inference)

End-to-end generation performance including forward pass, sampling strategies, and token generation latency.

3. Model Caching (cache)

Cache hit/miss latency, LRU eviction overhead, and concurrent access throughput.

4. Tokenization (tokenizer)

Encode/decode performance for Basic, BPE, and SentencePiece tokenizers across varying text lengths and vocabulary sizes.

Run benchmarks:

# All benchmarks
cargo bench

# Specific suite
cargo bench --bench tokenizer
cargo bench --bench cache

# View results
open target/criterion/report/index.html

Performance Targets:

  • Inference latency: p50 <100ms, p95 <200ms for 1B models
  • Cache hits: <1Ξs latency
  • Tokenization: Sub-millisecond for typical prompts

📊 Roadmap

Phase 1: Core Inference (Weeks 1-8) ✅ COMPLETE

Build from scratch:

  • ✅ GGUF parser (binary format reader)
  • ✅ Safetensors parser (zero-copy reader)
  • ✅ Transformer architecture (attention, FFN, LayerNorm, RoPE)
  • ✅ Quantization (Q4_0, Q8_0, dequantization)
  • ✅ Tokenizer (BPE, SentencePiece)
  • ✅ KV cache management
  • ✅ Inference engine (generation loop, greedy/top-k/top-p)
  • ✅ HTTP server with axum (REST API)
  • ✅ CLI: realizar serve --demo (model loading in Phase 2)
  • ✅ 260 tests (211 unit + 42 property + 7 integration), 94.61% coverage

Success criteria:

  • ✅ GGUF and Safetensors parsers working
  • ✅ Quantization working (Q4_0, Q8_0)
  • ✅ REST API with /health, /tokenize, /generate
  • ✅ GPU acceleration via Trueno
  • ✅ Zero external ML dependencies
  • ✅ TDG Score: 93.9/100 (A)

Phase 2: Optimization (Weeks 9-16) ✅ COMPLETE

  • ✅ Advanced quantization (Q4_K, Q5_K, Q6_K)
  • ✅ Flash Attention (memory-efficient block-wise computation)
  • ✅ Batch inference
  • ✅ Streaming responses (SSE)
  • ✅ Model caching/warming
  • ✅ Benchmarks vs llama.cpp

Phase 3: Advanced Models (Weeks 17-24)

  • ✅ Multi-query attention (MQA)
  • ✅ Grouped-query attention (GQA)
  • ✅ RoPE position embeddings
  • ✅ ALiBi position embeddings
  • Vision models (LLaVA, Qwen-VL)

Phase 4: Production (Weeks 25-32) ✅ COMPLETE

  • ✅ Multi-model serving (ModelRegistry with concurrent access)
  • ✅ Request batching (batch tokenize & generate endpoints)
  • ✅ Monitoring/metrics (Prometheus-compatible /metrics endpoint)
  • ✅ Docker + GPU support (Dockerfile, docker-compose, Kubernetes, AWS ECS)
  • ✅ Load testing (Rust-based load test client, 7 scenarios, performance targets)

🛠ïļ Development

# Build
cargo build --release

# Test
cargo test

# Quality gates
make quality-gates

# Run (when implemented)
cargo run --release -- serve --model llama-3.2-1b.gguf --port 8080

📚 Documentation

Comprehensive documentation is available as an mdBook:

# Build and view the book
make book

# Build only
make book-build

# Live reload (for writing docs)
make book-serve

# Open in browser
make book-open

The book covers:

  • Core Architecture - Design philosophy, Trueno integration, feature flags
  • Model Formats - GGUF and Safetensors parsing from scratch
  • Quantization - Q4_0, Q8_0, and K-quant algorithms
  • Transformer Architecture - Attention, RoPE, FFN, KV cache implementation
  • Tokenization - BPE and SentencePiece without external libraries
  • REST API & CLI - Production HTTP server and command-line interface
  • GPU Acceleration - Trueno SIMD/GPU dispatch
  • EXTREME TDD - Property-based testing, mutation testing methodology
  • Development Phases - Phase 1-4 roadmap and implementation details

Note: Book structure is validated in make quality-gates to ensure documentation stays in sync with code.

🎓 Learning Resources

We're building everything from scratch. Key papers:

  • [11] TensorFlow - Model serving architecture
  • [12] PyTorch - Imperative ML framework design
  • [13] NumPy - N-dimensional array design
  • [18] BLAS - Linear algebra API design
  • [19] Strassen - Fast matrix multiplication
  • [20] Kahan - Numerical stability

Full spec: docs/specifications/pure-rust-ml-library-research-spec.md

🔒 Security

  • Pure Rust - Memory safe by design
  • Zero unsafe in public API
  • Minimal deps - axum + tokio only for HTTP
  • cargo audit pre-commit
  • cargo-deny license checks

ðŸĪ Contributing

  1. Fork repo
  2. EXTREME TDD (tests first)
  3. make quality-gates passes
  4. All commits on master

📄 License

MIT License - see LICENSE

🙏 Acknowledgments

Developed by Pragmatic AI Labs


Built from SCRATCH with EXTREME TDD ðŸĶ€âšĄ