realizar 0.8.6

docs.rs failed to build realizar-0.8.6
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

Visit the last successful build: realizar-0.9.1

What is realizar?
Installation
Usage
Features
Benchmarks
Quality
Sovereign AI Stack
Documentation
Contributing
License

What is realizar?

realizar is a pure Rust LLM inference engine. It loads models in APR v2, GGUF, and SafeTensors formats, runs transformer inference with quantized kernels (Q4_K through Q8_0), and serves predictions over an OpenAI-compatible REST API.

Key design decisions:

Row-major mandate -- All tensors are row-major internally. GGUF column-major data is transposed at import by aprender. This matches PyTorch/SafeTensors layout and simplifies kernel implementations.
Pure Rust CUDA -- GPU acceleration via trueno-gpu generates PTX directly from Rust. No nvcc, no LLVM, no C++ dependencies.
Cost-based dispatch -- Backend selection (GPU/SIMD/scalar) uses a 5x PCIe cost model to avoid GPU overhead on small workloads.

Installation

CLI

cargo install realizar

Library

Add to your Cargo.toml:

[dependencies]
realizar = "0.8"

Usage

Serving

# Start demo server
realizar serve --demo --port 8080

# Health check
curl http://localhost:8080/health

# Prometheus metrics
curl http://localhost:8080/metrics

OpenAI-Compatible API

# Chat completions
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 50
  }'

# Streaming
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Library API

use realizar::chat_template::{auto_detect_template, ChatMessage};

let template = auto_detect_template("Qwen2-0.5B-Instruct");
let messages = vec![
    ChatMessage::system("You are a helpful assistant."),
    ChatMessage::user("Hello!"),
];
let formatted = template.format_conversation(&messages)?;

Tracing

Use the X-Trace-Level header for inference debugging:

# Brick-level: token-by-token timing
curl -H "X-Trace-Level: brick" -X POST http://localhost:8080/v1/chat/completions ...

# Layer-level: per-layer timing breakdown
curl -H "X-Trace-Level: layer" -X POST http://localhost:8080/v1/chat/completions ...

Features

Model Formats

Format	Description
APR v2	Native format with LZ4/ZSTD compression, zero-copy loading, Int4/Int8 quantization
GGUF	llama.cpp-compatible quantized models
SafeTensors	HuggingFace full-precision format

GPU Kernels

Kernel	Purpose
`GemmKernel`	Matrix multiplication (naive, tiled, tensor core)
`AttentionKernel`	FlashAttention-style tiled attention
`SoftmaxKernel`	Numerically stable with warp shuffle
`LayerNormKernel`	Fused layer normalization
`QuantizeKernel`	Q4_K dequantization fused with matmul
`Q5KKernel`	Q5_K dequantization
`Q6KKernel`	Q6_K dequantization

Quantization

Q4_0, Q8_0, Q4_K, Q5_K, Q6_K -- SIMD-accelerated on AVX2, AVX-512, and NEON. GPU dequantization fused with matrix operations to avoid memory round-trips.

KV Cache

Autoregressive decoding with persistent key-value cache. Supports grouped-query attention (GQA) for models like Qwen2.5 and Llama 3.

Chat Templates

Automatic template detection from model metadata:

Format	Models	System Prompt
ChatML	Qwen2, Yi, OpenHermes	Yes
Llama2	TinyLlama, Vicuna, LLaMA 2	Yes
Mistral	Mistral-7B, Mixtral	No
Phi	Phi-2, Phi-3	Yes
Alpaca	Alpaca, Guanaco	Yes
Raw	Fallback	Passthrough
Custom	Any (Jinja2)	Configurable

Feature Flags

Flag	Description
`default`	server + cli + gpu
`cuda`	NVIDIA CUDA support (pure Rust PTX, no nvcc)
`minimal`	Core inference only
`bench-http`	External server benchmarking

Benchmarks

LLM Inference (GPU)

Model	Size	Format	Backend	Throughput
Qwen2.5-Coder Q4_K_M	1.5B	APR	RTX 4090 (CUDA)	240 tok/s
Phi-2 Q4_K_M	2.7B	GGUF	RTX 4090 (CUDA)	276 tok/s
Phi-2 Q4_K_M	2.7B	GGUF	llama.cpp CUDA	256 tok/s
Phi-2 Q4_K_M	2.7B	GGUF	Ollama CUDA	228 tok/s

realizar achieves 8--21% faster inference than llama.cpp/Ollama via pure Rust CUDA PTX generation.

Classical ML (APR Format)

Model	Parameters	Latency	Throughput
Iris	131	103ns	9.6M inferences/sec
MNIST	103K	73us	13.6K inferences/sec
Large NN	1M	410us	2.4K inferences/sec

Methodology follows Hoefler & Belli SC'15 (CV-based stopping, warmup iterations discarded).

Quality

15,000+ tests across unit, integration, and property-based suites
95%+ line coverage via cargo-llvm-cov
Zero clippy warnings with -D warnings
Mutation testing via cargo-mutants
Provable contracts -- 1,725 bindings with AllImplemented policy

Sovereign AI Stack

realizar is the inference layer of the PAIML Sovereign AI Stack:

Layer	Crate	Purpose
Compute	trueno	SIMD/GPU primitives (AVX2/AVX-512/NEON, wgpu)
ML	aprender	ML algorithms, APR v2 format
Training	entrenar	Autograd, LoRA/QLoRA, quantization
Inference	realizar	LLM inference, GPU kernels, model serving
Speech	whisper-apr	Pure Rust Whisper ASR
Distribution	repartir	Distributed compute (CPU/GPU/Remote)
Registry	pacha	Model registry with Ed25519 signatures
Orchestration	batuta	Stack coordination and CLI

Documentation

API docs: docs.rs/realizar
Repository: github.com/paiml/realizar
Cookbook: github.com/paiml/apr-cookbook

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines, or open an issue to discuss your idea first.

License

MIT -- Pragmatic AI Labs

Table of Contents