Expand description
realizar has moved to aprender-serve.
This crate re-exports aprender-serve for backward compatibility.
New code should depend on aprender-serve directly.
Modules§
- api
- HTTP API for model inference
- apr
- Aprender .apr format support (PRIMARY inference format)
- apr_
transformer - APR Transformer format for WASM-compatible LLM inference
- arch_
requirements - Per-architecture required weight roles (GH-279).
- audit
- Audit trail and provenance logging
- bench
- Benchmark harness for model runner comparison
- bench_
preflight - Preflight validation protocol for deterministic benchmarking
- bench_
viz - Benchmark visualization for inference comparison (PAR-040)
- brick
- ComputeBrick architecture for token-centric, self-verifying inference
- cache
- Model caching and warming for reduced latency
- capability
- GH-280: Kernel capability gate — contract-driven GPU admission control.
- chat_
template - Chat template engine for model-specific message formatting
- cli
- CLI command implementations (extracted for testability) CLI command implementations
- contract_
gate - GGUF to APR Transformer converter
- convert
- GGUF to APR Transformer Converter
- cuda
- CUDA PTX generation for NVIDIA GPUs
- error
- Error types for Realizar
- explain
- Model explainability (SHAP, Attention)
- format
- Unified model format detection (APR, GGUF, SafeTensors)
- generate
- Text generation and sampling strategies
- gguf
- GGUF (GPT-Generated Unified Format) parser
- gpu
- GPU acceleration module (Phase 4: ≥100 tok/s target)
- grammar
- Grammar-constrained generation for structured output
- infer
- High-level inference API for CLI tools
- inference
- SIMD-accelerated inference engine using trueno
- inference_
trace - Inference tracing for debugging LLM pipelines
- layers
- Neural network layers for transformer models
- memory
- Memory management for hot expert pinning
- metrics
- Metrics collection and reporting for production monitoring
- model_
loader - Unified model loader for APR, GGUF, and SafeTensors
- moe
- Mixture-of-Experts (MOE) routing with Capacity Factor load balancing
- observability
- Observability: metrics, tracing, and A/B testing
- paged_
kv - PagedAttention KV cache management
- parallel
- Multi-GPU and Distributed Inference
- ptx_
parity - PTX Parity Validation — GH-219
- quantize
- Quantization and dequantization for model weights
- registry
- Model registry for multi-model serving
- safetensors
- Safetensors parser
- safetensors_
cuda - SafeTensors CUDA inference (PMAT-116)
- safetensors_
infer - SafeTensors inference support (PAR-301)
- scheduler
- Continuous batching scheduler
- serve
- Aprender ML Model Serving API
- speculative
- Speculative decoding for LLM inference acceleration
- stats
- Statistical analysis for A/B testing with log-normal latency support
- target
- Multi-target deployment support (Lambda, Docker, WASM) Multi-Target Deployment Support
- tensor
- Tensor implementation
- tensor_
names - GH-311: Contract-driven tensor name resolution (tensor-names-v1.yaml codegen)
- tokenizer
- Tokenizer for text encoding and decoding
- tui
- TUI monitoring for inference performance TUI Monitoring for LLM Inference
- uri
- Pacha URI scheme support for model loading Pacha URI scheme support for model loading
- viz
- Benchmark visualization using trueno-viz.
- warmup
- Model warm-up and pre-loading Model Warm-up and Pre-loading
Macros§
- profile_
clear - Clear the thread-local profiler
- profile_
report - Get a report from the thread-local profiler
- profile_
start - Start profiling an operation using the thread-local profiler
- profile_
stop - Stop profiling an operation using the thread-local profiler
- trace_
cpu - No-op version of trace_cpu when “trace” feature is disabled.
- trace_
gpu - No-op version of trace_gpu when “trace” feature is disabled.
Structs§
- Batch
Inference Config - Batch inference: load model once, process multiple prompts sequentially.
- Batch
Prompt - Single prompt in a batch
- Batch
Result - Single result from batch inference
- Batch
Stats - Aggregate stats for the entire batch
- Inference
Config - Configuration for inference
- Inference
Result - Result from inference
- Inference
Tracer - Inference tracer
- Kernel
Dimensions - Model dimensions needed to construct and validate kernels
- Mapped
Safe Tensors Model - Zero-copy memory-mapped SafeTensors model container
- Model
Info - Model information for trace header
- Prepared
Tokens - Tokenized input that has been processed through chat template formatting.
- PtxParity
Report - Full PTX parity validation report
- Safetensors
Config - Model configuration from config.json
- Sharded
Safe Tensors Model - Sharded SafeTensors model container (GH-213)
- Tensor
- N-dimensional tensor with automatic backend dispatch
- Trace
Config - Trace configuration
- Validated
AprTransformer - Validated APR Transformer — compile-time guarantee of tensor data quality
Enums§
- Realizar
Error - Error type for all Realizar operations
- Trace
Step - Inference pipeline steps (State Machine states per AWS Step Functions model)
- Weight
Role - Weight roles that may be required for a transformer layer. Each architecture requires a subset of these.
Constants§
- VERSION
- Library version
Functions§
- required_
roles - Returns the required weight roles for a given architecture.
- rms_
norm - RMSNorm (Root Mean Square Layer Normalization)
- rms_
norm_ into - RMSNorm to pre-allocated buffer (zero-allocation path)
- run_
batch_ inference - Run batch inference, auto-detecting model format (GGUF or APR).
- run_
inference - Run inference on a model
Type Aliases§
- Result
- Result type alias for Realizar operations