Skip to main content

Crate realizar

Crate realizar 

Source
Expand description

realizar has moved to aprender-serve.

This crate re-exports aprender-serve for backward compatibility. New code should depend on aprender-serve directly.

Modules§

api
HTTP API for model inference
apr
Aprender .apr format support (PRIMARY inference format)
apr_transformer
APR Transformer format for WASM-compatible LLM inference
arch_requirements
Per-architecture required weight roles (GH-279).
audit
Audit trail and provenance logging
bench
Benchmark harness for model runner comparison
bench_preflight
Preflight validation protocol for deterministic benchmarking
bench_viz
Benchmark visualization for inference comparison (PAR-040)
brick
ComputeBrick architecture for token-centric, self-verifying inference
cache
Model caching and warming for reduced latency
capability
GH-280: Kernel capability gate — contract-driven GPU admission control.
chat_template
Chat template engine for model-specific message formatting
cli
CLI command implementations (extracted for testability) CLI command implementations
contract_gate
GGUF to APR Transformer converter
convert
GGUF to APR Transformer Converter
cuda
CUDA PTX generation for NVIDIA GPUs
error
Error types for Realizar
explain
Model explainability (SHAP, Attention)
format
Unified model format detection (APR, GGUF, SafeTensors)
generate
Text generation and sampling strategies
gguf
GGUF (GPT-Generated Unified Format) parser
gpu
GPU acceleration module (Phase 4: ≥100 tok/s target)
grammar
Grammar-constrained generation for structured output
infer
High-level inference API for CLI tools
inference
SIMD-accelerated inference engine using trueno
inference_trace
Inference tracing for debugging LLM pipelines
layers
Neural network layers for transformer models
memory
Memory management for hot expert pinning
metrics
Metrics collection and reporting for production monitoring
model_loader
Unified model loader for APR, GGUF, and SafeTensors
moe
Mixture-of-Experts (MOE) routing with Capacity Factor load balancing
observability
Observability: metrics, tracing, and A/B testing
paged_kv
PagedAttention KV cache management
parallel
Multi-GPU and Distributed Inference
ptx_parity
PTX Parity Validation — GH-219
quantize
Quantization and dequantization for model weights
registry
Model registry for multi-model serving
safetensors
Safetensors parser
safetensors_cuda
SafeTensors CUDA inference (PMAT-116)
safetensors_infer
SafeTensors inference support (PAR-301)
scheduler
Continuous batching scheduler
serve
Aprender ML Model Serving API
speculative
Speculative decoding for LLM inference acceleration
stats
Statistical analysis for A/B testing with log-normal latency support
target
Multi-target deployment support (Lambda, Docker, WASM) Multi-Target Deployment Support
tensor
Tensor implementation
tensor_names
GH-311: Contract-driven tensor name resolution (tensor-names-v1.yaml codegen)
tokenizer
Tokenizer for text encoding and decoding
tui
TUI monitoring for inference performance TUI Monitoring for LLM Inference
uri
Pacha URI scheme support for model loading Pacha URI scheme support for model loading
viz
Benchmark visualization using trueno-viz.
warmup
Model warm-up and pre-loading Model Warm-up and Pre-loading

Macros§

profile_clear
Clear the thread-local profiler
profile_report
Get a report from the thread-local profiler
profile_start
Start profiling an operation using the thread-local profiler
profile_stop
Stop profiling an operation using the thread-local profiler
trace_cpu
No-op version of trace_cpu when “trace” feature is disabled.
trace_gpu
No-op version of trace_gpu when “trace” feature is disabled.

Structs§

BatchInferenceConfig
Batch inference: load model once, process multiple prompts sequentially.
BatchPrompt
Single prompt in a batch
BatchResult
Single result from batch inference
BatchStats
Aggregate stats for the entire batch
InferenceConfig
Configuration for inference
InferenceResult
Result from inference
InferenceTracer
Inference tracer
KernelDimensions
Model dimensions needed to construct and validate kernels
MappedSafeTensorsModel
Zero-copy memory-mapped SafeTensors model container
ModelInfo
Model information for trace header
PreparedTokens
Tokenized input that has been processed through chat template formatting.
PtxParityReport
Full PTX parity validation report
SafetensorsConfig
Model configuration from config.json
ShardedSafeTensorsModel
Sharded SafeTensors model container (GH-213)
Tensor
N-dimensional tensor with automatic backend dispatch
TraceConfig
Trace configuration
ValidatedAprTransformer
Validated APR Transformer — compile-time guarantee of tensor data quality

Enums§

RealizarError
Error type for all Realizar operations
TraceStep
Inference pipeline steps (State Machine states per AWS Step Functions model)
WeightRole
Weight roles that may be required for a transformer layer. Each architecture requires a subset of these.

Constants§

VERSION
Library version

Functions§

required_roles
Returns the required weight roles for a given architecture.
rms_norm
RMSNorm (Root Mean Square Layer Normalization)
rms_norm_into
RMSNorm to pre-allocated buffer (zero-allocation path)
run_batch_inference
Run batch inference, auto-detecting model format (GGUF or APR).
run_inference
Run inference on a model

Type Aliases§

Result
Result type alias for Realizar operations