Crate realizar

Source

Expand description

realizar has moved to aprender-serve.

This crate re-exports aprender-serve for backward compatibility. New code should depend on aprender-serve directly.

Modules§

api: HTTP API for model inference
apr: Aprender .apr format support (PRIMARY inference format)
apr_transformer: APR Transformer format for WASM-compatible LLM inference
arch_requirements: Per-architecture required weight roles (GH-279).
audit: Audit trail and provenance logging
bench: Benchmark harness for model runner comparison
bench_preflight: Preflight validation protocol for deterministic benchmarking
bench_viz: Benchmark visualization for inference comparison (PAR-040)
brick: ComputeBrick architecture for token-centric, self-verifying inference
cache: Model caching and warming for reduced latency
capability: GH-280: Kernel capability gate — contract-driven GPU admission control.
chat_template: Chat template engine for model-specific message formatting
cli: CLI command implementations (extracted for testability) CLI command implementations
contract_gate: GGUF to APR Transformer converter
convert: GGUF to APR Transformer Converter
cuda: CUDA PTX generation for NVIDIA GPUs
error: Error types for Realizar
explain: Model explainability (SHAP, Attention)
format: Unified model format detection (APR, GGUF, SafeTensors)
generate: Text generation and sampling strategies
gguf: GGUF (GPT-Generated Unified Format) parser
gpu: GPU acceleration module (Phase 4: ≥100 tok/s target)
grammar: Grammar-constrained generation for structured output
infer: High-level inference API for CLI tools
inference: SIMD-accelerated inference engine using trueno
inference_trace: Inference tracing for debugging LLM pipelines
layers: Neural network layers for transformer models
memory: Memory management for hot expert pinning
metrics: Metrics collection and reporting for production monitoring
model_loader: Unified model loader for APR, GGUF, and SafeTensors
moe: Mixture-of-Experts (MOE) routing with Capacity Factor load balancing
observability: Observability: metrics, tracing, and A/B testing
paged_kv: PagedAttention KV cache management
parallel: Multi-GPU and Distributed Inference
ptx_parity: PTX Parity Validation — GH-219
quantize: Quantization and dequantization for model weights
registry: Model registry for multi-model serving
safetensors: Safetensors parser
safetensors_cuda: SafeTensors CUDA inference (PMAT-116)
safetensors_infer: SafeTensors inference support (PAR-301)
scheduler: Continuous batching scheduler
serve: Aprender ML Model Serving API
speculative: Speculative decoding for LLM inference acceleration
stats: Statistical analysis for A/B testing with log-normal latency support
target: Multi-target deployment support (Lambda, Docker, WASM) Multi-Target Deployment Support
tensor: Tensor implementation
tensor_names: GH-311: Contract-driven tensor name resolution (tensor-names-v1.yaml codegen)
tokenizer: Tokenizer for text encoding and decoding
tui: TUI monitoring for inference performance TUI Monitoring for LLM Inference
uri: Pacha URI scheme support for model loading Pacha URI scheme support for model loading
viz: Benchmark visualization using trueno-viz.
warmup: Model warm-up and pre-loading Model Warm-up and Pre-loading

Macros§

profile_clear: Clear the thread-local profiler
profile_report: Get a report from the thread-local profiler
profile_start: Start profiling an operation using the thread-local profiler
profile_stop: Stop profiling an operation using the thread-local profiler
trace_cpu: No-op version of trace_cpu when “trace” feature is disabled.
trace_gpu: No-op version of trace_gpu when “trace” feature is disabled.

Structs§

BatchInferenceConfig: Batch inference: load model once, process multiple prompts sequentially.
BatchPrompt: Single prompt in a batch
BatchResult: Single result from batch inference
BatchStats: Aggregate stats for the entire batch
InferenceConfig: Configuration for inference
InferenceResult: Result from inference
InferenceTracer: Inference tracer
KernelDimensions: Model dimensions needed to construct and validate kernels
MappedSafeTensorsModel: Zero-copy memory-mapped SafeTensors model container
ModelInfo: Model information for trace header
PreparedTokens: Tokenized input that has been processed through chat template formatting.
PtxParityReport: Full PTX parity validation report
SafetensorsConfig: Model configuration from config.json
ShardedSafeTensorsModel: Sharded SafeTensors model container (GH-213)
Tensor: N-dimensional tensor with automatic backend dispatch
TraceConfig: Trace configuration
ValidatedAprTransformer: Validated APR Transformer — compile-time guarantee of tensor data quality

Enums§

RealizarError: Error type for all Realizar operations
TraceStep: Inference pipeline steps (State Machine states per AWS Step Functions model)
WeightRole: Weight roles that may be required for a transformer layer. Each architecture requires a subset of these.

Constants§

VERSION: Library version

Functions§

required_roles: Returns the required weight roles for a given architecture.
rms_norm: RMSNorm (Root Mean Square Layer Normalization)
rms_norm_into: RMSNorm to pre-allocated buffer (zero-allocation path)
run_batch_inference: Run batch inference, auto-detecting model format (GGUF or APR).
run_inference: Run inference on a model

Type Aliases§

Result: Result type alias for Realizar operations

Crate realizar

Crate realizar Copy item path

Modules§

Macros§

Structs§

Enums§

Constants§

Functions§

Type Aliases§

Crate realizar