Module model

Expand description

Llama-family transformer model for inference.

Composes trueno primitives (rms_norm, Q4K matmul, fused attention) into a complete transformer that loads GGUF weights and generates text.

Structs§

ForwardArena: Pre-allocated scratch buffers for the forward pass. Eliminates all per-token heap allocations (FALSIFY-ARENA-001). Contract: contracts/cgp/cgp-inference-arena-v1.yaml
KvCache: KV cache for incremental decoding.
LayerWeights: Weights for a single transformer layer.
LlamaModel: Complete transformer model ready for inference.
ModelConfig: Model hyperparameters extracted from GGUF metadata.
ModelWeights: Full model weights.

WeightMatrix: A weight matrix that may be Q4K (bytes) or any-other-quant dequantized to F32.