Expand description
Llama-family transformer model for inference.
Composes trueno primitives (rms_norm, Q4K matmul, fused attention) into a complete transformer that loads GGUF weights and generates text.
Structs§
- Forward
Arena - Pre-allocated scratch buffers for the forward pass. Eliminates all per-token heap allocations (FALSIFY-ARENA-001). Contract: contracts/cgp/cgp-inference-arena-v1.yaml
- KvCache
- KV cache for incremental decoding.
- Layer
Weights - Weights for a single transformer layer.
- Llama
Model - Complete transformer model ready for inference.
- Model
Config - Model hyperparameters extracted from GGUF metadata.
- Model
Weights - Full model weights.
Enums§
- Weight
Matrix - A weight matrix that may be Q4K (bytes) or any-other-quant dequantized to F32.