Module runner_weights

Expand description

Generic CUDA decode runner weight loader.

Loads transformer weights from safetensors, fuses separate Q/K/V → QKV and gate/up → gate_up, then uploads to a CUDA stream. Architecture-agnostic: works for Llama, Qwen2, Mistral, and any model with the standard naming.