Expand description
GGUF ↔ ferrum tensor-name translation.
Ferrum models address weights using HuggingFace-style names
(model.layers.0.self_attn.q_proj.weight). GGUF files use llama.cpp’s
shorthand (blk.0.attn_q.weight). This module is the single source of
truth for that mapping; both GgufLoader and any future tooling go
through ferrum_to_gguf.
Scope: dense Llama-family models (Qwen3, Qwen2.x, Llama-3.x, Mistral,
TinyLlama) and Qwen-style MoE families (Qwen3-MoE, Mixtral, DeepSeek-V2 —
they all use the same GGUF layout: per-layer router ffn_gate_inp plus
three stacked-expert tensors ffn_{gate,up,down}_exps with shape
[num_experts, ...]).
§ferrum-side naming convention for MoE tensors
ferrum mirrors GGUF’s stacked layout rather than HuggingFace’s
experts.{e}.gate_proj per-expert layout. Reasons:
- The stacked form is what candle’s
QMatMul::indexed_moe_forwardexpects — slicing per-expert is a runtime concern, not a storage concern. - Loading per-expert from GGUF would require N reads + concat per layer (the dense path’s qkv-fusion shim works the other direction and only does 3, not N=128).
- If a future safetensors-MoE loader needs to consume per-expert tensors, it can do its own concat just like the dense Qwen2.5 path concatenates q/k/v.
Functions§
- ferrum_
to_ gguf - Translate a ferrum tensor name to its GGUF equivalent.
- gate_
up_ split_ parts - The two sub-tensor names that fuse into
gate_up_proj, stacked along axis 0 (gate first, then up). - qkv_
split_ parts - The three sub-tensor names that fuse into
qkv_proj, in the order the model expects them stacked along axis 0 (rows = output neurons).