Skip to main content

Module names

Module names 

Source
Expand description

GGUF ↔ ferrum tensor-name translation.

Ferrum models address weights using HuggingFace-style names (model.layers.0.self_attn.q_proj.weight). GGUF files use llama.cpp’s shorthand (blk.0.attn_q.weight). This module is the single source of truth for that mapping; both GgufLoader and any future tooling go through ferrum_to_gguf.

Scope: dense Llama-family models (Qwen3, Qwen2.x, Llama-3.x, Mistral, TinyLlama) and Qwen-style MoE families (Qwen3-MoE, Mixtral, DeepSeek-V2 — they all use the same GGUF layout: per-layer router ffn_gate_inp plus three stacked-expert tensors ffn_{gate,up,down}_exps with shape [num_experts, ...]).

§ferrum-side naming convention for MoE tensors

ferrum mirrors GGUF’s stacked layout rather than HuggingFace’s experts.{e}.gate_proj per-expert layout. Reasons:

  1. The stacked form is what candle’s QMatMul::indexed_moe_forward expects — slicing per-expert is a runtime concern, not a storage concern.
  2. Loading per-expert from GGUF would require N reads + concat per layer (the dense path’s qkv-fusion shim works the other direction and only does 3, not N=128).
  3. If a future safetensors-MoE loader needs to consume per-expert tensors, it can do its own concat just like the dense Qwen2.5 path concatenates q/k/v.

Functions§

ferrum_to_gguf
Translate a ferrum tensor name to its GGUF equivalent.
gate_up_split_parts
The two sub-tensor names that fuse into gate_up_proj, stacked along axis 0 (gate first, then up).
qkv_split_parts
The three sub-tensor names that fuse into qkv_proj, in the order the model expects them stacked along axis 0 (rows = output neurons).