Expand description
Expert dispatch — load per-layer expert weights from a GGUF file and run the per-token MoE forward (top-K experts per token, weighted combine).
Phase 2 ships a CPU-only implementation (moe_forward_cpu). The
algorithm is:
for each token b in batch:
route token b → (expert_ids[K], weights[K])
out[b] = 0
for each (expert_id, weight) pair:
gate_up = experts.gate_up[expert_id].forward(x[b]) # [2*ffn]
silu_mul = silu(gate_up[..ffn]) * gate_up[ffn..] # [ffn]
contribution = experts.down[expert_id].forward(silu_mul) # [hidden]
out[b] += weight * contributionThe fused gate || up per-expert layout means we can call
Backend::fused_silu_mul_split directly on the projection’s output
— same kernel ferrum already uses for dense Llama-family models.
Structs§
- Expert
Stack - Per-layer expert weights, materialised as
[num_experts]-long vectors ofBox<dyn Linear<B>>. Each entry runs the corresponding expert’s fused[gate; up]projection or itsdownprojection.
Statics§
- MOE_
COPY_ CALLS - MOE_
COPY_ US - MOE_
GEMV_ DOWN_ CALLS - MOE_
GEMV_ DOWN_ US - MOE_
GEMV_ GATE_ UP_ CALLS - MOE_
GEMV_ GATE_ UP_ US - MOE_
HOST_ TOPK_ CALLS - MOE_
HOST_ TOPK_ US - MOE_
SCALED_ ADD_ CALLS - MOE_
SCALED_ ADD_ US - MOE_
SILU_ CALLS - MOE_
SILU_ US - MOE_
SYNC_ CALLS - MOE_
SYNC_ US - MoE per-op timers. Public so the model wrapper can drain + print at
end of decode. Times are in microseconds, atomically accumulated.
Toggle via env
FERRUM_MOE_PROFILE=1.
Functions§
- moe_
forward - Backend-generic MoE forward.
- moe_
forward_ cpu - Run MoE forward on CPU.