Module dispatch

Expand description

Expert dispatch — load per-layer expert weights from a GGUF file and run the per-token MoE forward (top-K experts per token, weighted combine).

Phase 2 ships a CPU-only implementation (moe_forward_cpu). The algorithm is:

for each token b in batch:
    route token b → (expert_ids[K], weights[K])
    out[b] = 0
    for each (expert_id, weight) pair:
        gate_up = experts.gate_up[expert_id].forward(x[b])     # [2*ffn]
        silu_mul = silu(gate_up[..ffn]) * gate_up[ffn..]       # [ffn]
        contribution = experts.down[expert_id].forward(silu_mul) # [hidden]
        out[b] += weight * contribution

The fused gate || up per-expert layout means we can call Backend::fused_silu_mul_split directly on the projection’s output — same kernel ferrum already uses for dense Llama-family models.

Structs§

ExpertStack: Per-layer expert weights, materialised as [num_experts]-long vectors of Box<dyn Linear<B>>. Each entry runs the corresponding expert’s fused [gate; up] projection or its down projection.

Statics§

MOE_COPY_CALLS
MOE_COPY_US
MOE_GEMV_DOWN_CALLS
MOE_GEMV_DOWN_US
MOE_GEMV_GATE_UP_CALLS
MOE_GEMV_GATE_UP_US
MOE_HOST_TOPK_CALLS
MOE_HOST_TOPK_US
MOE_SCALED_ADD_CALLS
MOE_SCALED_ADD_US
MOE_SILU_CALLS
MOE_SILU_US
MOE_SYNC_CALLS
MOE_SYNC_US: MoE per-op timers. Public so the model wrapper can drain + print at end of decode. Times are in microseconds, atomically accumulated. Toggle via env FERRUM_MOE_PROFILE=1.

Functions§

moe_forward: Backend-generic MoE forward.
moe_forward_cpu: Run MoE forward on CPU.

Module dispatch

Module dispatch Copy item path

Structs§

Statics§

Functions§

Module dispatch