Expand description
Expert dispatch — load per-layer expert weights from a GGUF file and run the per-token MoE forward (top-K experts per token, weighted combine).
Phase 2 ships a CPU-only implementation (moe_forward_cpu). The
algorithm is:
for each token b in batch:
route token b → (expert_ids[K], weights[K])
out[b] = 0
for each (expert_id, weight) pair:
gate_up = experts.gate_up[expert_id].forward(x[b]) # [2*ffn]
silu_mul = silu(gate_up[..ffn]) * gate_up[ffn..] # [ffn]
contribution = experts.down[expert_id].forward(silu_mul) # [hidden]
out[b] += weight * contributionThe fused gate || up per-expert layout means we can call
Backend::fused_silu_mul_split directly on the projection’s output
— same kernel ferrum already uses for dense Llama-family models.
Structs§
- Device
Route Scratch - Bundle of pre-allocated device buffers for the graph-capturable
device-routing path in
moe_forward_bucketed. PassSometo take the device path (underFERRUM_MOE_DEVICE_ROUTE=1); passNonefor the legacy host-mediated path (used by tests + the non-vLLM CUDA bucketed path). - Expert
Stack - Per-layer expert weights, materialised as
[num_experts]-long vectors ofBox<dyn Linear<B>>. Each entry runs the corresponding expert’s fused[gate; up]projection or itsdownprojection. - MoeBucket
Plan - Bucket plan: per-expert lists of which (token, k_slot) pairs route
through that expert. Built host-side from the router output and used
by
moe_forward_bucketedto issue ONE m=tokens_per_expert Marlin GEMM per active expert instead ofbatch * top_km=1 GEMMs. - MoeForward
Bucketed Params - Bucketed MoE forward: gather → per-expert m=N Marlin GEMM → silu_mul → per-expert m=N Marlin GEMM → moe_combine.
- MoeForward
Params - Backend-generic MoE forward.
- MoeRoute
Scratch - Reusable host-side scratch for
moe_forward_bucketed. Holds the router output, softmax scratch buffer, and bucket plan, all reused across layers so the inner MoE forward path is allocation-free.
Constants§
- MOE_
BLOCK_ SIZE_ MAX - Largest moe_block_size we’d ever pick. Drives Qwen3MoeScratch
route_sorted_tokens_devsizing (allocatest*top_k + n_exp*MAX).
Statics§
- MOE_
BUCKET_ COMBINE_ US - MOE_
BUCKET_ D2H_ US - MOE_
BUCKET_ GATHER_ US - MOE_
BUCKET_ GEMM1_ US - MOE_
BUCKET_ GEMM3_ US - MOE_
BUCKET_ LAYER_ CALLS - MOE_
BUCKET_ PLAN_ US - MOE_
BUCKET_ ROUTE_ US - MOE_
BUCKET_ SILU_ US - MOE_
BUCKET_ SYNC_ US - MOE_
COPY_ CALLS - MOE_
COPY_ US - MOE_
GEMV_ DOWN_ CALLS - MOE_
GEMV_ DOWN_ US - MOE_
GEMV_ GATE_ UP_ CALLS - MOE_
GEMV_ GATE_ UP_ US - MOE_
HOST_ TOPK_ CALLS - MOE_
HOST_ TOPK_ US - MOE_
SCALED_ ADD_ CALLS - MOE_
SCALED_ ADD_ US - MOE_
SILU_ CALLS - MOE_
SILU_ US - MOE_
SYNC_ CALLS - MOE_
SYNC_ US - MoE per-op timers. Public so the model wrapper can drain + print at
end of decode. Times are in microseconds, atomically accumulated.
Toggle via env
FERRUM_MOE_PROFILE=1.
Functions§
- moe_
forward - moe_
forward_ bucketed - moe_
forward_ cpu - Run MoE forward on CPU.