Module dispatch

Expand description

Expert dispatch — load per-layer expert weights from a GGUF file and run the per-token MoE forward (top-K experts per token, weighted combine).

Phase 2 ships a CPU-only implementation (moe_forward_cpu). The algorithm is:

for each token b in batch:
    route token b → (expert_ids[K], weights[K])
    out[b] = 0
    for each (expert_id, weight) pair:
        gate_up = experts.gate_up[expert_id].forward(x[b])     # [2*ffn]
        silu_mul = silu(gate_up[..ffn]) * gate_up[ffn..]       # [ffn]
        contribution = experts.down[expert_id].forward(silu_mul) # [hidden]
        out[b] += weight * contribution

The fused gate || up per-expert layout means we can call Backend::fused_silu_mul_split directly on the projection’s output — same kernel ferrum already uses for dense Llama-family models.

Structs§

DeviceRouteScratch: Bundle of pre-allocated device buffers for the graph-capturable device-routing path in moe_forward_bucketed. Pass Some to take the device path (under FERRUM_MOE_DEVICE_ROUTE=1); pass None for the legacy host-mediated path (used by tests + the non-vLLM CUDA bucketed path).
ExpertStack: Per-layer expert weights, materialised as [num_experts]-long vectors of Box<dyn Linear<B>>. Each entry runs the corresponding expert’s fused [gate; up] projection or its down projection.
MoeBucketPlan: Bucket plan: per-expert lists of which (token, k_slot) pairs route through that expert. Built host-side from the router output and used by moe_forward_bucketed to issue ONE m=tokens_per_expert Marlin GEMM per active expert instead of batch * top_k m=1 GEMMs.
MoeForwardBucketedParams: Bucketed MoE forward: gather → per-expert m=N Marlin GEMM → silu_mul → per-expert m=N Marlin GEMM → moe_combine.
MoeForwardParams: Backend-generic MoE forward.
MoeRouteScratch: Reusable host-side scratch for moe_forward_bucketed. Holds the router output, softmax scratch buffer, and bucket plan, all reused across layers so the inner MoE forward path is allocation-free.

Constants§

MOE_BLOCK_SIZE_MAX: Largest moe_block_size we’d ever pick. Drives Qwen3MoeScratch route_sorted_tokens_dev sizing (allocates t*top_k + n_exp*MAX).

Statics§

MOE_BUCKET_COMBINE_US
MOE_BUCKET_D2H_US
MOE_BUCKET_GATHER_US
MOE_BUCKET_GEMM1_US
MOE_BUCKET_GEMM3_US
MOE_BUCKET_LAYER_CALLS
MOE_BUCKET_PLAN_US
MOE_BUCKET_ROUTE_US
MOE_BUCKET_SILU_US
MOE_BUCKET_SYNC_US
MOE_COPY_CALLS
MOE_COPY_US
MOE_GEMV_DOWN_CALLS
MOE_GEMV_DOWN_US
MOE_GEMV_GATE_UP_CALLS
MOE_GEMV_GATE_UP_US
MOE_HOST_TOPK_CALLS
MOE_HOST_TOPK_US
MOE_SCALED_ADD_CALLS
MOE_SCALED_ADD_US
MOE_SILU_CALLS
MOE_SILU_US
MOE_SYNC_CALLS
MOE_SYNC_US: MoE per-op timers. Public so the model wrapper can drain + print at end of decode. Times are in microseconds, atomically accumulated. Toggle via env FERRUM_MOE_PROFILE=1.

Functions§

moe_forward
moe_forward_bucketed
moe_forward_cpu: Run MoE forward on CPU.

Module dispatch

Module dispatch Copy item path

Structs§

Constants§

Statics§

Functions§

Module dispatch