1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
//! GPU port of [`super::moe_router::moe_router_cpu`] — softmax →
//! selection-sort top-K → divide-by-sum normalize, all on the GPU
//! without a host bounce.
//!
//! Phase A of the graph-mode plan
//! ([`qwen_graph_mode_session6_plan.md`]): the routing host bounce
//! at every layer is what forces the per-layer commit boundary
//! today. Moving the router on-GPU is the *enabler* for Phase B's
//! graph-mode submission; on its own it's near-perf-neutral.
//!
//! This module only emits the encoder. The orchestrator still
//! decides whether to dispatch the GPU router or the CPU oracle
//! (a future flag toggles per call); Phase B will swap callsites.
//!
//! ## Diff target
//!
//! Slot order is the running-min replacement order from the CPU
//! oracle (see `moe_router_cpu`'s `cpu_topk` block), so per-slot
//! comparison is well-defined. Floating-point drift comes only from
//! the `exp` reduction order — softmax happens in tg-mem with a
//! tree-shaped reduction on GPU vs. a left-to-right scan on CPU.
//! The diff battery asserts:
//!
//! - Indices: bit-exact set (sorted) — magnitude separation between
//! adjacent expert scores dominates the per-element ULP drift.
//! - Weights: cosine ≥ 0.9999 — values match within the softmax
//! reduction-order tolerance.
use ;
use cratepipeline_bundle;
pipeline_bundle!
/// Encode the full GPU router pipeline (softmax + top-K + normalize) into
/// `cmdbuf`. No commit_and_wait — the caller controls the cmdbuf boundary.
///
/// `logits` is the gate-matvec output stacked across `n_tokens` rows of
/// `n_experts` f32 each (row-major). `indices_out` and `weights_out` are
/// caller-owned `n_tokens * k` element output buffers (i32 and f32
/// respectively).
///
/// The softmax dispatch picks `tg_size = 64` threads per token — a single
/// SIMD group covers 32 lanes for the parallel max/sum reductions; 64 keeps
/// per-token work well above the launch overhead without burning extra
/// lanes idling on a 128-expert vector. The selection-sort tail runs on
/// lane 0 only and is the same wall-clock cost regardless of `tg_size`.
///
/// `k` is bounded by the kernel-side `MAX_K = 16`; current models use 8.
/// `n_experts` is bounded by `MAX_EXPERTS = 512`; current models use ≤ 256.