Expand description
fused_silu_mul_split op-diff harness.
Input layout (matches the kernel API):
gate_up: tokens × (2 * intermediate)- For each token row:
[gate ‖ up]concatenated Output: out: tokens × intermediate, whereout[i,j] = silu(gate[i,j]) * up[i,j]