pub struct MoeForwardBucketedParams<'a, B: QuantLlmBackend + BackendMoeFused> {Show 17 fields
pub ctx: &'a mut B::Context,
pub x: &'a B::Buffer,
pub router_logits: &'a B::Buffer,
pub out: &'a mut B::Buffer,
pub batch: usize,
pub hidden_size: usize,
pub expert_intermediate: usize,
pub num_experts: usize,
pub top_k: usize,
pub norm_topk_prob: bool,
pub experts: &'a ExpertStack<B>,
pub x_packed: &'a mut B::Buffer,
pub gate_up_packed: &'a mut B::Buffer,
pub silu_packed: &'a mut B::Buffer,
pub down_packed: &'a mut B::Buffer,
pub route_scratch: &'a mut MoeRouteScratch,
pub device_route: Option<DeviceRouteScratch<'a, B>>,
}Expand description
Bucketed MoE forward: gather → per-expert m=N Marlin GEMM → silu_mul → per-expert m=N Marlin GEMM → moe_combine.
Replaces the batch × top_k m=1 dispatch loop in moe_forward with
num_active_experts × 2 m=tokens_per_expert dispatches. For prefill
(m=512+), this is a 30× reduction in GEMM launches AND each GEMM runs
at a much more efficient m than the m=1 path. For decode (m=1), the
number of dispatches is similar but we still benefit from the
gather/combine kernel pattern (one launch each instead of 2 per pair).
Requires: scratch buffers x_packed [total_pairs, hidden],
gate_up_packed [total_pairs, 2*expert_inter],
silu_packed [total_pairs, expert_inter], and
down_packed [total_pairs, hidden] provisioned by the caller. The
caller is responsible for sizing these to batch * top_k rows
(worst-case all top_k pairs alive).
Fields§
§ctx: &'a mut B::Context§x: &'a B::Buffer§router_logits: &'a B::Buffer§out: &'a mut B::Buffer§batch: usize§expert_intermediate: usize§num_experts: usize§top_k: usize§norm_topk_prob: bool§experts: &'a ExpertStack<B>§x_packed: &'a mut B::Buffer§gate_up_packed: &'a mut B::Buffer§silu_packed: &'a mut B::Buffer§down_packed: &'a mut B::Buffer§route_scratch: &'a mut MoeRouteScratch§device_route: Option<DeviceRouteScratch<'a, B>>Auto Trait Implementations§
impl<'a, B> !RefUnwindSafe for MoeForwardBucketedParams<'a, B>
impl<'a, B> !UnwindSafe for MoeForwardBucketedParams<'a, B>
impl<'a, B> Freeze for MoeForwardBucketedParams<'a, B>
impl<'a, B> Send for MoeForwardBucketedParams<'a, B>
impl<'a, B> Sync for MoeForwardBucketedParams<'a, B>
impl<'a, B> Unpin for MoeForwardBucketedParams<'a, B>
impl<'a, B> UnsafeUnpin for MoeForwardBucketedParams<'a, B>
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
impl<T> ErasedDestructor for Twhere
T: 'static,
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more