pub struct CpuMarlinExpertStack {
pub store: Arc<CpuGptqStore>,
pub num_experts: usize,
pub n_per_expert: usize,
pub k: usize,
}Fields§
§store: Arc<CpuGptqStore>§num_experts: usize§n_per_expert: usize§k: usizeImplementations§
Trait Implementations§
Source§impl MarlinExpertStack<CpuBackend> for CpuMarlinExpertStack
impl MarlinExpertStack<CpuBackend> for CpuMarlinExpertStack
Source§fn n_per_expert(&self) -> usize
fn n_per_expert(&self) -> usize
Per-expert output width (N tile cols).
Source§fn num_experts(&self) -> usize
fn num_experts(&self) -> usize
Number of experts packed into the tile.
Source§fn as_any(&self) -> &dyn Any
fn as_any(&self) -> &dyn Any
Downcast hook — used at FFN dispatch boundaries where the
caller needs to reach into the concrete store to e.g. share
workspace memory across phases. Standard
dyn Any pattern.Source§fn zero_workspace(
&self,
_ctx: &mut <CpuBackend as Backend>::Context,
) -> Result<()>
fn zero_workspace( &self, _ctx: &mut <CpuBackend as Backend>::Context, ) -> Result<()>
Bulk-zero the per-expert Marlin workspace mutex slots. Call ONCE
before a batch of bucketed
gemm_phase_batched calls — saves
the per-call cuMemsetD32Async (one launch each → one launch
total). At c=32 with 128 active experts × 2 phases × 48 layers
that’s ~12k memset launches/token reduced to ~96.Source§fn gemm_phase_batched(
&self,
ctx: &mut <CpuBackend as Backend>::Context,
input: &<CpuBackend as Backend>::Buffer,
dispatches: &[(usize, usize, usize, usize)],
output: &mut <CpuBackend as Backend>::Buffer,
k: usize,
) -> Result<()>
fn gemm_phase_batched( &self, ctx: &mut <CpuBackend as Backend>::Context, input: &<CpuBackend as Backend>::Buffer, dispatches: &[(usize, usize, usize, usize)], output: &mut <CpuBackend as Backend>::Buffer, k: usize, ) -> Result<()>
Batched per-expert offset GEMM.
dispatches[i] = (expert_idx, in_row_offset, out_row_offset, m). Runs each
expert’s (m × K) @ tile[expert] = m × n_per_expert slice;
CUDA backend overlaps via multi-stream round-robin.Source§fn make_expert_linear(
self: Arc<Self>,
expert_offset: usize,
expert_n: usize,
bias_host: Option<&[f32]>,
) -> Result<Box<dyn Linear<CpuBackend> + Send + Sync>>
fn make_expert_linear( self: Arc<Self>, expert_offset: usize, expert_n: usize, bias_host: Option<&[f32]>, ) -> Result<Box<dyn Linear<CpuBackend> + Send + Sync>>
Build a single-expert
Linear<B> view onto this stack’s
[expert_offset .. expert_offset + expert_n) column slice.
Used for per-expert dispatch outside the MoE phase batching
(e.g. shared-experts code paths). expert_offset and expert_n
MUST be multiples of the backend’s Marlin N tile (64 on CUDA).Source§fn gemm_phase_vllm(
&self,
_ctx: &mut B::Context,
_input: &B::Buffer,
_sorted_token_ids: &B::Buffer,
_expert_ids: &B::Buffer,
_num_tokens_past_padded: &B::Buffer,
_output: &mut B::Buffer,
_prob_m: usize,
_moe_block_size: usize,
_top_k: usize,
) -> Result<()>
fn gemm_phase_vllm( &self, _ctx: &mut B::Context, _input: &B::Buffer, _sorted_token_ids: &B::Buffer, _expert_ids: &B::Buffer, _num_tokens_past_padded: &B::Buffer, _output: &mut B::Buffer, _prob_m: usize, _moe_block_size: usize, _top_k: usize, ) -> Result<()>
vLLM
marlin_moe_wna16 fused GEMM (single launch, per-block
expert routing inside the kernel). Caller responsibilities: Read moreAuto Trait Implementations§
impl Freeze for CpuMarlinExpertStack
impl RefUnwindSafe for CpuMarlinExpertStack
impl Send for CpuMarlinExpertStack
impl Sync for CpuMarlinExpertStack
impl Unpin for CpuMarlinExpertStack
impl UnsafeUnpin for CpuMarlinExpertStack
impl UnwindSafe for CpuMarlinExpertStack
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more