Skip to main content

CpuMarlinExpertStack

Struct CpuMarlinExpertStack 

Source
pub struct CpuMarlinExpertStack {
    pub store: Arc<CpuGptqStore>,
    pub num_experts: usize,
    pub n_per_expert: usize,
    pub k: usize,
}

Fields§

§store: Arc<CpuGptqStore>§num_experts: usize§n_per_expert: usize§k: usize

Implementations§

Source§

impl CpuMarlinExpertStack

Source

pub fn new( store: Arc<CpuGptqStore>, num_experts: usize, n_per_expert: usize, k: usize, ) -> Self

Trait Implementations§

Source§

impl MarlinExpertStack<CpuBackend> for CpuMarlinExpertStack

Source§

fn n_per_expert(&self) -> usize

Per-expert output width (N tile cols).
Source§

fn k(&self) -> usize

Input width (K), common across experts.
Source§

fn num_experts(&self) -> usize

Number of experts packed into the tile.
Source§

fn as_any(&self) -> &dyn Any

Downcast hook — used at FFN dispatch boundaries where the caller needs to reach into the concrete store to e.g. share workspace memory across phases. Standard dyn Any pattern.
Source§

fn zero_workspace( &self, _ctx: &mut <CpuBackend as Backend>::Context, ) -> Result<()>

Bulk-zero the per-expert Marlin workspace mutex slots. Call ONCE before a batch of bucketed gemm_phase_batched calls — saves the per-call cuMemsetD32Async (one launch each → one launch total). At c=32 with 128 active experts × 2 phases × 48 layers that’s ~12k memset launches/token reduced to ~96.
Source§

fn gemm_phase_batched( &self, ctx: &mut <CpuBackend as Backend>::Context, input: &<CpuBackend as Backend>::Buffer, dispatches: &[(usize, usize, usize, usize)], output: &mut <CpuBackend as Backend>::Buffer, k: usize, ) -> Result<()>

Batched per-expert offset GEMM. dispatches[i] = (expert_idx, in_row_offset, out_row_offset, m). Runs each expert’s (m × K) @ tile[expert] = m × n_per_expert slice; CUDA backend overlaps via multi-stream round-robin.
Source§

fn make_expert_linear( self: Arc<Self>, expert_offset: usize, expert_n: usize, bias_host: Option<&[f32]>, ) -> Result<Box<dyn Linear<CpuBackend> + Send + Sync>>

Build a single-expert Linear<B> view onto this stack’s [expert_offset .. expert_offset + expert_n) column slice. Used for per-expert dispatch outside the MoE phase batching (e.g. shared-experts code paths). expert_offset and expert_n MUST be multiples of the backend’s Marlin N tile (64 on CUDA).
Source§

fn gemm_phase_vllm( &self, _ctx: &mut B::Context, _input: &B::Buffer, _sorted_token_ids: &B::Buffer, _expert_ids: &B::Buffer, _num_tokens_past_padded: &B::Buffer, _output: &mut B::Buffer, _prob_m: usize, _moe_block_size: usize, _top_k: usize, ) -> Result<()>

vLLM marlin_moe_wna16 fused GEMM (single launch, per-block expert routing inside the kernel). Caller responsibilities: Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more