ferrum_kernels::quant_linear::cpu_marlin_stack

Struct CpuMarlinExpertStack

pub struct CpuMarlinExpertStack {
    pub store: Arc<CpuGptqStore>,
    pub num_experts: usize,
    pub n_per_expert: usize,
    pub k: usize,
}

Fields§

§store: Arc<CpuGptqStore>§num_experts: usize§n_per_expert: usize§k: usize

Implementations§

Source §

impl CpuMarlinExpertStack

Source

pub fn new( store: Arc<CpuGptqStore>, num_experts: usize, n_per_expert: usize, k: usize, ) -> Self

Trait Implementations§

Source §

impl MarlinExpertStack<CpuBackend> for CpuMarlinExpertStack

Source §

fn n_per_expert(&self) -> usize

Per-expert output width (N tile cols).

Source §

fn k(&self) -> usize

Input width (K), common across experts.

Source §

fn num_experts(&self) -> usize

Number of experts packed into the tile.

Source §

fn as_any(&self) -> &dyn Any

Downcast hook — used at FFN dispatch boundaries where the caller needs to reach into the concrete store to e.g. share workspace memory across phases. Standard dyn Any pattern.

Source §

fn zero_workspace( &self, _ctx: &mut <CpuBackend as Backend>::Context, ) -> Result<()>

Bulk-zero the per-expert Marlin workspace mutex slots. Call ONCE before a batch of bucketed gemm_phase_batched calls — saves the per-call cuMemsetD32Async (one launch each → one launch total). At c=32 with 128 active experts × 2 phases × 48 layers that’s ~12k memset launches/token reduced to ~96.

Source §

fn gemm_phase_batched( &self, ctx: &mut <CpuBackend as Backend>::Context, input: &<CpuBackend as Backend>::Buffer, dispatches: &[(usize, usize, usize, usize)], output: &mut <CpuBackend as Backend>::Buffer, k: usize, ) -> Result<()>

Batched per-expert offset GEMM. dispatches[i] = (expert_idx, in_row_offset, out_row_offset, m). Runs each expert’s (m × K) @ tile[expert] = m × n_per_expert slice; CUDA backend overlaps via multi-stream round-robin.

Source §

fn make_expert_linear( self: Arc<Self>, expert_offset: usize, expert_n: usize, bias_host: Option<&[f32]>, ) -> Result<Box<dyn Linear<CpuBackend> + Send + Sync>>

Build a single-expert Linear view onto this stack’s [expert_offset .. expert_offset + expert_n) column slice. Used for per-expert dispatch outside the MoE phase batching (e.g. shared-experts code paths). expert_offset and expert_n MUST be multiples of the backend’s Marlin N tile (64 on CUDA).

Source §