Skip to main content

Module executor

Module executor 

Source
Expand description

PMAT-291: Graph executor – dispatches tensor operations to CUDA kernels.

Each TensorOp maps to ONE kernel launch. The executor walks the compute graph in topological order and dispatches the appropriate kernel for each node. Combined with CUDA graph capture, this reduces 430 launches to ~15 tensor-level dispatches, then 1 graph replay.

§Kernel Mapping

TensorOpKernelSource
MulMatBatchedHwDp4aQ4KGemvKerneltrueno-gpu/kernels/quantize/q4k/
RmsNormBatchedVectorizedRmsNormKerneltrueno-gpu/kernels/layernorm/
AddBatchedResidualAddKerneltrueno-gpu/kernels/
RopeBatchedRopeKerneltrueno-gpu/kernels/
MulFusedGateUpSwigluKerneltrueno-gpu/kernels/quantize/q4k/
SoftMax(attention dispatch)realizr attention module
CopycuMemcpyDtoDAsynctrueno driver

Structs§

GraphExecResult
Result of executing a compute graph.

Traits§

KernelDispatch
Trait for dispatching tensor operations to GPU kernels.

Functions§

execute_graph
Execute a compute graph using the provided kernel dispatcher.