Module executor

Expand description

PMAT-291: Graph executor – dispatches tensor operations to CUDA kernels.

Each TensorOp maps to ONE kernel launch. The executor walks the compute graph in topological order and dispatches the appropriate kernel for each node. Combined with CUDA graph capture, this reduces 430 launches to ~15 tensor-level dispatches, then 1 graph replay.

§Kernel Mapping

TensorOp	Kernel	Source
MulMat	BatchedHwDp4aQ4KGemvKernel	trueno-gpu/kernels/quantize/q4k/
RmsNorm	BatchedVectorizedRmsNormKernel	trueno-gpu/kernels/layernorm/
Add	BatchedResidualAddKernel	trueno-gpu/kernels/
Rope	BatchedRopeKernel	trueno-gpu/kernels/
Mul	FusedGateUpSwigluKernel	trueno-gpu/kernels/quantize/q4k/
SoftMax	(attention dispatch)	realizr attention module
Copy	cuMemcpyDtoDAsync	trueno driver

Structs§

GraphExecResult: Result of executing a compute graph.

Traits§

KernelDispatch: Trait for dispatching tensor operations to GPU kernels.

Functions§

execute_graph: Execute a compute graph using the provided kernel dispatcher.

Module executor

Module executor Copy item path

§Kernel Mapping

Structs§

Traits§

Functions§

Module executor