Expand description
PMAT-291: Graph executor – dispatches tensor operations to CUDA kernels.
Each TensorOp maps to ONE kernel launch. The executor walks the compute graph in topological order and dispatches the appropriate kernel for each node. Combined with CUDA graph capture, this reduces 430 launches to ~15 tensor-level dispatches, then 1 graph replay.
§Kernel Mapping
| TensorOp | Kernel | Source |
|---|---|---|
| MulMat | BatchedHwDp4aQ4KGemvKernel | trueno-gpu/kernels/quantize/q4k/ |
| RmsNorm | BatchedVectorizedRmsNormKernel | trueno-gpu/kernels/layernorm/ |
| Add | BatchedResidualAddKernel | trueno-gpu/kernels/ |
| Rope | BatchedRopeKernel | trueno-gpu/kernels/ |
| Mul | FusedGateUpSwigluKernel | trueno-gpu/kernels/quantize/q4k/ |
| SoftMax | (attention dispatch) | realizr attention module |
| Copy | cuMemcpyDtoDAsync | trueno driver |
Structs§
- Graph
Exec Result - Result of executing a compute graph.
Traits§
- Kernel
Dispatch - Trait for dispatching tensor operations to GPU kernels.
Functions§
- execute_
graph - Execute a compute graph using the provided kernel dispatcher.