Expand description
PMAT-291: Tensor compute graph for GPU inference (reduces 430 dispatches to ~15) PMAT-291: Tensor Compute Graph for GPU Inference
Inspired by ggml’s compute graph pattern (transpiled via decy for reference). Pure Rust, no FFI. Reduces ~430 individual cuLaunchKernel dispatches to ~15 tensor-level operations per decode step.
§Design (from cross-project analysis)
- ggml: C tensor graph with ~15 nodes, CUDA graph replay = 1 launch
- vLLM: PyTorch/inductor IR fusion + CUDA graphs (~80 nodes)
- realizr current: 430 individual kernel dispatches
- realizr target: ~15 tensor ops via this module + CUDA graph replay
§Academic References
- [Kwon et al., SOSP 2023] PagedAttention (arxiv:2309.06180)
- [Yu et al., OSDI 2022] Orca iteration-level scheduling
- [Dao, NeurIPS 2022] FlashAttention (arxiv:2205.14135)
Re-exports§
pub use executor::execute_graph;pub use executor::KernelDispatch;
Modules§
- executor
- PMAT-291: Graph executor – dispatches tensor operations to CUDA kernels.
Structs§
- Compute
Graph - Compute graph: topologically sorted list of tensor operations.
- OpParams
- Operation-specific parameters.
- Tensor
Node - A node in the compute graph.
Enums§
- Tensor
Op - Tensor operation types for decoder inference.