Skip to main content

Module graph

Module graph 

Source
Expand description

PMAT-291: Tensor compute graph for GPU inference (reduces 430 dispatches to ~15) PMAT-291: Tensor Compute Graph for GPU Inference

Inspired by ggml’s compute graph pattern (transpiled via decy for reference). Pure Rust, no FFI. Reduces ~430 individual cuLaunchKernel dispatches to ~15 tensor-level operations per decode step.

§Design (from cross-project analysis)

  • ggml: C tensor graph with ~15 nodes, CUDA graph replay = 1 launch
  • vLLM: PyTorch/inductor IR fusion + CUDA graphs (~80 nodes)
  • realizr current: 430 individual kernel dispatches
  • realizr target: ~15 tensor ops via this module + CUDA graph replay

§Academic References

  • [Kwon et al., SOSP 2023] PagedAttention (arxiv:2309.06180)
  • [Yu et al., OSDI 2022] Orca iteration-level scheduling
  • [Dao, NeurIPS 2022] FlashAttention (arxiv:2205.14135)

Re-exports§

pub use executor::execute_graph;
pub use executor::KernelDispatch;

Modules§

executor
PMAT-291: Graph executor – dispatches tensor operations to CUDA kernels.

Structs§

ComputeGraph
Compute graph: topologically sorted list of tensor operations.
OpParams
Operation-specific parameters.
TensorNode
A node in the compute graph.

Enums§

TensorOp
Tensor operation types for decoder inference.