Skip to main content

Module graph

Module graph

Expand description

PMAT-291: Tensor compute graph for GPU inference (reduces 430 dispatches to ~15) PMAT-291: Tensor Compute Graph for GPU Inference

Inspired by ggml’s compute graph pattern (transpiled via decy for reference). Pure Rust, no FFI. Reduces ~430 individual cuLaunchKernel dispatches to ~15 tensor-level operations per decode step.

§Design (from cross-project analysis)

ggml: C tensor graph with ~15 nodes, CUDA graph replay = 1 launch
vLLM: PyTorch/inductor IR fusion + CUDA graphs (~80 nodes)
realizr current: 430 individual kernel dispatches
realizr target: ~15 tensor ops via this module + CUDA graph replay

§Academic References

[Kwon et al., SOSP 2023] PagedAttention (arxiv:2309.06180)
[Yu et al., OSDI 2022] Orca iteration-level scheduling
[Dao, NeurIPS 2022] FlashAttention (arxiv:2205.14135)

Re-exports§

pub use executor::execute_graph;
pub use executor::KernelDispatch;

Modules§

executor: PMAT-291: Graph executor – dispatches tensor operations to CUDA kernels.

Structs§

ComputeGraph: Compute graph: topologically sorted list of tensor operations.
OpParams: Operation-specific parameters.
TensorNode: A node in the compute graph.

Enums§

TensorOp: Tensor operation types for decoder inference.