pub struct ComputeGraph { /* private fields */ }Expand description
A recorded sequence of GPU compute dispatches and barriers.
Created by running a forward pass with the encoder in capture mode.
Can be replayed into a real CommandEncoder via encode_sequential(),
producing identical Metal dispatch behavior to the original direct path.
Future phases (4e.2, 4e.3) will add fusion and reorder passes that transform the graph before encoding.
Implementations§
Source§impl ComputeGraph
impl ComputeGraph
Sourcepub fn from_nodes(nodes: Vec<CapturedNode>) -> Self
pub fn from_nodes(nodes: Vec<CapturedNode>) -> Self
Create a compute graph from a pre-built list of captured nodes.
Sourcepub fn record(&mut self, node: CapturedNode)
pub fn record(&mut self, node: CapturedNode)
Record a captured node into the graph.
Sourcepub fn dispatch_count(&self) -> usize
pub fn dispatch_count(&self) -> usize
Number of dispatch nodes (excludes barriers).
Sourcepub fn barrier_count(&self) -> usize
pub fn barrier_count(&self) -> usize
Number of barrier nodes.
Sourcepub fn nodes(&self) -> &[CapturedNode]
pub fn nodes(&self) -> &[CapturedNode]
Borrow the node list.
Sourcepub fn unannotated_dispatch_count(&self) -> usize
pub fn unannotated_dispatch_count(&self) -> usize
Count dispatch nodes that have empty read/write range annotations.
Used for diagnostics: if >0, the reorder pass cannot guarantee correctness because it relies on complete annotations.
Sourcepub fn into_nodes(self) -> Vec<CapturedNode>
pub fn into_nodes(self) -> Vec<CapturedNode>
Take ownership of the node list, consuming the graph.
Sourcepub fn encode_sequential(&self, encoder: &mut CommandEncoder) -> u32
pub fn encode_sequential(&self, encoder: &mut CommandEncoder) -> u32
Encode all nodes sequentially into the given encoder.
Barrier sentinel nodes emit a Metal memory barrier. Dispatch nodes
are replayed through CommandEncoder::replay_dispatch().
This produces identical GPU behavior to the direct-dispatch path — same pipeline bindings, same dispatch dimensions, same barrier placement.
Returns the number of barriers emitted.
Sourcepub fn encode_with_barriers(&self, encoder: &mut CommandEncoder) -> u32
pub fn encode_with_barriers(&self, encoder: &mut CommandEncoder) -> u32
Encode the graph into a Metal command buffer, computing barriers on the fly from each node’s read/write buffer ranges.
This is the correct encoding method for reordered graphs where barrier
sentinels have been stripped. Mirrors llama.cpp’s encode-time barrier
insertion via ggml_metal_op_concurrency_check.
Returns the number of barriers emitted.
Sourcepub fn encode_dual_buffer(
&self,
encoder0: &mut CommandEncoder,
encoder1: &mut CommandEncoder,
) -> (u32, u32)
pub fn encode_dual_buffer( &self, encoder0: &mut CommandEncoder, encoder1: &mut CommandEncoder, ) -> (u32, u32)
Encode the graph using two command buffers for CPU/GPU overlap.
The first n0 dispatches are encoded into encoder0 and committed
immediately (GPU starts executing). The remaining dispatches are encoded
into encoder1. The caller is responsible for committing encoder1.
This matches llama.cpp’s dual command buffer pattern from
ggml_metal_graph_compute (ggml-metal-context.m:441-644):
n_nodes_0 = MAX(64, 0.1 * n_nodes) for the first buffer.
Command buffers submitted to the same MTLCommandQueue execute in
submission order, so encoder0.commit() followed by encoder1.commit()
guarantees enc0 finishes before enc1 starts. The win: the GPU starts
executing enc0 while the CPU is still encoding enc1.
Returns (barriers_buf0, barriers_buf1).
Sourcepub fn fuse(
&mut self,
registry: &mut KernelRegistry,
device: &DeviceRef,
) -> Result<u32>
pub fn fuse( &mut self, registry: &mut KernelRegistry, device: &DeviceRef, ) -> Result<u32>
Run the RMS norm + MUL fusion pass over the graph.
Scans for the pattern:
Dispatch(RmsNorm) → Barrier(s) → Dispatch(ElemMul)
where the MUL reads the norm’s output buffer, and replaces the
sequence with a single fused rms_norm_mul_* dispatch.
The fused dispatch:
- Reads the norm’s input (buffer 0) and weight (buffer 1)
- Reads the MUL’s second operand as the scale (buffer 2)
- Writes to the MUL’s output (buffer 3)
- Carries the norm’s params (buffer 4)
- Uses the norm’s threadgroup config and shared memory
Returns the number of fusions applied.
§Arguments
registry- Kernel registry for compiling the fused pipeline.device- Metal device for pipeline compilation.
Sourcepub fn reorder(&mut self) -> u32
pub fn reorder(&mut self) -> u32
Run the reorder pass over the graph to improve GPU concurrency.
Port of llama.cpp’s ggml_metal_graph_optimize_reorder — a greedy
64-node lookahead that pulls independent dispatches forward to fill
larger concurrent groups between barriers.
Prerequisites: Call fuse() first if desired. The reorder pass
operates on the post-fusion graph. Barrier sentinel nodes are stripped
before reordering (they will be recomputed at encode time by the
ConflictTracker in encode_sequential).
Algorithm (matching llama.cpp exactly):
- Strip all
CapturedNode::Barriernodes. - For each unprocessed node
i0:- If it conflicts with the current concurrent group (
mrs0):- Initialize
mrs1fromi0’s ranges (skipped-over set) - Lookahead up to 64 nodes for candidates that:
(a) Are reorderable (
CapturedOpKind::is_reorderable()) (b) Don’t conflict withmrs0(current group) (c) Don’t conflict withmrs1(skipped-over nodes) - Pull qualifying candidates into the current group
- Non-reorderable ops break the lookahead
- Initialize
- Reset
mrs0(new concurrent group) - Add
i0to the new group
- If it conflicts with the current concurrent group (
Returns the number of nodes that were moved to earlier positions.