Expand description
Cross-grid synchronization: kernel-split fallback for backends
that lack a native cooperative-launch grid barrier. Splits a
Program at every Node::Barrier { ordering: GridSync } and
dispatches the segments in sequence - the kernel-launch boundary
itself is the grid-level fence.
Grid-sync kernel splitting.
Op id: vyre-driver::grid_sync. Soundness: Exact over the
cross-grid barrier contract.
§Why this lives in vyre-driver, not the backend
Every backend that lacks a native cooperative whole-grid launch
needs the same kernel-split semantics for
Node::Barrier { ordering: GridSync }: split the program at the
barrier, dispatch each segment as its own kernel launch, and
re-feed the prior segment’s outputs as inputs to the next. The
kernel-launch boundary itself is the grid-level fence - every
prior write becomes globally visible before the next launch reads.
Backends route through crate::grid_sync::dispatch_with_grid_sync_split when
VyreBackend::supports_grid_sync is false and the program
contains any Node::Barrier { ordering: GridSync }. Backends that
return true emit one kernel and satisfy the barrier device-side.
§Algorithm
- Walk the program’s top-level entry sequence.
- Each prefix-suffix split at a
Node::Barrier { GridSync }becomes one segment. - For each segment, build a
Programwith a segment-local buffer table: buffers read or written by that segment plus passthrough read-write buffers that must preserve caller-visible storage. - Dispatch segments in order, threading live buffers by buffer name rather than positional output slot. Segment read-only inputs are assembled from the caller’s original bytes or prior segment outputs; final host-visible output slots are reassembled in the original program’s output declaration order.
§Device-resident variant
[dispatch_with_grid_sync_split_into] round-trips every live buffer
host↔device between each segment and on every fixpoint pass. For a fused
multi-rule program whose shared output accumulator is hundreds of MiB and
which splits into hundreds of segments, that transfer — not launch
latency — dominates wall time. [dispatch_resident_grid_sync_fixpoint_into]
is the device-resident counterpart: it uploads inputs into backend-resident
resources once, keeps them bound across every segment and fixpoint pass (so
the accumulator threads in place on-device, since resident dispatch never
clears a bound buffer between launches), and reads back only the final
outputs. It requires VyreBackend::supports_resident_dispatch; callers
route to it on resident-capable backends and to the host split otherwise.
Both paths are recall- and proof-identical (proven by a host/resident
differential gate); the choice is purely a host↔device-traffic optimization.
§Soundness
- Atomicity preserved: every
atomic_orthat fired in segment N has flushed to global memory by the time segment N+1 launches - backend launch APIs issue an implicit grid-level fence at submission boundaries. - Ordering preserved: the original program’s host-visible output is byte-identical to the un-split version, modulo timing.
- No re-validation surprise: each split segment validates against the same backend supported-ops set as the original.
Functions§
- contains_
grid_ sync - Whether
programcontains anyNode::Barrier { ordering: GridSync }in its dispatch-level entry sequence (peeled past any synthetic outer Region). - dispatch_
resident_ grid_ sync_ fixpoint_ into - Device-resident counterpart of
dispatch_with_grid_sync_split_into. - dispatch_
resident_ with_ grid_ sync_ split_ timed - Resident-resource variant of
dispatch_with_grid_sync_split_timed. - dispatch_
with_ grid_ sync_ split - Universal dispatch helper that satisfies
Node::Barrier { ordering: GridSync }on any backend by splitting at the barrier and running each segment as its own kernel launch. - dispatch_
with_ grid_ sync_ split_ into - Variant of
dispatch_with_grid_sync_splitthat writes final outputs into caller-owned storage. - dispatch_
with_ grid_ sync_ split_ timed - Timed variant of
dispatch_with_grid_sync_split. - plan_
host_ grid_ sync_ segment_ programs - Diagnostics: the host-split segment programs (post buffer-rewrite) that
the fallback dispatch path (
dispatch_with_grid_sync_split*) validates and launches when the backend lacks native grid-sync. Exposed so tooling and tests can inspect or validate each segment without a live backend — the rawtry_split_on_grid_syncoutput omits the per-segment buffer access/role rewrite, so it is not what the backend actually sees. - split_
on_ grid_ sync - Split
programat every top-levelNode::Barrier { GridSync }. - try_
split_ on_ grid_ sync - Fallible variant of
split_on_grid_syncfor production dispatch paths.