Skip to main content

Module persistent

Module persistent 

Source
Expand description

G7: persistent-thread engine + device-side work queue. Eliminates per-file kernel-launch overhead for streams of many small scan jobs. Persistent-thread engine + host-side work queue (G7).

§What this is

A single long-lived GPU dispatch owns a chunk of the device. Host workers push PersistentWorkItems into a device-visible ring buffer via an atomic head counter; the device’s persistent threads poll a tail counter and pick up items. The host waits on per-item completion markers to gather results.

Eliminates the per-file kernel-launch cost (~5–20 µs on today’s drivers) so a stream of 10 000 × 1 KiB scan jobs pays launch overhead once, not 10 000 times.

§Scope of this file

This module owns the host-side ring buffer - the atomic head/tail pair, the lock-free claim protocol, and exhaustive tests. The actual persistent GPU kernel that consumes the queue lives behind the persistent cargo feature and talks to the owning backend’s native queue API. The host queue is proven correct in isolation so device integration only worries about the kernel side.

§Memory ordering

  • Producers AcqRel on the head CAS; writes to the slot before the CAS happen-before the head increment.
  • Consumers AcqRel on the tail CAS; after observing the incremented head, they see the producer’s slot writes.
  • A Release fence on the producer after the slot write and an Acquire fence on the consumer before reading the slot guarantees visibility across the weakest memory models we need to support (x86, ARM, RISC-V GPU consumers).

Structs§

PersistentEngine
Persistent-engine handle. Owns the host-side view of the ring buffer. The GPU kernel is a separate concern gated behind the persistent cargo feature.
PersistentWorkItem
One scan-unit descriptor.
QueueFull
Enqueue attempted but the ring is full.
RingAtomics
Shared atomics between host producers and device consumers.

Enums§

PersistentThreadMode
Caller-controlled persistent-thread dispatch policy.