Module persistent

Expand description

G7: persistent-thread engine + device-side work queue. Eliminates per-file kernel-launch overhead for streams of many small scan jobs. Persistent-thread engine + host-side work queue (G7).

§What this is

A single long-lived GPU dispatch owns a chunk of the device. Host workers push PersistentWorkItems into a device-visible ring buffer via an atomic head counter; the device’s persistent threads poll a tail counter and pick up items. The host waits on per-item completion markers to gather results.

Eliminates the per-file kernel-launch cost (~5–20 µs on today’s drivers) so a stream of 10 000 × 1 KiB scan jobs pays launch overhead once, not 10 000 times.

§Scope of this file

This module owns the host-side ring buffer - the atomic head/tail pair, the lock-free claim protocol, and exhaustive tests. The actual persistent GPU kernel that consumes the queue lives behind the persistent cargo feature and talks to the owning backend’s native queue API. The host queue is proven correct in isolation so device integration only worries about the kernel side.

§Memory ordering

Producers AcqRel on the head CAS; writes to the slot before the CAS happen-before the head increment.
Consumers AcqRel on the tail CAS; after observing the incremented head, they see the producer’s slot writes.
A Release fence on the producer after the slot write and an Acquire fence on the consumer before reading the slot guarantees visibility across the weakest memory models we need to support (x86, ARM, RISC-V GPU consumers).

Structs§

PersistentEngine: Persistent-engine handle. Owns the host-side view of the ring buffer. The GPU kernel is a separate concern gated behind the persistent cargo feature.
PersistentWorkItem: One scan-unit descriptor.
QueueFull: Enqueue attempted but the ring is full.
RingAtomics: Shared atomics between host producers and device consumers.

Enums§

PersistentThreadMode: Caller-controlled persistent-thread dispatch policy.