Expand description
G7: persistent-thread engine + device-side work queue. Eliminates per-file kernel-launch overhead for streams of many small scan jobs. Persistent-thread engine + host-side work queue (G7).
§What this is
A single long-lived GPU dispatch owns a chunk of the device.
Host workers push PersistentWorkItems into a device-visible ring buffer
via an atomic head counter; the device’s persistent threads
poll a tail counter and pick up items. The host waits on
per-item completion markers to gather results.
Eliminates the per-file kernel-launch cost (~5–20 µs on today’s drivers) so a stream of 10 000 × 1 KiB scan jobs pays launch overhead once, not 10 000 times.
§Scope of this file
This module owns the host-side ring buffer - the atomic
head/tail pair, the lock-free claim protocol, and exhaustive
tests. The actual persistent GPU kernel that consumes the queue
lives behind the persistent cargo feature and talks to the owning
backend’s native queue API. The host queue is proven correct in isolation
so device integration only worries about the kernel side.
§Memory ordering
- Producers
AcqRelon the head CAS; writes to the slot before the CAS happen-before the head increment. - Consumers
AcqRelon the tail CAS; after observing the incremented head, they see the producer’s slot writes. - A
Releasefence on the producer after the slot write and anAcquirefence on the consumer before reading the slot guarantees visibility across the weakest memory models we need to support (x86, ARM, RISC-V GPU consumers).
Structs§
- Persistent
Engine - Persistent-engine handle. Owns the host-side view of the ring
buffer. The GPU kernel is a separate concern gated behind
the
persistentcargo feature. - Persistent
Work Item - One scan-unit descriptor.
- Queue
Full - Enqueue attempted but the ring is full.
- Ring
Atomics - Shared atomics between host producers and device consumers.
Enums§
- Persistent
Thread Mode - Caller-controlled persistent-thread dispatch policy.