Module scheduling

Expand description

Dynamic Actor Scheduling — Work Stealing Protocol

Provides load balancing for persistent GPU actors via a work stealing protocol. Without dynamic scheduling, each actor (thread block) processes only its own message queue. If one actor’s workload spikes while neighbors are idle, the busy actor becomes a bottleneck.

§Scheduler Warp Pattern

Within each thread block of the persistent kernel:

Warp 0: Scheduler warp — monitors queue depth, steals work from overloaded neighbors, redistributes messages
Warps 1-N: Compute warps — process messages from the local work queue

┌─── Block (Actor) ───────────────────────────────────┐
│ Warp 0 [SCHEDULER]                                   │
│ ├─ Monitor local queue depth                         │
│ ├─ If depth < steal_threshold:                       │
│ │   └─ Steal from busiest neighbor via K2K           │
│ ├─ If depth > share_threshold:                       │
│ │   └─ Offer work to least-busy neighbor             │
│ └─ Update load metrics in shared memory              │
│                                                      │
│ Warps 1-7 [COMPUTE]                                  │
│ ├─ Dequeue message from local work queue             │
│ ├─ Process message (user handler)                    │
│ └─ Enqueue response to output queue                  │
└──────────────────────────────────────────────────────┘

§Work Stealing Protocol

Each block publishes its queue depth to a shared load table (global or DSMEM)
Scheduler warp compares local depth with neighbor depths
If local depth < steal_threshold and a neighbor has depth > share_threshold: a. Scheduler warp atomically reserves N messages from neighbor’s queue b. Messages are copied via K2K channel (DSMEM for cluster, global for cross-cluster) c. Both blocks update their queue depths
Grid sync (or cluster sync) ensures load table consistency

§Load Table Layout (in mapped/global memory)

load_table[block_id] = {
    queue_depth: u32,    // Current input queue depth
    capacity: u32,       // Queue capacity
    messages_processed: u64,  // Throughput indicator
    steal_requests: u32, // Pending steal requests
    offer_count: u32,    // Messages offered to steal
}

Structs§

LoadEntry: Per-actor load entry in the shared load table.
LoadTable: The load table containing entries for all actors.
SchedulerConfig: Configuration for dynamic actor scheduling.
SchedulerWarpConfig: Configuration for the scheduler warp pattern in CUDA codegen.
StealOp: A single work-stealing operation.
WorkItem: Work item for the scheduler.

Enums§

SchedulingStrategy: Scheduling strategy for persistent actors.

Module scheduling

Module scheduling Copy item path

§Scheduler Warp Pattern

§Work Stealing Protocol

§Load Table Layout (in mapped/global memory)

Structs§

Enums§

Module scheduling