Skip to main content

Module scheduling

Module scheduling 

Source
Expand description

Dynamic Actor Scheduling — Work Stealing Protocol

Provides load balancing for persistent GPU actors via a work stealing protocol. Without dynamic scheduling, each actor (thread block) processes only its own message queue. If one actor’s workload spikes while neighbors are idle, the busy actor becomes a bottleneck.

§Scheduler Warp Pattern

Within each thread block of the persistent kernel:

  • Warp 0: Scheduler warp — monitors queue depth, steals work from overloaded neighbors, redistributes messages
  • Warps 1-N: Compute warps — process messages from the local work queue
┌─── Block (Actor) ───────────────────────────────────┐
│ Warp 0 [SCHEDULER]                                   │
│ ├─ Monitor local queue depth                         │
│ ├─ If depth < steal_threshold:                       │
│ │   └─ Steal from busiest neighbor via K2K           │
│ ├─ If depth > share_threshold:                       │
│ │   └─ Offer work to least-busy neighbor             │
│ └─ Update load metrics in shared memory              │
│                                                      │
│ Warps 1-7 [COMPUTE]                                  │
│ ├─ Dequeue message from local work queue             │
│ ├─ Process message (user handler)                    │
│ └─ Enqueue response to output queue                  │
└──────────────────────────────────────────────────────┘

§Work Stealing Protocol

  1. Each block publishes its queue depth to a shared load table (global or DSMEM)
  2. Scheduler warp compares local depth with neighbor depths
  3. If local depth < steal_threshold and a neighbor has depth > share_threshold: a. Scheduler warp atomically reserves N messages from neighbor’s queue b. Messages are copied via K2K channel (DSMEM for cluster, global for cross-cluster) c. Both blocks update their queue depths
  4. Grid sync (or cluster sync) ensures load table consistency

§Load Table Layout (in mapped/global memory)

load_table[block_id] = {
    queue_depth: u32,    // Current input queue depth
    capacity: u32,       // Queue capacity
    messages_processed: u64,  // Throughput indicator
    steal_requests: u32, // Pending steal requests
    offer_count: u32,    // Messages offered to steal
}

Structs§

LoadEntry
Per-actor load entry in the shared load table.
LoadTable
The load table containing entries for all actors.
SchedulerConfig
Configuration for dynamic actor scheduling.
SchedulerWarpConfig
Configuration for the scheduler warp pattern in CUDA codegen.
StealOp
A single work-stealing operation.
WorkItem
Work item for the scheduler.

Enums§

SchedulingStrategy
Scheduling strategy for persistent actors.