Expand description
Persistent megakernel - the vyre Program that runs forever on the GPU, decoding host-fed ring opcodes from a host-fed ring buffer. Persistent megakernel - the GPU becomes a VIR0 bytecode interpreter.
One dispatch compiles the program; the kernel loops forever, pulling packed bytecode slots from a host-fed ring buffer and executing each. The host never re-dispatches - it only writes new slots and observes atomic counters in the control buffer.
§Layout
protocol- ring-buffer slot layout, control words, opcodes.handlers- built-in opcode handlers + extension mechanism.builder- IRProgramconstruction (interpreted + JIT).execution- compiled persistent-kernel handle and dispatch path.resident- host mirrors for GPU-resident runtime buffers.readback- strict output-buffer decoding after dispatch.recovery- device-loss classification and pipeline rebuild.
§Coordination protocol
- Read
control[SHUTDOWN]; if non-zero,Node::Return. - Read this slot’s
status; skip idle slots without tenant metadata loads. - If PUBLISHED, read
tenant_id; authorize via tenant-mask table. - CAS
ring_buffer[status]from PUBLISHED → CLAIMED. - Dispatch on opcode through If-tree (or JIT fused body).
atomic_add(control[DONE_COUNT], 1).- Store DONE into the status word.
Re-exports§
pub use builder::build_program;pub use builder::build_program_jit;pub use builder::build_program_jit_slots;pub use builder::build_program_priority;pub use builder::build_program_priority_slots;pub use builder::build_program_sharded;pub use builder::build_program_sharded_no_io;pub use builder::build_program_sharded_once_slots;pub use builder::build_program_sharded_slots;pub use builder::build_program_sharded_with_io_polling;pub use builder::build_program_sharded_with_workspace_adapter;pub use builder::build_program_with_self_loading_miss_handler;pub use builder::persistent_body;pub use builder::persistent_body_jit;pub use builder::persistent_body_priority;pub use builder::persistent_body_priority_slots;pub use descriptor::BatchDescriptor;pub use descriptor::BuiltinOpcode;pub use descriptor::PackedOpDescriptor;pub use descriptor::SlotDescriptor;pub use descriptor::SlotOpcode;pub use descriptor::WindowClass;pub use descriptor::WindowDescriptor;pub use execution::Megakernel;pub use execution::MegakernelDispatchOutput;pub use execution::MegakernelDispatchStats;pub use execution::MegakernelResidentHandles;pub use handlers::OpcodeHandler;pub use io::IoCompletion;pub use io::IoRequest;pub use io::MegakernelIoQueue;pub use io::IO_SLOT_COUNT;pub use io::IO_SLOT_WORDS;pub use planner::build_bellman_tn_order_program;pub use planner::build_kfac_autotune_step_program;pub use planner::build_persistent_fixpoint_program;pub use planner::build_scallop_lineage_with_scratch;pub use planner::build_scallop_provenance_wide_program;pub use planner::build_sinkhorn_clustering_program;pub use planner::build_sinkhorn_full_clustering_program;pub use planner::build_scallop_lineage_with_program_and_scratch;pub use planner::default_worker_groups_from_limits;pub use planner::dispatch_grid_for;pub use planner::padded_slot_count;pub use planner::plan_compact_fusion_into;pub use planner::prune_redundant_work_items_into;pub use planner::prune_redundant_work_items_with_scratch_into;pub use planner::select_fused_subset;pub use planner::select_fused_subset_compact;pub use planner::select_fused_subset_compact_into;pub use planner::select_fused_subset_into;pub use planner::select_fused_subset_with_rate;pub use planner::select_optimal_fused_subset;pub use planner::try_detect_cross_arm_redundancy;pub use planner::try_prune_redundant_work_items_into;pub use planner::try_prune_redundant_work_items_with_scratch_into;pub use planner::worker_workgroup_size;pub use planner::CompactFusionPlanningScratch;pub use planner::CrossArmRedundancy;pub use planner::FusionSelectionScratch;pub use planner::MegakernelCaps;pub use planner::MegakernelConfig;pub use planner::MegakernelGridLimits;pub use planner::MegakernelGridPlan;pub use planner::MegakernelGridRequest;pub use planner::MegakernelLaunchGeometry;pub use planner::MegakernelReport;pub use planner::MegakernelSizingPolicy;pub use planner::MegakernelTelemetry;pub use planner::MegakernelWorkItem;pub use planner::MegakernelWorkloadHints;pub use planner::RedundantWorkItemPruneScratch;pub use policy::diffuse_priority_across_siblings;pub use policy::diffuse_priority_across_siblings_into;pub use policy::MegakernelDispatchTopology;pub use policy::MegakernelExecutionMode;pub use policy::MegakernelLaunchCacheStats;pub use policy::MegakernelLaunchPolicy;pub use policy::MegakernelLaunchRecommendation;pub use policy::MegakernelLaunchRequest;pub use policy::MegakernelQueuePressure;pub use policy::PriorityRequeueAccounting;pub use protocol::control;pub use protocol::control_byte_len;pub use protocol::count_done_ring_slots;pub use protocol::debug;pub use protocol::debug_log_byte_len;pub use protocol::encode_control;pub use protocol::encode_empty_debug_log;pub use protocol::encode_empty_ring;pub use protocol::opcode;pub use protocol::read_debug_log;pub use protocol::read_done_count;pub use protocol::read_epoch;pub use protocol::read_metrics;pub use protocol::read_observable;pub use protocol::ring_byte_len;pub use protocol::slot;pub use protocol::try_count_done_ring_slots;pub use protocol::try_encode_control;pub use protocol::try_encode_control_into;pub use protocol::try_encode_empty_debug_log;pub use protocol::try_encode_empty_debug_log_into;pub use protocol::try_encode_empty_ring;pub use protocol::try_encode_empty_ring_into;pub use protocol::try_read_debug_log;pub use protocol::try_read_done_count;pub use protocol::try_read_epoch;pub use protocol::try_read_metrics;pub use protocol::try_read_observable;pub use protocol::DebugRecord;pub use protocol::ProtocolError;pub use protocol::ARG0_WORD;pub use protocol::ARGS_PER_SLOT;pub use protocol::CONTROL_MIN_WORDS;pub use protocol::OPCODE_WORD;pub use protocol::PRIORITY_WORD;pub use protocol::SLOT_WORDS;pub use protocol::STATUS_WORD;pub use protocol::TENANT_WORD;pub use readback::MegakernelReadback;pub use readback::MegakernelReadbackCounters;pub use recovery::backend_error_indicates_device_loss;pub use recovery::MegakernelRecoveryDecision;pub use recovery::MegakernelRecoveryPolicy;pub use resident::MegakernelResidentBuffers;pub use resident::MegakernelResidentDispatchScratch;pub use rule_catalog::BatchRuleProgram;pub use rule_catalog::BatchRuleRejection;pub use scheduler::default_priority_offsets;pub use scheduler::priority_partition_active_lane_count;pub use scheduler::priority_partition_probe_budget;pub use scheduler::priority_partition_probe_count;pub use scheduler::priority_scan_body;pub use scheduler::priority_scan_body_with_stride;pub use scheduler::try_default_priority_offsets;pub use scheduler::write_default_priority_offsets;pub use speculation::PairedSpeculationSample;pub use speculation::PairedSpeculationUpdate;pub use speculation::PairedSpeculationWindow;pub use task::TaskPriority;pub use task::TaskQueueSnapshot;pub use task::TaskState;pub use task::TaskWorkItem;pub use telemetry::ControlSnapshot;pub use telemetry::CountMinSketch;pub use telemetry::MegakernelRuntimeCounters;pub use telemetry::RingOccupancy;pub use telemetry::RingSlotSnapshot;pub use telemetry::RingStatus;pub use telemetry::RingTelemetry;pub use telemetry::SketchTelemetry;pub use telemetry::WindowTelemetry;pub use workspace_adapter::MegakernelWorkspaceAdapter;pub use workspace_layout::build_workspace_regions;pub use workspace_layout::first_workspace_region;pub use workspace_layout::next_record_workspace_region;pub use workspace_layout::next_workspace_region;pub use workspace_layout::workspace_record_words;pub use workspace_layout::MegakernelWorkspaceLayoutError;pub use workspace_layout::MegakernelWorkspaceRegion;pub use workspace_layout::MegakernelWorkspaceRegionSpec;
Modules§
- advanced
- Advanced Tier-2 LEGO building blocks for the persistent megakernel.
- builder
- IR program builders - construct the megakernel
Programfrom vyre IR. - descriptor
- Typed host-side descriptors for publishing work into the megakernel ring.
- execution
- Compiled persistent-megakernel handle and dispatch path.
- handlers
- Built-in opcode handler bodies - STORE_U32, ATOMIC_ADD, PRINTF, SHUTDOWN.
- io
- IO subsystem - GPU↔runtime DMA request queue for the persistent megakernel.
- ir_util
- Shared IR fragments used by megakernel builders and schedulers.
- planner
- Megakernel planning, fusion selection, sizing, and work-item contracts.
- policy
- Resident megakernel launch policy and queue-pressure decisions.
- protocol
- Ring-buffer protocol constants - slot layout, control words, opcodes, debug log.
- readback
- Typed host readback view for persistent megakernel outputs.
- recovery
- Device-loss classification and persistent-pipeline rebuild policy.
- resident
- Host mirrors for megakernel GPU-resident runtime buffers.
- ring
- Ring producer / consumer traits for the megakernel host protocol.
- rule_
catalog - DFA rule catalog packing for batched megakernel dispatch.
- scaling
- Occupancy-aware grid scaling for megakernels.
- scheduler
- Work scheduler - priority-aware slot scanning for the persistent megakernel.
- speculation
- Runtime-side paired speculation races for megakernel dispatch.
- task
- Resident task queue ABI for pause, resume, requeue, and priority aging.
- telemetry
- Host-side telemetry decoders for the megakernel ring and control buffers.
- workspace_
adapter - Consumer-neutral resident workspace adapter for megakernel programs.
- workspace_
layout - Generic resident workspace region layout for megakernel adapters.
Traits§
- Megakernel
Dispatch - Backend-neutral megakernel dispatch contract.