Struct MegakernelLaunchPolicy

Source

pub struct MegakernelLaunchPolicy {Show 13 fields
    pub sizing: MegakernelSizingPolicy,
    pub min_hit_capacity: u32,
    pub hit_capacity_multiplier: u32,
    pub saturated_waves: u32,
    pub hot_opcode_threshold: u32,
    pub hot_window_threshold: u32,
    pub jit_queue_len_threshold: u32,
    pub priority_age_threshold: u32,
    pub sparse_frontier_threshold_bps: u16,
    pub dense_frontier_threshold_bps: u16,
    pub memory_pressure_threshold_bps: u16,
    pub fusion_edge_threshold: u32,
    pub scratch_bytes_per_hit: u32,
}

Expand description

Single policy surface for megakernel launch sizing and telemetry-driven routing.

Fields§

§sizing: MegakernelSizingPolicy

Sizing policy for worker counts and grid geometry.

§min_hit_capacity: u32

Minimum capacity for sparse-hit results.

§hit_capacity_multiplier: u32

Multiplier for expected hits to determine capacity.

§saturated_waves: u32

Number of waves that define a saturated queue.

§hot_opcode_threshold: u32

Threshold for promoting hot opcodes to JIT.

§hot_window_threshold: u32

Threshold for promoting hot windows to JIT.

§jit_queue_len_threshold: u32

Queue length threshold to prefer JIT over interpreter.

§priority_age_threshold: u32

Priority age threshold to trigger aging promotions.

§sparse_frontier_threshold_bps: u16

Frontier density at or below this value uses sparse expansion.

§dense_frontier_threshold_bps: u16

Frontier density at or above this value uses dense propagation.

§memory_pressure_threshold_bps: u16

Memory pressure at or above this value uses the memory-constrained path.

§fusion_edge_threshold: u32

Minimum graph edge count before dense hot work is eligible for fusion.

§scratch_bytes_per_hit: u32

Conservative resident scratch bytes needed per sparse-hit entry.

Implementations§

Source §

impl MegakernelLaunchPolicy

Source

pub const fn standard() -> Self

Standard launch policy used by VYRE megakernel dispatchers.

Source

pub fn launch_cache_stats() -> MegakernelLaunchCacheStats

Return launch recommendation cache telemetry for the current thread.

Source

pub fn reset_launch_cache_for_thread()

Clear launch recommendation cache entries and counters for this thread.

Recommend geometry, hit capacity, and interpreter/JIT route.

§Errors

Returns BackendError when required adapter limits are zero or derived launch values cannot fit the u32 ring protocol.

Source

pub fn recommend_with_topology_evidence( &self, request: MegakernelLaunchRequest, ) -> Result<(MegakernelLaunchRecommendation, MegakernelTopologyEvidence), BackendError>

Recommend a launch and emit topology evidence for parity benches.

§Errors

Returns BackendError when the underlying recommendation cannot be built from the request or adapter limits.

Source

pub fn recommend_with_promotion_evidence( &self, request: MegakernelLaunchRequest, ) -> Result<(MegakernelLaunchRecommendation, MegakernelPromotionEvidence), BackendError>

Recommend a launch and emit hot opcode/window promotion evidence.

§Errors

Returns BackendError when the underlying recommendation cannot be built from the request or adapter limits.

Source

pub fn recommend_with_previous_topology( &self, request: MegakernelLaunchRequest, previous_topology: MegakernelDispatchTopology, ) -> Result<MegakernelLaunchRecommendation, BackendError>

Recommend a launch while preserving the previous topology inside a narrow hysteresis band.

CUDA resident graphs and long-running dataflow streams should use this entry point when they can track the last successful topology. It prevents borderline frontier-density or memory-pressure telemetry from repeatedly switching kernel variants, invalidating launch plans, and disturbing cache locality at scale.

§Errors

Returns BackendError when required adapter limits are zero or derived launch values cannot fit the u32 ring protocol.

Source

pub fn autotune_hit_capacity_multiplier( &self, candidate_multipliers: &[u32], costs: &[f64], ) -> u32

Select the best hit_capacity_multiplier from a candidate set.

candidate_multipliers are the multipliers to try; costs[i] is the observed dispatch latency (or any minimization metric) when candidate_multipliers[i] was used. Lower cost wins; the minimum observed cost selects the multiplier.

Returns the chosen multiplier. If candidate_multipliers is empty, returns the policy’s existing hit_capacity_multiplier.

Source

pub fn autotune_workgroup_size( &self, candidate_sizes: &[u32], costs: &[f64], current_size: u32, ) -> u32

Select the best workgroup-size from a candidate set.

candidate_sizes[i] is paired with costs[i] (lower is better). Returns the chosen size or the policy’s sizing.default_workgroup_size_x() fallback.

Source

pub fn natural_gradient_autotune_step( m_inv_sqrt: &[f64], grad: &[f64], n: u32, learning_rate: f64, ) -> Vec<f64>

Compute the next-step parameter delta for a continuous autotune knob using a Fisher-preconditioned natural-gradient step.

m_inv_sqrt: inverse-square-root of the Fisher block (n×n row-major). Passing an identity matrix reduces the natural gradient to plain gradient descent.

grad: plain gradient ∂latency/∂param (length n).

Returns the parameter delta -lr · M_inv_sqrt · grad.

P-DRIVER-8: every continuous autotune knob (workgroup size, hit-capacity, fixpoint iteration count, …) should follow the natural-gradient direction by default - Fisher-preconditioned descent converges 5-10× faster than plain gradient on the elongated-valley latency surfaces typical of GPU autotuning.

Source