Module thunk

Expand description

Thunks — pre-compiled kernel dispatch with zero per-call overhead.

At compile time, the graph is lowered into a flat Vec<Thunk> where each thunk holds pre-computed arena offsets, dimensions, and kernel type. At runtime, the executor just iterates thunks and calls kernels directly. No match dispatch, no HashMap lookup, no dimension computation.

Structs§

ThunkSchedule: Compiled thunk schedule — the runtime hot path. Nop thunks are filtered out at compile time for zero iteration overhead.

Enums§

Thunk: A pre-compiled kernel call with all args resolved to arena offsets.

Functions§

compile_thunks: Compile graph into thunk schedule.
dequant_matmul_nvfp4
execute_axial_rope2d_f32^⚠: Host axial 2-D RoPE for Metal (and other) fallbacks on unified memory.
execute_compiled: Execute a thunk schedule on a raw arena buffer. Fastest executor: call pre-compiled closures sequentially. Zero match dispatch — each closure is a direct kernel call.
execute_conv_transpose2d_nchw_f32^⚠: Host fallback for NCHW ConvTranspose2d.
execute_cumsum_backward_f32^⚠
execute_dequant_grouped_matmul_gguf_f32^⚠: Host-fallback entry for GGUF Op::DequantGroupedMatMul (MoE expert stack).
execute_dequant_matmul_fp8_f32^⚠: Host-fallback entry for FP8 Op::DequantMatMul (Metal unified memory).
execute_dequant_matmul_gguf_f32^⚠: Host-fallback entry for GGUF Op::DequantMatMul (Metal unified memory).
execute_dequant_matmul_int4_f32^⚠: Host-fallback entry for Int4 Op::DequantMatMul (Metal unified memory).
execute_dequant_matmul_nvfp4_f32^⚠: Host-fallback entry for NVFP4 Op::DequantMatMul (Metal unified memory).
execute_fft1d^⚠: Dtype-dispatching host entry for Op::Fft (shared by GPU host fallbacks).
execute_fft1d_c64^⚠: C64 interleaved layout: each complex element is [re: f32, im: f32].
execute_fft1d_f32^⚠: f32 mirror of execute_fft1d_f64. Same public-host-fallback role.
execute_fft1d_f64^⚠: Execute a batched 1D FFT in the f64 2N-real-block layout. Each “row” is 2N f64 elements: first N real, then N imag. The outer rows are independent and processed sequentially.
execute_gated_delta_net_f16^⚠: Host-fallback entry for f16 Op::GatedDeltaNet tensors on Metal.
execute_gated_delta_net_f32^⚠: f32 counterpart of execute_fft1d_f64. Same 2N-real-block layout (first N real, second N imag per row), same unnormalized convention; only the element width differs. Twiddle factors are computed in f64 and cast to f32 to keep large-N error closer to the f64 path (the savings from f32 are in memory bandwidth, not in twiddle precision). Host-fallback entry for Op::GatedDeltaNet (Metal / unified memory). When state == 0, uses a zero-initialized scratch state per batch item.
execute_gather_backward_f32^⚠
execute_group_norm_nchw_f32^⚠: Host fallback for NCHW group norm (Metal unified-memory arena).
execute_layer_norm2d_nchw_f32^⚠: Host fallback for NCHW LayerNorm2d (SAM / candle semantics).
execute_resize_nearest_2x_f32^⚠: Host fallback for nearest 2× upsample on NCHW.
execute_rms_norm_backward_beta_f32^⚠
execute_rms_norm_backward_gamma_f32^⚠
execute_rms_norm_backward_input_f32^⚠: Host-fallback: Op::RmsNormBackwardInput (GPU unified-memory / D2H arenas).
execute_rope_backward_f32^⚠
execute_thunks
execute_thunks_active: Active-extent execution stub. The runtime calls this when it has an active-extent hint set. CPU doesn’t implement per-thunk active-extent scaling yet — return false so the caller falls back to the full execute_thunks path.

Module thunk

Module thunk Copy item path

Structs§

Enums§

Functions§

Module thunk