Skip to main content

Module thunk

Module thunk 

Source
Expand description

Thunks — pre-compiled kernel dispatch with zero per-call overhead.

At compile time, the graph is lowered into a flat Vec<Thunk> where each thunk holds pre-computed arena offsets, dimensions, and kernel type. At runtime, the executor just iterates thunks and calls kernels directly. No match dispatch, no HashMap lookup, no dimension computation.

Structs§

ThunkSchedule
Compiled thunk schedule — the runtime hot path. Nop thunks are filtered out at compile time for zero iteration overhead.

Enums§

Thunk
A pre-compiled kernel call with all args resolved to arena offsets.

Functions§

compile_thunks
Compile graph into thunk schedule.
dequant_matmul_nvfp4
execute_axial_rope2d_f32
Host axial 2-D RoPE for Metal (and other) fallbacks on unified memory.
execute_compiled
Execute a thunk schedule on a raw arena buffer. Fastest executor: call pre-compiled closures sequentially. Zero match dispatch — each closure is a direct kernel call.
execute_conv_transpose2d_nchw_f32
Host fallback for NCHW ConvTranspose2d.
execute_cumsum_backward_f32
execute_dequant_grouped_matmul_gguf_f32
Host-fallback entry for GGUF Op::DequantGroupedMatMul (MoE expert stack).
execute_dequant_matmul_fp8_f32
Host-fallback entry for FP8 Op::DequantMatMul (Metal unified memory).
execute_dequant_matmul_gguf_f32
Host-fallback entry for GGUF Op::DequantMatMul (Metal unified memory).
execute_dequant_matmul_int4_f32
Host-fallback entry for Int4 Op::DequantMatMul (Metal unified memory).
execute_dequant_matmul_nvfp4_f32
Host-fallback entry for NVFP4 Op::DequantMatMul (Metal unified memory).
execute_fft1d
Dtype-dispatching host entry for Op::Fft (shared by GPU host fallbacks).
execute_fft1d_c64
C64 interleaved layout: each complex element is [re: f32, im: f32].
execute_fft1d_f32
f32 mirror of execute_fft1d_f64. Same public-host-fallback role.
execute_fft1d_f64
Execute a batched 1D FFT in the f64 2N-real-block layout. Each “row” is 2N f64 elements: first N real, then N imag. The outer rows are independent and processed sequentially.
execute_gated_delta_net_f16
Host-fallback entry for f16 Op::GatedDeltaNet tensors on Metal.
execute_gated_delta_net_f32
f32 counterpart of execute_fft1d_f64. Same 2N-real-block layout (first N real, second N imag per row), same unnormalized convention; only the element width differs. Twiddle factors are computed in f64 and cast to f32 to keep large-N error closer to the f64 path (the savings from f32 are in memory bandwidth, not in twiddle precision). Host-fallback entry for Op::GatedDeltaNet (Metal / unified memory). When state == 0, uses a zero-initialized scratch state per batch item.
execute_gather_backward_f32
execute_group_norm_nchw_f32
Host fallback for NCHW group norm (Metal unified-memory arena).
execute_layer_norm2d_nchw_f32
Host fallback for NCHW LayerNorm2d (SAM / candle semantics).
execute_resize_nearest_2x_f32
Host fallback for nearest 2× upsample on NCHW.
execute_rms_norm_backward_beta_f32
execute_rms_norm_backward_gamma_f32
execute_rms_norm_backward_input_f32
Host-fallback: Op::RmsNormBackwardInput (GPU unified-memory / D2H arenas).
execute_rope_backward_f32
execute_thunks
execute_thunks_active
Active-extent execution stub. The runtime calls this when it has an active-extent hint set. CPU doesn’t implement per-thunk active-extent scaling yet — return false so the caller falls back to the full execute_thunks path.