Expand description
Thunks — pre-compiled kernel dispatch with zero per-call overhead.
At compile time, the graph is lowered into a flat Vec<Thunk> where each
thunk holds pre-computed arena offsets, dimensions, and kernel type.
At runtime, the executor just iterates thunks and calls kernels directly.
No match dispatch, no HashMap lookup, no dimension computation.
Structs§
- Thunk
Schedule - Compiled thunk schedule — the runtime hot path. Nop thunks are filtered out at compile time for zero iteration overhead.
Enums§
- Thunk
- A pre-compiled kernel call with all args resolved to arena offsets.
Functions§
- compile_
thunks - Compile graph into thunk schedule.
- dequant_
matmul_ nvfp4 - execute_
axial_ ⚠rope2d_ f32 - Host axial 2-D RoPE for Metal (and other) fallbacks on unified memory.
- execute_
compiled - Execute a thunk schedule on a raw arena buffer. Fastest executor: call pre-compiled closures sequentially. Zero match dispatch — each closure is a direct kernel call.
- execute_
conv_ ⚠transpose2d_ nchw_ f32 - Host fallback for NCHW ConvTranspose2d.
- execute_
cumsum_ ⚠backward_ f32 - execute_
dequant_ ⚠grouped_ matmul_ gguf_ f32 - Host-fallback entry for GGUF
Op::DequantGroupedMatMul(MoE expert stack). - execute_
dequant_ ⚠matmul_ fp8_ f32 - Host-fallback entry for FP8
Op::DequantMatMul(Metal unified memory). - execute_
dequant_ ⚠matmul_ gguf_ f32 - Host-fallback entry for GGUF
Op::DequantMatMul(Metal unified memory). - execute_
dequant_ ⚠matmul_ int4_ f32 - Host-fallback entry for Int4
Op::DequantMatMul(Metal unified memory). - execute_
dequant_ ⚠matmul_ nvfp4_ f32 - Host-fallback entry for NVFP4
Op::DequantMatMul(Metal unified memory). - execute_
fft1d ⚠ - Dtype-dispatching host entry for
Op::Fft(shared by GPU host fallbacks). - execute_
fft1d_ ⚠c64 - C64 interleaved layout: each complex element is
[re: f32, im: f32]. - execute_
fft1d_ ⚠f32 - f32 mirror of
execute_fft1d_f64. Same public-host-fallback role. - execute_
fft1d_ ⚠f64 - Execute a batched 1D FFT in the f64 2N-real-block layout.
Each “row” is
2Nf64 elements: firstNreal, thenNimag. Theouterrows are independent and processed sequentially. - execute_
gated_ ⚠delta_ net_ f16 - Host-fallback entry for f16
Op::GatedDeltaNettensors on Metal. - execute_
gated_ ⚠delta_ net_ f32 - f32 counterpart of
execute_fft1d_f64. Same 2N-real-block layout (first N real, second N imag per row), same unnormalized convention; only the element width differs. Twiddle factors are computed in f64 and cast to f32 to keep large-N error closer to the f64 path (the savings from f32 are in memory bandwidth, not in twiddle precision). Host-fallback entry forOp::GatedDeltaNet(Metal / unified memory). Whenstate == 0, uses a zero-initialized scratch state per batch item. - execute_
gather_ ⚠backward_ f32 - execute_
group_ ⚠norm_ nchw_ f32 - Host fallback for NCHW group norm (Metal unified-memory arena).
- execute_
layer_ ⚠norm2d_ nchw_ f32 - Host fallback for NCHW LayerNorm2d (SAM / candle semantics).
- execute_
resize_ ⚠nearest_ 2x_ f32 - Host fallback for nearest 2× upsample on NCHW.
- execute_
rms_ ⚠norm_ backward_ beta_ f32 - execute_
rms_ ⚠norm_ backward_ gamma_ f32 - execute_
rms_ ⚠norm_ backward_ input_ f32 - Host-fallback:
Op::RmsNormBackwardInput(GPU unified-memory / D2H arenas). - execute_
rope_ ⚠backward_ f32 - execute_
thunks - execute_
thunks_ active - Active-extent execution stub. The runtime calls this when it has an
active-extent hint set. CPU doesn’t implement per-thunk active-extent
scaling yet — return false so the caller falls back to the full
execute_thunkspath.