#[non_exhaustive]pub enum BackendKind {
Show 14 variants
Bespoke,
Cutlass,
Cublas,
Cudnn,
Cufft,
Cusparse,
Cusolver,
Curand,
Cutensor,
Npp,
Cvcuda,
FlashAttentionV2,
FlashInfer,
Ozaki {
slices: u8,
},
}Expand description
Which underlying compute backend served a kernel SKU.
Surfaced through KernelSku::backend for telemetry, autotuner
cache keys, and selector debugging.
#[non_exhaustive] — new backends (TensorRT, custom JIT-emitted
kernels via baracuda-nvrtc, …) may land in future phases. Match arms
must include a _ => catch-all.
Variants (Non-exhaustive)§
This enum is marked as non-exhaustive
Bespoke
Hand-rolled kernel in baracuda-kernels-sys.
Cutlass
CUTLASS template instantiation in baracuda-cutlass-kernels-sys.
Cublas
baracuda-cublas wrapper of cuBLAS / cuBLASLt.
Cudnn
baracuda-cudnn wrapper of cuDNN (graph or legacy API).
Cufft
baracuda-cufft wrapper of cuFFT.
Cusparse
baracuda-cusparse wrapper of cuSPARSE / cuSPARSELt.
Cusolver
baracuda-cusolver wrapper of cuSOLVER.
Curand
baracuda-curand wrapper of cuRAND.
Cutensor
baracuda-cutensor wrapper of cuTENSOR.
Npp
baracuda-npp wrapper of NPP.
Cvcuda
baracuda-cvcuda wrapper of CV-CUDA.
FlashAttentionV2
Vendored Dao-AILab FlashAttention v2 (BSD-3-Clause). Phase 42
added this as a backend choice on FlashSdpaPlan for the long-
context regime where FA2’s tiling wins over the bespoke kernel.
FlashInfer
Vendored FlashInfer (Apache-2.0). Phase 46 added three plan
families backed by FlashInfer cherry-picked headers:
BatchPagedDecodePlan (batched paged-KV decode for vLLM-style
serving), TopKTopPSamplingPlan (sort-free combined top-K /
top-P / min-P sampling), and CascadeAttentionPlan (LSE-merge
for prefix-cache sharing across requests).
Ozaki
Vendored ozIMMU (MIT). Phase 44 backend choice on FP64 GemmPlan
that splits each operand into slices int8 slices and runs
slices² tensor-core matmuls (the Ozaki scheme) to synthesize
a DGEMM on hardware that has no FP64 tensor cores (RTX 4070,
L4, etc.). Opt-in — NOT bit-equivalent to native DGEMM;
slices = 8 is the upstream-recommended sweet spot for
well-conditioned inputs.
§Slice-count + variant discriminant encoding (Phase 44c)
The slices byte is split into two bit-fields:
-
Low 5 bits (
slices & 0x1F) — slice countS:0= auto (fp64_int8_auto, runtime selection based on mantissa-loss histogram).3..=18= fixed slice count (fp64_int8_3..fp64_int8_18).
-
High 3 bits (
slices >> 5) — Phase 44c variant flag:0= Base (original ozIMMU; default for back-compat with Phase 44/44b callers).1= EF (group-wise error-free summation; ~5–15% faster at the same accuracy).2= RN (nearest-rounding split; ~2 extra effective bits per slice).3= H (EF + RN combined).
Use the ozaki_slices helper constructors for ergonomic
construction (ozaki_slices::ef(8) → 40 = EF variant at
S=8). Values with any other bit pattern are rejected at
plan-select time.
n-blocking (chunk large-N int8 GEMMs into 8192-wide pieces) is applied automatically by the C++ shim regardless of the variant flag.
Fields
slices: u8Slice count + variant discriminant — see the
BackendKind::Ozaki doc-comment for the bit-field
layout. Prefer the ozaki_slices helpers over raw
integer construction.
Trait Implementations§
Source§impl Clone for BackendKind
impl Clone for BackendKind
Source§fn clone(&self) -> BackendKind
fn clone(&self) -> BackendKind
1.0.0 (const: unstable) · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreimpl Copy for BackendKind
Source§impl Debug for BackendKind
impl Debug for BackendKind
impl Eq for BackendKind
Source§impl Hash for BackendKind
impl Hash for BackendKind
Source§impl PartialEq for BackendKind
impl PartialEq for BackendKind
Source§fn eq(&self, other: &BackendKind) -> bool
fn eq(&self, other: &BackendKind) -> bool
self and other values to be equal, and is used by ==.