Enum BackendKind

Source

#[non_exhaustive]pub enum BackendKind {
Show 14 variants    Bespoke,
    Cutlass,
    Cublas,
    Cudnn,
    Cufft,
    Cusparse,
    Cusolver,
    Curand,
    Cutensor,
    Npp,
    Cvcuda,
    FlashAttentionV2,
    FlashInfer,
    Ozaki {
        slices: u8,
    },
}

Expand description

Which underlying compute backend served a kernel SKU.

Surfaced through KernelSku::backend for telemetry, autotuner cache keys, and selector debugging.

#[non_exhaustive] — new backends (TensorRT, custom JIT-emitted kernels via baracuda-nvrtc, …) may land in future phases. Match arms must include a _ => catch-all.

Variants (Non-exhaustive)§

This enum is marked as non-exhaustive

Non-exhaustive enums could have additional variants added in future. Therefore, when matching against variants of non-exhaustive enums, an extra wildcard arm must be added to account for any future variants.

§

Bespoke

Hand-rolled kernel in baracuda-kernels-sys.

§

Cutlass

CUTLASS template instantiation in baracuda-cutlass-kernels-sys.

§

Cublas

baracuda-cublas wrapper of cuBLAS / cuBLASLt.

§

Cudnn

baracuda-cudnn wrapper of cuDNN (graph or legacy API).

§

Cufft

baracuda-cufft wrapper of cuFFT.

§

Cusparse

baracuda-cusparse wrapper of cuSPARSE / cuSPARSELt.

§

Cusolver

baracuda-cusolver wrapper of cuSOLVER.

§

Curand

baracuda-curand wrapper of cuRAND.

§

Cutensor

baracuda-cutensor wrapper of cuTENSOR.

§

Npp

baracuda-npp wrapper of NPP.

§

Cvcuda

baracuda-cvcuda wrapper of CV-CUDA.

§

FlashAttentionV2

Vendored Dao-AILab FlashAttention v2 (BSD-3-Clause). Phase 42 added this as a backend choice on FlashSdpaPlan for the long- context regime where FA2’s tiling wins over the bespoke kernel.

§

FlashInfer

Vendored FlashInfer (Apache-2.0). Phase 46 added three plan families backed by FlashInfer cherry-picked headers: BatchPagedDecodePlan (batched paged-KV decode for vLLM-style serving), TopKTopPSamplingPlan (sort-free combined top-K / top-P / min-P sampling), and CascadeAttentionPlan (LSE-merge for prefix-cache sharing across requests).

§

Ozaki

Vendored ozIMMU (MIT). Phase 44 backend choice on FP64 GemmPlan that splits each operand into slices int8 slices and runs slices² tensor-core matmuls (the Ozaki scheme) to synthesize a DGEMM on hardware that has no FP64 tensor cores (RTX 4070, L4, etc.). Opt-in — NOT bit-equivalent to native DGEMM; slices = 8 is the upstream-recommended sweet spot for well-conditioned inputs.

§Slice-count + variant discriminant encoding (Phase 44c)

The slices byte is split into two bit-fields:

Low 5 bits (slices & 0x1F) — slice count S:
- 0 = auto (fp64_int8_auto, runtime selection based on mantissa-loss histogram).
- 3..=18 = fixed slice count (fp64_int8_3 .. fp64_int8_18).
High 3 bits (slices >> 5) — Phase 44c variant flag:
- 0 = Base (original ozIMMU; default for back-compat with Phase 44/44b callers).
- 1 = EF (group-wise error-free summation; ~5–15% faster at the same accuracy).
- 2 = RN (nearest-rounding split; ~2 extra effective bits per slice).
- 3 = H (EF + RN combined).

Use the ozaki_slices helper constructors for ergonomic construction (ozaki_slices::ef(8) → 40 = EF variant at S=8). Values with any other bit pattern are rejected at plan-select time.

n-blocking (chunk large-N int8 GEMMs into 8192-wide pieces) is applied automatically by the C++ shim regardless of the variant flag.

Fields

§slices: u8

Slice count + variant discriminant — see the BackendKind::Ozaki doc-comment for the bit-field layout. Prefer the ozaki_slices helpers over raw integer construction.

BackendKind

Enum BackendKind Copy item path

Variants (Non-exhaustive)§

Bespoke

Cutlass

Cublas

Cudnn

Cufft

Cusparse

Cusolver

Curand

Cutensor

Npp

Cvcuda

FlashAttentionV2

FlashInfer

Ozaki

§Slice-count + variant discriminant encoding (Phase 44c)

Fields

Trait Implementations§

impl Clone for BackendKind

fn clone(&self) -> BackendKind

fn clone_from(&mut self, source: &Self)

impl Copy for BackendKind

impl Debug for BackendKind

fn fmt(&self, f: &mut Formatter<'_>) -> Result<(), Error>

impl Eq for BackendKind

impl Hash for BackendKind

fn hash<__H>(&self, state: &mut __H)where __H: Hasher,

fn hash_slice<H>(data: &[Self], state: &mut H)where H: Hasher, Self: Sized,

impl PartialEq for BackendKind

fn eq(&self, other: &BackendKind) -> bool

fn ne(&self, other: &Rhs) -> bool

impl StructuralPartialEq for BackendKind

Auto Trait Implementations§

impl Freeze for BackendKind

impl RefUnwindSafe for BackendKind

impl Send for BackendKind

impl Sync for BackendKind

impl Unpin for BackendKind

impl UnsafeUnpin for BackendKind

impl UnwindSafe for BackendKind

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> CloneToUninit for Twhere T: Clone,

unsafe fn clone_to_uninit(&self, dest: *mut u8)

impl<T> From<T> for T

fn from(t: T) -> T

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> ToOwned for Twhere T: Clone,

type Owned = T

fn to_owned(&self) -> T

fn clone_into(&self, target: &mut T)

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Enum BackendKind

fn hash<H>(&self, state: &mut H)
where __H: Hasher,

fn hash_slice<H>(data: &[Self], state: &mut H)
where H: Hasher, Self: Sized,

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T> CloneToUninit for T
where T: Clone,

impl<T, U> Into<U> for T
where U: From<T>,

impl<T> ToOwned for T
where T: Clone,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,