baracuda-kernels-types
Shared type vocabulary for the baracuda ML kernel facade.
This crate has no GPU code and no behavior of its own. It ships
pure-data types — traits, enums, structs — that are the contracts the
safe Rust layer (baracuda-kernels) and the raw FFI layer
(baracuda-kernels-sys) both agree on. The same vocabulary is also
consumed by every per-library wrapper crate
(baracuda-cublas, baracuda-cudnn, baracuda-cusolver, …) so the
whole workspace speaks one dtype + tag language instead of each crate
re-declaring its own.
The types were previously defined in baracuda-cutlass::types; they
were lifted out in Phase 0 so the facade could share them without
forcing every consumer to pull in CUTLASS. baracuda-cutlass keeps the
old names as re-exports for back-compat.
What's in here
src/
element.rs KernelDtype (umbrella marker) + Element / IntElement / FpElement /
BinElement / BiasElement sibling traits
+ ElementKind / MathPrecision / BiasElementKind tag enums
+ scalar wrappers: S8, U8, S4, U4, Bin, F32Strict, Fp8E4M3, Fp8E5M2,
Bool, Complex32, Complex64
+ ScalarType (alpha / beta projection)
layout.rs LayoutSku (Rcr, Rrr), ArchSku (Sm80, Sm89, Sm90a),
EpilogueKind, ActivationKind
matrix.rs MatrixRef<T> + MatrixMut<T> + VectorRef<T> — 2-D matrix views for GEMM
tensor.rs TensorRef<T, N> + TensorMut<T, N> — rank-N tensor views for
everything else (const-generic rank, element strides, broadcast convention)
plan.rs PlanPreference, PrecisionGuarantee, Workspace
sku.rs OpCategory, BackendKind, KernelSku — the structural identity tuple
returned by every plan's sku() accessor
ops.rs Per-category op-discriminant enums: BinaryKind, UnaryKind,
ReduceKind, ScanKind, SoftmaxKind, NormalizationKind, LossKind,
AttentionKind, IndexingKind, EmbeddingKind, ShapeLayoutKind,
SortKind, QuantizeKind, RandomKind, SegmentKind, ImageKind,
FftKind, LinalgKind, MoeKind, GgufBlockFormat, … (and the
supporting tag enums: PadMode, FillMode, LossReduction,
CrossEntropyTargetKind, …)
Element vs KernelDtype — which to bound on
[KernelDtype] is the umbrella marker every kernel-usable dtype
implements — including the sub-byte / FP8 / packed-bit newtypes
(S4, U4, S8, U8, Fp8E4M3, Fp8E5M2, Bin) that have their
own kernel families. Element, IntElement, FpElement, and
BinElement all use KernelDtype as a supertrait, so a function
bounded by <T: KernelDtype> accepts any kernel-usable type.
The op-shaped sub-traits are what plans actually parameterize on:
| Bound | Accepts | When to use |
|---|---|---|
<T: Element> |
f16, bf16, f32, F32Strict, f64, i32, i64, Bool, Complex32, Complex64 |
The elementwise / reduce / scan / norm / loss / shape-layout plans — they consume BinaryPlan<T, N> / UnaryPlan<T, N> shape with a type Scalar projection for α/β. |
<T: IntElement> |
S8, U8, S4, U4 |
The int-GEMM plan family. |
<T: FpElement> |
Fp8E4M3, Fp8E5M2 |
The FP8 GEMM plan family (sm_89+). |
<T: BinElement> |
Bin |
The 1-bit binary GEMM plan family (XOR + popcount). |
<T: KernelDtype> |
union of the four above | Generic utility code that wants to accept any dtype — telemetry helpers, dtype-size queries, downstream framework wrappers. |
KernelDtype::KIND: ElementKind is the single source of truth for the
runtime dtype tag — pre-Phase-28 code that wrote
<T as Element>::KIND should switch to plain T::KIND (works under
any of the sub-trait bounds via supertrait inheritance) or
<T as KernelDtype>::KIND for the fully-qualified form.
#[non_exhaustive] and forward-compat
Phase 28 marked the op-family discriminant enums plus several tag
enums #[non_exhaustive]. Downstream code that matches on them
must include a _ => catch-all — adding new variants then no longer
breaks the build. The covered enums:
- Op-family:
BinaryKind,UnaryKind,TernaryKind,GatedActivationKind,PadMode,ShapeLayoutKind,ArgReduceKind,ReduceKind,SoftmaxKind,ScanKind,BinaryCmpKind,NormalizationKind,LossKind,RandomKind,LinalgKind,FftKind,ConvKind,PoolKind,AttentionKind,IndexingKind,SegmentKind,EmbeddingKind,QuantizeKind,GgufBlockFormat,MoeKind,SortKind,ImageKind. - Auxiliary tags:
OpCategory,BackendKind,IndexElementKind,IndexOutputKind.
Intentionally LEFT exhaustive (deliberate breaking-change events on new variants):
ElementKind— every kernel dtype is enumerated; a new dtype is a workspace-wide event that should surface as a build break across every match.LayoutSku,ArchSku,EpilogueKind,ActivationKind,BiasElementKind— these are the keys cutlass GEMM and int-GEMM dispatchers exhaustively match on to pick per-arch / per-fused-epilogue / per-bias-dtype kernel SKUs; adding a variant deserves to surface at every match site so each can wire or reject.Workspace<'a>— hot-path-matched by every plan'srunmethod; theNone/Borrowedsplit has been stable through every alpha.EmbeddingBagMode,FillMode,LossReduction,CrossEntropyTargetKind,BatchedOrmqrSide,BatchedOrmqrOp— closed mathematical / convention sets (Sum/Mean for the bag, Lower/Upper for triangular fill, the LAPACK Left/Right and N/T/C ops, the PyTorch reduction modes).
Phase 32 — descriptor builder retrofit
Phase 32 propagated the same #[non_exhaustive] marker to the
descriptor structs that have been amended in recent phases
(Conv{1,2,3}dDescriptor, ConvTranspose{1,2,3}dDescriptor,
Pool{1,2,3}dDescriptor, AdaptivePool{1,2,3}dDescriptor,
LpPool{1,2}dDescriptor, FractionalMaxPool{2,3}dDescriptor,
InterpolateDescriptor, InterpolateBackwardDescriptor). These
structs live in baracuda-kernels, not in this crate — see the
"Phase 32 builder migration" section in
baracuda-kernels/README.md for the full
list of new ::new(...) constructors and .with_* setters.
The marker means downstream callers MUST use the new builder
(Conv2dDescriptor::new(...).with_padding(...).with_stride(...))
instead of a struct literal. Adding optional fields in future phases
then no longer breaks downstream builds.
Why this crate is split out
A few load-bearing reasons:
- Zero CUDA dependency at the type level. Downstream crates that
only want the dtype vocabulary (e.g. a tensor library that needs
Element::KINDto identify dtypes for printing) don't have to pull in CUDA, CUTLASS, or any*-syscrate. The runtime dependency surface isbaracuda-driver(forDeviceSlice) +baracuda-types(forDeviceRepr/Half/BFloat16) +half+float8. - One vocabulary per scalar dtype. Without this crate, each
per-library wrapper would re-derive its own
enum Dtype { F32, F16, … }from the underlying NVIDIA library's tag enum (cudaDataType_t,cudnnDataType_t,cufftType, …) and the safe facade would spend its life translating between them. Centralizing the vocabulary here means the translation table lives in one place per wrapper. KernelSkuis the autotuner cache key. It needs to beCopy + Eq + Hashand stable across versions. Defining it here, away from any one library wrapper, keeps it neutral.
Dependencies
[]
= "...features = [\"half-crate\", \"f8-crate\"]"
= "..."
= "2"
= "0.7"
baracuda-driver is the source of the lifetimed device-slice types
(DeviceSlice / DeviceSliceMut) that back TensorRef / TensorMut.
The half and float8 crates supply the precise IEEE half-precision
and 8-bit-float wrappers (half::f16, half::bf16, float8::F8E4M3,
float8::F8E5M2) that the FpElement impls re-export.
Usage
You generally don't depend on this crate directly — depend on
baracuda-kernels (which re-exports the entire surface) and import
from there:
use ;
Depending on baracuda-kernels-types directly is only useful if you're
writing a sibling wrapper crate that needs the vocabulary but doesn't
want the rest of the facade (baracuda-cublas / baracuda-cudnn etc.
do exactly this).
See also
baracuda-kernels— the safe facade that re-exports everything in this crate.baracuda-kernels-sys— the raw FFI layer this vocabulary describes the contracts for.ARCHITECTURE.md— the layered design, the Plan / Descriptor / Args triple, theKernelSkutaxonomy, the workspace contract.