Expand description
§atomr-accel-cutlass
CUTLASS kernel-template instantiation for atomr-accel-cuda.
Phase 6 of the CUDA-coverage roadmap. This crate exposes a thin actor-friendly facade over CUTLASS GEMM, grouped-GEMM, and implicit-GEMM convolution templates. Two compilation strategies are supported:
§Strategy A (default — NVRTC at runtime)
User code constructs a gemm::GemmRequest<T> (or grouped /
conv equivalent) and forwards it to a actor::CutlassActor.
The actor builds a small .cu translation unit that
#includes the vendored CUTLASS headers (under
crates/atomr-accel-cutlass/cutlass/include/) and instantiates
the requested template, then hands the source to
atomr_accel_cuda::kernel::NvrtcActor for compilation.
Compilation cost is amortized via the per-arch disk cache that
NvrtcActor already maintains and via the in-process
plan_cache::PlanCache keyed by
(template_id, shape, dtype, arch).
§Strategy B (cutlass-prebuilt feature — nvcc at build time)
build.rs walks a generator that emits a static library of
pre-instantiated kernels for a fixed (op × dtype × arch) matrix.
When the feature is OFF the build.rs is a no-op probe so the
crate still builds on hosts without nvcc. The contract for
Strategy B is documented in [build.rs] and in the crate
README — full implementation is a follow-up; we wire the toggle
and the empty hooks here.
§Coverage
| op | dtype | arch |
|---|---|---|
| GEMM | fp32, fp16, bf16 | sm_80, sm_86, sm_89 |
| GEMM | fp8 e4m3 / e5m2 | sm_89, sm_90a, sm_100 |
| GEMM | fp4 e2m1 | sm_100, sm_120 |
| grouped GEMM | fp16, bf16, fp8 | sm_90a, sm_100 |
| conv2d fwd / dgrad / wgrad (implicit-GEMM) | fp16, bf16, fp32, fp8 | sm_80, sm_86, sm_89, sm_90a |
Compute targets: Ampere (sm_80, fp16/bf16/fp32), Hopper (sm_90a, fp8 e4m3/e5m2 + fp16/bf16, persistent kernels), Blackwell (sm_100, fp8/fp4 + EVT).
§Crate layout
crates/atomr-accel-cutlass/
├── Cargo.toml
├── cutlass/ # vendored CUTLASS headers (BSD-3-Clause)
├── include/ # local template adapters
├── build.rs # gated on cutlass-prebuilt feature
├── examples/
│ ├── cutlass_gemm_fp8.rs
│ └── cutlass_grouped_gemm.rs
└── src/
├── lib.rs # CutlassActor, CutlassMsg, props
├── actor.rs # actor surface
├── gemm.rs # GemmRequest<T>, CutlassGemmDispatch
├── grouped_gemm.rs # GroupedGemmRequest<T>
├── conv.rs # ConvFwd / Dgrad / Wgrad requests
├── evt.rs # EpilogueVisitorTree builder
├── plan_cache.rs # template plan cache
└── kernels/ # generated .cu sources at runtimeRe-exports§
pub use actor::CutlassActor;pub use actor::CutlassInner;pub use actor::CutlassMsg;pub use conv::ConvDgradRequest;pub use conv::ConvFwdRequest;pub use conv::ConvLayout;pub use conv::ConvShape;pub use conv::ConvWgradRequest;pub use conv::CutlassConvDispatch;pub use dtype::is_fp4_supported;pub use dtype::is_fp8_supported;pub use dtype::is_supported_for;pub use dtype::CutlassDtype;pub use dtype::GemmSupported;pub use dtype::SmArch;pub use gemm::CutlassGemmDispatch;pub use gemm::GemmEpilogue;pub use gemm::GemmLayout;pub use gemm::GemmRequest;pub use gemm::GemmShape;pub use gemm::RefitMsg;pub use plan_cache::PlanCache;pub use plan_cache::PlanKey;
Modules§
- actor
CutlassActor— host-side dispatcher for CUTLASS template instantiations.- conv
- Implicit-GEMM convolution requests.
- dtype
- Local dtype enum used by CUTLASS template messages.
- gemm
- GEMM template request and dispatch surface.
- plan_
cache - Plan cache: in-process LRU keyed by template instantiation
identity. Backs the
CutlassActorso that repeated GEMM messages with identical(template_id, shape, dtype, arch)skip both the Rust-side template render and the NVRTC compile.
Functions§
- props
- Convenience: build a
CutlassActorprops value. Mirrors thepropsconstructors used by the rest ofatomr-accel-cudaso downstream code can writesystem.actor_of(atomr_accel_cutlass::props(...), "cutlass"). - version
- Entry point used by the
cutlasscargo feature onatomr-accel-cuda. Returns the human-readable crate version string. Exposed for completeness so that downstream re-exports can version-pin against the cutlass crate without an extra cargo metadata call.