1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
//! Hopper / Blackwell primitives (Phase 5).
//!
//! Hopper (sm_90 / sm_90a) introduced four kernel-side primitives that
//! materially change how high-throughput CUDA kernels are written:
//!
//! 1. **Tensor Memory Accelerator (TMA)** — bulk asynchronous tensor
//! copies between global and shared memory described by an opaque
//! `CUtensorMap` (built host-side via `cuTensorMapEncodeTiled`).
//! See [`tma`].
//! 2. **WGMMA** — warp-group matrix multiply accumulate, the
//! successor to MMA. Issued from a warpgroup of 128 threads via
//! `wgmma.mma_async.sync`. See [`wgmma`].
//! 3. **`cp.async`** — already on Ampere, but Hopper adds the
//! bulk-asynchronous TMA-driven `cp.async.bulk` that fences with
//! barrier objects. See [`cp_async`].
//! 4. **Thread-block clusters** — a new launch dimension above grid /
//! block that exposes Distributed Shared Memory (DSM) and the
//! `cluster.sync` barrier. See [`cluster`].
//!
//! Blackwell (sm_100 / sm_120) adds the second-generation TMA, larger
//! cluster sizes, the new fp4 / fp6 / mxfp variants, and tensor memory
//! (TMEM) that backs `tcgen05.mma`. The `blackwell` cargo feature gates
//! the additional intrinsics; the host-side wrappers are shared with
//! Hopper through this module.
//!
//! ## Layout
//!
//! * [`tma`] — `TensorMapDescriptor` builder + the safe wrapper around
//! `cuTensorMapEncodeTiled`.
//! * [`wgmma`] — public re-exports of the macro-defined `wgmma_*`
//! intrinsics (definitions live in `include/atomr_hopper.cuh`).
//! * [`cp_async`] — `cp.async` pipeline macro shims.
//! * [`cluster`] — [`LaunchSpec`] and the safe wrapper around
//! `cudaLaunchKernelExC` for cluster-dim launches; DSM helpers.
pub use ;
pub use ;
/// Path to the vendored hopper header (`atomr_hopper.cuh`) shipped
/// alongside the crate. NVRTC kernels can `--include-path` this and
/// `#include "atomr_hopper.cuh"` to pick up the wgmma / cp.async /
/// cluster macro shims.
pub const ATOMR_HOPPER_HEADER_REL_PATH: &str = "include/atomr_hopper.cuh";
/// Returns the absolute filesystem path to `atomr_hopper.cuh` if it
/// exists alongside the crate sources. Returns `None` for installations
/// that strip the `include/` directory (e.g. crates.io binary
/// publication of just the compiled lib).