pub enum MemoryOp {
LdParam {
dst: Register,
param_name: String,
ty: PtxType,
},
LdGlobal {
dst: Register,
addr: Register,
ty: PtxType,
},
LdGlobalPred {
dst: Register,
addr: Register,
ty: PtxType,
pred: Register,
negate: bool,
},
LdGlobalB128 {
dsts: [Register; 4],
addr: Register,
},
StGlobal {
addr: Register,
src: Register,
ty: PtxType,
},
StGlobalPred {
addr: Register,
src: Register,
ty: PtxType,
pred: Register,
negate: bool,
},
LdShared {
dst: Register,
addr: Register,
ty: PtxType,
},
StShared {
addr: Register,
src: Register,
ty: PtxType,
},
CvtaToGlobal {
dst: Register,
src: Register,
},
CpAsyncCaSharedGlobal {
dst_shared: Register,
src_global: Register,
size_bytes: u8,
},
CpAsyncCommitGroup,
CpAsyncWaitGroup {
n: u8,
},
}Expand description
Memory PTX instruction variants.
Operand conventions:
Variants§
LdParam
Load kernel parameter: ld.param{ty} dst, [param_name];
References the parameter by name from the kernel signature.
Example: ld.param.u64 %rd1, [vector_add_param_0];
Fields
LdGlobal
Load from global memory: ld.global{ty} dst, [addr];
The addr register holds the computed memory address.
Example: ld.global.f32 %f1, [%rd8];
Fields
LdGlobalPred
Predicated load from global memory: @[!]{pred} ld.global{ty} dst, [addr];
Skips the load when the predicate evaluates false (or true when
negate is set). Used for edge-tile bounds checking — the OOB
thread’s dst register is left unchanged, so callers typically
pre-initialize dst to zero with mov.b32 dst, 0 and then
conditionally overwrite with a predicated load.
Sprint 6.7 (multi-warp matmul_tc edge tiles) is the first user.
Example: @%p1 ld.global.u32 %r5, [%rd9];
Fields
LdGlobalB128
128-bit vectorized load from global memory:
ld.global.v4.b32 {%r_i, %r_j, %r_k, %r_l}, [addr];
Single-instruction 128-bit transfer into 4 independent b32
destination registers. Halves (or more) the global-load
instruction count vs scalar b32 loads for bandwidth-bound
kernels. Requires the addr register to hold a 16-byte
aligned global-space address — unaligned access will fault
at runtime; PTX does not catch this statically.
Destinations are NOT required to be consecutive registers in
the allocator — PTX ld.global.v4.b32 accepts any 4 b32 regs
in the vector brace list. In practice, allocating 4 regs in
sequence produces consecutive indices, which is what callers
typically do.
No predicate variant in Sprint 6.7b — edge tiles stay on the
existing LdGlobalPred scalar path. A
future LdGlobalB128Pred would be additive.
Sprint 6.7b (multi-warp matmul_tc Tile B fast path) is the
first user. Construct via
MemoryOp::new_ld_global_b128,
which validates that all 4 destinations are b32-class registers.
Example: ld.global.v4.b32 {%r0, %r1, %r2, %r3}, [%rd8];
Fields
StGlobal
Store to global memory: st.global{ty} [addr], src;
Operand order is reversed in PTX — address comes first,
value second. This matches PTX convention but is opposite to
loads and arithmetic where dst is first.
Example: st.global.f32 [%rd10], %f3;
Fields
StGlobalPred
Predicated store to global memory: @[!]{pred} st.global{ty} [addr], src;
Skips the store when the predicate evaluates false (or true when
negate is set). Used for edge-tile bounds checking on output
writes — out-of-bounds threads simply don’t store, leaving the
destination memory untouched.
Sprint 6.7 (multi-warp matmul_tc edge tiles) is the first user.
Example: @%p1 st.global.f32 [%rd11], %f4;
Fields
Load from shared memory: ld.shared{ty} dst, [addr];
Shared memory is block-scoped SRAM. The addr register holds the
offset into the declared shared allocation.
Example: ld.shared.f32 %f0, [%r0];
Store to shared memory: st.shared{ty} [addr], src;
Operand order is reversed in PTX — address first, value second
(same convention as StGlobal).
Example: st.shared.f32 [%r0], %f1;
CvtaToGlobal
Convert generic address to global: cvta.to.global.u64 dst, src;
Always .u64 (64-bit address space, matching .address_size 64).
Required because ld.param returns generic-space pointers —
ld.global needs global-space addresses.
Fields
Asynchronous global→shared copy, cache-at-all-levels variant:
cp.async.ca.shared.global [dst_shared], [src_global], size_bytes;
Issues a non-blocking transfer from global memory into shared
memory without tying up registers. The copy is in-flight after
this instruction; use CpAsyncCommitGroup
to delimit a batch and CpAsyncWaitGroup
to synchronize. Requires SM 8.0+ (Ampere).
size_bytes must be one of 4, 8, or 16 (validated at construction
via MemoryOp::new_cp_async_ca).
Example: cp.async.ca.shared.global [%r0], [%rd3], 16;
Placement note: cp.async lives in MemoryOp for Sprint 6.2
because semantically it is a memory op. The commit/wait variants
are pipeline-state operations and may relocate to a dedicated
PipelineOp category in Sprint 6.4 once double-buffering patterns
exercise the state machine.
CpAsyncCommitGroup
Commit all pending cp.async operations into a new async group:
cp.async.commit_group;
Groups are numbered implicitly from 0 (most-recently committed)
upward. Used in conjunction with
CpAsyncWaitGroup to block until a
specific group completes. Requires SM 8.0+.
CpAsyncWaitGroup
Wait until at most n async copy groups remain in-flight:
cp.async.wait_group n;
wait_group 0 waits for all outstanding groups to complete
(the common one-stage-pipeline case). For double-buffered
kernels, wait_group 1 is used to block on the N-1’th group
while issuing the N’th. Requires SM 8.0+.
Implementations§
Source§impl MemoryOp
impl MemoryOp
Sourcepub fn new_cp_async_ca(
dst_shared: Register,
src_global: Register,
size_bytes: u8,
) -> Self
pub fn new_cp_async_ca( dst_shared: Register, src_global: Register, size_bytes: u8, ) -> Self
Construct a CpAsyncCaSharedGlobal,
validating the size byte count.
§Panics
Panics if size_bytes is not one of 4, 8, or 16 — the
only sizes PTX accepts for cp.async.ca. PTX won’t catch this
until ptxas runs, and the error there is cryptic, so we fail
loudly at construction time.
Sourcepub fn new_ld_global_b128(dsts: [Register; 4], addr: Register) -> Self
pub fn new_ld_global_b128(dsts: [Register; 4], addr: Register) -> Self
Construct an LdGlobalB128, validating
that all 4 destinations are b32-class registers.
§Panics
Panics if any destination register is not crate::types::RegKind::R (b32).
ld.global.v4.b32 requires 4× 32-bit-wide integer-class
destinations; .f / .rd / .h / .hb / .p registers are
invalid and ptxas’s error message is cryptic. Fail loudly at
construction.