Enum MemoryOp

Source

pub enum MemoryOp {
    LdParam {
        dst: Register,
        param_name: String,
        ty: PtxType,
    },
    LdGlobal {
        dst: Register,
        addr: Register,
        ty: PtxType,
    },
    LdGlobalPred {
        dst: Register,
        addr: Register,
        ty: PtxType,
        pred: Register,
        negate: bool,
    },
    LdGlobalB128 {
        dsts: [Register; 4],
        addr: Register,
    },
    StGlobal {
        addr: Register,
        src: Register,
        ty: PtxType,
    },
    StGlobalPred {
        addr: Register,
        src: Register,
        ty: PtxType,
        pred: Register,
        negate: bool,
    },
    LdShared {
        dst: Register,
        addr: Register,
        ty: PtxType,
    },
    StShared {
        addr: Register,
        src: Register,
        ty: PtxType,
    },
    CvtaToGlobal {
        dst: Register,
        src: Register,
    },
    CpAsyncCaSharedGlobal {
        dst_shared: Register,
        src_global: Register,
        size_bytes: u8,
    },
    CpAsyncCommitGroup,
    CpAsyncWaitGroup {
        n: u8,
    },
}

Expand description

Memory PTX instruction variants.

Operand conventions:

All addresses and values are Registers (not Operand). You can’t ld.global from an immediate address or st.global an immediate value in PTX — those go through mov first.
LdParam is the exception: it references a kernel parameter by name (a String), not by register.

Variants§

§

LdParam

Load kernel parameter: ld.param{ty} dst, [param_name];

References the parameter by name from the kernel signature. Example: ld.param.u64 %rd1, [vector_add_param_0];

Fields

§dst: Register

Destination register.

§param_name: String

Parameter name from the kernel signature.

§ty: PtxType

PTX type of the parameter value.

§

LdGlobal

Load from global memory: ld.global{ty} dst, [addr];

The addr register holds the computed memory address. Example: ld.global.f32 %f1, [%rd8];

Fields

§dst: Register

Destination register.

§addr: Register

Register holding the memory address.

§ty: PtxType

PTX type of the loaded value.

§

LdGlobalPred

Predicated load from global memory: @[!]{pred} ld.global{ty} dst, [addr];

Skips the load when the predicate evaluates false (or true when negate is set). Used for edge-tile bounds checking — the OOB thread’s dst register is left unchanged, so callers typically pre-initialize dst to zero with mov.b32 dst, 0 and then conditionally overwrite with a predicated load.

Sprint 6.7 (multi-warp matmul_tc edge tiles) is the first user. Example: @%p1 ld.global.u32 %r5, [%rd9];

Fields

§dst: Register

Destination register (unchanged when predicate is false).

§addr: Register

Register holding the memory address.

§ty: PtxType

PTX type of the loaded value.

§pred: Register

Predicate register controlling the load.

§negate: bool

When true, negate the predicate (@!pred).

§

LdGlobalB128

128-bit vectorized load from global memory: ld.global.v4.b32 {%r_i, %r_j, %r_k, %r_l}, [addr];

Single-instruction 128-bit transfer into 4 independent b32 destination registers. Halves (or more) the global-load instruction count vs scalar b32 loads for bandwidth-bound kernels. Requires the addr register to hold a 16-byte aligned global-space address — unaligned access will fault at runtime; PTX does not catch this statically.

Destinations are NOT required to be consecutive registers in the allocator — PTX ld.global.v4.b32 accepts any 4 b32 regs in the vector brace list. In practice, allocating 4 regs in sequence produces consecutive indices, which is what callers typically do.

No predicate variant in Sprint 6.7b — edge tiles stay on the existing LdGlobalPred scalar path. A future LdGlobalB128Pred would be additive.

Sprint 6.7b (multi-warp matmul_tc Tile B fast path) is the first user. Construct via MemoryOp::new_ld_global_b128, which validates that all 4 destinations are b32-class registers.

Example: ld.global.v4.b32 {%r0, %r1, %r2, %r3}, [%rd8];

Fields

§dsts: [Register; 4]

Four b32 destination registers — receive bytes 0-3, 4-7, 8-11, 12-15 of the loaded 128-bit value respectively.

§addr: Register

Register holding a 16-B aligned global-space address.

§

StGlobal

Store to global memory: st.global{ty} [addr], src;

Operand order is reversed in PTX — address comes first, value second. This matches PTX convention but is opposite to loads and arithmetic where dst is first.

Example: st.global.f32 [%rd10], %f3;

Fields

§addr: Register

Register holding the memory address.

§src: Register

Source register (value to store).

§ty: PtxType

PTX type of the stored value.

§

StGlobalPred

Predicated store to global memory: @[!]{pred} st.global{ty} [addr], src;

Skips the store when the predicate evaluates false (or true when negate is set). Used for edge-tile bounds checking on output writes — out-of-bounds threads simply don’t store, leaving the destination memory untouched.

Sprint 6.7 (multi-warp matmul_tc edge tiles) is the first user. Example: @%p1 st.global.f32 [%rd11], %f4;

Fields

§addr: Register

Register holding the memory address.

§src: Register

Source register (value to store).

§ty: PtxType

PTX type of the stored value.

§pred: Register

Predicate register controlling the store.

§negate: bool

When true, negate the predicate (@!pred).

§

LdShared

Load from shared memory: ld.shared{ty} dst, [addr];

Shared memory is block-scoped SRAM. The addr register holds the offset into the declared shared allocation. Example: ld.shared.f32 %f0, [%r0];

Fields

§dst: Register

Destination register.

§addr: Register

Register holding the shared memory offset.

§ty: PtxType

PTX type of the loaded value.

§

StShared

Store to shared memory: st.shared{ty} [addr], src;

Operand order is reversed in PTX — address first, value second (same convention as StGlobal). Example: st.shared.f32 [%r0], %f1;

Fields

§addr: Register

Register holding the shared memory offset.

§src: Register

Source register (value to store).

§ty: PtxType

PTX type of the stored value.

§

CvtaToGlobal

Convert generic address to global: cvta.to.global.u64 dst, src;

Always .u64 (64-bit address space, matching .address_size 64). Required because ld.param returns generic-space pointers — ld.global needs global-space addresses.

Fields

§dst: Register

Destination register (global-space address).

§src: Register

Source register (generic-space address from ld.param).

§

CpAsyncCaSharedGlobal

Asynchronous global→shared copy, cache-at-all-levels variant: cp.async.ca.shared.global [dst_shared], [src_global], size_bytes;

Issues a non-blocking transfer from global memory into shared memory without tying up registers. The copy is in-flight after this instruction; use CpAsyncCommitGroup to delimit a batch and CpAsyncWaitGroup to synchronize. Requires SM 8.0+ (Ampere).

size_bytes must be one of 4, 8, or 16 (validated at construction via MemoryOp::new_cp_async_ca).

Example: cp.async.ca.shared.global [%r0], [%rd3], 16;

Placement note: cp.async lives in MemoryOp for Sprint 6.2 because semantically it is a memory op. The commit/wait variants are pipeline-state operations and may relocate to a dedicated PipelineOp category in Sprint 6.4 once double-buffering patterns exercise the state machine.

Fields

§dst_shared: Register

Register holding the shared-memory destination offset.

§src_global: Register

Register holding the global-memory source address (.to.global).

§size_bytes: u8

Copy size in bytes: must be 4, 8, or 16.

§

CpAsyncCommitGroup

Commit all pending cp.async operations into a new async group: cp.async.commit_group;

Groups are numbered implicitly from 0 (most-recently committed) upward. Used in conjunction with CpAsyncWaitGroup to block until a specific group completes. Requires SM 8.0+.

§

CpAsyncWaitGroup

Wait until at most n async copy groups remain in-flight: cp.async.wait_group n;

wait_group 0 waits for all outstanding groups to complete (the common one-stage-pipeline case). For double-buffered kernels, wait_group 1 is used to block on the N-1’th group while issuing the N’th. Requires SM 8.0+.

Fields

§n: u8

Number of outstanding groups still permitted after this wait.

Enum MemoryOp Copy item path

Variants§

LdParam

Fields

LdGlobal

Fields

LdGlobalPred

Fields

LdGlobalB128

Fields

StGlobal

Fields

StGlobalPred

Fields

LdShared

Fields

StShared

Fields

CvtaToGlobal

Fields

CpAsyncCaSharedGlobal

Fields

CpAsyncCommitGroup

CpAsyncWaitGroup

Fields

Implementations§

impl MemoryOp

pub fn new_cp_async_ca( dst_shared: Register, src_global: Register, size_bytes: u8, ) -> Self

§Panics

pub fn new_ld_global_b128(dsts: [Register; 4], addr: Register) -> Self

§Panics

Trait Implementations§

impl Clone for MemoryOp

fn clone(&self) -> MemoryOp

fn clone_from(&mut self, source: &Self)

impl Debug for MemoryOp

fn fmt(&self, f: &mut Formatter<'_>) -> Result

impl Emit for MemoryOp

fn emit(&self, w: &mut PtxWriter) -> Result

Auto Trait Implementations§

impl Freeze for MemoryOp

impl RefUnwindSafe for MemoryOp

impl Send for MemoryOp

impl Sync for MemoryOp

impl Unpin for MemoryOp

impl UnsafeUnpin for MemoryOp

impl UnwindSafe for MemoryOp

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> CloneToUninit for Twhere T: Clone,

unsafe fn clone_to_uninit(&self, dest: *mut u8)

impl<T> From<T> for T

fn from(t: T) -> T

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> ToOwned for Twhere T: Clone,

type Owned = T

fn to_owned(&self) -> T

fn clone_into(&self, target: &mut T)

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Enum MemoryOp

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T> CloneToUninit for T
where T: Clone,

impl<T, U> Into<U> for T
where U: From<T>,

impl<T> ToOwned for T
where T: Clone,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,