Skip to main content

MemoryOp

Enum MemoryOp 

Source
pub enum MemoryOp {
    LdParam {
        dst: Register,
        param_name: String,
        ty: PtxType,
    },
    LdGlobal {
        dst: Register,
        addr: Register,
        ty: PtxType,
    },
    LdGlobalPred {
        dst: Register,
        addr: Register,
        ty: PtxType,
        pred: Register,
        negate: bool,
    },
    LdGlobalB128 {
        dsts: [Register; 4],
        addr: Register,
    },
    StGlobal {
        addr: Register,
        src: Register,
        ty: PtxType,
    },
    StGlobalPred {
        addr: Register,
        src: Register,
        ty: PtxType,
        pred: Register,
        negate: bool,
    },
    LdShared {
        dst: Register,
        addr: Register,
        ty: PtxType,
    },
    StShared {
        addr: Register,
        src: Register,
        ty: PtxType,
    },
    CvtaToGlobal {
        dst: Register,
        src: Register,
    },
    CpAsyncCaSharedGlobal {
        dst_shared: Register,
        src_global: Register,
        size_bytes: u8,
    },
    CpAsyncCommitGroup,
    CpAsyncWaitGroup {
        n: u8,
    },
}
Expand description

Memory PTX instruction variants.

Operand conventions:

  • All addresses and values are Registers (not Operand). You can’t ld.global from an immediate address or st.global an immediate value in PTX — those go through mov first.
  • LdParam is the exception: it references a kernel parameter by name (a String), not by register.

Variants§

§

LdParam

Load kernel parameter: ld.param{ty} dst, [param_name];

References the parameter by name from the kernel signature. Example: ld.param.u64 %rd1, [vector_add_param_0];

Fields

§dst: Register

Destination register.

§param_name: String

Parameter name from the kernel signature.

§ty: PtxType

PTX type of the parameter value.

§

LdGlobal

Load from global memory: ld.global{ty} dst, [addr];

The addr register holds the computed memory address. Example: ld.global.f32 %f1, [%rd8];

Fields

§dst: Register

Destination register.

§addr: Register

Register holding the memory address.

§ty: PtxType

PTX type of the loaded value.

§

LdGlobalPred

Predicated load from global memory: @[!]{pred} ld.global{ty} dst, [addr];

Skips the load when the predicate evaluates false (or true when negate is set). Used for edge-tile bounds checking — the OOB thread’s dst register is left unchanged, so callers typically pre-initialize dst to zero with mov.b32 dst, 0 and then conditionally overwrite with a predicated load.

Sprint 6.7 (multi-warp matmul_tc edge tiles) is the first user. Example: @%p1 ld.global.u32 %r5, [%rd9];

Fields

§dst: Register

Destination register (unchanged when predicate is false).

§addr: Register

Register holding the memory address.

§ty: PtxType

PTX type of the loaded value.

§pred: Register

Predicate register controlling the load.

§negate: bool

When true, negate the predicate (@!pred).

§

LdGlobalB128

128-bit vectorized load from global memory: ld.global.v4.b32 {%r_i, %r_j, %r_k, %r_l}, [addr];

Single-instruction 128-bit transfer into 4 independent b32 destination registers. Halves (or more) the global-load instruction count vs scalar b32 loads for bandwidth-bound kernels. Requires the addr register to hold a 16-byte aligned global-space address — unaligned access will fault at runtime; PTX does not catch this statically.

Destinations are NOT required to be consecutive registers in the allocator — PTX ld.global.v4.b32 accepts any 4 b32 regs in the vector brace list. In practice, allocating 4 regs in sequence produces consecutive indices, which is what callers typically do.

No predicate variant in Sprint 6.7b — edge tiles stay on the existing LdGlobalPred scalar path. A future LdGlobalB128Pred would be additive.

Sprint 6.7b (multi-warp matmul_tc Tile B fast path) is the first user. Construct via MemoryOp::new_ld_global_b128, which validates that all 4 destinations are b32-class registers.

Example: ld.global.v4.b32 {%r0, %r1, %r2, %r3}, [%rd8];

Fields

§dsts: [Register; 4]

Four b32 destination registers — receive bytes 0-3, 4-7, 8-11, 12-15 of the loaded 128-bit value respectively.

§addr: Register

Register holding a 16-B aligned global-space address.

§

StGlobal

Store to global memory: st.global{ty} [addr], src;

Operand order is reversed in PTX — address comes first, value second. This matches PTX convention but is opposite to loads and arithmetic where dst is first.

Example: st.global.f32 [%rd10], %f3;

Fields

§addr: Register

Register holding the memory address.

§src: Register

Source register (value to store).

§ty: PtxType

PTX type of the stored value.

§

StGlobalPred

Predicated store to global memory: @[!]{pred} st.global{ty} [addr], src;

Skips the store when the predicate evaluates false (or true when negate is set). Used for edge-tile bounds checking on output writes — out-of-bounds threads simply don’t store, leaving the destination memory untouched.

Sprint 6.7 (multi-warp matmul_tc edge tiles) is the first user. Example: @%p1 st.global.f32 [%rd11], %f4;

Fields

§addr: Register

Register holding the memory address.

§src: Register

Source register (value to store).

§ty: PtxType

PTX type of the stored value.

§pred: Register

Predicate register controlling the store.

§negate: bool

When true, negate the predicate (@!pred).

§

LdShared

Load from shared memory: ld.shared{ty} dst, [addr];

Shared memory is block-scoped SRAM. The addr register holds the offset into the declared shared allocation. Example: ld.shared.f32 %f0, [%r0];

Fields

§dst: Register

Destination register.

§addr: Register

Register holding the shared memory offset.

§ty: PtxType

PTX type of the loaded value.

§

StShared

Store to shared memory: st.shared{ty} [addr], src;

Operand order is reversed in PTX — address first, value second (same convention as StGlobal). Example: st.shared.f32 [%r0], %f1;

Fields

§addr: Register

Register holding the shared memory offset.

§src: Register

Source register (value to store).

§ty: PtxType

PTX type of the stored value.

§

CvtaToGlobal

Convert generic address to global: cvta.to.global.u64 dst, src;

Always .u64 (64-bit address space, matching .address_size 64). Required because ld.param returns generic-space pointers — ld.global needs global-space addresses.

Fields

§dst: Register

Destination register (global-space address).

§src: Register

Source register (generic-space address from ld.param).

§

CpAsyncCaSharedGlobal

Asynchronous global→shared copy, cache-at-all-levels variant: cp.async.ca.shared.global [dst_shared], [src_global], size_bytes;

Issues a non-blocking transfer from global memory into shared memory without tying up registers. The copy is in-flight after this instruction; use CpAsyncCommitGroup to delimit a batch and CpAsyncWaitGroup to synchronize. Requires SM 8.0+ (Ampere).

size_bytes must be one of 4, 8, or 16 (validated at construction via MemoryOp::new_cp_async_ca).

Example: cp.async.ca.shared.global [%r0], [%rd3], 16;

Placement note: cp.async lives in MemoryOp for Sprint 6.2 because semantically it is a memory op. The commit/wait variants are pipeline-state operations and may relocate to a dedicated PipelineOp category in Sprint 6.4 once double-buffering patterns exercise the state machine.

Fields

§dst_shared: Register

Register holding the shared-memory destination offset.

§src_global: Register

Register holding the global-memory source address (.to.global).

§size_bytes: u8

Copy size in bytes: must be 4, 8, or 16.

§

CpAsyncCommitGroup

Commit all pending cp.async operations into a new async group: cp.async.commit_group;

Groups are numbered implicitly from 0 (most-recently committed) upward. Used in conjunction with CpAsyncWaitGroup to block until a specific group completes. Requires SM 8.0+.

§

CpAsyncWaitGroup

Wait until at most n async copy groups remain in-flight: cp.async.wait_group n;

wait_group 0 waits for all outstanding groups to complete (the common one-stage-pipeline case). For double-buffered kernels, wait_group 1 is used to block on the N-1’th group while issuing the N’th. Requires SM 8.0+.

Fields

§n: u8

Number of outstanding groups still permitted after this wait.

Implementations§

Source§

impl MemoryOp

Source

pub fn new_cp_async_ca( dst_shared: Register, src_global: Register, size_bytes: u8, ) -> Self

Construct a CpAsyncCaSharedGlobal, validating the size byte count.

§Panics

Panics if size_bytes is not one of 4, 8, or 16 — the only sizes PTX accepts for cp.async.ca. PTX won’t catch this until ptxas runs, and the error there is cryptic, so we fail loudly at construction time.

Source

pub fn new_ld_global_b128(dsts: [Register; 4], addr: Register) -> Self

Construct an LdGlobalB128, validating that all 4 destinations are b32-class registers.

§Panics

Panics if any destination register is not crate::types::RegKind::R (b32). ld.global.v4.b32 requires 4× 32-bit-wide integer-class destinations; .f / .rd / .h / .hb / .p registers are invalid and ptxas’s error message is cryptic. Fail loudly at construction.

Trait Implementations§

Source§

impl Clone for MemoryOp

Source§

fn clone(&self) -> MemoryOp

Returns a duplicate of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for MemoryOp

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl Emit for MemoryOp

Source§

fn emit(&self, w: &mut PtxWriter) -> Result

Write this node’s PTX representation to the writer.

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.