Expand description
§baracuda-cutlass-kernels-sys
Raw extern "C" entry points for compiled CUTLASS template
instantiations. You almost certainly want baracuda-cutlass
instead — that crate wraps these unsafe calls with typed plans,
lifetime-checked device buffers, and a proper Rust API.
Functions in this crate take raw void* pointers, integer dimensions,
and a cudaStream_t cast as *mut c_void. They are unsafe because:
- They dereference the pointer arguments without bounds-checking.
- They assume the pointers are valid device addresses.
- They assume the workspace pointer (when non-null) points to at least
workspace_bytesof writable device memory. - They assume the stream is a valid CUDA stream owned by the calling thread’s current context.
§Status codes
All *_run and *_can_implement functions return an i32 status:
0: success.1: misaligned operand.2: invalid problem (e.g. M, N, or K is non-positive).3: not supported (this kernel doesn’t implement the requested shape).4: workspace too small or null when required.5: internal CUTLASS error (typically a kernel launch failure).
Functions§
- baracuda_
cutlass_ ⚠gemm_ batched_ bf16_ rcr_ sm80_ can_ implement - Pre-launch implementability check for
bf16batched RCR sm_80. - baracuda_
cutlass_ ⚠gemm_ batched_ bf16_ rcr_ sm80_ run bf16batched GEMM, RCR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ batched_ bf16_ rcr_ sm80_ workspace_ size - Workspace bytes needed by the
bf16batched RCR sm_80 GEMM. - baracuda_
cutlass_ ⚠gemm_ batched_ f16_ rcr_ sm80_ can_ implement - Pre-launch implementability check for
f16batched RCR sm_80. - baracuda_
cutlass_ ⚠gemm_ batched_ f16_ rcr_ sm80_ run f16batched GEMM, RCR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ batched_ f16_ rcr_ sm80_ workspace_ size - Workspace bytes needed by the
f16batched RCR sm_80 GEMM. - baracuda_
cutlass_ ⚠gemm_ bf16_ rcr_ sm80_ can_ implement - Pre-launch implementability check for
bf16RCR sm_80. - baracuda_
cutlass_ ⚠gemm_ bf16_ rcr_ sm80_ run bf16GEMM, RCR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ bf16_ rcr_ sm80_ workspace_ size - Workspace size in bytes for
bf16RCR sm_80 GEMM at the given problem size. - baracuda_
cutlass_ ⚠gemm_ bf16_ rrr_ sm80_ can_ implement - Pre-launch implementability check for
bf16RRR sm_80. - baracuda_
cutlass_ ⚠gemm_ bf16_ rrr_ sm80_ run bf16GEMM, RRR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ bf16_ rrr_ sm80_ workspace_ size - Workspace size in bytes for
bf16RRR sm_80 GEMM. - baracuda_
cutlass_ ⚠gemm_ bias_ bf16_ rcr_ sm80_ can_ implement - Pre-launch implementability check for
bf16bias RCR sm_80. - baracuda_
cutlass_ ⚠gemm_ bias_ bf16_ rcr_ sm80_ run bf16bias-fused GEMM, RCR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ bias_ bf16_ rcr_ sm80_ workspace_ size - Workspace bytes needed by the
bf16bias-fused RCR sm_80 GEMM. - baracuda_
cutlass_ ⚠gemm_ bias_ bf16_ rrr_ sm80_ can_ implement - Safety
- baracuda_
cutlass_ ⚠gemm_ bias_ bf16_ rrr_ sm80_ run bf16bias-fused GEMM, RRR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ bias_ bf16_ rrr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_bf16_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ f16_ rcr_ sm80_ can_ implement - Pre-launch implementability check for
f16bias RCR sm_80. - baracuda_
cutlass_ ⚠gemm_ bias_ f16_ rcr_ sm80_ run f16bias-fused GEMM, RCR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ bias_ f16_ rcr_ sm80_ workspace_ size - Workspace bytes needed by the
f16bias-fused RCR sm_80 GEMM. - baracuda_
cutlass_ ⚠gemm_ bias_ f16_ rrr_ sm80_ can_ implement - Safety
- baracuda_
cutlass_ ⚠gemm_ bias_ f16_ rrr_ sm80_ run f16bias-fused GEMM, RRR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ bias_ f16_ rrr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_f16_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ f32_ simt_ rcr_ sm80_ can_ implement - CUTLASS GEMM trampoline (implementability check for gemm_bias_f32_simt_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ f32_ simt_ rcr_ sm80_ run - CUTLASS GEMM trampoline (launch gemm_bias_f32_simt_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ f32_ simt_ rcr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_f32_simt_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ f32_ simt_ rrr_ sm80_ can_ implement - CUTLASS GEMM trampoline (implementability check for gemm_bias_f32_simt_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ f32_ simt_ rrr_ sm80_ run - CUTLASS GEMM trampoline (launch gemm_bias_f32_simt_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ f32_ simt_ rrr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_f32_simt_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ f32bias_ s8_ rcr_ sm80_ can_ implement - Pre-launch implementability check for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ f32bias_ s8_ rcr_ sm80_ run - int8 bias-fused GEMM with optional fused activation.
- baracuda_
cutlass_ ⚠gemm_ bias_ f32bias_ s8_ rcr_ sm80_ workspace_ size - Workspace size in bytes for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ f32bias_ u8_ rcr_ sm80_ can_ implement - Pre-launch implementability check for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ f32bias_ u8_ rcr_ sm80_ run - int8 bias-fused GEMM with optional fused activation.
- baracuda_
cutlass_ ⚠gemm_ bias_ f32bias_ u8_ rcr_ sm80_ workspace_ size - Workspace size in bytes for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ f64_ rcr_ sm80_ can_ implement - CUTLASS GEMM trampoline (implementability check for gemm_bias_f64_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ f64_ rcr_ sm80_ run - CUTLASS GEMM trampoline (launch gemm_bias_f64_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ f64_ rcr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_f64_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ f64_ rrr_ sm80_ can_ implement - CUTLASS GEMM trampoline (implementability check for gemm_bias_f64_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ f64_ rrr_ sm80_ run - CUTLASS GEMM trampoline (launch gemm_bias_f64_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ f64_ rrr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_f64_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ bf16_ rcr_ sm80_ can_ implement - Pre-launch check for
bf16bias+GELU RCR sm_80. - baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ bf16_ rcr_ sm80_ run bf16bias + GELU activation GEMM, RCR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ bf16_ rcr_ sm80_ workspace_ size - Workspace bytes for
bf16bias+GELU RCR sm_80 GEMM. - baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ bf16_ rrr_ sm80_ can_ implement - Safety
- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ bf16_ rrr_ sm80_ run bf16bias+GELU GEMM, RRR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ bf16_ rrr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_gelu_bf16_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ f16_ rcr_ sm80_ can_ implement - Pre-launch check for
f16bias+GELU RCR sm_80. - baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ f16_ rcr_ sm80_ run f16bias + GELU activation GEMM, RCR layout, sm_80. ComputesD = gelu(alpha*AB + beta*C + bias_broadcast(N))using the exact (erf-based) GELU formulation, matching PyTorch’s defaultnn.GELU().- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ f16_ rcr_ sm80_ workspace_ size - Workspace bytes for
f16bias+GELU RCR sm_80 GEMM. - baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ f16_ rrr_ sm80_ can_ implement - Safety
- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ f16_ rrr_ sm80_ run f16bias+GELU GEMM, RRR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ f16_ rrr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_gelu_f16_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ f32_ simt_ rcr_ sm80_ can_ implement - CUTLASS GEMM trampoline (implementability check for gemm_bias_gelu_f32_simt_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ f32_ simt_ rcr_ sm80_ run - CUTLASS GEMM trampoline (launch gemm_bias_gelu_f32_simt_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ f32_ simt_ rcr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_gelu_f32_simt_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ f32_ simt_ rrr_ sm80_ can_ implement - CUTLASS GEMM trampoline (implementability check for gemm_bias_gelu_f32_simt_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ f32_ simt_ rrr_ sm80_ run - CUTLASS GEMM trampoline (launch gemm_bias_gelu_f32_simt_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ f32_ simt_ rrr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_gelu_f32_simt_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ f32bias_ s8_ rcr_ sm80_ can_ implement - Pre-launch implementability check for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ f32bias_ s8_ rcr_ sm80_ run - int8 bias-fused GEMM with optional fused activation.
- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ f32bias_ s8_ rcr_ sm80_ workspace_ size - Workspace size in bytes for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ f32bias_ u8_ rcr_ sm80_ can_ implement - Pre-launch implementability check for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ f32bias_ u8_ rcr_ sm80_ run - int8 bias-fused GEMM with optional fused activation.
- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ f32bias_ u8_ rcr_ sm80_ workspace_ size - Workspace size in bytes for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ f64_ rcr_ sm80_ can_ implement - CUTLASS GEMM trampoline (implementability check for gemm_bias_gelu_f64_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ f64_ rcr_ sm80_ run - CUTLASS GEMM trampoline (launch gemm_bias_gelu_f64_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ f64_ rcr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_gelu_f64_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ f64_ rrr_ sm80_ can_ implement - CUTLASS GEMM trampoline (implementability check for gemm_bias_gelu_f64_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ f64_ rrr_ sm80_ run - CUTLASS GEMM trampoline (launch gemm_bias_gelu_f64_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ f64_ rrr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_gelu_f64_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ i32bias_ s8_ rcr_ sm80_ can_ implement - Pre-launch implementability check for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ i32bias_ s8_ rcr_ sm80_ run - int8 bias-fused GEMM with optional fused activation.
- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ i32bias_ s8_ rcr_ sm80_ workspace_ size - Workspace size in bytes for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ i32bias_ u8_ rcr_ sm80_ can_ implement - Pre-launch implementability check for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ i32bias_ u8_ rcr_ sm80_ run - int8 bias-fused GEMM with optional fused activation.
- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ i32bias_ u8_ rcr_ sm80_ workspace_ size - Workspace size in bytes for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ tf32_ rcr_ sm80_ can_ implement - Safety
- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ tf32_ rcr_ sm80_ run f32(TF32) bias+GELU GEMM, RCR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ tf32_ rcr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_gelu_tf32_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ tf32_ rrr_ sm80_ can_ implement - Safety
- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ tf32_ rrr_ sm80_ run f32(TF32) bias+GELU GEMM, RRR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ bias_ gelu_ tf32_ rrr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_gelu_tf32_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ i32bias_ s8_ rcr_ sm80_ can_ implement - Pre-launch implementability check for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ i32bias_ s8_ rcr_ sm80_ run - int8 bias-fused GEMM with optional fused activation.
- baracuda_
cutlass_ ⚠gemm_ bias_ i32bias_ s8_ rcr_ sm80_ workspace_ size - Workspace size in bytes for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ i32bias_ u8_ rcr_ sm80_ can_ implement - Pre-launch implementability check for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ i32bias_ u8_ rcr_ sm80_ run - int8 bias-fused GEMM with optional fused activation.
- baracuda_
cutlass_ ⚠gemm_ bias_ i32bias_ u8_ rcr_ sm80_ workspace_ size - Workspace size in bytes for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ relu_ bf16_ rcr_ sm80_ can_ implement - Pre-launch check for
bf16bias+ReLU RCR sm_80. - baracuda_
cutlass_ ⚠gemm_ bias_ relu_ bf16_ rcr_ sm80_ run bf16bias + ReLU activation GEMM, RCR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ bf16_ rcr_ sm80_ workspace_ size - Workspace bytes for
bf16bias+ReLU RCR sm_80 GEMM. - baracuda_
cutlass_ ⚠gemm_ bias_ relu_ bf16_ rrr_ sm80_ can_ implement - Safety
- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ bf16_ rrr_ sm80_ run bf16bias+ReLU GEMM, RRR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ bf16_ rrr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_relu_bf16_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ f16_ rcr_ sm80_ can_ implement - Pre-launch check for
f16bias+ReLU RCR sm_80. - baracuda_
cutlass_ ⚠gemm_ bias_ relu_ f16_ rcr_ sm80_ run f16bias + ReLU activation GEMM, RCR layout, sm_80. ComputesD = max(alpha*AB + beta*C + bias_broadcast(N), 0).- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ f16_ rcr_ sm80_ workspace_ size - Workspace bytes for
f16bias+ReLU RCR sm_80 GEMM. - baracuda_
cutlass_ ⚠gemm_ bias_ relu_ f16_ rrr_ sm80_ can_ implement - Safety
- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ f16_ rrr_ sm80_ run f16bias+ReLU GEMM, RRR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ f16_ rrr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_relu_f16_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ f32_ simt_ rcr_ sm80_ can_ implement - CUTLASS GEMM trampoline (implementability check for gemm_bias_relu_f32_simt_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ f32_ simt_ rcr_ sm80_ run - CUTLASS GEMM trampoline (launch gemm_bias_relu_f32_simt_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ f32_ simt_ rcr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_relu_f32_simt_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ f32_ simt_ rrr_ sm80_ can_ implement - CUTLASS GEMM trampoline (implementability check for gemm_bias_relu_f32_simt_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ f32_ simt_ rrr_ sm80_ run - CUTLASS GEMM trampoline (launch gemm_bias_relu_f32_simt_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ f32_ simt_ rrr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_relu_f32_simt_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ f32bias_ s8_ rcr_ sm80_ can_ implement - Pre-launch implementability check for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ relu_ f32bias_ s8_ rcr_ sm80_ run - int8 bias-fused GEMM with optional fused activation.
- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ f32bias_ s8_ rcr_ sm80_ workspace_ size - Workspace size in bytes for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ relu_ f32bias_ u8_ rcr_ sm80_ can_ implement - Pre-launch implementability check for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ relu_ f32bias_ u8_ rcr_ sm80_ run - int8 bias-fused GEMM with optional fused activation.
- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ f32bias_ u8_ rcr_ sm80_ workspace_ size - Workspace size in bytes for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ relu_ f64_ rcr_ sm80_ can_ implement - CUTLASS GEMM trampoline (implementability check for gemm_bias_relu_f64_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ f64_ rcr_ sm80_ run - CUTLASS GEMM trampoline (launch gemm_bias_relu_f64_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ f64_ rcr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_relu_f64_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ f64_ rrr_ sm80_ can_ implement - CUTLASS GEMM trampoline (implementability check for gemm_bias_relu_f64_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ f64_ rrr_ sm80_ run - CUTLASS GEMM trampoline (launch gemm_bias_relu_f64_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ f64_ rrr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_relu_f64_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ i32bias_ s8_ rcr_ sm80_ can_ implement - Pre-launch implementability check for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ relu_ i32bias_ s8_ rcr_ sm80_ run - int8 bias-fused GEMM with optional fused activation.
- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ i32bias_ s8_ rcr_ sm80_ workspace_ size - Workspace size in bytes for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ relu_ i32bias_ u8_ rcr_ sm80_ can_ implement - Pre-launch implementability check for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ relu_ i32bias_ u8_ rcr_ sm80_ run - int8 bias-fused GEMM with optional fused activation.
- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ i32bias_ u8_ rcr_ sm80_ workspace_ size - Workspace size in bytes for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ relu_ tf32_ rcr_ sm80_ can_ implement - Safety
- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ tf32_ rcr_ sm80_ run f32(TF32) bias+ReLU GEMM, RCR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ tf32_ rcr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_relu_tf32_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ tf32_ rrr_ sm80_ can_ implement - Safety
- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ tf32_ rrr_ sm80_ run f32(TF32) bias+ReLU GEMM, RRR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ bias_ relu_ tf32_ rrr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_relu_tf32_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ bf16_ rcr_ sm80_ can_ implement - Pre-launch check for
bf16bias+SiLU RCR sm_80. - baracuda_
cutlass_ ⚠gemm_ bias_ silu_ bf16_ rcr_ sm80_ run bf16bias + SiLU activation GEMM, RCR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ bf16_ rcr_ sm80_ workspace_ size - Workspace bytes for
bf16bias+SiLU RCR sm_80 GEMM. - baracuda_
cutlass_ ⚠gemm_ bias_ silu_ bf16_ rrr_ sm80_ can_ implement - Safety
- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ bf16_ rrr_ sm80_ run bf16bias+SiLU GEMM, RRR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ bf16_ rrr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_silu_bf16_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ f16_ rcr_ sm80_ can_ implement - Pre-launch check for
f16bias+SiLU RCR sm_80. - baracuda_
cutlass_ ⚠gemm_ bias_ silu_ f16_ rcr_ sm80_ run f16bias + SiLU activation GEMM, RCR layout, sm_80. ComputesD = silu(alpha*AB + beta*C + bias_broadcast(N))wheresilu(x) = x * sigmoid(x). Also known as Swish.- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ f16_ rcr_ sm80_ workspace_ size - Workspace bytes for
f16bias+SiLU RCR sm_80 GEMM. - baracuda_
cutlass_ ⚠gemm_ bias_ silu_ f16_ rrr_ sm80_ can_ implement - Safety
- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ f16_ rrr_ sm80_ run f16bias+SiLU GEMM, RRR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ f16_ rrr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_silu_f16_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ f32_ simt_ rcr_ sm80_ can_ implement - CUTLASS GEMM trampoline (implementability check for gemm_bias_silu_f32_simt_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ f32_ simt_ rcr_ sm80_ run - CUTLASS GEMM trampoline (launch gemm_bias_silu_f32_simt_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ f32_ simt_ rcr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_silu_f32_simt_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ f32_ simt_ rrr_ sm80_ can_ implement - CUTLASS GEMM trampoline (implementability check for gemm_bias_silu_f32_simt_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ f32_ simt_ rrr_ sm80_ run - CUTLASS GEMM trampoline (launch gemm_bias_silu_f32_simt_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ f32_ simt_ rrr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_silu_f32_simt_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ f32bias_ s8_ rcr_ sm80_ can_ implement - Pre-launch implementability check for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ silu_ f32bias_ s8_ rcr_ sm80_ run - int8 bias-fused GEMM with optional fused activation.
- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ f32bias_ s8_ rcr_ sm80_ workspace_ size - Workspace size in bytes for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ silu_ f32bias_ u8_ rcr_ sm80_ can_ implement - Pre-launch implementability check for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ silu_ f32bias_ u8_ rcr_ sm80_ run - int8 bias-fused GEMM with optional fused activation.
- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ f32bias_ u8_ rcr_ sm80_ workspace_ size - Workspace size in bytes for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ silu_ f64_ rcr_ sm80_ can_ implement - CUTLASS GEMM trampoline (implementability check for gemm_bias_silu_f64_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ f64_ rcr_ sm80_ run - CUTLASS GEMM trampoline (launch gemm_bias_silu_f64_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ f64_ rcr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_silu_f64_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ f64_ rrr_ sm80_ can_ implement - CUTLASS GEMM trampoline (implementability check for gemm_bias_silu_f64_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ f64_ rrr_ sm80_ run - CUTLASS GEMM trampoline (launch gemm_bias_silu_f64_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ f64_ rrr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_silu_f64_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ i32bias_ s8_ rcr_ sm80_ can_ implement - Pre-launch implementability check for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ silu_ i32bias_ s8_ rcr_ sm80_ run - int8 bias-fused GEMM with optional fused activation.
- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ i32bias_ s8_ rcr_ sm80_ workspace_ size - Workspace size in bytes for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ silu_ i32bias_ u8_ rcr_ sm80_ can_ implement - Pre-launch implementability check for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ silu_ i32bias_ u8_ rcr_ sm80_ run - int8 bias-fused GEMM with optional fused activation.
- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ i32bias_ u8_ rcr_ sm80_ workspace_ size - Workspace size in bytes for the corresponding
_runentry point. - baracuda_
cutlass_ ⚠gemm_ bias_ silu_ tf32_ rcr_ sm80_ can_ implement - Safety
- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ tf32_ rcr_ sm80_ run f32(TF32) bias+SiLU GEMM, RCR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ tf32_ rcr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_silu_tf32_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ tf32_ rrr_ sm80_ can_ implement - Safety
- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ tf32_ rrr_ sm80_ run f32(TF32) bias+SiLU GEMM, RRR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ bias_ silu_ tf32_ rrr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_silu_tf32_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ tf32_ rcr_ sm80_ can_ implement - Safety
- baracuda_
cutlass_ ⚠gemm_ bias_ tf32_ rcr_ sm80_ run f32(TF32) bias-fused GEMM, RCR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ bias_ tf32_ rcr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_tf32_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ bias_ tf32_ rrr_ sm80_ can_ implement - Safety
- baracuda_
cutlass_ ⚠gemm_ bias_ tf32_ rrr_ sm80_ run f32(TF32) bias-fused GEMM, RRR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ bias_ tf32_ rrr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_tf32_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ f16_ rcr_ sm80_ can_ implement - Pre-launch implementability check for
f16RCR sm_80. - baracuda_
cutlass_ ⚠gemm_ f16_ rcr_ sm80_ run f16GEMM, RCR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ f16_ rcr_ sm80_ workspace_ size - Workspace size in bytes for
f16RCR sm_80 GEMM at the given problem size. - baracuda_
cutlass_ ⚠gemm_ f16_ rrr_ sm80_ can_ implement - Pre-launch implementability check for
f16RRR sm_80. - baracuda_
cutlass_ ⚠gemm_ f16_ rrr_ sm80_ run f16GEMM, RRR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ f16_ rrr_ sm80_ workspace_ size - Workspace size in bytes for
f16RRR sm_80 GEMM. - baracuda_
cutlass_ ⚠gemm_ f32_ simt_ rcr_ sm80_ can_ implement - Safety
- baracuda_
cutlass_ ⚠gemm_ f32_ simt_ rcr_ sm80_ run f32GEMM via SIMT (CUDA cores), RCR layout, sm_80. Full-precision counterpart to the TF32 RCR kernel.- baracuda_
cutlass_ ⚠gemm_ f32_ simt_ rcr_ sm80_ workspace_ size - Workspace size in bytes for
f32_simtRCR sm_80 GEMM. - baracuda_
cutlass_ ⚠gemm_ f32_ simt_ rrr_ sm80_ can_ implement - Safety
- baracuda_
cutlass_ ⚠gemm_ f32_ simt_ rrr_ sm80_ run f32GEMM via SIMT (CUDA cores), RRR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ f32_ simt_ rrr_ sm80_ workspace_ size - Workspace size in bytes for
f32_simtRRR sm_80 GEMM. - baracuda_
cutlass_ ⚠gemm_ f64_ rcr_ sm80_ can_ implement - Safety
- baracuda_
cutlass_ ⚠gemm_ f64_ rcr_ sm80_ run f64GEMM via Ampere FP64 tensor cores, RCR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ f64_ rcr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_f64_rcr_sm80).
- baracuda_
cutlass_ ⚠gemm_ f64_ rrr_ sm80_ can_ implement - Safety
- baracuda_
cutlass_ ⚠gemm_ f64_ rrr_ sm80_ run f64GEMM via Ampere FP64 tensor cores, RRR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ f64_ rrr_ sm80_ workspace_ size - CUTLASS GEMM trampoline (workspace-bytes query for gemm_f64_rrr_sm80).
- baracuda_
cutlass_ ⚠gemm_ s8_ rcr_ sm80_ can_ implement - Pre-launch implementability check for the
s8RCR sm_80 GEMM. - baracuda_
cutlass_ ⚠gemm_ s8_ rcr_ sm80_ run - Signed-int8 GEMM, RCR layout, sm_80.
- baracuda_
cutlass_ ⚠gemm_ s8_ rcr_ sm80_ workspace_ size - Workspace size in bytes for the
s8RCR sm_80 GEMM. - baracuda_
cutlass_ ⚠gemm_ tf32_ rcr_ sm80_ can_ implement - Pre-launch implementability check for
tf32RCR sm_80. - baracuda_
cutlass_ ⚠gemm_ tf32_ rcr_ sm80_ run f32GEMM via TF32 tensor cores, RCR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ tf32_ rcr_ sm80_ workspace_ size - Workspace size in bytes for the
tf32RCR sm_80 GEMM. - baracuda_
cutlass_ ⚠gemm_ tf32_ rrr_ sm80_ can_ implement - Pre-launch implementability check for
tf32RRR sm_80. - baracuda_
cutlass_ ⚠gemm_ tf32_ rrr_ sm80_ run f32GEMM via TF32 tensor cores, RRR layout, sm_80.- baracuda_
cutlass_ ⚠gemm_ tf32_ rrr_ sm80_ workspace_ size - Workspace size in bytes for the
tf32RRR sm_80 GEMM. - baracuda_
cutlass_ ⚠gemm_ u8_ rcr_ sm80_ can_ implement - Pre-launch check for
u8RCR sm_80 GEMM. - baracuda_
cutlass_ ⚠gemm_ u8_ rcr_ sm80_ run - Unsigned-uint8 GEMM, RCR layout, sm_80.
- baracuda_
cutlass_ ⚠gemm_ u8_ rcr_ sm80_ workspace_ size - Workspace size for
u8RCR sm_80 GEMM. - baracuda_
cutlass_ ⚠grouped_ gemm_ bf16_ rcr_ sm80_ can_ implement - Safety
- baracuda_
cutlass_ ⚠grouped_ gemm_ bf16_ rcr_ sm80_ run - Safety
- baracuda_
cutlass_ ⚠grouped_ gemm_ bf16_ rcr_ sm80_ scratch_ bytes - Safety
- baracuda_
cutlass_ ⚠grouped_ gemm_ bf16_ rcr_ sm80_ sufficient bf16grouped GEMM — see f16 counterpart for documentation.- baracuda_
cutlass_ ⚠grouped_ gemm_ f16_ rcr_ sm80_ can_ implement - Pre-launch implementability check (host-only, no CUDA traffic).
- baracuda_
cutlass_ ⚠grouped_ gemm_ f16_ rcr_ sm80_ run - Launch the grouped GEMM.
- baracuda_
cutlass_ ⚠grouped_ gemm_ f16_ rcr_ sm80_ scratch_ bytes - CUTLASS-internal scratch bytes needed for the launch.
- baracuda_
cutlass_ ⚠grouped_ gemm_ f16_ rcr_ sm80_ sufficient - Compute the number of threadblocks to launch for an
f16grouped GEMM with the given per-group(M, N, K)shapes. CUTLASS chooses based on device SM count vs total tile count.