Skip to main content

Crate baracuda_cutlass_kernels_sys

Crate baracuda_cutlass_kernels_sys 

Source
Expand description

§baracuda-cutlass-kernels-sys

Raw extern "C" entry points for compiled CUTLASS template instantiations. You almost certainly want baracuda-cutlass instead — that crate wraps these unsafe calls with typed plans, lifetime-checked device buffers, and a proper Rust API.

Functions in this crate take raw void* pointers, integer dimensions, and a cudaStream_t cast as *mut c_void. They are unsafe because:

  • They dereference the pointer arguments without bounds-checking.
  • They assume the pointers are valid device addresses.
  • They assume the workspace pointer (when non-null) points to at least workspace_bytes of writable device memory.
  • They assume the stream is a valid CUDA stream owned by the calling thread’s current context.

§Status codes

All *_run and *_can_implement functions return an i32 status:

  • 0: success.
  • 1: misaligned operand.
  • 2: invalid problem (e.g. M, N, or K is non-positive).
  • 3: not supported (this kernel doesn’t implement the requested shape).
  • 4: workspace too small or null when required.
  • 5: internal CUTLASS error (typically a kernel launch failure).

Functions§

baracuda_cutlass_gemm_batched_bf16_rcr_sm80_can_implement
Pre-launch implementability check for bf16 batched RCR sm_80.
baracuda_cutlass_gemm_batched_bf16_rcr_sm80_run
bf16 batched GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_batched_bf16_rcr_sm80_workspace_size
Workspace bytes needed by the bf16 batched RCR sm_80 GEMM.
baracuda_cutlass_gemm_batched_f16_rcr_sm80_can_implement
Pre-launch implementability check for f16 batched RCR sm_80.
baracuda_cutlass_gemm_batched_f16_rcr_sm80_run
f16 batched GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_batched_f16_rcr_sm80_workspace_size
Workspace bytes needed by the f16 batched RCR sm_80 GEMM.
baracuda_cutlass_gemm_bf16_rcr_sm80_can_implement
Pre-launch implementability check for bf16 RCR sm_80.
baracuda_cutlass_gemm_bf16_rcr_sm80_run
bf16 GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_bf16_rcr_sm80_workspace_size
Workspace size in bytes for bf16 RCR sm_80 GEMM at the given problem size.
baracuda_cutlass_gemm_bf16_rrr_sm80_can_implement
Pre-launch implementability check for bf16 RRR sm_80.
baracuda_cutlass_gemm_bf16_rrr_sm80_run
bf16 GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_bf16_rrr_sm80_workspace_size
Workspace size in bytes for bf16 RRR sm_80 GEMM.
baracuda_cutlass_gemm_bias_bf16_rcr_sm80_can_implement
Pre-launch implementability check for bf16 bias RCR sm_80.
baracuda_cutlass_gemm_bias_bf16_rcr_sm80_run
bf16 bias-fused GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_bias_bf16_rcr_sm80_workspace_size
Workspace bytes needed by the bf16 bias-fused RCR sm_80 GEMM.
baracuda_cutlass_gemm_bias_bf16_rrr_sm80_can_implement
Safety
baracuda_cutlass_gemm_bias_bf16_rrr_sm80_run
bf16 bias-fused GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_bias_bf16_rrr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_bf16_rrr_sm80).
baracuda_cutlass_gemm_bias_f16_rcr_sm80_can_implement
Pre-launch implementability check for f16 bias RCR sm_80.
baracuda_cutlass_gemm_bias_f16_rcr_sm80_run
f16 bias-fused GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_bias_f16_rcr_sm80_workspace_size
Workspace bytes needed by the f16 bias-fused RCR sm_80 GEMM.
baracuda_cutlass_gemm_bias_f16_rrr_sm80_can_implement
Safety
baracuda_cutlass_gemm_bias_f16_rrr_sm80_run
f16 bias-fused GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_bias_f16_rrr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_f16_rrr_sm80).
baracuda_cutlass_gemm_bias_f32_simt_rcr_sm80_can_implement
CUTLASS GEMM trampoline (implementability check for gemm_bias_f32_simt_rcr_sm80).
baracuda_cutlass_gemm_bias_f32_simt_rcr_sm80_run
CUTLASS GEMM trampoline (launch gemm_bias_f32_simt_rcr_sm80).
baracuda_cutlass_gemm_bias_f32_simt_rcr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_f32_simt_rcr_sm80).
baracuda_cutlass_gemm_bias_f32_simt_rrr_sm80_can_implement
CUTLASS GEMM trampoline (implementability check for gemm_bias_f32_simt_rrr_sm80).
baracuda_cutlass_gemm_bias_f32_simt_rrr_sm80_run
CUTLASS GEMM trampoline (launch gemm_bias_f32_simt_rrr_sm80).
baracuda_cutlass_gemm_bias_f32_simt_rrr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_f32_simt_rrr_sm80).
baracuda_cutlass_gemm_bias_f32bias_s8_rcr_sm80_can_implement
Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_f32bias_s8_rcr_sm80_run
int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_f32bias_s8_rcr_sm80_workspace_size
Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_f32bias_u8_rcr_sm80_can_implement
Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_f32bias_u8_rcr_sm80_run
int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_f32bias_u8_rcr_sm80_workspace_size
Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_f64_rcr_sm80_can_implement
CUTLASS GEMM trampoline (implementability check for gemm_bias_f64_rcr_sm80).
baracuda_cutlass_gemm_bias_f64_rcr_sm80_run
CUTLASS GEMM trampoline (launch gemm_bias_f64_rcr_sm80).
baracuda_cutlass_gemm_bias_f64_rcr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_f64_rcr_sm80).
baracuda_cutlass_gemm_bias_f64_rrr_sm80_can_implement
CUTLASS GEMM trampoline (implementability check for gemm_bias_f64_rrr_sm80).
baracuda_cutlass_gemm_bias_f64_rrr_sm80_run
CUTLASS GEMM trampoline (launch gemm_bias_f64_rrr_sm80).
baracuda_cutlass_gemm_bias_f64_rrr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_f64_rrr_sm80).
baracuda_cutlass_gemm_bias_gelu_bf16_rcr_sm80_can_implement
Pre-launch check for bf16 bias+GELU RCR sm_80.
baracuda_cutlass_gemm_bias_gelu_bf16_rcr_sm80_run
bf16 bias + GELU activation GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_bias_gelu_bf16_rcr_sm80_workspace_size
Workspace bytes for bf16 bias+GELU RCR sm_80 GEMM.
baracuda_cutlass_gemm_bias_gelu_bf16_rrr_sm80_can_implement
Safety
baracuda_cutlass_gemm_bias_gelu_bf16_rrr_sm80_run
bf16 bias+GELU GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_bias_gelu_bf16_rrr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_gelu_bf16_rrr_sm80).
baracuda_cutlass_gemm_bias_gelu_f16_rcr_sm80_can_implement
Pre-launch check for f16 bias+GELU RCR sm_80.
baracuda_cutlass_gemm_bias_gelu_f16_rcr_sm80_run
f16 bias + GELU activation GEMM, RCR layout, sm_80. Computes D = gelu(alpha*AB + beta*C + bias_broadcast(N)) using the exact (erf-based) GELU formulation, matching PyTorch’s default nn.GELU().
baracuda_cutlass_gemm_bias_gelu_f16_rcr_sm80_workspace_size
Workspace bytes for f16 bias+GELU RCR sm_80 GEMM.
baracuda_cutlass_gemm_bias_gelu_f16_rrr_sm80_can_implement
Safety
baracuda_cutlass_gemm_bias_gelu_f16_rrr_sm80_run
f16 bias+GELU GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_bias_gelu_f16_rrr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_gelu_f16_rrr_sm80).
baracuda_cutlass_gemm_bias_gelu_f32_simt_rcr_sm80_can_implement
CUTLASS GEMM trampoline (implementability check for gemm_bias_gelu_f32_simt_rcr_sm80).
baracuda_cutlass_gemm_bias_gelu_f32_simt_rcr_sm80_run
CUTLASS GEMM trampoline (launch gemm_bias_gelu_f32_simt_rcr_sm80).
baracuda_cutlass_gemm_bias_gelu_f32_simt_rcr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_gelu_f32_simt_rcr_sm80).
baracuda_cutlass_gemm_bias_gelu_f32_simt_rrr_sm80_can_implement
CUTLASS GEMM trampoline (implementability check for gemm_bias_gelu_f32_simt_rrr_sm80).
baracuda_cutlass_gemm_bias_gelu_f32_simt_rrr_sm80_run
CUTLASS GEMM trampoline (launch gemm_bias_gelu_f32_simt_rrr_sm80).
baracuda_cutlass_gemm_bias_gelu_f32_simt_rrr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_gelu_f32_simt_rrr_sm80).
baracuda_cutlass_gemm_bias_gelu_f32bias_s8_rcr_sm80_can_implement
Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_gelu_f32bias_s8_rcr_sm80_run
int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_gelu_f32bias_s8_rcr_sm80_workspace_size
Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_gelu_f32bias_u8_rcr_sm80_can_implement
Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_gelu_f32bias_u8_rcr_sm80_run
int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_gelu_f32bias_u8_rcr_sm80_workspace_size
Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_gelu_f64_rcr_sm80_can_implement
CUTLASS GEMM trampoline (implementability check for gemm_bias_gelu_f64_rcr_sm80).
baracuda_cutlass_gemm_bias_gelu_f64_rcr_sm80_run
CUTLASS GEMM trampoline (launch gemm_bias_gelu_f64_rcr_sm80).
baracuda_cutlass_gemm_bias_gelu_f64_rcr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_gelu_f64_rcr_sm80).
baracuda_cutlass_gemm_bias_gelu_f64_rrr_sm80_can_implement
CUTLASS GEMM trampoline (implementability check for gemm_bias_gelu_f64_rrr_sm80).
baracuda_cutlass_gemm_bias_gelu_f64_rrr_sm80_run
CUTLASS GEMM trampoline (launch gemm_bias_gelu_f64_rrr_sm80).
baracuda_cutlass_gemm_bias_gelu_f64_rrr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_gelu_f64_rrr_sm80).
baracuda_cutlass_gemm_bias_gelu_i32bias_s8_rcr_sm80_can_implement
Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_gelu_i32bias_s8_rcr_sm80_run
int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_gelu_i32bias_s8_rcr_sm80_workspace_size
Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_gelu_i32bias_u8_rcr_sm80_can_implement
Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_gelu_i32bias_u8_rcr_sm80_run
int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_gelu_i32bias_u8_rcr_sm80_workspace_size
Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_gelu_tf32_rcr_sm80_can_implement
Safety
baracuda_cutlass_gemm_bias_gelu_tf32_rcr_sm80_run
f32 (TF32) bias+GELU GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_bias_gelu_tf32_rcr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_gelu_tf32_rcr_sm80).
baracuda_cutlass_gemm_bias_gelu_tf32_rrr_sm80_can_implement
Safety
baracuda_cutlass_gemm_bias_gelu_tf32_rrr_sm80_run
f32 (TF32) bias+GELU GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_bias_gelu_tf32_rrr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_gelu_tf32_rrr_sm80).
baracuda_cutlass_gemm_bias_i32bias_s8_rcr_sm80_can_implement
Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_i32bias_s8_rcr_sm80_run
int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_i32bias_s8_rcr_sm80_workspace_size
Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_i32bias_u8_rcr_sm80_can_implement
Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_i32bias_u8_rcr_sm80_run
int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_i32bias_u8_rcr_sm80_workspace_size
Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_relu_bf16_rcr_sm80_can_implement
Pre-launch check for bf16 bias+ReLU RCR sm_80.
baracuda_cutlass_gemm_bias_relu_bf16_rcr_sm80_run
bf16 bias + ReLU activation GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_bias_relu_bf16_rcr_sm80_workspace_size
Workspace bytes for bf16 bias+ReLU RCR sm_80 GEMM.
baracuda_cutlass_gemm_bias_relu_bf16_rrr_sm80_can_implement
Safety
baracuda_cutlass_gemm_bias_relu_bf16_rrr_sm80_run
bf16 bias+ReLU GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_bias_relu_bf16_rrr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_relu_bf16_rrr_sm80).
baracuda_cutlass_gemm_bias_relu_f16_rcr_sm80_can_implement
Pre-launch check for f16 bias+ReLU RCR sm_80.
baracuda_cutlass_gemm_bias_relu_f16_rcr_sm80_run
f16 bias + ReLU activation GEMM, RCR layout, sm_80. Computes D = max(alpha*AB + beta*C + bias_broadcast(N), 0).
baracuda_cutlass_gemm_bias_relu_f16_rcr_sm80_workspace_size
Workspace bytes for f16 bias+ReLU RCR sm_80 GEMM.
baracuda_cutlass_gemm_bias_relu_f16_rrr_sm80_can_implement
Safety
baracuda_cutlass_gemm_bias_relu_f16_rrr_sm80_run
f16 bias+ReLU GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_bias_relu_f16_rrr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_relu_f16_rrr_sm80).
baracuda_cutlass_gemm_bias_relu_f32_simt_rcr_sm80_can_implement
CUTLASS GEMM trampoline (implementability check for gemm_bias_relu_f32_simt_rcr_sm80).
baracuda_cutlass_gemm_bias_relu_f32_simt_rcr_sm80_run
CUTLASS GEMM trampoline (launch gemm_bias_relu_f32_simt_rcr_sm80).
baracuda_cutlass_gemm_bias_relu_f32_simt_rcr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_relu_f32_simt_rcr_sm80).
baracuda_cutlass_gemm_bias_relu_f32_simt_rrr_sm80_can_implement
CUTLASS GEMM trampoline (implementability check for gemm_bias_relu_f32_simt_rrr_sm80).
baracuda_cutlass_gemm_bias_relu_f32_simt_rrr_sm80_run
CUTLASS GEMM trampoline (launch gemm_bias_relu_f32_simt_rrr_sm80).
baracuda_cutlass_gemm_bias_relu_f32_simt_rrr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_relu_f32_simt_rrr_sm80).
baracuda_cutlass_gemm_bias_relu_f32bias_s8_rcr_sm80_can_implement
Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_relu_f32bias_s8_rcr_sm80_run
int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_relu_f32bias_s8_rcr_sm80_workspace_size
Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_relu_f32bias_u8_rcr_sm80_can_implement
Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_relu_f32bias_u8_rcr_sm80_run
int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_relu_f32bias_u8_rcr_sm80_workspace_size
Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_relu_f64_rcr_sm80_can_implement
CUTLASS GEMM trampoline (implementability check for gemm_bias_relu_f64_rcr_sm80).
baracuda_cutlass_gemm_bias_relu_f64_rcr_sm80_run
CUTLASS GEMM trampoline (launch gemm_bias_relu_f64_rcr_sm80).
baracuda_cutlass_gemm_bias_relu_f64_rcr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_relu_f64_rcr_sm80).
baracuda_cutlass_gemm_bias_relu_f64_rrr_sm80_can_implement
CUTLASS GEMM trampoline (implementability check for gemm_bias_relu_f64_rrr_sm80).
baracuda_cutlass_gemm_bias_relu_f64_rrr_sm80_run
CUTLASS GEMM trampoline (launch gemm_bias_relu_f64_rrr_sm80).
baracuda_cutlass_gemm_bias_relu_f64_rrr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_relu_f64_rrr_sm80).
baracuda_cutlass_gemm_bias_relu_i32bias_s8_rcr_sm80_can_implement
Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_relu_i32bias_s8_rcr_sm80_run
int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_relu_i32bias_s8_rcr_sm80_workspace_size
Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_relu_i32bias_u8_rcr_sm80_can_implement
Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_relu_i32bias_u8_rcr_sm80_run
int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_relu_i32bias_u8_rcr_sm80_workspace_size
Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_relu_tf32_rcr_sm80_can_implement
Safety
baracuda_cutlass_gemm_bias_relu_tf32_rcr_sm80_run
f32 (TF32) bias+ReLU GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_bias_relu_tf32_rcr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_relu_tf32_rcr_sm80).
baracuda_cutlass_gemm_bias_relu_tf32_rrr_sm80_can_implement
Safety
baracuda_cutlass_gemm_bias_relu_tf32_rrr_sm80_run
f32 (TF32) bias+ReLU GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_bias_relu_tf32_rrr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_relu_tf32_rrr_sm80).
baracuda_cutlass_gemm_bias_silu_bf16_rcr_sm80_can_implement
Pre-launch check for bf16 bias+SiLU RCR sm_80.
baracuda_cutlass_gemm_bias_silu_bf16_rcr_sm80_run
bf16 bias + SiLU activation GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_bias_silu_bf16_rcr_sm80_workspace_size
Workspace bytes for bf16 bias+SiLU RCR sm_80 GEMM.
baracuda_cutlass_gemm_bias_silu_bf16_rrr_sm80_can_implement
Safety
baracuda_cutlass_gemm_bias_silu_bf16_rrr_sm80_run
bf16 bias+SiLU GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_bias_silu_bf16_rrr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_silu_bf16_rrr_sm80).
baracuda_cutlass_gemm_bias_silu_f16_rcr_sm80_can_implement
Pre-launch check for f16 bias+SiLU RCR sm_80.
baracuda_cutlass_gemm_bias_silu_f16_rcr_sm80_run
f16 bias + SiLU activation GEMM, RCR layout, sm_80. Computes D = silu(alpha*AB + beta*C + bias_broadcast(N)) where silu(x) = x * sigmoid(x). Also known as Swish.
baracuda_cutlass_gemm_bias_silu_f16_rcr_sm80_workspace_size
Workspace bytes for f16 bias+SiLU RCR sm_80 GEMM.
baracuda_cutlass_gemm_bias_silu_f16_rrr_sm80_can_implement
Safety
baracuda_cutlass_gemm_bias_silu_f16_rrr_sm80_run
f16 bias+SiLU GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_bias_silu_f16_rrr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_silu_f16_rrr_sm80).
baracuda_cutlass_gemm_bias_silu_f32_simt_rcr_sm80_can_implement
CUTLASS GEMM trampoline (implementability check for gemm_bias_silu_f32_simt_rcr_sm80).
baracuda_cutlass_gemm_bias_silu_f32_simt_rcr_sm80_run
CUTLASS GEMM trampoline (launch gemm_bias_silu_f32_simt_rcr_sm80).
baracuda_cutlass_gemm_bias_silu_f32_simt_rcr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_silu_f32_simt_rcr_sm80).
baracuda_cutlass_gemm_bias_silu_f32_simt_rrr_sm80_can_implement
CUTLASS GEMM trampoline (implementability check for gemm_bias_silu_f32_simt_rrr_sm80).
baracuda_cutlass_gemm_bias_silu_f32_simt_rrr_sm80_run
CUTLASS GEMM trampoline (launch gemm_bias_silu_f32_simt_rrr_sm80).
baracuda_cutlass_gemm_bias_silu_f32_simt_rrr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_silu_f32_simt_rrr_sm80).
baracuda_cutlass_gemm_bias_silu_f32bias_s8_rcr_sm80_can_implement
Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_silu_f32bias_s8_rcr_sm80_run
int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_silu_f32bias_s8_rcr_sm80_workspace_size
Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_silu_f32bias_u8_rcr_sm80_can_implement
Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_silu_f32bias_u8_rcr_sm80_run
int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_silu_f32bias_u8_rcr_sm80_workspace_size
Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_silu_f64_rcr_sm80_can_implement
CUTLASS GEMM trampoline (implementability check for gemm_bias_silu_f64_rcr_sm80).
baracuda_cutlass_gemm_bias_silu_f64_rcr_sm80_run
CUTLASS GEMM trampoline (launch gemm_bias_silu_f64_rcr_sm80).
baracuda_cutlass_gemm_bias_silu_f64_rcr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_silu_f64_rcr_sm80).
baracuda_cutlass_gemm_bias_silu_f64_rrr_sm80_can_implement
CUTLASS GEMM trampoline (implementability check for gemm_bias_silu_f64_rrr_sm80).
baracuda_cutlass_gemm_bias_silu_f64_rrr_sm80_run
CUTLASS GEMM trampoline (launch gemm_bias_silu_f64_rrr_sm80).
baracuda_cutlass_gemm_bias_silu_f64_rrr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_silu_f64_rrr_sm80).
baracuda_cutlass_gemm_bias_silu_i32bias_s8_rcr_sm80_can_implement
Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_silu_i32bias_s8_rcr_sm80_run
int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_silu_i32bias_s8_rcr_sm80_workspace_size
Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_silu_i32bias_u8_rcr_sm80_can_implement
Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_silu_i32bias_u8_rcr_sm80_run
int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_silu_i32bias_u8_rcr_sm80_workspace_size
Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_silu_tf32_rcr_sm80_can_implement
Safety
baracuda_cutlass_gemm_bias_silu_tf32_rcr_sm80_run
f32 (TF32) bias+SiLU GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_bias_silu_tf32_rcr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_silu_tf32_rcr_sm80).
baracuda_cutlass_gemm_bias_silu_tf32_rrr_sm80_can_implement
Safety
baracuda_cutlass_gemm_bias_silu_tf32_rrr_sm80_run
f32 (TF32) bias+SiLU GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_bias_silu_tf32_rrr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_silu_tf32_rrr_sm80).
baracuda_cutlass_gemm_bias_tf32_rcr_sm80_can_implement
Safety
baracuda_cutlass_gemm_bias_tf32_rcr_sm80_run
f32 (TF32) bias-fused GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_bias_tf32_rcr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_tf32_rcr_sm80).
baracuda_cutlass_gemm_bias_tf32_rrr_sm80_can_implement
Safety
baracuda_cutlass_gemm_bias_tf32_rrr_sm80_run
f32 (TF32) bias-fused GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_bias_tf32_rrr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_tf32_rrr_sm80).
baracuda_cutlass_gemm_f16_rcr_sm80_can_implement
Pre-launch implementability check for f16 RCR sm_80.
baracuda_cutlass_gemm_f16_rcr_sm80_run
f16 GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_f16_rcr_sm80_workspace_size
Workspace size in bytes for f16 RCR sm_80 GEMM at the given problem size.
baracuda_cutlass_gemm_f16_rrr_sm80_can_implement
Pre-launch implementability check for f16 RRR sm_80.
baracuda_cutlass_gemm_f16_rrr_sm80_run
f16 GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_f16_rrr_sm80_workspace_size
Workspace size in bytes for f16 RRR sm_80 GEMM.
baracuda_cutlass_gemm_f32_simt_rcr_sm80_can_implement
Safety
baracuda_cutlass_gemm_f32_simt_rcr_sm80_run
f32 GEMM via SIMT (CUDA cores), RCR layout, sm_80. Full-precision counterpart to the TF32 RCR kernel.
baracuda_cutlass_gemm_f32_simt_rcr_sm80_workspace_size
Workspace size in bytes for f32_simt RCR sm_80 GEMM.
baracuda_cutlass_gemm_f32_simt_rrr_sm80_can_implement
Safety
baracuda_cutlass_gemm_f32_simt_rrr_sm80_run
f32 GEMM via SIMT (CUDA cores), RRR layout, sm_80.
baracuda_cutlass_gemm_f32_simt_rrr_sm80_workspace_size
Workspace size in bytes for f32_simt RRR sm_80 GEMM.
baracuda_cutlass_gemm_f64_rcr_sm80_can_implement
Safety
baracuda_cutlass_gemm_f64_rcr_sm80_run
f64 GEMM via Ampere FP64 tensor cores, RCR layout, sm_80.
baracuda_cutlass_gemm_f64_rcr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_f64_rcr_sm80).
baracuda_cutlass_gemm_f64_rrr_sm80_can_implement
Safety
baracuda_cutlass_gemm_f64_rrr_sm80_run
f64 GEMM via Ampere FP64 tensor cores, RRR layout, sm_80.
baracuda_cutlass_gemm_f64_rrr_sm80_workspace_size
CUTLASS GEMM trampoline (workspace-bytes query for gemm_f64_rrr_sm80).
baracuda_cutlass_gemm_s8_rcr_sm80_can_implement
Pre-launch implementability check for the s8 RCR sm_80 GEMM.
baracuda_cutlass_gemm_s8_rcr_sm80_run
Signed-int8 GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_s8_rcr_sm80_workspace_size
Workspace size in bytes for the s8 RCR sm_80 GEMM.
baracuda_cutlass_gemm_tf32_rcr_sm80_can_implement
Pre-launch implementability check for tf32 RCR sm_80.
baracuda_cutlass_gemm_tf32_rcr_sm80_run
f32 GEMM via TF32 tensor cores, RCR layout, sm_80.
baracuda_cutlass_gemm_tf32_rcr_sm80_workspace_size
Workspace size in bytes for the tf32 RCR sm_80 GEMM.
baracuda_cutlass_gemm_tf32_rrr_sm80_can_implement
Pre-launch implementability check for tf32 RRR sm_80.
baracuda_cutlass_gemm_tf32_rrr_sm80_run
f32 GEMM via TF32 tensor cores, RRR layout, sm_80.
baracuda_cutlass_gemm_tf32_rrr_sm80_workspace_size
Workspace size in bytes for the tf32 RRR sm_80 GEMM.
baracuda_cutlass_gemm_u8_rcr_sm80_can_implement
Pre-launch check for u8 RCR sm_80 GEMM.
baracuda_cutlass_gemm_u8_rcr_sm80_run
Unsigned-uint8 GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_u8_rcr_sm80_workspace_size
Workspace size for u8 RCR sm_80 GEMM.
baracuda_cutlass_grouped_gemm_bf16_rcr_sm80_can_implement
Safety
baracuda_cutlass_grouped_gemm_bf16_rcr_sm80_run
Safety
baracuda_cutlass_grouped_gemm_bf16_rcr_sm80_scratch_bytes
Safety
baracuda_cutlass_grouped_gemm_bf16_rcr_sm80_sufficient
bf16 grouped GEMM — see f16 counterpart for documentation.
baracuda_cutlass_grouped_gemm_f16_rcr_sm80_can_implement
Pre-launch implementability check (host-only, no CUDA traffic).
baracuda_cutlass_grouped_gemm_f16_rcr_sm80_run
Launch the grouped GEMM.
baracuda_cutlass_grouped_gemm_f16_rcr_sm80_scratch_bytes
CUTLASS-internal scratch bytes needed for the launch.
baracuda_cutlass_grouped_gemm_f16_rcr_sm80_sufficient
Compute the number of threadblocks to launch for an f16 grouped GEMM with the given per-group (M, N, K) shapes. CUTLASS chooses based on device SM count vs total tile count.