Crate baracuda_cutlass_kernels_sys

Expand description

§baracuda-cutlass-kernels-sys

Raw extern "C" entry points for compiled CUTLASS template instantiations. You almost certainly want baracuda-cutlass instead — that crate wraps these unsafe calls with typed plans, lifetime-checked device buffers, and a proper Rust API.

Functions in this crate take raw void* pointers, integer dimensions, and a cudaStream_t cast as *mut c_void. They are unsafe because:

They dereference the pointer arguments without bounds-checking.
They assume the pointers are valid device addresses.
They assume the workspace pointer (when non-null) points to at least workspace_bytes of writable device memory.
They assume the stream is a valid CUDA stream owned by the calling thread’s current context.

§Status codes

All *_run and *_can_implement functions return an i32 status:

0: success.
1: misaligned operand.
2: invalid problem (e.g. M, N, or K is non-positive).
3: not supported (this kernel doesn’t implement the requested shape).
4: workspace too small or null when required.
5: internal CUTLASS error (typically a kernel launch failure).

Functions§

baracuda_cutlass_gemm_batched_bf16_rcr_sm80_can_implement^⚠: Pre-launch implementability check for bf16 batched RCR sm_80.
baracuda_cutlass_gemm_batched_bf16_rcr_sm80_run^⚠: bf16 batched GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_batched_bf16_rcr_sm80_workspace_size^⚠: Workspace bytes needed by the bf16 batched RCR sm_80 GEMM.
baracuda_cutlass_gemm_batched_f16_rcr_sm80_can_implement^⚠: Pre-launch implementability check for f16 batched RCR sm_80.
baracuda_cutlass_gemm_batched_f16_rcr_sm80_run^⚠: f16 batched GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_batched_f16_rcr_sm80_workspace_size^⚠: Workspace bytes needed by the f16 batched RCR sm_80 GEMM.
baracuda_cutlass_gemm_bf16_rcr_sm80_can_implement^⚠: Pre-launch implementability check for bf16 RCR sm_80.
baracuda_cutlass_gemm_bf16_rcr_sm80_run^⚠: bf16 GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_bf16_rcr_sm80_workspace_size^⚠: Workspace size in bytes for bf16 RCR sm_80 GEMM at the given problem size.
baracuda_cutlass_gemm_bf16_rrr_sm80_can_implement^⚠: Pre-launch implementability check for bf16 RRR sm_80.
baracuda_cutlass_gemm_bf16_rrr_sm80_run^⚠: bf16 GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_bf16_rrr_sm80_workspace_size^⚠: Workspace size in bytes for bf16 RRR sm_80 GEMM.
baracuda_cutlass_gemm_bias_bf16_rcr_sm80_can_implement^⚠: Pre-launch implementability check for bf16 bias RCR sm_80.
baracuda_cutlass_gemm_bias_bf16_rcr_sm80_run^⚠: bf16 bias-fused GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_bias_bf16_rcr_sm80_workspace_size^⚠: Workspace bytes needed by the bf16 bias-fused RCR sm_80 GEMM.
baracuda_cutlass_gemm_bias_bf16_rrr_sm80_can_implement^⚠: Safety
baracuda_cutlass_gemm_bias_bf16_rrr_sm80_run^⚠: bf16 bias-fused GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_bias_bf16_rrr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_bf16_rrr_sm80).
baracuda_cutlass_gemm_bias_f16_rcr_sm80_can_implement^⚠: Pre-launch implementability check for f16 bias RCR sm_80.
baracuda_cutlass_gemm_bias_f16_rcr_sm80_run^⚠: f16 bias-fused GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_bias_f16_rcr_sm80_workspace_size^⚠: Workspace bytes needed by the f16 bias-fused RCR sm_80 GEMM.
baracuda_cutlass_gemm_bias_f16_rrr_sm80_can_implement^⚠: Safety
baracuda_cutlass_gemm_bias_f16_rrr_sm80_run^⚠: f16 bias-fused GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_bias_f16_rrr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_f16_rrr_sm80).
baracuda_cutlass_gemm_bias_f32_simt_rcr_sm80_can_implement^⚠: CUTLASS GEMM trampoline (implementability check for gemm_bias_f32_simt_rcr_sm80).
baracuda_cutlass_gemm_bias_f32_simt_rcr_sm80_run^⚠: CUTLASS GEMM trampoline (launch gemm_bias_f32_simt_rcr_sm80).
baracuda_cutlass_gemm_bias_f32_simt_rcr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_f32_simt_rcr_sm80).
baracuda_cutlass_gemm_bias_f32_simt_rrr_sm80_can_implement^⚠: CUTLASS GEMM trampoline (implementability check for gemm_bias_f32_simt_rrr_sm80).
baracuda_cutlass_gemm_bias_f32_simt_rrr_sm80_run^⚠: CUTLASS GEMM trampoline (launch gemm_bias_f32_simt_rrr_sm80).
baracuda_cutlass_gemm_bias_f32_simt_rrr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_f32_simt_rrr_sm80).
baracuda_cutlass_gemm_bias_f32bias_s8_rcr_sm80_can_implement^⚠: Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_f32bias_s8_rcr_sm80_run^⚠: int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_f32bias_s8_rcr_sm80_workspace_size^⚠: Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_f32bias_u8_rcr_sm80_can_implement^⚠: Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_f32bias_u8_rcr_sm80_run^⚠: int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_f32bias_u8_rcr_sm80_workspace_size^⚠: Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_f64_rcr_sm80_can_implement^⚠: CUTLASS GEMM trampoline (implementability check for gemm_bias_f64_rcr_sm80).
baracuda_cutlass_gemm_bias_f64_rcr_sm80_run^⚠: CUTLASS GEMM trampoline (launch gemm_bias_f64_rcr_sm80).
baracuda_cutlass_gemm_bias_f64_rcr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_f64_rcr_sm80).
baracuda_cutlass_gemm_bias_f64_rrr_sm80_can_implement^⚠: CUTLASS GEMM trampoline (implementability check for gemm_bias_f64_rrr_sm80).
baracuda_cutlass_gemm_bias_f64_rrr_sm80_run^⚠: CUTLASS GEMM trampoline (launch gemm_bias_f64_rrr_sm80).
baracuda_cutlass_gemm_bias_f64_rrr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_f64_rrr_sm80).
baracuda_cutlass_gemm_bias_gelu_bf16_rcr_sm80_can_implement^⚠: Pre-launch check for bf16 bias+GELU RCR sm_80.
baracuda_cutlass_gemm_bias_gelu_bf16_rcr_sm80_run^⚠: bf16 bias + GELU activation GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_bias_gelu_bf16_rcr_sm80_workspace_size^⚠: Workspace bytes for bf16 bias+GELU RCR sm_80 GEMM.
baracuda_cutlass_gemm_bias_gelu_bf16_rrr_sm80_can_implement^⚠: Safety
baracuda_cutlass_gemm_bias_gelu_bf16_rrr_sm80_run^⚠: bf16 bias+GELU GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_bias_gelu_bf16_rrr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_gelu_bf16_rrr_sm80).
baracuda_cutlass_gemm_bias_gelu_f16_rcr_sm80_can_implement^⚠: Pre-launch check for f16 bias+GELU RCR sm_80.
baracuda_cutlass_gemm_bias_gelu_f16_rcr_sm80_run^⚠: f16 bias + GELU activation GEMM, RCR layout, sm_80. Computes D = gelu(alpha*AB + beta*C + bias_broadcast(N)) using the exact (erf-based) GELU formulation, matching PyTorch’s default nn.GELU().
baracuda_cutlass_gemm_bias_gelu_f16_rcr_sm80_workspace_size^⚠: Workspace bytes for f16 bias+GELU RCR sm_80 GEMM.
baracuda_cutlass_gemm_bias_gelu_f16_rrr_sm80_can_implement^⚠: Safety
baracuda_cutlass_gemm_bias_gelu_f16_rrr_sm80_run^⚠: f16 bias+GELU GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_bias_gelu_f16_rrr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_gelu_f16_rrr_sm80).
baracuda_cutlass_gemm_bias_gelu_f32_simt_rcr_sm80_can_implement^⚠: CUTLASS GEMM trampoline (implementability check for gemm_bias_gelu_f32_simt_rcr_sm80).
baracuda_cutlass_gemm_bias_gelu_f32_simt_rcr_sm80_run^⚠: CUTLASS GEMM trampoline (launch gemm_bias_gelu_f32_simt_rcr_sm80).
baracuda_cutlass_gemm_bias_gelu_f32_simt_rcr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_gelu_f32_simt_rcr_sm80).
baracuda_cutlass_gemm_bias_gelu_f32_simt_rrr_sm80_can_implement^⚠: CUTLASS GEMM trampoline (implementability check for gemm_bias_gelu_f32_simt_rrr_sm80).
baracuda_cutlass_gemm_bias_gelu_f32_simt_rrr_sm80_run^⚠: CUTLASS GEMM trampoline (launch gemm_bias_gelu_f32_simt_rrr_sm80).
baracuda_cutlass_gemm_bias_gelu_f32_simt_rrr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_gelu_f32_simt_rrr_sm80).
baracuda_cutlass_gemm_bias_gelu_f32bias_s8_rcr_sm80_can_implement^⚠: Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_gelu_f32bias_s8_rcr_sm80_run^⚠: int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_gelu_f32bias_s8_rcr_sm80_workspace_size^⚠: Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_gelu_f32bias_u8_rcr_sm80_can_implement^⚠: Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_gelu_f32bias_u8_rcr_sm80_run^⚠: int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_gelu_f32bias_u8_rcr_sm80_workspace_size^⚠: Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_gelu_f64_rcr_sm80_can_implement^⚠: CUTLASS GEMM trampoline (implementability check for gemm_bias_gelu_f64_rcr_sm80).
baracuda_cutlass_gemm_bias_gelu_f64_rcr_sm80_run^⚠: CUTLASS GEMM trampoline (launch gemm_bias_gelu_f64_rcr_sm80).
baracuda_cutlass_gemm_bias_gelu_f64_rcr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_gelu_f64_rcr_sm80).
baracuda_cutlass_gemm_bias_gelu_f64_rrr_sm80_can_implement^⚠: CUTLASS GEMM trampoline (implementability check for gemm_bias_gelu_f64_rrr_sm80).
baracuda_cutlass_gemm_bias_gelu_f64_rrr_sm80_run^⚠: CUTLASS GEMM trampoline (launch gemm_bias_gelu_f64_rrr_sm80).
baracuda_cutlass_gemm_bias_gelu_f64_rrr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_gelu_f64_rrr_sm80).
baracuda_cutlass_gemm_bias_gelu_i32bias_s8_rcr_sm80_can_implement^⚠: Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_gelu_i32bias_s8_rcr_sm80_run^⚠: int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_gelu_i32bias_s8_rcr_sm80_workspace_size^⚠: Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_gelu_i32bias_u8_rcr_sm80_can_implement^⚠: Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_gelu_i32bias_u8_rcr_sm80_run^⚠: int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_gelu_i32bias_u8_rcr_sm80_workspace_size^⚠: Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_gelu_tf32_rcr_sm80_can_implement^⚠: Safety
baracuda_cutlass_gemm_bias_gelu_tf32_rcr_sm80_run^⚠: f32 (TF32) bias+GELU GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_bias_gelu_tf32_rcr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_gelu_tf32_rcr_sm80).
baracuda_cutlass_gemm_bias_gelu_tf32_rrr_sm80_can_implement^⚠: Safety
baracuda_cutlass_gemm_bias_gelu_tf32_rrr_sm80_run^⚠: f32 (TF32) bias+GELU GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_bias_gelu_tf32_rrr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_gelu_tf32_rrr_sm80).
baracuda_cutlass_gemm_bias_i32bias_s8_rcr_sm80_can_implement^⚠: Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_i32bias_s8_rcr_sm80_run^⚠: int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_i32bias_s8_rcr_sm80_workspace_size^⚠: Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_i32bias_u8_rcr_sm80_can_implement^⚠: Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_i32bias_u8_rcr_sm80_run^⚠: int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_i32bias_u8_rcr_sm80_workspace_size^⚠: Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_relu_bf16_rcr_sm80_can_implement^⚠: Pre-launch check for bf16 bias+ReLU RCR sm_80.
baracuda_cutlass_gemm_bias_relu_bf16_rcr_sm80_run^⚠: bf16 bias + ReLU activation GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_bias_relu_bf16_rcr_sm80_workspace_size^⚠: Workspace bytes for bf16 bias+ReLU RCR sm_80 GEMM.
baracuda_cutlass_gemm_bias_relu_bf16_rrr_sm80_can_implement^⚠: Safety
baracuda_cutlass_gemm_bias_relu_bf16_rrr_sm80_run^⚠: bf16 bias+ReLU GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_bias_relu_bf16_rrr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_relu_bf16_rrr_sm80).
baracuda_cutlass_gemm_bias_relu_f16_rcr_sm80_can_implement^⚠: Pre-launch check for f16 bias+ReLU RCR sm_80.
baracuda_cutlass_gemm_bias_relu_f16_rcr_sm80_run^⚠: f16 bias + ReLU activation GEMM, RCR layout, sm_80. Computes D = max(alpha*AB + beta*C + bias_broadcast(N), 0).
baracuda_cutlass_gemm_bias_relu_f16_rcr_sm80_workspace_size^⚠: Workspace bytes for f16 bias+ReLU RCR sm_80 GEMM.
baracuda_cutlass_gemm_bias_relu_f16_rrr_sm80_can_implement^⚠: Safety
baracuda_cutlass_gemm_bias_relu_f16_rrr_sm80_run^⚠: f16 bias+ReLU GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_bias_relu_f16_rrr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_relu_f16_rrr_sm80).
baracuda_cutlass_gemm_bias_relu_f32_simt_rcr_sm80_can_implement^⚠: CUTLASS GEMM trampoline (implementability check for gemm_bias_relu_f32_simt_rcr_sm80).
baracuda_cutlass_gemm_bias_relu_f32_simt_rcr_sm80_run^⚠: CUTLASS GEMM trampoline (launch gemm_bias_relu_f32_simt_rcr_sm80).
baracuda_cutlass_gemm_bias_relu_f32_simt_rcr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_relu_f32_simt_rcr_sm80).
baracuda_cutlass_gemm_bias_relu_f32_simt_rrr_sm80_can_implement^⚠: CUTLASS GEMM trampoline (implementability check for gemm_bias_relu_f32_simt_rrr_sm80).
baracuda_cutlass_gemm_bias_relu_f32_simt_rrr_sm80_run^⚠: CUTLASS GEMM trampoline (launch gemm_bias_relu_f32_simt_rrr_sm80).
baracuda_cutlass_gemm_bias_relu_f32_simt_rrr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_relu_f32_simt_rrr_sm80).
baracuda_cutlass_gemm_bias_relu_f32bias_s8_rcr_sm80_can_implement^⚠: Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_relu_f32bias_s8_rcr_sm80_run^⚠: int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_relu_f32bias_s8_rcr_sm80_workspace_size^⚠: Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_relu_f32bias_u8_rcr_sm80_can_implement^⚠: Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_relu_f32bias_u8_rcr_sm80_run^⚠: int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_relu_f32bias_u8_rcr_sm80_workspace_size^⚠: Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_relu_f64_rcr_sm80_can_implement^⚠: CUTLASS GEMM trampoline (implementability check for gemm_bias_relu_f64_rcr_sm80).
baracuda_cutlass_gemm_bias_relu_f64_rcr_sm80_run^⚠: CUTLASS GEMM trampoline (launch gemm_bias_relu_f64_rcr_sm80).
baracuda_cutlass_gemm_bias_relu_f64_rcr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_relu_f64_rcr_sm80).
baracuda_cutlass_gemm_bias_relu_f64_rrr_sm80_can_implement^⚠: CUTLASS GEMM trampoline (implementability check for gemm_bias_relu_f64_rrr_sm80).
baracuda_cutlass_gemm_bias_relu_f64_rrr_sm80_run^⚠: CUTLASS GEMM trampoline (launch gemm_bias_relu_f64_rrr_sm80).
baracuda_cutlass_gemm_bias_relu_f64_rrr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_relu_f64_rrr_sm80).
baracuda_cutlass_gemm_bias_relu_i32bias_s8_rcr_sm80_can_implement^⚠: Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_relu_i32bias_s8_rcr_sm80_run^⚠: int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_relu_i32bias_s8_rcr_sm80_workspace_size^⚠: Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_relu_i32bias_u8_rcr_sm80_can_implement^⚠: Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_relu_i32bias_u8_rcr_sm80_run^⚠: int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_relu_i32bias_u8_rcr_sm80_workspace_size^⚠: Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_relu_tf32_rcr_sm80_can_implement^⚠: Safety
baracuda_cutlass_gemm_bias_relu_tf32_rcr_sm80_run^⚠: f32 (TF32) bias+ReLU GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_bias_relu_tf32_rcr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_relu_tf32_rcr_sm80).
baracuda_cutlass_gemm_bias_relu_tf32_rrr_sm80_can_implement^⚠: Safety
baracuda_cutlass_gemm_bias_relu_tf32_rrr_sm80_run^⚠: f32 (TF32) bias+ReLU GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_bias_relu_tf32_rrr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_relu_tf32_rrr_sm80).
baracuda_cutlass_gemm_bias_silu_bf16_rcr_sm80_can_implement^⚠: Pre-launch check for bf16 bias+SiLU RCR sm_80.
baracuda_cutlass_gemm_bias_silu_bf16_rcr_sm80_run^⚠: bf16 bias + SiLU activation GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_bias_silu_bf16_rcr_sm80_workspace_size^⚠: Workspace bytes for bf16 bias+SiLU RCR sm_80 GEMM.
baracuda_cutlass_gemm_bias_silu_bf16_rrr_sm80_can_implement^⚠: Safety
baracuda_cutlass_gemm_bias_silu_bf16_rrr_sm80_run^⚠: bf16 bias+SiLU GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_bias_silu_bf16_rrr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_silu_bf16_rrr_sm80).
baracuda_cutlass_gemm_bias_silu_f16_rcr_sm80_can_implement^⚠: Pre-launch check for f16 bias+SiLU RCR sm_80.
baracuda_cutlass_gemm_bias_silu_f16_rcr_sm80_run^⚠: f16 bias + SiLU activation GEMM, RCR layout, sm_80. Computes D = silu(alpha*AB + beta*C + bias_broadcast(N)) where silu(x) = x * sigmoid(x). Also known as Swish.
baracuda_cutlass_gemm_bias_silu_f16_rcr_sm80_workspace_size^⚠: Workspace bytes for f16 bias+SiLU RCR sm_80 GEMM.
baracuda_cutlass_gemm_bias_silu_f16_rrr_sm80_can_implement^⚠: Safety
baracuda_cutlass_gemm_bias_silu_f16_rrr_sm80_run^⚠: f16 bias+SiLU GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_bias_silu_f16_rrr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_silu_f16_rrr_sm80).
baracuda_cutlass_gemm_bias_silu_f32_simt_rcr_sm80_can_implement^⚠: CUTLASS GEMM trampoline (implementability check for gemm_bias_silu_f32_simt_rcr_sm80).
baracuda_cutlass_gemm_bias_silu_f32_simt_rcr_sm80_run^⚠: CUTLASS GEMM trampoline (launch gemm_bias_silu_f32_simt_rcr_sm80).
baracuda_cutlass_gemm_bias_silu_f32_simt_rcr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_silu_f32_simt_rcr_sm80).
baracuda_cutlass_gemm_bias_silu_f32_simt_rrr_sm80_can_implement^⚠: CUTLASS GEMM trampoline (implementability check for gemm_bias_silu_f32_simt_rrr_sm80).
baracuda_cutlass_gemm_bias_silu_f32_simt_rrr_sm80_run^⚠: CUTLASS GEMM trampoline (launch gemm_bias_silu_f32_simt_rrr_sm80).
baracuda_cutlass_gemm_bias_silu_f32_simt_rrr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_silu_f32_simt_rrr_sm80).
baracuda_cutlass_gemm_bias_silu_f32bias_s8_rcr_sm80_can_implement^⚠: Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_silu_f32bias_s8_rcr_sm80_run^⚠: int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_silu_f32bias_s8_rcr_sm80_workspace_size^⚠: Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_silu_f32bias_u8_rcr_sm80_can_implement^⚠: Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_silu_f32bias_u8_rcr_sm80_run^⚠: int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_silu_f32bias_u8_rcr_sm80_workspace_size^⚠: Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_silu_f64_rcr_sm80_can_implement^⚠: CUTLASS GEMM trampoline (implementability check for gemm_bias_silu_f64_rcr_sm80).
baracuda_cutlass_gemm_bias_silu_f64_rcr_sm80_run^⚠: CUTLASS GEMM trampoline (launch gemm_bias_silu_f64_rcr_sm80).
baracuda_cutlass_gemm_bias_silu_f64_rcr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_silu_f64_rcr_sm80).
baracuda_cutlass_gemm_bias_silu_f64_rrr_sm80_can_implement^⚠: CUTLASS GEMM trampoline (implementability check for gemm_bias_silu_f64_rrr_sm80).
baracuda_cutlass_gemm_bias_silu_f64_rrr_sm80_run^⚠: CUTLASS GEMM trampoline (launch gemm_bias_silu_f64_rrr_sm80).
baracuda_cutlass_gemm_bias_silu_f64_rrr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_silu_f64_rrr_sm80).
baracuda_cutlass_gemm_bias_silu_i32bias_s8_rcr_sm80_can_implement^⚠: Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_silu_i32bias_s8_rcr_sm80_run^⚠: int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_silu_i32bias_s8_rcr_sm80_workspace_size^⚠: Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_silu_i32bias_u8_rcr_sm80_can_implement^⚠: Pre-launch implementability check for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_silu_i32bias_u8_rcr_sm80_run^⚠: int8 bias-fused GEMM with optional fused activation.
baracuda_cutlass_gemm_bias_silu_i32bias_u8_rcr_sm80_workspace_size^⚠: Workspace size in bytes for the corresponding _run entry point.
baracuda_cutlass_gemm_bias_silu_tf32_rcr_sm80_can_implement^⚠: Safety
baracuda_cutlass_gemm_bias_silu_tf32_rcr_sm80_run^⚠: f32 (TF32) bias+SiLU GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_bias_silu_tf32_rcr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_silu_tf32_rcr_sm80).
baracuda_cutlass_gemm_bias_silu_tf32_rrr_sm80_can_implement^⚠: Safety
baracuda_cutlass_gemm_bias_silu_tf32_rrr_sm80_run^⚠: f32 (TF32) bias+SiLU GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_bias_silu_tf32_rrr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_silu_tf32_rrr_sm80).
baracuda_cutlass_gemm_bias_tf32_rcr_sm80_can_implement^⚠: Safety
baracuda_cutlass_gemm_bias_tf32_rcr_sm80_run^⚠: f32 (TF32) bias-fused GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_bias_tf32_rcr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_tf32_rcr_sm80).
baracuda_cutlass_gemm_bias_tf32_rrr_sm80_can_implement^⚠: Safety
baracuda_cutlass_gemm_bias_tf32_rrr_sm80_run^⚠: f32 (TF32) bias-fused GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_bias_tf32_rrr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_bias_tf32_rrr_sm80).
baracuda_cutlass_gemm_f16_rcr_sm80_can_implement^⚠: Pre-launch implementability check for f16 RCR sm_80.
baracuda_cutlass_gemm_f16_rcr_sm80_run^⚠: f16 GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_f16_rcr_sm80_workspace_size^⚠: Workspace size in bytes for f16 RCR sm_80 GEMM at the given problem size.
baracuda_cutlass_gemm_f16_rrr_sm80_can_implement^⚠: Pre-launch implementability check for f16 RRR sm_80.
baracuda_cutlass_gemm_f16_rrr_sm80_run^⚠: f16 GEMM, RRR layout, sm_80.
baracuda_cutlass_gemm_f16_rrr_sm80_workspace_size^⚠: Workspace size in bytes for f16 RRR sm_80 GEMM.
baracuda_cutlass_gemm_f32_simt_rcr_sm80_can_implement^⚠: Safety
baracuda_cutlass_gemm_f32_simt_rcr_sm80_run^⚠: f32 GEMM via SIMT (CUDA cores), RCR layout, sm_80. Full-precision counterpart to the TF32 RCR kernel.
baracuda_cutlass_gemm_f32_simt_rcr_sm80_workspace_size^⚠: Workspace size in bytes for f32_simt RCR sm_80 GEMM.
baracuda_cutlass_gemm_f32_simt_rrr_sm80_can_implement^⚠: Safety
baracuda_cutlass_gemm_f32_simt_rrr_sm80_run^⚠: f32 GEMM via SIMT (CUDA cores), RRR layout, sm_80.
baracuda_cutlass_gemm_f32_simt_rrr_sm80_workspace_size^⚠: Workspace size in bytes for f32_simt RRR sm_80 GEMM.
baracuda_cutlass_gemm_f64_rcr_sm80_can_implement^⚠: Safety
baracuda_cutlass_gemm_f64_rcr_sm80_run^⚠: f64 GEMM via Ampere FP64 tensor cores, RCR layout, sm_80.
baracuda_cutlass_gemm_f64_rcr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_f64_rcr_sm80).
baracuda_cutlass_gemm_f64_rrr_sm80_can_implement^⚠: Safety
baracuda_cutlass_gemm_f64_rrr_sm80_run^⚠: f64 GEMM via Ampere FP64 tensor cores, RRR layout, sm_80.
baracuda_cutlass_gemm_f64_rrr_sm80_workspace_size^⚠: CUTLASS GEMM trampoline (workspace-bytes query for gemm_f64_rrr_sm80).
baracuda_cutlass_gemm_s8_rcr_sm80_can_implement^⚠: Pre-launch implementability check for the s8 RCR sm_80 GEMM.
baracuda_cutlass_gemm_s8_rcr_sm80_run^⚠: Signed-int8 GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_s8_rcr_sm80_workspace_size^⚠: Workspace size in bytes for the s8 RCR sm_80 GEMM.
baracuda_cutlass_gemm_tf32_rcr_sm80_can_implement^⚠: Pre-launch implementability check for tf32 RCR sm_80.
baracuda_cutlass_gemm_tf32_rcr_sm80_run^⚠: f32 GEMM via TF32 tensor cores, RCR layout, sm_80.
baracuda_cutlass_gemm_tf32_rcr_sm80_workspace_size^⚠: Workspace size in bytes for the tf32 RCR sm_80 GEMM.
baracuda_cutlass_gemm_tf32_rrr_sm80_can_implement^⚠: Pre-launch implementability check for tf32 RRR sm_80.
baracuda_cutlass_gemm_tf32_rrr_sm80_run^⚠: f32 GEMM via TF32 tensor cores, RRR layout, sm_80.
baracuda_cutlass_gemm_tf32_rrr_sm80_workspace_size^⚠: Workspace size in bytes for the tf32 RRR sm_80 GEMM.
baracuda_cutlass_gemm_u8_rcr_sm80_can_implement^⚠: Pre-launch check for u8 RCR sm_80 GEMM.
baracuda_cutlass_gemm_u8_rcr_sm80_run^⚠: Unsigned-uint8 GEMM, RCR layout, sm_80.
baracuda_cutlass_gemm_u8_rcr_sm80_workspace_size^⚠: Workspace size for u8 RCR sm_80 GEMM.
baracuda_cutlass_grouped_gemm_bf16_rcr_sm80_can_implement^⚠: Safety
baracuda_cutlass_grouped_gemm_bf16_rcr_sm80_run^⚠: Safety
baracuda_cutlass_grouped_gemm_bf16_rcr_sm80_scratch_bytes^⚠: Safety
baracuda_cutlass_grouped_gemm_bf16_rcr_sm80_sufficient^⚠: bf16 grouped GEMM — see f16 counterpart for documentation.
baracuda_cutlass_grouped_gemm_f16_rcr_sm80_can_implement^⚠: Pre-launch implementability check (host-only, no CUDA traffic).
baracuda_cutlass_grouped_gemm_f16_rcr_sm80_run^⚠: Launch the grouped GEMM.
baracuda_cutlass_grouped_gemm_f16_rcr_sm80_scratch_bytes^⚠: CUTLASS-internal scratch bytes needed for the launch.
baracuda_cutlass_grouped_gemm_f16_rcr_sm80_sufficient^⚠: Compute the number of threadblocks to launch for an f16 grouped GEMM with the given per-group (M, N, K) shapes. CUTLASS chooses based on device SM count vs total tile count.

Crate baracuda_cutlass_kernels_sys

Crate baracuda_cutlass_kernels_sys Copy item path

§baracuda-cutlass-kernels-sys

§Status codes

Functions§

Crate baracuda_cutlass_kernels_sys